Commit Graph

566 Commits

Author SHA1 Message Date
George Hotz 0f28e93224
add pickle support for pattern matchers [run_process_replay] (#6816)
* add pickle support for pattern matchers [run_process_replay]

* cleaner and all

* no closures

* fix tests

* revert that

* final

* cleaner

* python 3.8 fix

* add round trip back

* this

* waste lines on this. that's the final line count

* max print better

* more targetted fix

* regrettably add 3.8 support
2024-09-30 21:54:46 +08:00
wozeparrot 2b899164c6
no numpy (#6751) 2024-09-26 16:40:18 +08:00
wozeparrot c100f3d406
default threefry (#6116) 2024-09-25 17:45:13 +08:00
George Hotz dd575da7ee
real minimum cstyle change (#6709)
* real minimum cstyle change

* make it match

* bring back DEFINE_GLOBAL store marking writable

* bump line count to 9800

* closer

* precompute don't render

* cast/bitcast too

* smem_align

* vectorize

* more pr match

* remove that test

* less PR diff
2024-09-25 12:40:46 +08:00
George Hotz f45d178a55 hotfix: support JIT_BATCH_SIZE=0, make that the default 2024-09-25 10:36:04 +08:00
George Hotz 52e7f1c108 add new model CI 2024-09-25 10:23:06 +08:00
George Hotz b0ffe2452b bump line count to 9800 2024-09-25 09:15:30 +08:00
George Hotz de259e3f09 hotfix: add compile3 to comma CI 2024-09-23 18:25:49 +08:00
qazal e2d6e10ddf
hotfix: reset benchmarks cache for process replay (#6671) 2024-09-23 15:13:02 +08:00
chenyu 26ebb7cab4
don't use div_folding in lt_folding (#6666)
* don't use div_folding in lt_folding

valids 35 -> 13

* fails the same as before
2024-09-23 01:50:18 -04:00
chenyu da5b741656
removed valid in openpilot conv (#6619)
35 valids left
2024-09-23 00:30:18 -04:00
chenyu 1923932339
canonicalize simplex lt (#6658)
(X := a0*x0 + a1*x1 + ...) > 0 is equivalent to x0 + x1 + ... > 0 if xi >= 0 and ai > 0 for ints
2024-09-22 23:04:47 -04:00
chenyu 5707503048
x//a<b -> x <a*b for positive a (#6622)
openpilot valids 47 -> 37
2024-09-20 04:38:47 -04:00
chenyu b14c1bc417
UOps.RANGE is_increasing (#6615)
* UOps.RANGE is_increasing

283 -> 47 valids

* test
2024-09-20 03:14:52 -04:00
chenyu 036c2f5b26
validhack use the new style ge for upper bound valid (#6612)
also relaxed the bound check to check vmin/vmax instead just const.
valids 482 -> 283
2024-09-19 23:45:42 -04:00
George Hotz a1a882b006
arange folding with new ge (#6604)
* arange folding with new ge

* bump allowed gated

* bump allowed speed
2024-09-19 18:01:28 +08:00
chenyu d148a62f8d
more generic simplify_valid_image_load (#6603)
use graph_rewrite to simplify the expression with narrowed variables, and check boundry conditions on monotonically increasing function to drop valid.
2024-09-19 05:33:37 -04:00
chenyu 162ead02a9
remove LOAD where valid is an empty set (#6579)
356 -> 354 valids
2024-09-18 03:49:41 -04:00
chenyu a72d51e277
brute force VALIDHACK matching (#6575)
* brute force VALIDHACK matching

* cleanup

* 9700
2024-09-18 01:59:50 -04:00
qazal d8e5d5c663
move VIZ=1 tests to fuzzers (#6574) 2024-09-18 12:12:03 +08:00
George Hotz 67a03e72bb
remove expr_idxs [run_process_replay] (#6567)
* remove expr_idxs [run_process_replay]

* goodbye that test
2024-09-17 18:34:51 +08:00
chenyu 5fb877c78c
generic valid match criteria of #6552 (#6558)
455 -> 364 valids.
generalize `idx < image bound` to `idx < image bound + c` for some `c`
2024-09-17 02:40:36 -04:00
George Hotz 0ab06d5840
push geps through wmma (#6559)
* push geps through wmma

* update tests
2024-09-17 14:38:40 +08:00
chenyu 7c942418a1
other side of simple out of bound valid case (#6552)
462 -> 455
2024-09-16 23:57:15 -04:00
chenyu aeaf7894a7
more generic version of #6548 (#6549)
x*(-1)<0 can be generalized to x*(-1)<c, 473 -> 462 valids
2024-09-16 23:17:16 -04:00
chenyu 596f41eb46
simple drop image valid case (#6548)
* simple drop image valid case

started unit test, 530 -> 473 valids

* cleanup
2024-09-16 22:54:07 -04:00
chenyu 798be6bb74
add gated read_image count in openpilot compile2 (#6546)
530 to go
2024-09-16 21:17:00 -04:00
George Hotz cd90092f14
graph rewrite tests (#6519)
* more graph rewrite tests

* more complex test cases

* more tests

* more tests

* cleanups

* 9600 lines

* cleanups
2024-09-15 17:29:16 +08:00
qazal c5bae55ec8
new generate_dataset.sh (#6423)
* new generate_dataset.sh

* keep those there

* test: rm expected failures

* rename to extract
2024-09-09 15:13:07 +08:00
George Hotz 4b128da525 hotfix: line count to 9500 2024-09-06 09:10:03 +08:00
ignaciosica c15506fc35
[WIP] amx support as TC (#5693)
* almost working with relu, even hackable... but acc size is wrong, fix needed

* upcast based on threads, change thread size to 4x4

* revert wrongfully commented assert

* fix tc load indexing

* modify for size 8

* fix bug for size 8

* Revert "fix bug for size 8"

This reverts commit cdb3f5df85b6116e8bef10214647a9201c400655.

* Revert "modify for size 8"

This reverts commit 3ef0904bd96291c7a3a351c702fba2905c196bcc.

* good kernel with changes in lowerer

* revert "good kernel with changes in lowerer"

This reverts commit 975e2b5a4ecfe475370e88ce9db78b2d42e4c4d4.

* good kernel for relu!

* refactor lowerer changes

* add amx context var to helper

* clean up amx flag

* improve lowerer changes readability

* improve check for amx

* revert lowerer if

* add float4 type rendering for clang

* add amx definitions

* enable indexing for clang if amx

* working amx example, wrong because of dims

* almost works for float 16, need to spot using double load in amx

* cleaner render_kernel

* revert chages in simple_matmul and delete env

* add new var upcast_offset to get_optimized_ast

* change axis for axes

* invert if in rendering phi

* fix some bugs

* fix linearizer tests

* fix vec/get pat for amx

* remove clang tc if amx is disabled

* add ops_python support

* refactor into one complementary function in ops_python

* add job for EMUALTE_AMX

* improve checking for AMX in UPCAST and TC extra ops

* fix lint issue

* commit before refactor into autocontained AMX

* start refactor by removing special rendering for AMX

* all ready for amx handcoded kernel

* working poc, most straightforward amx support

* avoid local opts for tc if amx

* fix merge bugs

* skip test for clang

* skip tc hand-coded opts if amx

* remove hardcoded ops_python values

* remove hardcoded sizes for amx kernel

* fix ops_python bug where dim was hard-coded

* change contract for vectorize

* working without changes in lowerer

* revert changes in gep rendering

* fix ops_python

* modify comment

* skip test if clang for different type accumulation

* move rename and bug for seperate pr

* fix wrong path for test

* addmm not implemented in torch for cpu

* change struct for vector; equally slow but cleaner

* revert modified test

* simply wmma rendering

* minor change

* noqa:501

* add length 16 for AMX

* fix vectorized half issue

* fix error

* remove comment

* change set for dedup

* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX

* add amx reference

* load acc into amx registers

* fix dtype rendering and remove noqa

* moved tests change into another pr

* add real AMX job for CI and fix bug

* fix ops_python bug

* fix test class

* remove real AMX tests and fix uops_stats test

* remove wrong test

* acc folding

* hotfix: bug

* fix float4 tests for amx

* hack for fixing flops counting

* hotfix: mypy

* add flop counts test for amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_unaligned_load_amx

* nits tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-09-06 09:01:10 +08:00
nimlgen d22b46a2ac
qcom in benchmarks (#6337) 2024-09-02 19:59:11 +03:00
nimlgen 8e2a3fc165
raise lines count to 9300 for qcom (#6336) 2024-09-02 18:57:57 +03:00
George Hotz 365babe391
precompute early_reject [run_process_replay] (#6327)
* precompute early_reject [run_process_replay]

* features for ebs

* fix ocelot cache
2024-08-29 18:26:24 -07:00
CaltropHungerton 002f60b4c3
fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192)
* fix wmma flop counting on intel, add count tests

* half

* add half gemm

* Update test.yml

* one test

* Update test_uops_stats.py

* Update test_uops_stats.py

* Update test_uops_stats.py

* smaller matrix, use unittest skipUnless decorator
2024-08-25 18:37:05 -07:00
chenyu e745e16441
remove UnaryOps.NEG (#6238)
* Remove UnaryOps.NEG

generated new dataset with
```
time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```

* fix that
2024-08-22 14:21:39 -04:00
CaltropHungerton 38fb1e14a2
Intel XMX Tensor Core Support (#5622)
* fixed xmx demo

* i think i'm invoking the DPAS but it's slow

* compiler build arg to stop register spilling, indicated where to fix flop counter

* don't mind this

* do NOT mind me

* do not mind me

* do not view

* i will add bf16 later

* in process of figuring out tc fields

* we figured out the fields!!!

* added check for cl device vendor, added seperate IntelRenderer

* remove tc thread_local_aliases

* cleaning debris before draft pr

* edits for linter

* deduping and checking device extensions

* i will find more line reductions in other places

* before merge upstream

* double grf size in compiler to fix register spilling (bandaid), device checking changes

* tc python emulation

* fixed emulation

* tests for emulated intel tensor core

* TC=0, 1 working on upstream, fixed perf

* test

* debris

* check for specialized cl device when we canonicalize device

* bf16 support, tc=3 test added

* address tests

* revert half2 loads on intel tc, cleanup

* linter

* fold_expanded revert

* lint, whitespace fix

* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too

* make line shorter, no need for noqa E501

* removed device intel

* fix python emulation

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-16 09:19:21 -07:00
George Hotz e8ae9af962 bump line count to 9000. we should be here a while 2024-08-16 08:46:36 -07:00
chenyu 7d46fb0c83
load balance NV benchmark ci (#6107) 2024-08-16 10:08:08 -04:00
chenyu a41c9dd12c
test py.typed as a package (#6094)
* test py.typed as a package

* try this?

* and this

* try that?

* add this back

* cleanup
2024-08-15 11:19:08 -04:00
qazal 30035df5a4
add metal process replay back (#6068)
test this new one
2024-08-14 12:29:56 +03:00
qazal 9d2ea94fe9
temp: disable process replay on metal (#6062) 2024-08-13 16:31:55 +03:00
nimlgen 8f787785d9
fix openpilot benchmark (#6049) 2024-08-12 21:12:32 +03:00
chenyu e6c7c3e499
update pylint path to check indent/space for all (#6022)
also fixed many errors. it was not checking nested dirs. exclude autogen for now.

can we use ruff for this?
2024-08-10 14:41:09 -04:00
George Hotz cfb04c67d1
run unit tests separate from others (and only once) (#6020)
* run unit tests separate from others

* ignore unit tests elsewhere
2024-08-10 11:17:56 -07:00
qazal 266afad8ed
hotfix: skip schedule capture in benchmarks (#6012) 2024-08-10 17:13:53 +03:00
qazal 24c7c41ce0
diff LazyBuffer schedules in process replay (#5996)
* start diff printing

* this should be 2

* add to process_replay.py

* enable schedule capture

* arange diff is process replay
2024-08-09 14:16:43 +03:00
George Hotz 3d445039c2 hotfix: 8800 lines for AMX+intel tc 2024-08-06 17:50:26 -07:00
chenyu adba5efc64
enable llama 2 70B in tinybox green CI (#5905)
runnable with MAX_CONTEXT=256
2024-08-04 18:48:46 -04:00
George Hotz 7348c40d9d
sampling time sync (8700 lines) (#5843)
* sampling time sync

* jitter matrix

* comment

* pass mypy

* line count
2024-08-02 14:44:35 -07:00