Commit Graph

2309 Commits

Author SHA1 Message Date
George Hotz 3e1336957d
test arange with all opts (#5923)
* test arange with all opts

* Update test_arange.py

* Update test_arange.py

* Update test_arange.py

* Update test_arange.py

* Update test_arange.py
2024-08-05 18:38:25 -07:00
George Hotz 5d17f54e3c
fast mnist indexing (#5921)
* fast mnist indexing

* more tests

* remove those tests, new indexing rule
2024-08-05 13:55:15 -07:00
George Hotz e81c18f494
make the arange test check correctness [run_process_replay] (#5920) 2024-08-05 13:41:06 -07:00
George Hotz 8d1c884e78
capture the const pattern in both directions (#5919)
* capture the const pattern in both directions

* add regression test
2024-08-05 12:15:38 -07:00
George Hotz 42f599870c
unroll arange is broken (#5918)
* unroll arange is broken

* fix unrolled arange

* one more test
2024-08-05 12:15:07 -07:00
qazal 70949ea7e6
test cstyle compile error for max with inline const (#5838)
* test_failure_46

* GPU=1 fails too

* add test_renderer

* add failing platforms

* nv too

* assert return value
2024-08-05 19:02:16 +03:00
qazal e0c6520138
check arange fusing with VIEW and COPY (#5912)
* check arange fusing with VIEW and COPY

* gpu and clang
2024-08-05 17:09:21 +03:00
nimlgen 590b9ebb34
hcq copy queue is optional (#5909)
* hcq copy queue is optional

* one more

* this
2024-08-05 14:03:25 +03:00
George Hotz 159ac06b5b
remove unused reduce rules + improve unparented (#5908)
* remove unused reduce rules [run_process_replay]

* this work

* those tests are meaningless now
2024-08-04 18:18:27 -07:00
George Hotz d7387d31bf
remove useless reduce cases [run_process_replay] (#5907)
* remove useless reduce cases [run_process_replay]

* do_reduce cleanup

* more cleanups + no longer supported tests

* Revert "more cleanups + no longer supported tests"

This reverts commit e9f2f6ba7061f8697a308aacdc3442fa922a77f5.

* no longer supported tests

* switch ReduceOps.SUM -> BinaryOps.ADD
2024-08-04 17:11:08 -07:00
George Hotz be8958e26b
use CONTRACT before REDUCE (#5903)
* use CONTRACT before REDUCE [run_process_replay]

* support half expand

* EXPAND GEP
2024-08-04 16:17:33 -07:00
chenyu 4a65010de8
remove CUDACPU flag in tests [run_process_replay] (#5902)
no longer used
2024-08-04 16:06:38 -04:00
qazal aad9234e52
test fused precompute_freqs_cis (#5900)
* test_precompute_freqs_cis

* tiny for ci
2024-08-04 21:01:05 +03:00
chenyu c67e9887f7
support using str to specify dtype (#5897)
* support using str to specify dtype

in Tensor creation and args into `cast` and `bitcast`, and acc_dtype

* more tests
2024-08-04 12:56:28 -04:00
qazal 4c5ef2cc4f
setitem with arange fusion 1 (#5898) 2024-08-04 16:09:21 +03:00
chenyu da61dea1b2
simple failed UOp sub symbolic test case (#5894) 2024-08-03 14:27:23 -04:00
qazal 56ef9e453e
pad reduceops to the max of each dimension (#5889)
* early verify

* pad reduceops to the max of each dim

* remove the function
2024-08-03 14:03:30 +03:00
qazal 65fa86901a
indexing fusion 2 (#5888)
* arange fusion

* kernels that fuse

* tests
2024-08-03 13:13:39 +03:00
qazal af59b2eea9
tests from the indexing fusion branch (#5886) 2024-08-03 11:56:48 +03:00
chenyu d5de44340e
UOp add mod folding (#5862)
* UOp add mod folding

* that passes now
2024-08-02 18:31:46 -04:00
chenyu 41bbd3f4c1
update UOp mod reduction patterns (#5883)
prepare generic mod folding, also some test changes from mod folding pr
2024-08-02 17:43:40 -04:00
wozeparrot acadccf344
comma benchmark (#5518) 2024-08-02 14:36:54 -07:00
Elias Wahl 4a114756f6
New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00
nimlgen 2777784b91
add dependency viewer to hcq profiler (#5874)
* hcq profiler support deps

* clean up

* cleaner

* cleanup

* revert this

* linter

* mypy

* add test

* sync is strange, need to take the end

* linter + test
2024-08-02 22:07:01 +03:00
George Hotz 23e8c39288
get program fields in __post_init__ [run_process_replay] (#5878)
* get program fields in __post_init__ [run_process_replay]

* remove print
2024-08-02 09:57:12 -07:00
qazal 8611fa6c99
apply opts.extra_matcher in process replay [run_process_replay] (#5877) 2024-08-02 18:07:58 +03:00
qazal 2a791f7924
fuzz uops is simpler with List[UOp] [run_process_replay] (#5875)
* remove from fuzz_uops

* update fuzz_uops.py

* add to realize.py
2024-08-02 17:28:15 +03:00
George Hotz 877e0b4ba0
define global only has the index [run_process_replay] (#5869)
* define global only has the index [run_process_replay]

* fix that linearizer test

* fix ptx

* stupid ptx fix
2024-08-01 19:01:15 -07:00
chenyu f27f949a5d
Revert "revert some UOp IDIV bound (#5863)" (#5871)
This reverts commit 0c8d202348.
2024-08-01 21:38:31 -04:00
chenyu df138bc558
Revert "revert a mod pattern (#5864)" (#5870)
This reverts commit 5c8de2d044.
2024-08-01 20:44:26 -04:00
chenyu 1b0314d9ef
Revert "remove one more UOp mod pattern (#5865)" (#5868)
This reverts commit b03b8e18c2.
2024-08-01 20:28:35 -04:00
George Hotz d73bc85ba9
UOpGraph not in renderer or Program [run_process_replay] (#5867)
* UOpGraph not in renderer or Program [run_process_replay]

* fix some tests

* fix ptx
2024-08-01 16:20:30 -07:00
chenyu b392b8edc3
increase atol and rtol test_gemm_fp16 (#5866)
* increase atol and rtol test_gemm_fp16

made it pass with NOOPT which has larger accumulated error

* revert that
2024-08-01 19:09:58 -04:00
chenyu b03b8e18c2
remove one more UOp mod pattern (#5865)
fixed UOP_IS_SYMBOLIC=1 test_failure_40
2024-08-01 18:29:04 -04:00
chenyu 5c8de2d044
revert a mod pattern (#5864)
fixed UOP_IS_SYMBOLIC=1 linearizer failure 47
2024-08-01 17:24:26 -04:00
George Hotz 2d3c7e4d4e
some TestPickleJIT tests (#5860)
* some TestPickleJIT tests

* hotfix: print which opencl device we are using
2024-08-01 12:39:59 -07:00
chenyu 0c8d202348
revert some UOp IDIV bound (#5863)
* revert some UOp IDIV bound

breaks conv with UOP_IS_SYMBOLIC, added some conv tests in CI

* those are correct

* skip slow ones
2024-08-01 15:09:06 -04:00
George Hotz 53fcac9e80 hotfix: increase time on flaky NV test 2024-08-01 10:20:07 -07:00
qazal 26d0265d66
test schedule of LazyBuffers [run_process_replay] (#5859) 2024-08-01 19:06:29 +03:00
David Hou eb91423cb4
MLB support reshape for uneven shards (#5804)
* cleaner uneven reshape

* update test
2024-08-01 02:36:03 -07:00
David González Martínez 0f09b94c43
add failing test for second order derivatives (#5772)
* add failing test

* fix lint

* fix bad merge

* fix again

* fix test

* more minimal
2024-08-01 02:34:47 -07:00
George Hotz 9d05dfb6f4
move JIT graphing into CapturedJit (#5852)
* move JIT graphing into CapturedJit

* better

* _jit_cache

* clear inputs cleanup

* test_pickle_jit with graph + cleanup

* 0 is fine to start

* support None in bufs

* alloc real buffers

* cleaner
2024-07-31 20:48:17 -07:00
chenyu 0ec732b494
test lin fail 47 for UOP_IS_SYMBOLIC (#5853)
failed arange example with UOP_IS_SYMBOLIC
2024-07-31 23:09:22 -04:00
George Hotz c6a8395f1b
CapturedJit is fun to pickle [run_process_replay] (#5851)
* CapturedJit is fun to pickle

* export input replace
2024-07-31 17:23:01 -07:00
George Hotz 72621d9e7c
count the specials in uops [run_process_replay] (#5848)
* count the specials in uops [run_process_replay]

* cleanups
2024-07-31 14:53:18 -07:00
chenyu c2ffcf6887
remove the wrong mod UOp pattern (#5847)
don't think we are hitting it because the stride construction, and it's wrong and not needed
2024-07-31 16:24:25 -04:00
qazal 8174c438a3
pad test_failure_45 (#5846) 2024-07-31 23:08:48 +03:00
George Hotz 8672a9db3f
add test to validate lazyops dims (#5845) 2024-07-31 12:59:38 -07:00
chenyu 4fe5b95568
fix UOp ALU bound (#5844)
* fix UOp ALU bound

root cause of resnet bug, the ALU bound is only correct for scalar, not vectorized

* it can be nan...
2024-07-31 15:19:31 -04:00
nimlgen f768935be8
add RING_ALLREDUCE_THRESHOLD (#5835)
* add RING_ALLREDUCE_THRESHOLD

* becnhmark

* fixes

* fix n_gpus

* unused import

* remove debug=2
2024-07-31 16:13:09 +03:00
chenyu 2e087ca8e4
UOp bound for div negative number (#5808) 2024-07-31 02:10:23 -04:00
qazal bcbd925001
hcopts failing test for fused arange kernel (#5815)
* add failure_43

* n 45
2024-07-31 09:02:44 +03:00
qazal ed556c260e
UOps.IF rules more tests (#5831)
* init tests

* split tests

* assert multiple gates simplicity
2024-07-31 00:11:02 -04:00
David Hou 492a696d14
allow specify splits in shard, handle multiple different splits in MLB.e (#5599)
* allow specify splits in shard, handle multiple different splits in MLB.e

* line width

* linter

* don't use Device in docstring

* specify size of shards instead of boundaries

* adjust docstring for specify size of shards instead of boundaries

* don't allow splits on symbolic axis?

* just allow sint in splits_to_bounds

* add message for assert

* bounds instead of splits to save lines

* fix types

* reduce diff

* fix

* tuple

* golf :(

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-07-30 19:33:04 -07:00
chenyu c3da458bc3
UOp if min==max folds to CONST (#5828)
* UOp if min==max folds to CONST

* fix test
2024-07-30 22:14:22 -04:00
George Hotz e6879035a0
work to make GEMV fast (#5824)
* work to make GEMV fast

* half8 cast

* align struct

* fix amd

* float8 is a later problem
2024-07-30 17:41:40 -07:00
chenyu 02f0be03f2
tests on UOp div negative number and arange opts (#5825) 2024-07-30 20:06:57 -04:00
George Hotz 693990a346
swap src[2] and src[3] in load [run_process_replay] (#5821)
* swap src[2] and src[3] in load [run_process_replay]

* cleanups + bugfix

* fix ptx
2024-07-30 14:04:13 -07:00
George Hotz 17a2f74412
new style load/store folder (#5784)
* remove old index reorder

* new style folder

* works better

* dedup

* one failure

* this is fine now...

* expander_rewrite

* images broken, but all else should work

* cleanups

* make tests work with old

* fix images

* cleanups + bugfix

* minor fixes

* fix gated store folding

* flip gate_creator and expander

* fix gated store

* remove unneeded rules

* lines getting close

* line count good
2024-07-30 13:17:20 -07:00
qazal 03d866b84f
UOps.IF with rewrite rules (#5812)
* expand merge

* merge barriers

* gate_folder

* test_linearizer_failures

* this can be here

* bring the new repr back

* gate_folder2

* gate_creator is better

* gate_folder

* dedup conditions

* early gate folding

* dedup barrier

* fold noop conditions

* all consts can go away

* free lines
2024-07-30 20:50:56 +03:00
chenyu defd89e8e0
unify negative shape creation to raise ValueError (#5817)
[run_process_replay]
2024-07-30 13:42:59 -04:00
P4ssenger 6742a4789a
Add check for negative dimension in view (#5790)
* add check for negative dimension in view

* add negative dim tests

* move check to tensor level

* fix error message

* move check to view create

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-30 13:26:27 -04:00
Francis Lata ce61be16f1
clean up how preprocessed folder is defined (#5813) 2024-07-30 12:35:26 -04:00
qazal 5e827e51d2
add llama3 BEAM=2 failures to test_linearizer_failures (#5553)
* skips

* opts.device

* benchmarks

* add to test_linearizer_failures

* remove hardcoded ones

* linter

* skip cpu
2024-07-30 00:37:32 +03:00
samm393 573e0f9a48
remove float division from idiv in python_alu (#5777)
* removes float division from idiv in python_alu

* add test

* cleaner logic

* pass clang unsigned literals correctly

* suffix ULL instead of U

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-29 12:14:12 -04:00
samm393 2c94316bd2
ull literal support and test (#5789)
* ull literal support and test

* missing .numpy()
2024-07-29 11:50:49 -04:00
nimlgen ab3839a80a
cleanup nv/cuda compilers (#5767)
* cleanup nv/cuda compilers

* destroy prog

* small test

* fix test

* nv ptx rewrite key

* jitlink free

* ptx is part of cuda
2024-07-29 13:50:03 +03:00
chenyu e7a14f398e
more uop_symbolic tests for divmod pairs (#5785) 2024-07-28 21:27:06 -04:00
George Hotz 76d191ab94
move consts to end of add (#5783)
* move consts to end of add

* better

* fix infinite loop
2024-07-28 17:38:57 -07:00
chenyu 71a64d8252
UOps.MUL bound when one is negative (#5781)
* UOps.MUL bound when one is negative

also one more distribute_mul rule

* don't always expand
2024-07-28 19:02:47 -04:00
qazal b775db6b60
high-level benchmark timing diff (#5776)
* high level timings

benchmark times

fix defs

* use the name map

* skip last task
2024-07-28 23:42:57 +03:00
chenyu 600a39771d
fix Tensor.arange if (stop-start) and step have different signs (#5775) 2024-07-28 14:34:10 -04:00
David González Martínez d0fd84e617
feat: allow passing gradient to .backward() to compute vjp (#5771)
* feat: allow passing gradient to .backward() to compute vjp

* fix

* refactor

* fix trailing whitespace
2024-07-28 11:13:18 -07:00
qazal e0e7293b0a
make process replay unique in retries [run_process_replay] (#5773) 2024-07-28 20:44:15 +03:00
qazal 95dda8dadf
more unmatching vectorize/gep asserts [run_process_replay] (#5760)
* merge vectorize/gep rules [run_process_replay]

* assert dtypes

* src=

* float2=(float4.x,float4.y)
2024-07-28 15:08:54 +08:00
chenyu bfbd7c5461
more generic UOp mul mod folding (#5765) 2024-07-27 20:20:35 -04:00
chenyu 80c6475757
update test_uop_symbolic to test UOp min and max (#5764)
covers #5750, #5748, #5741
2024-07-27 19:53:21 -04:00
nimlgen ed1d784077
test profiler timer sync across devs (#5751)
* test profiler timer sync across devs

* more correct

* typo
2024-07-27 16:47:37 +03:00
qazal 3e49d86c01
process replay diffs 3 things now (#5731)
* github api infra

* process replay is 3 parts now

* parse benchmarks

* add gh_token

* complete diff

* move process replay tests

* last successful run

* add tempdir

* skip master
2024-07-27 12:52:20 +03:00
qazal 57b4a8e98d
assert process replay asserts (#5737)
* assert process replay asserts

* one ci job is fine

* test: Revert "separate process replay main loop (#5734)"

This reverts commit 94d578396f.

* mac sed needs that

* Revert "test: Revert "separate process replay main loop (#5734)""

This reverts commit e4ad7684d5472a64841a66b43bc1db7c9bbbf9e8.

* disable process replay capture

* save time

* amd is tiny

* send to /dev/null
2024-07-27 12:07:50 +03:00
George Hotz f8972ace38
test flops (and allow wide ALU in UOps) [run_process_replay] (#5749)
* flops test in external_test_speed_theoretical.py

* test speed theo

* min SZMAX

* allow wide ALU for things that support it

* needed for mypy
2024-07-26 21:07:28 -07:00
George Hotz 2fde2d2914 hotfix: external_test_speed_theoretical works on 24GB 2024-07-26 18:41:52 -07:00
George Hotz 829262a5ee add external_test_speed_theoretical 2024-07-26 17:45:22 -07:00
kormann a5ede535ef
NOp field name [run_process_replay] (#5742)
* rm def name

* add field name
2024-07-26 18:45:59 -04:00
George Hotz c50e374bb6
multiple locals + get_kernel_modifier + fix valid (#5739)
* multiple locals + get_kernel_modifier + fix valid

* fix test pattern matcher
2024-07-26 15:10:10 -07:00
chenyu dc7483ee6f
UOp simple div folding (#5740)
made UOp.divides return the Optional[quotient] and used it for simple div folding
2024-07-26 17:14:32 -04:00
chenyu 671259417f
reuse UOp `__repr__` for NOp (#5738) 2024-07-26 16:59:55 -04:00
kormann b0c1dba299
named UOp class "NOP" [run_process_replay] (#5728)
* NOP

* fix const + simplify compile

* rm VAR for NOOP

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-07-26 13:25:53 -07:00
George Hotz 4df46eac67
clean up tensor cores [run_process_replay] (#5736)
* clean up tensor cores [run_process_replay]

* remove tuple(wmma_sz), self.opts.device

* remove tls, leave DEVICE
2024-07-26 13:21:23 -07:00
qazal 94d578396f
separate process replay main loop (#5734)
* separate process replay main loop

* [run_process_replay]

* add kernel_changed

* test with [run_process_replay]

* revert temp [run_process_replay]
2024-07-26 21:43:08 +03:00
chenyu a4e9ebc68a
update test_uop_symbolic (#5733)
enabled more passed tests
2024-07-26 13:46:09 -04:00
chenyu 2cc55a3095
UOp simple mul add div fold (#5726) 2024-07-25 22:00:30 -04:00
chenyu 5521b6d437
UOp simple mul-add-lt fold (#5721) 2024-07-25 20:49:38 -04:00
qazal 1b53207b4f
revert isolated dags scheduling (#5724) 2024-07-25 19:45:12 -04:00
chenyu 845b0d1c9d
UOp more generic div folding (#5722)
old: `x // c` can fold if `0 <= x.vmin <= x.vmax < c`
new: `x // c` can fold if `0 < c and x.vmin // c == x.vmax // c`
2024-07-25 17:49:14 -04:00
chenyu a82815262c
more test_pattern_matcher fixups (#5714) 2024-07-25 14:12:21 -04:00
chenyu 05e02ddfb3
fixup test_pattern_matcher (#5712) 2024-07-25 13:48:52 -04:00
qazal 9ceb3a3d1f
beautiful_mnist -4.3% kernels (#5709)
* add is_complete

* partially delete forced_realized

* p2

* start

* refactor to can_group

* remove steps

* _get_inputs is nicer

* fix the cache

* cache is dict now

* rename to group
2024-07-25 20:30:49 +03:00
kormann 1e2eac755d
Fix repr upat (#5705)
* test

* fix

* x fix

* simpler

* rm extra space
2024-07-25 12:05:48 -04:00
qazal 1c992de257
hotfix: compare_schedule defaults to false (#5707) 2024-07-25 17:08:28 +03:00