Commit Graph

2309 Commits

Author SHA1 Message Date
chenyu 2e087ca8e4
UOp bound for div negative number (#5808) 2024-07-31 02:10:23 -04:00
qazal bcbd925001
hcopts failing test for fused arange kernel (#5815)
* add failure_43

* n 45
2024-07-31 09:02:44 +03:00
qazal ed556c260e
UOps.IF rules more tests (#5831)
* init tests

* split tests

* assert multiple gates simplicity
2024-07-31 00:11:02 -04:00
David Hou 492a696d14
allow specify splits in shard, handle multiple different splits in MLB.e (#5599)
* allow specify splits in shard, handle multiple different splits in MLB.e

* line width

* linter

* don't use Device in docstring

* specify size of shards instead of boundaries

* adjust docstring for specify size of shards instead of boundaries

* don't allow splits on symbolic axis?

* just allow sint in splits_to_bounds

* add message for assert

* bounds instead of splits to save lines

* fix types

* reduce diff

* fix

* tuple

* golf :(

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-07-30 19:33:04 -07:00
chenyu c3da458bc3
UOp if min==max folds to CONST (#5828)
* UOp if min==max folds to CONST

* fix test
2024-07-30 22:14:22 -04:00
George Hotz e6879035a0
work to make GEMV fast (#5824)
* work to make GEMV fast

* half8 cast

* align struct

* fix amd

* float8 is a later problem
2024-07-30 17:41:40 -07:00
chenyu 02f0be03f2
tests on UOp div negative number and arange opts (#5825) 2024-07-30 20:06:57 -04:00
George Hotz 693990a346
swap src[2] and src[3] in load [run_process_replay] (#5821)
* swap src[2] and src[3] in load [run_process_replay]

* cleanups + bugfix

* fix ptx
2024-07-30 14:04:13 -07:00
George Hotz 17a2f74412
new style load/store folder (#5784)
* remove old index reorder

* new style folder

* works better

* dedup

* one failure

* this is fine now...

* expander_rewrite

* images broken, but all else should work

* cleanups

* make tests work with old

* fix images

* cleanups + bugfix

* minor fixes

* fix gated store folding

* flip gate_creator and expander

* fix gated store

* remove unneeded rules

* lines getting close

* line count good
2024-07-30 13:17:20 -07:00
qazal 03d866b84f
UOps.IF with rewrite rules (#5812)
* expand merge

* merge barriers

* gate_folder

* test_linearizer_failures

* this can be here

* bring the new repr back

* gate_folder2

* gate_creator is better

* gate_folder

* dedup conditions

* early gate folding

* dedup barrier

* fold noop conditions

* all consts can go away

* free lines
2024-07-30 20:50:56 +03:00
chenyu defd89e8e0
unify negative shape creation to raise ValueError (#5817)
[run_process_replay]
2024-07-30 13:42:59 -04:00
P4ssenger 6742a4789a
Add check for negative dimension in view (#5790)
* add check for negative dimension in view

* add negative dim tests

* move check to tensor level

* fix error message

* move check to view create

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-30 13:26:27 -04:00
Francis Lata ce61be16f1
clean up how preprocessed folder is defined (#5813) 2024-07-30 12:35:26 -04:00
qazal 5e827e51d2
add llama3 BEAM=2 failures to test_linearizer_failures (#5553)
* skips

* opts.device

* benchmarks

* add to test_linearizer_failures

* remove hardcoded ones

* linter

* skip cpu
2024-07-30 00:37:32 +03:00
samm393 573e0f9a48
remove float division from idiv in python_alu (#5777)
* removes float division from idiv in python_alu

* add test

* cleaner logic

* pass clang unsigned literals correctly

* suffix ULL instead of U

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-29 12:14:12 -04:00
samm393 2c94316bd2
ull literal support and test (#5789)
* ull literal support and test

* missing .numpy()
2024-07-29 11:50:49 -04:00
nimlgen ab3839a80a
cleanup nv/cuda compilers (#5767)
* cleanup nv/cuda compilers

* destroy prog

* small test

* fix test

* nv ptx rewrite key

* jitlink free

* ptx is part of cuda
2024-07-29 13:50:03 +03:00
chenyu e7a14f398e
more uop_symbolic tests for divmod pairs (#5785) 2024-07-28 21:27:06 -04:00
George Hotz 76d191ab94
move consts to end of add (#5783)
* move consts to end of add

* better

* fix infinite loop
2024-07-28 17:38:57 -07:00
chenyu 71a64d8252
UOps.MUL bound when one is negative (#5781)
* UOps.MUL bound when one is negative

also one more distribute_mul rule

* don't always expand
2024-07-28 19:02:47 -04:00
qazal b775db6b60
high-level benchmark timing diff (#5776)
* high level timings

benchmark times

fix defs

* use the name map

* skip last task
2024-07-28 23:42:57 +03:00
chenyu 600a39771d
fix Tensor.arange if (stop-start) and step have different signs (#5775) 2024-07-28 14:34:10 -04:00
David González Martínez d0fd84e617
feat: allow passing gradient to .backward() to compute vjp (#5771)
* feat: allow passing gradient to .backward() to compute vjp

* fix

* refactor

* fix trailing whitespace
2024-07-28 11:13:18 -07:00
qazal e0e7293b0a
make process replay unique in retries [run_process_replay] (#5773) 2024-07-28 20:44:15 +03:00
qazal 95dda8dadf
more unmatching vectorize/gep asserts [run_process_replay] (#5760)
* merge vectorize/gep rules [run_process_replay]

* assert dtypes

* src=

* float2=(float4.x,float4.y)
2024-07-28 15:08:54 +08:00
chenyu bfbd7c5461
more generic UOp mul mod folding (#5765) 2024-07-27 20:20:35 -04:00
chenyu 80c6475757
update test_uop_symbolic to test UOp min and max (#5764)
covers #5750, #5748, #5741
2024-07-27 19:53:21 -04:00
nimlgen ed1d784077
test profiler timer sync across devs (#5751)
* test profiler timer sync across devs

* more correct

* typo
2024-07-27 16:47:37 +03:00
qazal 3e49d86c01
process replay diffs 3 things now (#5731)
* github api infra

* process replay is 3 parts now

* parse benchmarks

* add gh_token

* complete diff

* move process replay tests

* last successful run

* add tempdir

* skip master
2024-07-27 12:52:20 +03:00
qazal 57b4a8e98d
assert process replay asserts (#5737)
* assert process replay asserts

* one ci job is fine

* test: Revert "separate process replay main loop (#5734)"

This reverts commit 94d578396f.

* mac sed needs that

* Revert "test: Revert "separate process replay main loop (#5734)""

This reverts commit e4ad7684d5472a64841a66b43bc1db7c9bbbf9e8.

* disable process replay capture

* save time

* amd is tiny

* send to /dev/null
2024-07-27 12:07:50 +03:00
George Hotz f8972ace38
test flops (and allow wide ALU in UOps) [run_process_replay] (#5749)
* flops test in external_test_speed_theoretical.py

* test speed theo

* min SZMAX

* allow wide ALU for things that support it

* needed for mypy
2024-07-26 21:07:28 -07:00
George Hotz 2fde2d2914 hotfix: external_test_speed_theoretical works on 24GB 2024-07-26 18:41:52 -07:00
George Hotz 829262a5ee add external_test_speed_theoretical 2024-07-26 17:45:22 -07:00
kormann a5ede535ef
NOp field name [run_process_replay] (#5742)
* rm def name

* add field name
2024-07-26 18:45:59 -04:00
George Hotz c50e374bb6
multiple locals + get_kernel_modifier + fix valid (#5739)
* multiple locals + get_kernel_modifier + fix valid

* fix test pattern matcher
2024-07-26 15:10:10 -07:00
chenyu dc7483ee6f
UOp simple div folding (#5740)
made UOp.divides return the Optional[quotient] and used it for simple div folding
2024-07-26 17:14:32 -04:00
chenyu 671259417f
reuse UOp `__repr__` for NOp (#5738) 2024-07-26 16:59:55 -04:00
kormann b0c1dba299
named UOp class "NOP" [run_process_replay] (#5728)
* NOP

* fix const + simplify compile

* rm VAR for NOOP

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-07-26 13:25:53 -07:00
George Hotz 4df46eac67
clean up tensor cores [run_process_replay] (#5736)
* clean up tensor cores [run_process_replay]

* remove tuple(wmma_sz), self.opts.device

* remove tls, leave DEVICE
2024-07-26 13:21:23 -07:00
qazal 94d578396f
separate process replay main loop (#5734)
* separate process replay main loop

* [run_process_replay]

* add kernel_changed

* test with [run_process_replay]

* revert temp [run_process_replay]
2024-07-26 21:43:08 +03:00
chenyu a4e9ebc68a
update test_uop_symbolic (#5733)
enabled more passed tests
2024-07-26 13:46:09 -04:00
chenyu 2cc55a3095
UOp simple mul add div fold (#5726) 2024-07-25 22:00:30 -04:00
chenyu 5521b6d437
UOp simple mul-add-lt fold (#5721) 2024-07-25 20:49:38 -04:00
qazal 1b53207b4f
revert isolated dags scheduling (#5724) 2024-07-25 19:45:12 -04:00
chenyu 845b0d1c9d
UOp more generic div folding (#5722)
old: `x // c` can fold if `0 <= x.vmin <= x.vmax < c`
new: `x // c` can fold if `0 < c and x.vmin // c == x.vmax // c`
2024-07-25 17:49:14 -04:00
chenyu a82815262c
more test_pattern_matcher fixups (#5714) 2024-07-25 14:12:21 -04:00
chenyu 05e02ddfb3
fixup test_pattern_matcher (#5712) 2024-07-25 13:48:52 -04:00
qazal 9ceb3a3d1f
beautiful_mnist -4.3% kernels (#5709)
* add is_complete

* partially delete forced_realized

* p2

* start

* refactor to can_group

* remove steps

* _get_inputs is nicer

* fix the cache

* cache is dict now

* rename to group
2024-07-25 20:30:49 +03:00
kormann 1e2eac755d
Fix repr upat (#5705)
* test

* fix

* x fix

* simpler

* rm extra space
2024-07-25 12:05:48 -04:00
qazal 1c992de257
hotfix: compare_schedule defaults to false (#5707) 2024-07-25 17:08:28 +03:00
qazal 489cda827a
more scheduler process replay tooling (#5706)
* more scheduler process replay tooling

* refactor to compare_schedule
2024-07-25 15:47:18 +03:00
qazal 4e070a2c89
start work on indexing fusion (#5590)
* start base

* the views add up

base reduceop st:
ShapeTracker(views=(View(shape=(60000, 1), strides=(1, 0), offset=0, mask=None, contiguous=True),))

top st:

ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))

merged buf.st+st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))

* p1

* some cleanups

* more cleanups

* one kernel

* more

* late fuse arange

* less lines

* more work

* fix st strides 1

* update test_schedule, start argmax

* test_tiny_argmax

* add FUSE_ARANGE

* more cleanup

* add utils

* reduce merging

* fix axis and fold if needed

* more fusion

* need to figure this out

* now fixing all of these

* todos+save a line

* ready for p1
2024-07-25 13:23:38 +03:00
nimlgen 08f47d7dc3
more info on failure 41 (#5704) 2024-07-25 12:14:28 +03:00
nimlgen 69d4f474d8
amd resnet pf (#5703) 2024-07-25 11:21:22 +03:00
chenyu 46e1151c02
UOp more generic mul -> mod folding (#5698) 2024-07-24 21:41:25 -04:00
chenyu 66a9c372af
UOp mod reduction (#5697) 2024-07-24 20:36:00 -04:00
chenyu 8648fb2636
UOp vmin/vmax on ADD (#5689) 2024-07-24 19:09:42 -04:00
chenyu 85710e86cb
UOps div folding (#5690)
#5689, with just div folding and new test cases
2024-07-24 14:21:44 -04:00
chenyu a7a77dfd83
UOp mul lt fold (#5677) 2024-07-24 02:49:25 -04:00
chenyu 4e85761d40
UOp mod folding (#5668) 2024-07-24 00:10:47 -04:00
George Hotz 053550c3f3
remove MERGE opt, cleanup wmma upcast (#5669)
* remove MERGE opt, cleanup wmma upcast

* upcast first

* fix broken vectorize folding rule
2024-07-23 20:43:42 -07:00
chenyu 3060e0be4f
add vmin vmax of SPECIAL (#5670)
* add vmin vmax of SPECIAL

folded stuff like (-1 < gidx0)

* flaky
2024-07-23 22:55:54 -04:00
George Hotz fa14f7b4fd
switch contract arg to match expand arg [run_process_replay] (#5667)
* switch contract arg to match expand arg [run_process_replay]

* support multiaxis contract too, it's easy

* cancel contract/expand
2024-07-23 18:08:33 -07:00
George Hotz a85493bdbe multiaxis contract test 2024-07-23 15:09:15 -07:00
George Hotz e3f00ac77d
Fix cuda tc emu test (#5663)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand

* fix test emulated CUDA tensor cores

* test_gemm_fp16 on some devices
2024-07-23 15:04:25 -07:00
chenyu 16c27ae400
update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]

from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable

* fix ptx
2024-07-23 16:58:12 -04:00
chenyu 01fe00e055
skip test_failure_39 in CI (#5660)
took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger
2024-07-23 14:47:05 -04:00
chenyu 199b3bf02b
simple UOp lt/ge folding (#5657)
works if lhs is a DEFINE_VAR.
folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more
2024-07-23 14:11:05 -04:00
qazal b0fc5a4c6f
start scheduler process replay (#5656) 2024-07-23 20:02:51 +03:00
chenyu e210c87b4a
uop mod-mod simplification (#5650) 2024-07-23 12:33:55 -04:00
nimlgen 1384f08cd4
hcq profile tests (#5654)
* profile tests

* fixes

* remove linter
2024-07-23 18:40:33 +03:00
qazal 5f394fc9c6
more work toward non-blocking process replay (#5653)
* non-blocking process replay

* more actionable

* test it

* revert the test

* %s/logging.warn/logging.warning
2024-07-23 14:26:31 +03:00
qazal 7cb67e6fb2
merge gated stores spec (#5652)
* test_unmerged_ifs should merge ifs

* test_tiny_gate_store

* test_merge_ifs_alt

* assert assert asserts
2024-07-23 18:53:27 +08:00
George Hotz 7c4b177e3a
add tests for uops stats (#5649)
* add tests for uops stats

* no locals skip is fine

* eh
2024-07-22 21:57:03 -07:00
chenyu 4f83da626e
uop symbolic simple mul mod (#5648) 2024-07-22 23:17:41 -04:00
chenyu f2d2afdaa4
dumb linearizer example that max is not simplified (#5644)
* dumb linearizer example that max is not simplified

this might just get fix once basic mod simplification is done

* need local
2024-07-22 18:37:26 -04:00
chenyu 24505199fb
UOp.const(x.dtype, y) -> x.const(y) [run_process_replay] (#5642) 2024-07-22 17:09:40 -04:00
chenyu 97b116bb1d
UOp mul div simplification (#5637)
* UOp mul div simplification

* != 0 is fine
2024-07-22 16:14:12 -04:00
nimlgen 26fc4610a0
amd more accurate cache managment (#5631)
* amd more accurate cache managment

* fix amd

* add memory_barrier + copies tests

* tranfer test as well

* linter
2024-07-22 19:07:01 +03:00
Vyacheslav Pachkov edc58e6b6e
hcq: remove duplicate allocation of kernel args by abstracting (#5633) 2024-07-22 18:29:41 +03:00
George Hotz dc21e63bd2
test: put conv in one reduce (#4441)
* test: put conv in one reduce

* put reduce at the end

* more expand

* generic, and that expand was breaking things

* ratio

* don't undo the expand

* arg 1

* strides

* warning, for resnet

* warning removed

* disable cast

* handle cast

* op

* err, that's right

* fixup

* fix that

* a test to play with

* add double_reduces

* working up to final reshape

* fold the last reshape

* moved to schedule

* fix axis

* ci, need to bring arange back

* FUSE_CONV_BW maybe

* valid in 3.9

* test_expand_reduce_is_folded_on_different_axes

* add FUSE_CONV_BW=1

* test_fold_batchnorm_backward

* test_sgd_4convs_fuse

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-07-22 12:16:13 +03:00
George Hotz 386fb5e7f8
folding without UNMUL (#5628)
* folding without UNMUL

* fix failures, index_collapse

* import ReduceOps

* test_arange_4096 isn't folding
2024-07-21 20:14:44 -07:00
George Hotz 7f5282b2f5
tests if the linearizer is generating dumb code (#5611)
* tests if the linearizer is generating dumb code

* push consts to the end

* sort adds

* sorted add and mul

* this better

* simple expand/contract

* no math contract/expand
2024-07-20 20:36:32 -07:00
George Hotz b399ccd6ef
BEAM bugfix, kernels dedup now (#5617)
* BEAM bugfix, kernels dedup now

* getenv is default
2024-07-20 19:43:50 -07:00
chenyu 92e7e65712
one more test case for symbolic mod mul (#5615) 2024-07-20 17:23:06 -04:00
qazal 3ab5fe4e1b
test argmax multireduce failure (#5609) 2024-07-20 21:33:03 +08:00
chenyu b991097d41
move UPat and PatternMatcher from uopgraph.py to uops.py (#5597)
* move UPat and PatternMatcher from uopgraph.py to uops.py

towards instant UOps rewrite on UOp.alu

[run_process_replay]

* fix imports
2024-07-19 19:28:24 -04:00
George Hotz 2e617ca59e
lowerer img index (#5592) 2024-07-19 14:22:02 -07:00
nimlgen b1782e3fef
hcq refactor signal into class (#5575)
* hcq refactor signal into class

* fix amd

* amd do not use amd_signal_t

* cleanup

* signal setter

* fix linter

* docs

* more docs + types

* fix types
2024-07-19 23:23:05 +03:00
George Hotz d0ab20a5e5
careful memory counting (with tests to specify behavior) (#5587) 2024-07-19 11:37:34 -07:00
chenyu 37dd233650
always reverse global dim (#5586)
* always reverse global dim

* one more test
2024-07-19 13:58:05 -04:00
George Hotz 10be05aae5
push contract through cast to fix test_float2_acc (try 2) (#5585)
* push contract through cast to fix test_float2_acc (try 2)

* contract push only on floats
2024-07-19 10:34:43 -07:00
George Hotz 51892c8fac
Revert "push contract through cast to fix test_float2_acc (#5581)" (#5583)
This reverts commit ddda9420be.
2024-07-19 09:44:30 -07:00
George Hotz ddda9420be
push contract through cast to fix test_float2_acc (#5581)
* push contract through cast to fix test_float2_acc

* no_vectorized_alu applies to cast too
2024-07-19 09:30:26 -07:00
chenyu 3f590c3b31
some limit_dims to limit global merging (#5489)
only supports merging dims in a way that does not surpass limit, no splitting yet
2024-07-19 12:17:46 -04:00
George Hotz 0ad87021e2
move acc to end (#5568)
* move acc to end

* confirmed pictures are the same

* relax that

* Update test_ops.py
2024-07-19 03:06:52 -07:00
George Hotz 2de82b8a5d
remove get_lazyop_info (#5570)
* don't use get_lazyop_info more

* keep that min

* no ptx for that test
2024-07-19 03:05:33 -07:00
nimlgen 9d7edc9269
hcq rename HCQCompat -> HCQ (#5577) 2024-07-19 11:34:17 +03:00
chenyu 2b2f8ad18c
failed example of float2 acc no long applies (#5573)
* failed example of float2 acc no long applies

* # noqa: E501
2024-07-19 02:40:04 -04:00
qazal e7a057c20f
retire replay_schedule (#5563) 2024-07-18 23:07:02 +03:00
qazal 50aba32ea8
hotfix: don't assert process replay in master. (#5562)
This is because https://github.com/tinygrad/tinygrad/actions/runs/9996754763/job/27631802686 ran exactly when master changed state, causing the diff to assert.
if [run_process_replay] is green pre merge it's ok.
2024-07-18 22:05:00 +03:00
George Hotz 223d9283ee
fix float4 acc by moving contracts (#5559) 2024-07-18 11:30:16 -07:00
chenyu f5af98c450
failed test case that DEFINE_ACC no long uses float4 (#5555)
* failed test case that DEFINE_ACC no long uses float4

* line
2024-07-18 10:55:59 -07:00
George Hotz 923e0fe0b8
fix half4 folding (#5556) 2024-07-18 10:47:39 -07:00
chenyu 12e6771209
failed test case for unrolled half4 (#5552) 2024-07-18 13:05:52 -04:00
George Hotz d1a7279605
indexing fold with casted bool (#5551)
* cast bool is where

* universal transform is wrong
2024-07-18 10:02:29 -07:00
kormann 2c4add6844
pretty print lazy op per default (#5505)
* pretty lop

* min diff

* walrus

* fix

* min diff

* simplify

* pretty helper function

* ws

* pretty uop upat

* tests

* stricter tests

* test passes

* ws

* stronger upat test

* delete print_tree

* min diff

* stricter exp test

* fix merge

* stronger uops eval test

* +readable and deep upat test

* +readable and deep upat test

* sort inv fix

* fix

* revert allowed_len
2024-07-18 09:34:08 -07:00
qazal 0ad1672d5f
fuse indexing (LazyOp creation) (#5506)
* bring FUSE_AS_ONE_KERNEL back

* operands need reshape?

* fused but arange didnt fold

* something deeply wrong

* yay, fused

* derive broadcasts

* s/input/reduce_input

* _fixup_ones proved a point

* this is what it takes

* down to 3 required reshapes:

1. output_shape
2. the second reduce merge dims
3. remove dims for above reshape

* start real reshapes

* resolve shape in the edges pre lazyop

* outputs are the same shape

* rewrite1: just the reduce

* more correct

* fuse_as_one_kernel

* closer

* this passes

* dont rerun info

* dont need these

* not needed
2024-07-18 14:09:17 +03:00
George Hotz fa7e734b49
MetaOps.KERNEL (#5543) 2024-07-17 19:41:23 -07:00
George Hotz d3b098299d
add failing regression test for image (#5540)
* add failing regression test for image

* tg type

* simpler test

* don't realize image to image casts caused issue

* simple pad
2024-07-17 17:27:18 -07:00
qazal 61ee02e93d
start multireduce lowerer work (var/std) (#5537)
* multireduce no-opts works

* passed test_var_multireduce

* cleanup

* double reduce

* extra check for range_group

* more checking for range_groups

* cleaning up debug prints

* cleanup diff

* linters

* revert kernel changes

* these are uops toposort

---------

Co-authored-by: timmy <timmy0x@proton.me>
2024-07-17 23:43:46 +03:00
Francis Lam c4eb30a04c
test/test_linearizer_failures: add a new beautiful_mnist one (#5531)
* test/test_linearizer_failures: add a new beautiful_mnist one

this one is from a DEPTH=2 fuzz_linearizer search

* add GPU to test_failure_40

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-17 16:27:04 -04:00
qazal 0259d76183
use Context only in replaying Kernel [run_process_replay] (#5535) 2024-07-18 03:46:14 +08:00
George Hotz 1a68854766
PatternMatcher add (#5532)
* PatternMatcher add [run_process_replay]

* f4 dynamic

* test_failure_36 is fixed

* fix PTX
2024-07-17 12:44:42 -07:00
qazal a7706e05f9
option to [skip_process_replay] (#5533) 2024-07-17 22:30:46 +03:00
George Hotz 1242b302fa
expand UOps with rewrite rules (#5501)
* expand UOps with rewrite rules [run_process_replay]

* progress

* much closer

* close, way less bugs

* bunch of expander tests

* fix contract

* ops tests pass

* fix barrier

* mostly passing

* bitcast in expanded ops

* support more expand merges

* all tests pass maybe

* fix empty EXPAND

* fix LIN fuzzing

* add ALL_SAME assert

* all same

* all same work

* raise CompileError

* pass fuzz linearizer

* revert whitespace

* fix nv tensor core test

* fix mypy

* bug fix

* fuzzer passes

* put tests back

* expand arg to idx
2024-07-17 10:17:50 -07:00
George Hotz 158221b36b
expand tests from uop_expander [run_process_replay] (#5524)
* expand tests from uop_expander

* more changes from the branch
2024-07-17 09:22:36 -07:00
George Hotz 42c25cc961
fix fixup_ast (#5523)
* fix fixup_ast

* these lin failures are fixed
2024-07-17 08:52:21 -07:00
nimlgen dcd462860f
elf loader (#5508)
* elf loader

* cleanup

* cleaner

* cleaner

* fixes

* revert this

* fix div 0

* fix nv

* amd fix

* fix mockgpu

* amd better?

* restore relocs for <12.4

* linter

* this is fixed now

* revert this

* process cdefines as function

* cleaner

* align

* save lines

* revert this change
2024-07-17 17:09:34 +03:00
Francis Lam 2d53abb04a
test/external/fuzz_linearizer: fix for new AST changes (#5519)
* test/external/fuzz_linearizer: fix for new AST changes

also add beautiful_mnist failures

* add CLANG and LLVM to test_failure_35 failed_platforms

* fix test_linearizer_failure names
2024-07-17 00:08:07 -04:00
chenyu 6e405b0a2b
add 0d tensor to trunc/floor/ceil/round tests (#5512)
existing trunc test passes backward but its backward is incorrect in general. added tests that would fail
2024-07-16 16:48:25 -04:00
Tobias Fischer 87a2ef2bc2
Add Interpolate Function (#5482)
* add interpolate function

* fixed linter issue

* reduced sizes in test

---------

Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2024-07-16 09:44:01 -07:00
qazal 173064c69c
(re)start multireduce in codegen/* (#5391)
* test_var_multireduce

* run verify_lazyop

* test_var_multireduce

* assert lazyop

* add test_indexing_multireduce

* arange fuses (crude)

* note: extra reshape

* start readble

* test_arange_simple

* test_arange_expanded

* test_indexing_multireduce

* cleanups

* skip ptx

* skip nv and amd ci

* skip arange expanded too

* GPU=1 is slow too in CI
2024-07-16 14:20:48 +03:00
chenyu 07ff4b7d24
test_failure_33 ast that has UOps.UNMUL after linearize (#5504)
* test_failure_33 ast that has UOps.UNMUL after linearize

* smaller
2024-07-15 22:54:23 -04:00
chenyu 63990705b5
test kernel opts case for 4 local and 4 groups (#5499)
make sure local grouped dim is correct
2024-07-15 20:09:38 -04:00
Edward Wang 9a7d5a148e
move colorize_float to helpers.py (#5490)
* add colorize_float to helpers.py

* update references
2024-07-15 11:29:03 -07:00
qazal ac08f0eb00
reshape rawbufs in test_linearizer (#5492)
* reshape rawbufs in test_linearizer

* fix helper_linearizer_ast
2024-07-15 19:14:38 +03:00
qazal ae4cb7994e
run process replay with DEBUG=0 (#5491)
* process replay with DEBUG=0

* graceful shutdown

* use and
2024-07-15 16:30:57 +03:00
Tobias Fischer e219103677
Add Pad to Pooling (#5488) 2024-07-14 21:50:20 -07:00
Tobias Fischer 5849130cbb
gather negative dim fix (#5486) 2024-07-14 20:20:53 -04:00
qazal 3c378efcb6
process replay docs improvements (#5481)
* minor cleanups

* docs and logs

* shorter

* comma

* s/print/logging.info [run_process_replay]

* use logging.warn

* process name is noise

* revert lowerer change [run_process_replay]
2024-07-15 00:09:28 +03:00
chenyu 613a1dbeed
render lidx starting with 0 (#5478)
* render lidx starting with 0

changed from
```
  int gidx0 = gid.x; /* 4096 */
  int lidx4 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx5 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx6 = lid.z; /* 2 */
```
to
```
  int gidx0 = gid.x; /* 4096 */
  int lidx0 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx1 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx2 = lid.z; /* 2 */
```

the existing one started from pre-limited global dims which skip number if there are more than 3 global dims

* don't need start_dim

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-07-14 16:34:04 -04:00
qazal 671779f280
limit process replay diff to ~20% of kernels (#5480)
* render lidx starting with 0

changed from
```
  int gidx0 = gid.x; /* 4096 */
  int lidx4 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx5 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx6 = lid.z; /* 2 */
```
to
```
  int gidx0 = gid.x; /* 4096 */
  int lidx0 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx1 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx2 = lid.z; /* 2 */
```

the existing one started from pre-limited global dims which skip number if there are more than 3 global dims

* don't need start_dim

* add changed

* env var

* more early exit

* simpler?

* Revert "Merge branch 'lidx0' into process_replay_limit"

This reverts commit cbadcfa5e9b0489a2a9c1e0b6682db9f4f554ab8, reversing
changes made to fc9bf37ee70392a4170036698ced14621783b625.

* minor cleanup

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-14 23:10:08 +03:00
chenyu f8a47608cc
test dtype.min and dtype.max (#5479)
compared with np.iinfo for integer dtype
2024-07-14 15:31:37 -04:00
George Hotz a9f5a764dc
make BatchNorm work for 2D and 3D (#5477)
* make BatchNorm work for 2D and 3D

* beautiful mnist shouldn't use BatchNorm2d
2024-07-14 11:39:58 -07:00
chenyu e41ab66653
use is to compare types (#5476)
new rule in latest ruff
2024-07-14 14:26:41 -04:00
nimlgen 61822d1a14
nv fix timeline signal rollover on copy queue (#5473)
* hotfix: nv rollover to 32bits

* test both queues
2024-07-14 16:06:12 +03:00
nimlgen 8835d6c49a
cleanup nv/amd program (#5449)
* cleanup nv/amd program

* fix amd

* a bit cleaner

* ugh, typo

* linter

* fix nv

* tiny thing
2024-07-14 14:08:35 +03:00
qazal 0b3a34e3b1
vectorize folding [run_process_replay] (#5470)
* test_gep_vec_fold

* remove that

* fix process replay

* lint
2024-07-14 09:41:48 +03:00
chenyu 28972418c4
s/get_linearizer/get_kernel [run_process_replay] (#5467) 2024-07-13 20:32:22 -04:00
Francis Lata 0345577032
UNet3D dataloader shared memory fix (#5465)
* create separate SharedMemory between inputs and labels

* update path check for shared mem

* clean up unit test for dataset
2024-07-13 20:26:00 -04:00
George Hotz 942c58be90
BEAM_COMPARE=2 validates the correctness of BEAM kernels (#5458)
* beam compare 2

* found issue maybe

* correct, not fail

* full rand

* less numpy

* extra simplify doesn't fix it

* reorder

* no numpy

* check in reverse

* test new tensor behavior

* better error msg
2024-07-13 13:53:43 -07:00
qazal 487ceff825
hotfix: ASSERT_PROCESS_REPLAY sometimes doesn't exist (#5456) 2024-07-13 21:15:40 +03:00
qazal 40ec9410f9
simpler process replay (#5452)
* remove check_process_replay

* that can go to the top

* add assert back

* [run_process_replay]

* checkout code [run_process_replay]

* temp [run_process_replay]

* revert temp [run_process_replay]

* ahh this is why [run_process_replay]

* revert temp [run_process_replay]
2024-07-13 19:55:06 +03:00
qazal 23b907efbb
restore process replay runs by their id (#5453) 2024-07-13 19:32:34 +03:00
George Hotz e638b0084f
smaller multitensor resnet test (#5450)
* minor improvments to matcher speed [run_process_replay]

* oh, put that back

* make fake images smaller for resnet test
2024-07-13 07:31:28 -07:00
qazal bb1a9ebf78
run process replay in parallel (#5443) 2024-07-13 11:29:36 +03:00
chenyu 3ebf569f04
relax fuzz transend math threshold a bit (#5442)
* relax fuzz transend math threshold a bit

* fuzz more

* fuzz 50k
2024-07-13 03:31:21 -04:00
chenyu e398734890
fuzz test transcend math (#5383)
* fuzz test transcend math

found something wrong with float64 sin reduction

```
from tinygrad import Tensor, dtypes
import numpy as np
print(Tensor([39800.0], dtype=dtypes.float64).sin().numpy())
print(Tensor([39800.0], dtype=dtypes.float32).sin().numpy())
print(Tensor([39800.0], dtype=dtypes.float16).sin().numpy())
print(np.sin(np.array([39800.0], dtype=np.float64)))
print(np.sin(np.array([39800.0], dtype=np.float32)))
print(np.sin(np.array([39800.0], dtype=np.float16)))
```

```
CLANG=1 python test.py
[0.92785633]
[0.7428573]
[-0.7705]
[0.74285722]
[0.7428572]
[-0.7705]
```

* fix test

* abs

* skip
2024-07-13 01:54:52 -04:00
hikettei 3a7262d923
[Patch] Fixed an invaild value of fp64 xlog(DBL_MIN) (#5441)
* [Patch] Removed weird NaN Handling in xlog2 resulting in different output around 1e-203

* Patch: compare the value of xlog(x) using y, allowing x <= 1e-200

* mypy

* fuzzer tests for log2

* fix tests: use approximate dbl_min, fp64 fails at nv

* update: gradually increment the scale (if y is not inf)
2024-07-13 01:11:53 -04:00