Commit Graph

5525 Commits

Author SHA1 Message Date
gswangg df44a4e861
Make vectorization of CONST explicit (#5322)
* remove test_const_vectorize_fold

* remove const folding UPat for VECTORIZE

* refactor cstyle render_const

* remove calls to dtype.scalar() in render_const

* add assert

* add vectorized const to UOp.const

* add UPat GEP-VECTORIZE-CONST -> CONST

* render_vectorize for DEFINE_ACC in cstyle

* add back missing render_cast in render_const

* generate vectorized consts as UOps for DEFINE_ACC

* update asserts for DEFINE_ACC with VECTORIZE src

* add UPats for PHI with VECTORIZE src

* use prev rendered vectorize in DEFINE_ACC render

* update DEFINE_ACC in python runtime

* update vectorized DEFINE_ACC in PTXRenderer

* rebase DEFINE_ACC changes on lowerer

* verbose rewrite of bad UPats

* simplify UOps.CONST implementation in ops_python

* update sum_collapse UPats for DEFINE_ACC-VECTORIZE

* revert linearizer to TOT

* fix DEFINE_ACC implementation in ops_python

* simplify DEFINE_ACC in cstyle

* Fix linter error

* support VECTORIZE in fold gated load/store UPat

* support VECTORIZE in other fold gated load UPats

* rewrite VECTORIZE in UPat for no input DEFINE_ACC

* simplify DEFINE_ACC render in cstyle

* make VECTORIZE rules more concise

* add more vectorize fold tests

* inline VECTORIZE-CONSTs in cstyle render

* revert VECTORIZE/GEP rule refactor

* revert cstyle render_const refactor

* inline VECTORIZE-CONSTs in cstyle render

* implicitly vectorized const rendering -> explicit

* WMMA VECTORIZE CONST process replay hacks

* VECTORIZE CONST NAN process_replay hacks

* more VECTORIZE CONST NAN hacks

* cleanup process_replay hacks

* isnan() -> not isfinite() cstyle VECTORIZE CONST

* tweak isnan and isfinite checks VECTORIZE CONST

* tweak for positive vs negative infinity VECTORIZE CONST

* add assert to PTX CONST render

* process_replay VECTORIZE CONST render parity for PTX STORE

* vmin/vmax for VECTORIZE'd CONST

* update WMMA folding rules

* add tests for WMMA VECTORIZE fold

* hack for cstyle half4 CONST zero process_replay parity

* revert PTX backend changes

* add back minimal DEFINE_ACC PTX change

* remove cstyle process_replay hacks

* remove dead code in PTX CONST render

* cleanup vmin/vmax logic for VECTORIZE'd CONSTs

* update vectorize fold tests to use DEFINE_VAR

* fix long line formatting in test

* remove unwanted merge artifact

* more vmin/vmax cleanup

* remove unnecessary asserts

* yet more vmin/vmax cleanup

* get rid of explicit VECTORIZE CONST logic in _min_max

* reuse CONST instead of creating a new one

* remove unneeded cast

* handle DType correctly in sconst

* improve readability of tests

* save a line

* save another line

* tuplize pats in src

* remove GEP-VECTORIZE pats

* add vec +0 fold

* HACK: fold only vec8 +0

* remove vectorized ALU fold hack

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-08 20:59:05 +03:00
chenyu 62c77a2831
trim const in UOp div_folding (#5982)
simplify `(4*x+4*y+7)//16` to `(x+y+1)//4`.
fixed `GPU=1 UOP_IS_SYMBOLIC=1 IMAGE=2 python -m pytest test/test_ops.py -k conv`
2024-08-08 12:49:05 -04:00
qazal e6d41b0ce7
hotfix: adjust test_backward_pass_diamond_model thresholds (#5981) 2024-08-09 00:20:53 +08:00
gswangg 08d22066ee
simplify ALU vmin==vmax fold (#5962) 2024-08-08 11:29:16 -04:00
Elias Wahl c9b4602854
no load in INITMLPERF (#5957) 2024-08-08 11:28:24 -04:00
nimlgen 183c4c91a3
fix non-jitted transfers in profile (#5980)
* fix transfers in profile

* fix linter

* sync to be sure everythin is recorded
2024-08-08 17:58:08 +03:00
nimlgen 76eca0d27e
nv fix host mem mappings (#5979) 2024-08-08 17:03:44 +03:00
nimlgen e89eff11a6
amd raise when not supported arch (#5978) 2024-08-08 14:46:14 +03:00
George Hotz bc55c8a30e
pmatmul example + GB/s bugfix [run_process_replay] (#5974)
* pmatmul example + bugfix

* improve pmatmul

* Update real_pmatmul.py
2024-08-07 22:32:11 -07:00
George Hotz c5baa3d66b hotfix: don't run OOM test in CI 2024-08-07 22:19:29 -07:00
chenyu 859d0e4709
UOp simplify `(x+c0)*c1 -> x*c1+c0*c1` (#5973) 2024-08-07 21:25:22 -04:00
wozeparrot 97d708252a
remove realize from threefry (#5969) 2024-08-07 15:08:49 -07:00
George Hotz bf8ec23b00 hotfix: contiguous on precompute_freqs_cis 2024-08-07 14:40:56 -07:00
wozeparrot d3e427c8d9
fix sqlite3 locks (#5971) 2024-08-07 14:38:19 -07:00
nimlgen cc37c99ae4
tiny hcq touchups (#5964) 2024-08-07 21:03:20 +03:00
nimlgen 8d8704af2d
fix amd exec_update for locals (#5966) 2024-08-07 21:02:56 +03:00
ignaciosica 0ddcd005f5
fix priority width and give more space for src (#5509) 2024-08-07 10:48:18 -07:00
tyoc213 0c4e9dbe71
retrieve defined opencl error codes (#5792) 2024-08-07 10:46:24 -07:00
ignaciosica 4b48f166ec
Refactor render_kernel for NV [run_process_replay] (#5965)
* start working on it

* blind test with process replay

* remove noqa:E501 refactoring make_cuda_dtype

* refactor even more but with known bug

* fix known bug with duplicated includes

* working locally

* add noqa:e501

* remove comment and move map

* fix qaz comments

* remove comment

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-08-07 20:36:04 +03:00
qazal d6f4a61c42
graph LBScheduleItem [run_process_replay] (#5960)
* add toposort key to LBScheduleItem

* use dedup

* graph LBScheduleItem

* make that comment beautiful again

* diff_schedule utils

* update fuzz_schedule
2024-08-07 19:59:11 +03:00
George Hotz 0a8668cf30 improvements to docs 2024-08-07 09:57:24 -07:00
qazal 7677361d90
test pushing through different expands in 1 kernel (#5963)
* test pushing through different expands in 1 kernel

* realize eye

* back to test_example_matmul
2024-08-07 19:33:18 +03:00
nimlgen 564a352194
nv unify _gpu_free (#5961)
* nv unify _gpu_free

* revert this
2024-08-07 18:18:17 +03:00
Eitan Turok 39c8c9c00a
Add docs (#5942)
* init commit

* finish writing

* add to docs

* fix docs

* fix typo

* delete new line

* rename to tensor properties

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-07 07:38:51 -07:00
qazal 39dda3d042
rename prescheduled items to lsi [run_process_replay] (#5959)
* rename to lsi

* fuzz_schedule more typings

* rename fuzz_schedule
2024-08-07 14:31:50 +03:00
qazal 728b7e189e
diff_schedule tests [run_process_replay] (#5958)
* diff_schedule tests [run_process_replay]

* ok to run serial
2024-08-07 13:50:27 +03:00
chenyu a7163b80d8
lower test_transcendental fuzz test threshold for sin float64 (#5956) 2024-08-07 02:04:37 -04:00
chenyu fa3a36e576
fancier UOp div gcd folding (#5953)
combine and cancel the remaining const based on gcd of other terms like SumNode.
2024-08-07 02:04:25 -04:00
chenyu aa7fd7ef74
Use `(-self).lt(-x+1)` for `UOp.ge` (#5955)
matched symbolic and fixed UOP_IS_SYMBOLIC=1 arange folding
2024-08-07 01:31:27 -04:00
George Hotz 3d445039c2 hotfix: 8800 lines for AMX+intel tc 2024-08-06 17:50:26 -07:00
George Hotz 658d58784b
embedding doesn't cast (#5952)
* embedding doesn't cast

* test the right thing

* too much annoying with that test
2024-08-06 17:49:14 -07:00
wozeparrot 30d0cb2a82
fix: fix transcendental flakyness on exp float with 9.96875 (#5951) 2024-08-06 17:32:13 -07:00
George Hotz 3a0515ea22 hotfix: process_replay/diff_schedule.py to LBScheduleItem 2024-08-06 17:01:05 -07:00
chenyu aee737bd9e
divide by gcd in UOp div folding (#5949)
* divide by gcd in UOp div folding

`(6x+6y)//16 -> (3x+3y)//8` etc
simpler version

* only factor out const

* don't apply for unsigned

* don't need that if

* space
2024-08-06 20:00:57 -04:00
George Hotz 6d1fdcfce2
don't reduce the same thing in a vector (#5950)
* don't reduce the same thing over and over

* cleaner way to write it that doesn't loop
2024-08-06 16:59:15 -07:00
qazal d5d7f4e7b8
more TestIndexing correctness asserts [run_process_replay] (#5948)
* use torch in test_mnist_val

* more asserts
2024-08-07 01:50:42 +03:00
qazal 7f062929e8
start all cached scheduler functions with buf, st [run_process_replay] (#5946)
* start all cached scheduler functions with buf, st

- [x] _recursive_group
- [x] _recursive_lazyop
- [x] _recurse_reduceops

* use dict [run_process_replay]
2024-08-07 01:24:22 +03:00
chenyu 794796256c
UOp.const_factor [run_process_replay] (#5945)
* UOp.const_factor [run_process_replay]

simplify mod and div folding

* test does not work now
2024-08-06 18:18:29 -04:00
Elias Wahl c9862e17d4
MLPERF BERT submission scripts (#5931)
* green

* red

* fix benchmark

* log

* count train samples

* oops. 4.0 -> 4.1

* note to todo

* no pillow
2024-08-06 18:09:18 -04:00
George Hotz 73d4d51845
add LBScheduleItem type [run_process_replay] (#5944)
* add LBScheduleItem type [run_process_replay]

* minor cleanups

* fix

* fix fuzz tests

* add group cache type
2024-08-06 14:49:40 -07:00
chenyu 1dab75ae37
clean up mlperf dataloader import (#5940)
use tinygrad tqdm for dataset, and PIL Image is only needed for resnet
2024-08-06 17:10:08 -04:00
qazal 7b6496f2e6
fix the reduceops cache breaking beautiful_mnist (#5938)
* fix the reduceops cache breaking beautiful_mnist

* test_sparse_categorical_crossentropy_simple

* starting tests

* atol from test_nn

* test_sparse_categorical_crossentropy_alt

* dont use torch
2024-08-07 00:02:54 +03:00
George Hotz 1417cc8df1
can reenable that test now (#5914) 2024-08-06 13:38:21 -07:00
George Hotz 75154d7ae2
add some types to the scheduler [run_process_replay] (#5941)
* add some types to the scheduler [run_process_replay]

* set -> dedup
2024-08-06 12:23:54 -07:00
George Hotz e077bc7baf
move memory planner to realize (#5937) 2024-08-06 10:41:29 -07:00
chenyu 489575c3be
more UOp sum div with gcd tests (#5936)
* more UOp sum div with gcd tests

* one more
2024-08-06 12:50:10 -04:00
ignaciosica 81ae9fadc8
Float4 support for CLANG (#5915)
* float4 support on clang

* skip linearizer tests that require locals

* add aligned attribute
2024-08-06 07:50:12 -07:00
qazal a7db4c3ee9
show timings for DIFF_ARANGE=1 (#5935)
* show timings for DIFF_ARANGE=1

* always with DEBUG=2
2024-08-06 17:20:38 +03:00
qazal 102a8c184b
diff fused arange schedules with ARANGE_DIFF=1 (#5934)
* diff fused arange schedules with ARANGE_DIFF=1

* better llama diff
2024-08-06 16:52:26 +03:00
qazal f7761245aa
save_schedule pre toposort [run_process_replay] (#5933) 2024-08-06 15:10:01 +03:00