Commit Graph

270 Commits

Author SHA1 Message Date
George Hotz ded1b38b84
minor dtype cleanup [pr] (#7124)
* minor dtype cleanup [pr]

* use ptr() function
2024-10-17 17:41:23 +08:00
George Hotz a71bb09ec3
remove symbolic file [pr] (#7012) 2024-10-12 18:44:44 +08:00
qazal 20d3c2d113
unify UOps.SHAPETRACKER and UOps.SWIZZLE with UOps.VIEW (#6955)
* add UOps.VIEW

* update hardcoded asts

* update sops.gz
2024-10-09 02:00:17 +08:00
qazal 391497a311
schedule independent of Device [run_process_replay] (#6829) 2024-10-01 14:46:26 +08:00
George Hotz 50dd6bd951
move cmp tuple out [run_process_replay] (#6825)
* move cmp tuple out [run_process_replay]

* was unneeded
2024-10-01 10:38:28 +08:00
qazal e7fcbe1a4d
refactor test_linearizer correctness asserts (#6812) 2024-09-30 15:31:02 +08:00
qazal e0d8685c99
test_masked_upcast_wino check device buf_max (#6723) 2024-09-25 11:26:53 +08:00
George Hotz 7c38121280
load penalty (#6681)
* bias/bn loads after loops

* load penalty in fix_priority

* more generic test
2024-09-23 18:12:12 +08:00
qazal 982086f54c
UOps.VALID try 2 (#6623)
* make UOps.VALID compile

* fixable tests

* bufs dedup

* cleanup the CONST spec

* regenerate dataset with graph_rewrite

```py
def rewrite_const(const:UOp, st_src:UOp) -> UOp:
  st: ShapeTracker = st_src.arg
  return UOp(UOps.VALID, dtypes.bool, (st.to_uop(),)).where(UOp.const(const.dtype, const.arg), UOp.const(const.dtype, 0))
pm = PatternMatcher([(UPat(UOps.CONST, name="const", src=(UPat(UOps.SHAPETRACKER, name="st_src"),)), rewrite_const)])
```

* rm arg

* remove arg

* revert arg removal

This reverts commit 2c35c75c950075d38c9fb8572f14640fe8235f74.

* red test_pickle_define_var
2024-09-21 14:19:25 +08:00
George Hotz 42ba887daa
remove logic to vectorize reduces (#6536)
* remove logic to vectorize reduces

* fix tests
2024-09-16 14:04:48 +08:00
ignaciosica c447ec2190
Fix amx shape [run_process_replay] (#6524)
* fix amx shape (sz,sz,sz) -> (sz,sz,1)

* revert check
2024-09-16 09:49:55 +08:00
George Hotz bdd0c06f29
add void type to uop (#6471)
* unwrap_dtype maybe

* uopgraph stuff that hardcoded None

* test_ops passes

* dtypes.py fixups

* update test_linearizer and friends

* more ast updates

* test_beam and test_schedule too

* add void type to uop [run_process_replay]

* remove dumb casts

* start making it green

* more cast cleanups

* more cls methods to fix

* regenerate dataset

* split UOp and NOp const

* maybe that too

* fix docs

* update test_uop_symbolic

* test_verify_ast

* new sops with no diff

* meh, type_ignore is alright

* remove that assert

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-11 18:16:28 +08:00
chenyu e0d35e3657
update test_padto_sum_not_ok (#6450)
updated the setup as `exp() < -1` could be folded to False
2024-09-09 22:46:42 -04:00
qazal 935b4ddff6
use ast_const in test_linearizer asts [run_process_replay] (#6407) 2024-09-09 08:46:58 +08:00
George Hotz 86d34daac9
UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383) 2024-09-06 12:38:35 +08:00
George Hotz 66e7e51c79
Revert beam failure (#6376)
* Revert "late gate creation for STORE [run_process_replay] (#6373)"

This reverts commit c26744de9f.

* Revert "gated store rewrite to UOps.IF (#5976)"

This reverts commit 48061e8400.
2024-09-06 09:36:44 +08:00
ignaciosica c15506fc35
[WIP] amx support as TC (#5693)
* almost working with relu, even hackable... but acc size is wrong, fix needed

* upcast based on threads, change thread size to 4x4

* revert wrongfully commented assert

* fix tc load indexing

* modify for size 8

* fix bug for size 8

* Revert "fix bug for size 8"

This reverts commit cdb3f5df85b6116e8bef10214647a9201c400655.

* Revert "modify for size 8"

This reverts commit 3ef0904bd96291c7a3a351c702fba2905c196bcc.

* good kernel with changes in lowerer

* revert "good kernel with changes in lowerer"

This reverts commit 975e2b5a4ecfe475370e88ce9db78b2d42e4c4d4.

* good kernel for relu!

* refactor lowerer changes

* add amx context var to helper

* clean up amx flag

* improve lowerer changes readability

* improve check for amx

* revert lowerer if

* add float4 type rendering for clang

* add amx definitions

* enable indexing for clang if amx

* working amx example, wrong because of dims

* almost works for float 16, need to spot using double load in amx

* cleaner render_kernel

* revert chages in simple_matmul and delete env

* add new var upcast_offset to get_optimized_ast

* change axis for axes

* invert if in rendering phi

* fix some bugs

* fix linearizer tests

* fix vec/get pat for amx

* remove clang tc if amx is disabled

* add ops_python support

* refactor into one complementary function in ops_python

* add job for EMUALTE_AMX

* improve checking for AMX in UPCAST and TC extra ops

* fix lint issue

* commit before refactor into autocontained AMX

* start refactor by removing special rendering for AMX

* all ready for amx handcoded kernel

* working poc, most straightforward amx support

* avoid local opts for tc if amx

* fix merge bugs

* skip test for clang

* skip tc hand-coded opts if amx

* remove hardcoded ops_python values

* remove hardcoded sizes for amx kernel

* fix ops_python bug where dim was hard-coded

* change contract for vectorize

* working without changes in lowerer

* revert changes in gep rendering

* fix ops_python

* modify comment

* skip test if clang for different type accumulation

* move rename and bug for seperate pr

* fix wrong path for test

* addmm not implemented in torch for cpu

* change struct for vector; equally slow but cleaner

* revert modified test

* simply wmma rendering

* minor change

* noqa:501

* add length 16 for AMX

* fix vectorized half issue

* fix error

* remove comment

* change set for dedup

* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX

* add amx reference

* load acc into amx registers

* fix dtype rendering and remove noqa

* moved tests change into another pr

* add real AMX job for CI and fix bug

* fix ops_python bug

* fix test class

* remove real AMX tests and fix uops_stats test

* remove wrong test

* acc folding

* hotfix: bug

* fix float4 tests for amx

* hack for fixing flops counting

* hotfix: mypy

* add flop counts test for amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_unaligned_load_amx

* nits tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-09-06 09:01:10 +08:00
Ian Paul 48061e8400
gated store rewrite to UOps.IF (#5976)
* Core change to gate stores in IFs

* Updates to cstyle renderer to handle IFs around STOREs

* Make uops asserts happy

* Add tests and fix newly broken tests

* make ruff happy

* make mypy happy

* Simplify renderer to have all gated stores use IF

* Revert some changes

* Make test_where_fold happy

* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out

* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE

* Re-change broken test

* Make ifs be grouped together

* get non-merged IFs working. ALl tests pass except grouping related ifs together

* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp

* Changes to uopgraph

* Simplify graph rewrite logic

* Changes to get test_padto_where_multireduce working

* Simplify uops.store renderer

* Make test_padto_where_multireduce pass but now other tests fail

* Clean up uopgraph from scrach work

* Ignore sudo IF srcs when rendering

* Attempt to fix llvm tests

* rm comment

* reduce lines

* Add line to make mypy happy :(

* llvmir fix pt 1

* Mods after rebasing to master

* Fix llvmir

* Fix ptx tests

* Fix other ptx tests

* Move changes from uops.py to ops.py

* rm uops.py

* Fix TestGateStoreRewrite tests

* Get multireduce tests working

* reset to remote branch

* Fix linearizer tests

* uop_graph test patch

* Add comment to create_gate

* hotfix: uncomment those tests

* Attempt to fix ptx tests by including whitespace inside if block

* Patch from remote tinybox. Tests passing here

* Min changes to get some ptx tests passsing

* Changes after rebase

* Exclude ifs and endifs from ptx

* IF conditional branching within ptx

* Save lines on delete_redundant_gates

* Simplify merge_gates

* rm noqa

* Remove unnecessary checks when merging gates

* Fix ops error msg

* Smarter check for if/endif in llvmir

* simplify delete redundant gates to only have 2 returns

* spacing

* Smarter check at beginning of merge_gates

* patches from comments

* Remove need for merge_gates

* include proper srcs in IF from the get-go

* test expand ifs dumb will result in 4 ifs, not 1 now

* Make tests happy

* Fix uops stats

* rm merge_gates method. Will add back in separate PR

* Spacing

* cleaner error msg

* Fix uops rendering when expanding. test_failure_43

* patch tests

* undo changes in delete_redundant_gates

* process replay attempt

* re-intro deletion of redundant gates

* fix addition of gates when they get nested in stores and loads

* patch tests

* smarter init of IF srcs when adding gate to STORE

* make ruff happy

* Resp to comment

* include all src[2]'s srcs in IF for gated store

* add reference of the storing value to the gate's src

* minor patch after rebasing

* change ptx renderer

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-06 01:05:30 +08:00
gswangg 3cf507ae7f
remove extra.ops and LazyOp support from Kernel (#6267)
* remove extra.ops and BufferOps

* remove extra.ops and LazyOp support in Kernel
2024-08-24 16:44:38 +03:00
qazal ccb05d8baa
fixup neg tests [run_process_replay] (#6268) 2024-08-24 16:35:43 +03:00
qazal bcb2f1caa3
init REDUCE_AXIS with BinaryOps (#6256)
* REDUCE_AXIS arg with BinaryOps

* more work in kernel.py
fixup sops.gz

* fix TestGraphRewriteEfficiency
2024-08-24 11:28:41 +03:00
chenyu 3fc8203475
remove NEG from handwritten ast in tests (#6234)
* remove NEG from handwritten ast in tests

* test_linearizer_failures
2024-08-22 09:06:59 -04:00
gswangg c74b318458
migrate test_linearizer.py to UOp AST, pt. 2 (#6228) 2024-08-21 22:16:11 +03:00
qazal 3b8cc5a3e0
more multireduce tests prep for neg removal [run_process_replay] (#6220) 2024-08-21 12:45:24 +03:00
qazal f03e5a4b3b
test_multireduce const has a shape (#6218) 2024-08-21 11:02:45 +03:00
gswangg 0e6f057eae
migrate test_linearizer.py to UOP AST (pt. 1) (#6150)
* migrate test_multioutput to UOP AST

* inline buf declarations

* migrate test_multireduce to UOp AST

* update test_mid_dim_multireduce to UOp AST

* update test_triple_multireduce with UOp AST

* make global definitions more concise

* update test_double_reduce_multireduce with UOp AST

* update test_multireduce_with_parallel with UOp AST

* update test_multiout_multireduce to UOp AST

* make gidx style consistent across updated tests

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-20 10:02:20 +03:00
chenyu b36a7273c6
RUF018 assignment-in-assert [run_process_replay] (#6172)
assertion should not have side effect or `-O` breaks.

initially just wanted to fix the one in rearrange, but it also made some long lines less long
2024-08-19 00:34:52 -04:00
Timmy e3d14d1ccc
Lowerer Multireduce Grouping (#6097)
* grouping changes to codegen

* linters + tests

* fix identical store issue on PTX

* comment in grouping multireduce tests

* cleaning up diff

* cleaning up diff

* comments

* linters

* hotfix: dont change kernels

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-08-18 19:57:51 +03:00
George Hotz 5048066e79
st_arg, never -1 [run_process_replay] (#6128) 2024-08-16 22:46:56 -07:00
George Hotz 74ee9febec
remove iter from uopgraph (#6110)
* remove iter from uopgraph

* linearize returns uops

* fix tests

* linearize in linearize

* tests fix

* touchup

* test failures
2024-08-16 15:58:29 -07:00
qazal 28c75bf2a6
merge uops with ops (#6111)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-08-16 18:17:57 -04:00
qazal c23d44c779
AST is UOp (#6030)
* most of the work from the uops2 branch

* schedule

* realize

* kernel

* lowerer

* search

* green

* merge uops with ops

* Revert "merge uops with ops"

This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.

* fix benchmark

* remove extra dedup
2024-08-16 22:09:00 +03:00
CaltropHungerton 38fb1e14a2
Intel XMX Tensor Core Support (#5622)
* fixed xmx demo

* i think i'm invoking the DPAS but it's slow

* compiler build arg to stop register spilling, indicated where to fix flop counter

* don't mind this

* do NOT mind me

* do not mind me

* do not view

* i will add bf16 later

* in process of figuring out tc fields

* we figured out the fields!!!

* added check for cl device vendor, added seperate IntelRenderer

* remove tc thread_local_aliases

* cleaning debris before draft pr

* edits for linter

* deduping and checking device extensions

* i will find more line reductions in other places

* before merge upstream

* double grf size in compiler to fix register spilling (bandaid), device checking changes

* tc python emulation

* fixed emulation

* tests for emulated intel tensor core

* TC=0, 1 working on upstream, fixed perf

* test

* debris

* check for specialized cl device when we canonicalize device

* bf16 support, tc=3 test added

* address tests

* revert half2 loads on intel tc, cleanup

* linter

* fold_expanded revert

* lint, whitespace fix

* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too

* make line shorter, no need for noqa E501

* removed device intel

* fix python emulation

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-16 09:19:21 -07:00
qazal 4d38fec8c1
rename lazyops to parents [run_process_replay] (#6091) 2024-08-15 17:27:32 +03:00
ignaciosica 164ca5632e
split tensor core tests (#6041) 2024-08-12 09:42:02 -04:00
Timmy a00994b423
Lowerer Multireduce Uopgraph (#6007)
* uopgraph changes

* fixing for non-reducing ranges

* multireduce tests

* linters

* linters

* removing comments

* removing arg[1]

* linters

* prettier

* linters

* more linters

* use any instead of intersection
2024-08-12 15:16:07 +03:00
Timmy 8c99bdab08
More Multireduce Tests (#5968)
* multireduce tests

* linters

* more linters

* more linters

* seeing how it works with parallel
2024-08-08 22:04:08 +03:00
wozeparrot 97d708252a
remove realize from threefry (#5969) 2024-08-07 15:08:49 -07:00
George Hotz 1417cc8df1
can reenable that test now (#5914) 2024-08-06 13:38:21 -07:00
ignaciosica 81ae9fadc8
Float4 support for CLANG (#5915)
* float4 support on clang

* skip linearizer tests that require locals

* add aligned attribute
2024-08-06 07:50:12 -07:00
George Hotz 159ac06b5b
remove unused reduce rules + improve unparented (#5908)
* remove unused reduce rules [run_process_replay]

* this work

* those tests are meaningless now
2024-08-04 18:18:27 -07:00
George Hotz 877e0b4ba0
define global only has the index [run_process_replay] (#5869)
* define global only has the index [run_process_replay]

* fix that linearizer test

* fix ptx

* stupid ptx fix
2024-08-01 19:01:15 -07:00
chenyu 02f0be03f2
tests on UOp div negative number and arange opts (#5825) 2024-07-30 20:06:57 -04:00
George Hotz 17a2f74412
new style load/store folder (#5784)
* remove old index reorder

* new style folder

* works better

* dedup

* one failure

* this is fine now...

* expander_rewrite

* images broken, but all else should work

* cleanups

* make tests work with old

* fix images

* cleanups + bugfix

* minor fixes

* fix gated store folding

* flip gate_creator and expander

* fix gated store

* remove unneeded rules

* lines getting close

* line count good
2024-07-30 13:17:20 -07:00
George Hotz 4df46eac67
clean up tensor cores [run_process_replay] (#5736)
* clean up tensor cores [run_process_replay]

* remove tuple(wmma_sz), self.opts.device

* remove tls, leave DEVICE
2024-07-26 13:21:23 -07:00
chenyu 16c27ae400
update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]

from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable

* fix ptx
2024-07-23 16:58:12 -04:00
George Hotz 386fb5e7f8
folding without UNMUL (#5628)
* folding without UNMUL

* fix failures, index_collapse

* import ReduceOps

* test_arange_4096 isn't folding
2024-07-21 20:14:44 -07:00
qazal 3ab5fe4e1b
test argmax multireduce failure (#5609) 2024-07-20 21:33:03 +08:00
chenyu 37dd233650
always reverse global dim (#5586)
* always reverse global dim

* one more test
2024-07-19 13:58:05 -04:00
George Hotz 10be05aae5
push contract through cast to fix test_float2_acc (try 2) (#5585)
* push contract through cast to fix test_float2_acc (try 2)

* contract push only on floats
2024-07-19 10:34:43 -07:00