* unwrap_dtype maybe
* uopgraph stuff that hardcoded None
* test_ops passes
* dtypes.py fixups
* update test_linearizer and friends
* more ast updates
* test_beam and test_schedule too
* add void type to uop [run_process_replay]
* remove dumb casts
* start making it green
* more cast cleanups
* more cls methods to fix
* regenerate dataset
* split UOp and NOp const
* maybe that too
* fix docs
* update test_uop_symbolic
* test_verify_ast
* new sops with no diff
* meh, type_ignore is alright
* remove that assert
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* Revert "late gate creation for STORE [run_process_replay] (#6373)"
This reverts commit c26744de9f.
* Revert "gated store rewrite to UOps.IF (#5976)"
This reverts commit 48061e8400.
* almost working with relu, even hackable... but acc size is wrong, fix needed
* upcast based on threads, change thread size to 4x4
* revert wrongfully commented assert
* fix tc load indexing
* modify for size 8
* fix bug for size 8
* Revert "fix bug for size 8"
This reverts commit cdb3f5df85b6116e8bef10214647a9201c400655.
* Revert "modify for size 8"
This reverts commit 3ef0904bd96291c7a3a351c702fba2905c196bcc.
* good kernel with changes in lowerer
* revert "good kernel with changes in lowerer"
This reverts commit 975e2b5a4ecfe475370e88ce9db78b2d42e4c4d4.
* good kernel for relu!
* refactor lowerer changes
* add amx context var to helper
* clean up amx flag
* improve lowerer changes readability
* improve check for amx
* revert lowerer if
* add float4 type rendering for clang
* add amx definitions
* enable indexing for clang if amx
* working amx example, wrong because of dims
* almost works for float 16, need to spot using double load in amx
* cleaner render_kernel
* revert chages in simple_matmul and delete env
* add new var upcast_offset to get_optimized_ast
* change axis for axes
* invert if in rendering phi
* fix some bugs
* fix linearizer tests
* fix vec/get pat for amx
* remove clang tc if amx is disabled
* add ops_python support
* refactor into one complementary function in ops_python
* add job for EMUALTE_AMX
* improve checking for AMX in UPCAST and TC extra ops
* fix lint issue
* commit before refactor into autocontained AMX
* start refactor by removing special rendering for AMX
* all ready for amx handcoded kernel
* working poc, most straightforward amx support
* avoid local opts for tc if amx
* fix merge bugs
* skip test for clang
* skip tc hand-coded opts if amx
* remove hardcoded ops_python values
* remove hardcoded sizes for amx kernel
* fix ops_python bug where dim was hard-coded
* change contract for vectorize
* working without changes in lowerer
* revert changes in gep rendering
* fix ops_python
* modify comment
* skip test if clang for different type accumulation
* move rename and bug for seperate pr
* fix wrong path for test
* addmm not implemented in torch for cpu
* change struct for vector; equally slow but cleaner
* revert modified test
* simply wmma rendering
* minor change
* noqa:501
* add length 16 for AMX
* fix vectorized half issue
* fix error
* remove comment
* change set for dedup
* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX
* add amx reference
* load acc into amx registers
* fix dtype rendering and remove noqa
* moved tests change into another pr
* add real AMX job for CI and fix bug
* fix ops_python bug
* fix test class
* remove real AMX tests and fix uops_stats test
* remove wrong test
* acc folding
* hotfix: bug
* fix float4 tests for amx
* hack for fixing flops counting
* hotfix: mypy
* add flop counts test for amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_unaligned_load_amx
* nits tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Core change to gate stores in IFs
* Updates to cstyle renderer to handle IFs around STOREs
* Make uops asserts happy
* Add tests and fix newly broken tests
* make ruff happy
* make mypy happy
* Simplify renderer to have all gated stores use IF
* Revert some changes
* Make test_where_fold happy
* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out
* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE
* Re-change broken test
* Make ifs be grouped together
* get non-merged IFs working. ALl tests pass except grouping related ifs together
* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp
* Changes to uopgraph
* Simplify graph rewrite logic
* Changes to get test_padto_where_multireduce working
* Simplify uops.store renderer
* Make test_padto_where_multireduce pass but now other tests fail
* Clean up uopgraph from scrach work
* Ignore sudo IF srcs when rendering
* Attempt to fix llvm tests
* rm comment
* reduce lines
* Add line to make mypy happy :(
* llvmir fix pt 1
* Mods after rebasing to master
* Fix llvmir
* Fix ptx tests
* Fix other ptx tests
* Move changes from uops.py to ops.py
* rm uops.py
* Fix TestGateStoreRewrite tests
* Get multireduce tests working
* reset to remote branch
* Fix linearizer tests
* uop_graph test patch
* Add comment to create_gate
* hotfix: uncomment those tests
* Attempt to fix ptx tests by including whitespace inside if block
* Patch from remote tinybox. Tests passing here
* Min changes to get some ptx tests passsing
* Changes after rebase
* Exclude ifs and endifs from ptx
* IF conditional branching within ptx
* Save lines on delete_redundant_gates
* Simplify merge_gates
* rm noqa
* Remove unnecessary checks when merging gates
* Fix ops error msg
* Smarter check for if/endif in llvmir
* simplify delete redundant gates to only have 2 returns
* spacing
* Smarter check at beginning of merge_gates
* patches from comments
* Remove need for merge_gates
* include proper srcs in IF from the get-go
* test expand ifs dumb will result in 4 ifs, not 1 now
* Make tests happy
* Fix uops stats
* rm merge_gates method. Will add back in separate PR
* Spacing
* cleaner error msg
* Fix uops rendering when expanding. test_failure_43
* patch tests
* undo changes in delete_redundant_gates
* process replay attempt
* re-intro deletion of redundant gates
* fix addition of gates when they get nested in stores and loads
* patch tests
* smarter init of IF srcs when adding gate to STORE
* make ruff happy
* Resp to comment
* include all src[2]'s srcs in IF for gated store
* add reference of the storing value to the gate's src
* minor patch after rebasing
* change ptx renderer
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.
* fix benchmark
* remove extra dedup
* fixed xmx demo
* i think i'm invoking the DPAS but it's slow
* compiler build arg to stop register spilling, indicated where to fix flop counter
* don't mind this
* do NOT mind me
* do not mind me
* do not view
* i will add bf16 later
* in process of figuring out tc fields
* we figured out the fields!!!
* added check for cl device vendor, added seperate IntelRenderer
* remove tc thread_local_aliases
* cleaning debris before draft pr
* edits for linter
* deduping and checking device extensions
* i will find more line reductions in other places
* before merge upstream
* double grf size in compiler to fix register spilling (bandaid), device checking changes
* tc python emulation
* fixed emulation
* tests for emulated intel tensor core
* TC=0, 1 working on upstream, fixed perf
* test
* debris
* check for specialized cl device when we canonicalize device
* bf16 support, tc=3 test added
* address tests
* revert half2 loads on intel tc, cleanup
* linter
* fold_expanded revert
* lint, whitespace fix
* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too
* make line shorter, no need for noqa E501
* removed device intel
* fix python emulation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* remove old index reorder
* new style folder
* works better
* dedup
* one failure
* this is fine now...
* expander_rewrite
* images broken, but all else should work
* cleanups
* make tests work with old
* fix images
* cleanups + bugfix
* minor fixes
* fix gated store folding
* flip gate_creator and expander
* fix gated store
* remove unneeded rules
* lines getting close
* line count good