* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.
* fix benchmark
* remove extra dedup
* st to uops function
* lowerer
* uops reduce
* uops reduce
* acc_number correct
* reduce unroll
* complete unroll
* do upcasts
* handle multioutput
* define_accs
* fix valid
* get grouped dims
* revert lin
* minor
* fixup_ast
* group for reduce
* group works now
* all forwards pass
* all ops tests pass
* fix clang
* mypy
* lil cleanups, no image yet
* ugh, variables everywhere
* bugfix
* counters and name fix
* use symbolic, not uops
* cleanups
* Fix tests
* linearizer tests
* expands
* float4 expand load
* tests pass
* woooo, float4 test
* test ops works again
* one more lin test
* more lin tests
* bypass
* fix tests
* something like this
* const in defineacc
* uops get_reduce_acc
* move around
* allow consts in the LOAD/STORE
* each axis should only appear once, 21 failures
* 16 failures
* fix some image
* optional float4
* onnx tests
* gate the stores
* add reorder
* fix terrible skip function
* tc work
* opt add/mul merge
* fix float4 tests
* tiny tweak, 9 failing
* 7 test failures
* start tc, but i don't think this will work
* progress on tensorcores
* note
* fix ops tests
* closer on tc
* weeee...one tensor core works
* still works, more generic
* large WMMA works
* tc test passes
* use WMMA as accumulator
* basic tc tests passing
* small gemm padded works
* 4 failures
* 3 tests failing
* super barrier
* now two tests failing
* one test failing
* cleanpus, add reduce to UopGraph
* remove the linearizer
* remove unused
* lil cleanups
* Lowerer everywhere
* remove test that doesn't exist now
* image indexing
* llvm fix
* fix metal
* fix image
* fix images
* might fix ptx
* fix image type mismatch
* more tests pass
* CAST -> VECTORIZE
* forgot that one
* fix TestOps.test_flip_eye_crash
* locals shouldn't be image dtype
* change less files
* test fix
* fix recursive expands
* touches
* MULACC support in python
* delete unneeded
* alu before contract
* bug fixes
* tests
* no var multireduce
* simpler tc
* metal works in new style
* working on AMD and METAL
* fix amd
* shot in the dark, fix amd
* something for CUDA
* CUDA WORKS from the docs
* comment
* correct merge
* cleanups + ptx fix + get_reduce_acc
* local alias isn't used anymore
* add store sanity check
* fix for AMD
* cleanups and single expand pass
* more correct with acc_cache
* tests should pass
* block on WMMA
* tests pass
* merge contract and reduce
* contractor fixes issue
* multicontract
* pre expand wmma (same as a reduce)
* expand wmma and only take one
* all expands
* comments and whitespace
* single pass rewrite
* claude cleanups
* claude cleanups
* skip those tests
* restrict that to ints
* comment
* asserts i don't expect to fail do fail
* simplest...rewrite...ever
* simplest...rewrite...ever
* add that rule back
* tests pass?
* only collapse reduce loops
* second SHL/SHR arg must be 4 bytes
* fix verify
* no SHL/SHR in ptx
* put that back
* skip them in PTX...bad tests
* PoC faster wino compile by catting consts across data expand dim
* fix fusions
* faster + golf it
* noqa 501
* implicit broadcast
* Revert "implicit broadcast"
This reverts commit 5915a9083d045ec1e6be84dcb492333325d48666.
* shorter
* shorter
* oops
* 216 upcasts is probably fine
* wino kernel count test
* test winograd number of sts
* specify device for apply_matrix mat elements