* unwrap_dtype maybe
* uopgraph stuff that hardcoded None
* test_ops passes
* dtypes.py fixups
* update test_linearizer and friends
* more ast updates
* test_beam and test_schedule too
* add void type to uop [run_process_replay]
* remove dumb casts
* start making it green
* more cast cleanups
* more cls methods to fix
* regenerate dataset
* split UOp and NOp const
* maybe that too
* fix docs
* update test_uop_symbolic
* test_verify_ast
* new sops with no diff
* meh, type_ignore is alright
* remove that assert
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.
* fix benchmark
* remove extra dedup
* add input to unit tests [run_process_replay]
* add setup [run_process_replay]
* run tests [run_process_replay]
* add cuda and amd [run_process_replay]
* run everything but BEAM=2 [run_process_replay]
* skip export_model [run_process_replay]
* fix amd CI
* add concurrency back
* fix(beam): GlobalCounters kernel count increasing when clearing l2
* fix: removed the NOSTATS var by adding do_update_stats to Tensor.realize()
* test(search): regression test for _time_program, should not increment kernel_count
* fix(test_search): unused var and now properly checking when l2 is cleared
* fix(test_search): added assert message
* fix(test_search): now testing public beam api for kcount
* ruff fixes
---------
Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr>
* mockgpu nv
* works
* comment that out
* fix merge
* setup gpuocelot
* install packages
* not run all of them
* passes
* fix ci
* almost
* should pass
* linter
* linter 2
* try this?
* ugn, not supported
* ci
* remove ticket from description
* better descs
* refactor to Program class
* switch to Program
* fix tests
* smaller diff
* self.p
* more tests
* fix metal test
* tests
* fix openpilot
* move that to linearizer
* p.launchdims
* search: fix edge cases on screening potential ops
won't change correctness, but will save a little python time by
properly deduplicating potential actions
* check for de-duplication instead of exact valid actions
* refactor long line
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None