* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* dedup buffers
* hmm, bugfix
* fix tensor cores
* opts device
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* Context and Timing can now be used as decorators
* Using Timing decorator in quickstart.md
The time formating is better and is a useful tool to learn.
Old: Time: 3.5260659999912605
New: Time: 3526.14 ms
* Updated env_vars documentation for Context
* Added test for Context decorator
* Put new import on same line as others
* new version
* fix abstractions
* try remove test
* Revert "try remove test"
This reverts commit 2fc18a9f8ed180540baf73d32b568262709822f1.
* assert_allclose
* minimize the test
* minimize the test
* minimize the test
* minimize the test
* Revert "minimize the test"
This reverts commit e0c092959636109f745d1c8a73f2db90c75fe3c1.
* Revert "minimize the test"
This reverts commit 88240551b13403b21a81765043d5736103a49293.
* Revert "minimize the test"
This reverts commit 78328a7ce27328c8bf9a325ae017cc2a4d98f65b.
* Revert "minimize the test"
This reverts commit 989523fded4319b13db047e45ad8c35c861a36aa.
* skip test inside body
* oops
* oops
* Rename FusedOps to TernaryOps
* Support ternary broadcast
* Add where llop and mlop
* Make where op work in cstyle codegen
* Don't skip test_inf_where
* Add backward path to where op
* Use bool in cstyle codegen
* Add LLVM where op
* Add numpy where op
* Add torch where op
* Simplify where mlop
* Update documentation
* Forgot a rename
* Merged relevant changes from PR #1195 onto PR #1196
* Add test to cover changes to linearizer.ast_parse for WHERE op
Without this METAL will try to use ternary op on float4 and fail
* Make where op work in wgsl backend
* Allow ternary ops to be merged
* Make mypy happy
---------
Co-authored-by: Francis Lam <flam@alum.mit.edu>
* Rename in files
* Move files
* Moved to extra/datasets as suggested
* Changes to files
* Fixed stupid mistake
---------
Co-authored-by: terafo <terafo@protonmail.com>
* fixed division by zero for fast operations
* made et closer to 0
* replace POW llop with SQRT
* updated mlops to swap SQRT and POW llops
* updated hlops to swap POW and SQRT
* added sqrt llop to cpu runtime
* added sqrt llop to cstyle codegen
* added POW llop to llvm ir codegen
* added SQRT llop to torch runtime
* moved pow from mlops to hlops
* found a better way to do reverse pow
* fixed indentation
* added SQRT llop to triton
* update docs to match new llops
* removed POW operator from assembly codegen
* added sqrt and rsqrt to pow hlop
* rewrote pow function in tensor.py
* Adjust tolerance
* Adjust for adamw
* Reduce for Adam too
* removed accidental leftover code
* removed all of accidental code
* added rsqrt test
* removed pow from mlops again
it was added back when resolving merge conflicts
---------
Co-authored-by: Jacky Lee <jla524@sfu.ca>