* zero in shape start
* no assert for that
* if output size is 0, return without exec
* tweak
* strides
* reduce over non-zero
* shrink and expand
* fix import
* test_elementwise where
* cannot reshape from size 0 to size 1
* compiled backend reduce over 0
* zeros for numpy
* reduce over 0 and keepdim resulted in 1
* reduce empty set default values
* compare with same input
* pad test case
* cat test case
* torch does not support that?
* metal indirect command buffers
* sub 1ms gpt
* metal batch exec is good
* remove whitespace
* input_replace
* fix ci
* useResources
* very simple cacheallocator
* update_stats
* fix CI
* minor
* remove that from jit
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* Change linearizer to parse CAST
* Oneliner renders for cstyle and triton
* LLVM cast and ALU implementation
* pylint fixes
* cast in gep
* remove printbufs
* use cast for post-load ops
* get rid of parse_cast
* partially supported vectorized dtypes for initial dev
* render phi as the dtype
* Revert "partially supported vectorized dtypes for initial dev"
This reverts commit 1bf1a818a3350d74314806f00f5aaacb075bdf51.
* Revert "render phi as the dtype"
This reverts commit d08cb270b42266f06e4a78b199f9937cb9dc4711.
* reenable triton tests
* no vstore_half if dtype is already half
* upcast max
* Change linearizer to parse CAST
* Oneliner renders for cstyle and triton
* LLVM cast and ALU implementation
* pylint fixes
* cast in gep
* remove printbufs
* use cast for post-load ops
* get rid of parse_cast
* partially supported vectorized dtypes for initial dev
* render phi as the dtype
* Revert "partially supported vectorized dtypes for initial dev"
This reverts commit 1bf1a818a3350d74314806f00f5aaacb075bdf51.
* Revert "render phi as the dtype"
This reverts commit d08cb270b42266f06e4a78b199f9937cb9dc4711.
* reenable triton tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* For cuda get current free space from device, and rery alloc failures
* type ignore for mypy
* add init to get free mem in cuda
* Move retry logic in common lib.
Fix typo in override _get_cur_free_space
* linter error fix in test file
* Not catch all, as it will catch KeyboardInterrupt
* fix unintened line changes
* fix test ops
* decompose the err from test_ops
* skipTest skips the entire test, we dont want that
* handle cases with the same priority
* add int16 to torch map
* fuzz linearizer transformation
* no standard normal for fp16
* work
* Interpreted start
* CPU and TORCH work
* fix MemBuffer with same idx
* id for failed kernels
* no image and variable for Interpreted
* symbolic shape
* IMAGE only for GPU
* Interpreted almost all good
* cleanup
* fix bufs_from_lin
* zero size
* some failed examples
* just Exception
* just test not pass
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None
* refactor unit tests for dtypes
* add missing dtypes in llvmir.py and lib.py
* skip torch tests
* webgpu
* cleaner skips
* fix llvm bool casting issue using compare
* llvm 100% passing
* llvm segfault
* TEMP decrease timeout mins to 11
debug
* add bf16 to setup
* skip half tests in cuda cpu
* check for CUDACPU insetad
* add int16 to triton dtypes
* u16 for triton
* remove debug - diff is still hard to read
* derive from base class TestDType
* enhance test_upcast and downcast by running on every possible version
* dummy commit to rerun the flakey test
* skip the correct tests for CUDA
* bf16 should be skipped in the common TestDType cases
* re-enable bf16
* more consistent structure
* tiny changes to is_dtype_supported 1
* tiny changes 2
add reason
* fuzz
* fuzzer p2
* run fp32 twice
* remove duplicate fp32 run
* clang: use stdbool
* skip triton on bool casts
* merge and resolve conflicts
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test