* hopeful impl for Tensor.einsum
* satisfy mypy by having less typing. :(
* a few simple tests
* even more tests
* permute tests
* xfails for improper usage
* fix LLVM test fail
* use argfix
* more helpful error message on shape mismatch
The current yolov3 example is broken with the current implementation of of fetch in the helpers. I was tempted to fix the helpers instead but that could have just as well broken other examples.
* add bf16 test support
this model takes me almost a minute to download though:
https://huggingface.co/TinyPixel/Llama-2-7B-bf16-sharded/resolve/main/pytorch_model-00001-of-00014.bin?download=true: 100%|█████████████████████████████| 981M/981M [00:40<00:00, 24.2MB/s]
* ensure we first load if it is bitcast to avoid taking the address of an rvalue
* tiny bf16 in the cloud
skip GPU
* should skip torch
lint
* Revert "ensure we first load if it is bitcast to avoid taking the address of an rvalue"
This reverts commit b86a28ab84bc1173764b2d480218e8de41a32390.
* break the kernel
* skip LLVM and GPU in CI
* skip CUDA
* universal test cast
* disable div
* midcast fixup
* add 64-bit types
* hack maximum
* use Metal precise::sin instead of default
This is because the default sin function defaults to single-percision math: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf#page=164
* LLVM code_for_op support for var_dtype
* comment out maximum for now with a TODO explaining it
* Revert "hack maximum"
This reverts commit d170048c5fc029eab41f8472dd53f44c448370a1.
* make the comment more specific
* slightly more forgiving
* ok does this fail in all backends?
* weird its only Metal CI
* add graph
* skip sin of nan for CUDACPU
This is only happening in the CUDACPU runtime and not CUDA itself. https://github.com/tinygrad/tinygrad/actions/runs/7128973726/job/19412000385#step:16:36
* METAL and CUDACPU behave differently in overflows with numpy running on CI
* that skip is wrong
* skip fp16 tests on LLVM similar to test_dtype
original commit that skipped LLVM in CI 1826ff6b89
* remove all of sin from CUDACPU
* limit range of values in CUDACPU and METAL CI
* Revert "use Metal precise::sin instead of default"
This reverts commit d960094d4a22fe69a9b6cb23ff7cd88e86a3c675.
* change atol and rtol for Metal sin
* METAL CI is more imprecise
* cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* enable test_index and test_advancedindex with pretty diff
* removed contig
* created set_ helper function
* comment change
* del empty line
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* jit graph split
* update
* that's fine, not all buffers are there now
* use logariphmic tho, seems good
* no keep it simple
* add test
* simplify
* split graph when jit item cannot be graphed
* dtypes alu test
* those types don't exist in torch
* floats
* more tests
* disable those
* a couple unary tests
* skip float16 tests in CI for GPU
* fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM
* remove hardcoded float for LLVM ALU fns
* less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32.
* return on overflows
* fix CUDA exp2
* compute results of op regardless of bounds in a python backend
* skip fp16 in GPU and CUDACPU
* fuzz a smaller range in the float_midcast_int32 test
I sampled this and we overflow ~70% of the time.
because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here
* remove CUDA exp2 overload it's already there now
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* bitcast renderers
* fast llama load
* make it one kernel
* regression testing p1: re-enable test_dtype for all backends
fix GPU
* regression testing p2: fuzz all possible cases against numpy
remove hancoded tests since the fuzzer covers them
* define ushort
* fix indent, probably need flake8 back for CI to catch
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>