* generic rendering of half and bf16
hotfix
* fix uops + regression test
* fix the test for metal's half4
* uop.uop fixup
* mypy with --strict-equality, fix ops_gpu
* set metal fast math default to 0 (disabled)
It's a correctness fix because we use inf and nan. Let's see how slow it is
* skip failed onnx tests
* tmp DISABLE_COMPILER_CACHE=1 in metal benchmark
* Revert "tmp DISABLE_COMPILER_CACHE=1 in metal benchmark"
This reverts commit 22267df38099acbf949aefdb6a5911ebc3a31984.
* env var METAL_FAST_MATH to disable fastmath for metal
use this to test impact of fast math. might need to disable compiler cache with DISABLE_COMPILER_CACHE
* failed onnx test with fast math
METAL_FAST_MATH=0 DISABLE_COMPILER_CACHE=1 NOOPT=1 python -m pytest -n=auto test/external/external_test_onnx_backend.py -k test_MaxPool3d_stride_padding_cpu
Fully UNROLLing the first_reduce should not change the number of
local_dims.
Fully UNROLLing a GROUP dim should reduce the number of
group_for_reduces by one.
Also changed group_for_reduces to be a count as the axis number
isn't used anywhere (they are always the first reduce dims).
* ops_python: add HIP tensor core mock and refactor METAL
* Add tests to CI
* add DEBUG=2 to full tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style
* add gated load support to PYTHON
* out of bounds error message
* cleaner
* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style
run on TORCH since it's the fastest one on CI.
caught a bug in multinomial, and update the behavior of fancy index and gather to move the indices Tensor to same device as self.
left ones in conv2d and wino, no E501 elsewhere in tensor.
three functions need general readability improvement: getitem and gather, conv2d and wino, and pow
fix when correction is too big. it seems to only work when input size is 0 though.
torch can output -inf in var when correction is too big, which does not make sense.
* fix Tensor.mean to compute the mean correctly with 0-length axes are selected
* add a regression test
* rename sum variable to sum_t to avoid conflict with built it function
* refactor Tensor.mean to has less lines
* skip matacc opt if the all src buffers of mul op are const buffers
* add noqa directive for long test
* unskip MALACC opt
* ensure that a_axes at least includes summation axes in order to perform np.einsum correctly
* add regression test for mulacc op
* compute a_slices using a_axes
* refactor helper of function to retrieve axes and slices for nonzero strides as well as summation axes
* include a regression test that uses and to test the behaviour indirectly