it's recommended to use __getnewargs__ to update the args of classes that use __new__ when unpickling.
It's preferred because it does not change the __new__ behavior.
* do not truncate float64 precision
* use l suffix to try avoid overload confusion
* long line, ruff bloats the function otherwise
* fmt
* remove long double suffix (l), it's sufficient to have the float32 (f) suffix to avoid function overload ambigouity; add test showcasing rtol=1e-12 precision increase, the test fails without the renderer changes
* use more reasonable test values, same as test_int_to_float_unary_func
* disable test for CUDACPU, does not support half and segfaults on some operations per dtypes_alu test
* disable test for HIP, renderer does not support f64 precision
* do not use noqa E501, break up condition
* remove cpu and torch backends
* don't copy to cpu
* use clang instead of cpu
* multitensor gathers on the first device
* clang is cpu + use default
* fixup
* bugfix
* remove float cast
* cast scalars to the correct value in creation time
* cast scalar in the correct place
* wrong, use y_dtype
* make consts have a unique cache key
* add cast_scalar back
* test_load_cache_const_bufs
* add bool dtype
* test_const_dtype
* fix linters
* generic rendering of half and bf16
hotfix
* fix uops + regression test
* fix the test for metal's half4
* uop.uop fixup
* mypy with --strict-equality, fix ops_gpu
* set metal fast math default to 0 (disabled)
It's a correctness fix because we use inf and nan. Let's see how slow it is
* skip failed onnx tests
* tmp DISABLE_COMPILER_CACHE=1 in metal benchmark
* Revert "tmp DISABLE_COMPILER_CACHE=1 in metal benchmark"
This reverts commit 22267df38099acbf949aefdb6a5911ebc3a31984.
* env var METAL_FAST_MATH to disable fastmath for metal
use this to test impact of fast math. might need to disable compiler cache with DISABLE_COMPILER_CACHE
* failed onnx test with fast math
METAL_FAST_MATH=0 DISABLE_COMPILER_CACHE=1 NOOPT=1 python -m pytest -n=auto test/external/external_test_onnx_backend.py -k test_MaxPool3d_stride_padding_cpu
Fully UNROLLing the first_reduce should not change the number of
local_dims.
Fully UNROLLing a GROUP dim should reduce the number of
group_for_reduces by one.
Also changed group_for_reduces to be a count as the axis number
isn't used anywhere (they are always the first reduce dims).
* ops_python: add HIP tensor core mock and refactor METAL
* Add tests to CI
* add DEBUG=2 to full tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style
run on TORCH since it's the fastest one on CI.
caught a bug in multinomial, and update the behavior of fancy index and gather to move the indices Tensor to same device as self.
fix when correction is too big. it seems to only work when input size is 0 though.
torch can output -inf in var when correction is too big, which does not make sense.
* fix Tensor.mean to compute the mean correctly with 0-length axes are selected
* add a regression test
* rename sum variable to sum_t to avoid conflict with built it function
* refactor Tensor.mean to has less lines
* skip matacc opt if the all src buffers of mul op are const buffers
* add noqa directive for long test
* unskip MALACC opt
* ensure that a_axes at least includes summation axes in order to perform np.einsum correctly
* add regression test for mulacc op
* compute a_slices using a_axes
* refactor helper of function to retrieve axes and slices for nonzero strides as well as summation axes
* include a regression test that uses and to test the behaviour indirectly
* PoC faster wino compile by catting consts across data expand dim
* fix fusions
* faster + golf it
* noqa 501
* implicit broadcast
* Revert "implicit broadcast"
This reverts commit 5915a9083d045ec1e6be84dcb492333325d48666.
* shorter
* shorter
* oops
* 216 upcasts is probably fine
* wino kernel count test
* test winograd number of sts
* specify device for apply_matrix mat elements
* shrink MLB on sharded axis
use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.
draft version in https://github.com/chenyuxyz/tinygrad/pull/109
* SYNCBN flag
* test unclean shrinks
* UnsyncedBatchNorm reuses BatchNorm
* more robust pad arg check
* better types
* more tests!
* 6 gpus in benchmark
* disable slow GPUS=6 benchmark
- removed noop a=0
- fixed integer div test
- added test for both python expression and Tensor method call
- reordered for consistency and added some spaces