* `global_load` and `global_store` using buffer dtype
* `UOps.PHI` in all dtypes
* `UOps.ALU` in all dtypes
* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes
* -- endof implementation --
+tiny lint changes
* these tests require the fp16 extention
you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)
`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`
skip the new test_linearizer_failures in CI GPU because of the fp16 extention
This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`
see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)
* these tests fail in CI due to segfaults and CPU crashes
To confirm they're green locally, you can run the following commands:
1. For the tests skipped in test_ops.py (note: CLANG is very slow)
`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`
2. For the ONNX tests skipped in CLANG:
```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```
3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific
`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`
* Revert "these tests fail in CI due to segfaults and CPU crashes"
This reverts commit 15db57014381a4449d563526ac6c870e36257658.
* merge with cleanup-vectorized-hip-renders
* barely working HIP P1, ALU ops need a refactor?
* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:
```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```
* update the sign definition to avoid division by zero in all dtypes
* diff cleanup p1: why were these in the diff anyways
* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!
add ALU ops overloads for HIP
this will make HIP max work
handle mod
Revert "handle mod"
This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.
update max to use hmax
add HIP GEP render logic
enable CIFAR fp16 benchmark
test ops for HIP
back to store as float because this only works for float4 grouping right now
test_ops for hip!!
always sign
* back to the sign we had before because we cant do a backward pass on a Less node
* remove old hacks
HIP compiling test_ops in CI takes ~9 mins, not doing it for now
new HIP ALUs
* reduce accs done right
* refactor to function
* no device hacks
hacks p2
the other way
* LLVM ALU ops
half, float and double are all float
update max
* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool
* cleanup LLVM wrong code
* dummy change for the CUDA install glitch
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* remove pytest marks
* test more stuff
* fine revert some
* add that mark back
* skip that
* hmm LLVM does not work on ubuntu
* too slow on CUDA CI
* dup test
* ops_gpu is go
* fix size 0
* fix image, and add more tests
* nerf openpilot test, doesn't test thneed
* run the schedule
* better
* oops, new inputs
* delete pyopencl
* Update ops_gpu.py
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* rebalance
* balance
* parallel apt-get for all
* .local/lib/python3.11/site-packages
* what is user doing
* is that path right
* Update test.yml
* okay where are you
* site-packages
* image support weird loads
* umm, that was always wrong
* openpilot compile fails with a weird error
* image test passes
* we have valids now
* clean that up
* no more required opts
* add fastvits test, fix bug
* minor cleanups
* hip amd compilation
* gate the test properly
* cleanup unused import
* remove superfluous numpy conversion
* add SpeedyNet tests (f32 [passes] & f16 [fails])
* make CI verbose (error log from hip compiler)
* test the real ops_hip
* Merge branch 'tinygrad:master' into ci/hip-compilation
* fix CI
* cleanup
* really fix CI
* Fix CI Three: the refixening
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* hip amd compilation
* gate the test properly
* cleanup unused import
* remove superfluous numpy conversion
* add SpeedyNet tests (f32 [passes] & f16 [fails])
* make CI verbose (error log from hip compiler)
* test the real ops_hip
* Merge branch 'tinygrad:master' into ci/hip-compilation
* fix CI
* cleanup
* really fix CI
* add name support
* use fetch in gpt2
* remove requests from main lib, networkx also optional
* umm, keep that assert
* updates to fetch
* i love the walrus so much
* stop bundling mnist with tinygrad
* err, https
* download cache names
* add DOWNLOAD_CACHE_VERSION
* need env.
* ugh, wrong path
* replace get_child
* force rebuild of ocelot
* SzymonOzog gpuocelot
* delete that
* downgrade that
* non parallel
* force rebuild
* use llvm
* nauto
* less mem maybe
* print test
* helper_test_exception skip CUDACPU
* helper_test_exception
* shippable
* very close
* remove comment
* negative strides working
* almost everything passes
* calculate offset with list comprehension
* some cleanup
* got disk load working
* review suggestions
* fix after merge
* overlap working
* did it
* clean
* fixed disk load
* lint
* mypy
* removed as_strided
* trying without simplify
* added back simplify
* make sure expanding to smaller shape
* cleanup
* removed comment
* removed env file
* trying whisper test again
* onnx test sqlite issue
* working on test
* finished test
* eliminate unnecessary shrink-then-pad
* don't shrink buffer
* added strides check
* added to ci under linters
* switch issue
* allow symbolic stride
* removed .env
* isinstance
* adjust strides for double expand
* cleanup
* needed to add type hint for mypy
* set pythonpath
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test
* wmma: refactor tensor cores using existing local dims
* optimizer: fix bad rebase and break after one late local
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>