The current yolov3 example is broken with the current implementation of of fetch in the helpers. I was tempted to fix the helpers instead but that could have just as well broken other examples.
* add bf16 test support
this model takes me almost a minute to download though:
https://huggingface.co/TinyPixel/Llama-2-7B-bf16-sharded/resolve/main/pytorch_model-00001-of-00014.bin?download=true: 100%|█████████████████████████████| 981M/981M [00:40<00:00, 24.2MB/s]
* ensure we first load if it is bitcast to avoid taking the address of an rvalue
* tiny bf16 in the cloud
skip GPU
* should skip torch
lint
* Revert "ensure we first load if it is bitcast to avoid taking the address of an rvalue"
This reverts commit b86a28ab84bc1173764b2d480218e8de41a32390.
* break the kernel
* skip LLVM and GPU in CI
* skip CUDA
* universal test cast
* disable div
* midcast fixup
* add 64-bit types
* hack maximum
* use Metal precise::sin instead of default
This is because the default sin function defaults to single-percision math: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf#page=164
* LLVM code_for_op support for var_dtype
* comment out maximum for now with a TODO explaining it
* Revert "hack maximum"
This reverts commit d170048c5fc029eab41f8472dd53f44c448370a1.
* make the comment more specific
* slightly more forgiving
* ok does this fail in all backends?
* weird its only Metal CI
* add graph
* skip sin of nan for CUDACPU
This is only happening in the CUDACPU runtime and not CUDA itself. https://github.com/tinygrad/tinygrad/actions/runs/7128973726/job/19412000385#step:16:36
* METAL and CUDACPU behave differently in overflows with numpy running on CI
* that skip is wrong
* skip fp16 tests on LLVM similar to test_dtype
original commit that skipped LLVM in CI 1826ff6b89
* remove all of sin from CUDACPU
* limit range of values in CUDACPU and METAL CI
* Revert "use Metal precise::sin instead of default"
This reverts commit d960094d4a22fe69a9b6cb23ff7cd88e86a3c675.
* change atol and rtol for Metal sin
* METAL CI is more imprecise
* cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* enable test_index and test_advancedindex with pretty diff
* removed contig
* created set_ helper function
* comment change
* del empty line
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* jit graph split
* update
* that's fine, not all buffers are there now
* use logariphmic tho, seems good
* no keep it simple
* add test
* simplify
* split graph when jit item cannot be graphed
* dtypes alu test
* those types don't exist in torch
* floats
* more tests
* disable those
* a couple unary tests
* skip float16 tests in CI for GPU
* fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM
* remove hardcoded float for LLVM ALU fns
* less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32.
* return on overflows
* fix CUDA exp2
* compute results of op regardless of bounds in a python backend
* skip fp16 in GPU and CUDACPU
* fuzz a smaller range in the float_midcast_int32 test
I sampled this and we overflow ~70% of the time.
because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here
* remove CUDA exp2 overload it's already there now
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* bitcast renderers
* fast llama load
* make it one kernel
* regression testing p1: re-enable test_dtype for all backends
fix GPU
* regression testing p2: fuzz all possible cases against numpy
remove hancoded tests since the fuzzer covers them
* define ushort
* fix indent, probably need flake8 back for CI to catch
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* add some helpers
* I think it should all work..
* fixed get_set_tensor
* done
* del import
* bye bye typing
* style
* remove empty lines lol
* deleted dtype arg
* del trailing space
* new getitem
* go
* add temporary simple tests
* better
* comments
* WOW that took awhile
* save 1 line lol
* work
* still need to add comprehensive tests, but i think getitem looks nice :D
* GIMME GREEN CI CHECKMARK PLS
* try..
* k idk
* added tests for errors
* fixed small hack
* added tests
* almost good
* try no contig?
* yay no more contig + comments and spacing
* finishing touches (comments)
* revert regex unittests lol
* add suggested change
* oops I fell asleep yesterday
* handle reshape of contiguous subparts with explicit mask
* remove the add/remove ones logic in reshape
* accomodate ones in accumulate logic
* make multiply commutative
* fix linting
* make mypy happy
* add test for commutative mul
* merge dimensions in shape_strides for 1 range masks
* add offsets for merging
* fix linting
* add back explicit 1 reshapes
* fix mypy errors
* fix accumulate by includng state
* include non-zero stride dimension in acc
* small cleanup
* more compact to_shape_strides
* more logical cleanup
* compress more
* compress reshape mask
* adding some comments
* small bug fix
* improve test coverage
* remove explicit add remove ones
* small bug in test
* enable test_reshape_splitting_combining
* small fix
* 10 lines less to_shape_strides
* shorten reshape mask
* some more cleanup
* more cleanup
* introduce some symbols for compactness
* more symbols
* more cleaner
* lessen symbols, it became less readable
* remove merge_views from view.reshape
* change to_shape_strides to _merge_dims
* improve readability
* fix corner case
* cleanup
* better handling of 1 <= Variable('i',1,10) & new_dim = Variable('i',1,10)
* rewrite _reshape_mask for readability
* fix white space
* add comment
* nice shorthands for readability
* add proof in docs
* small nit
---------
Co-authored-by: chenyu <chenyu@fastmail.com>