Commit Graph

3056 Commits

Author SHA1 Message Date
George Hotz 3635540ddb
shorter line (#2733) 2023-12-12 15:34:17 -08:00
nimlgen ede7971ada
save some lines (#2731)
* remove unsused mem_cached var

* one more
2023-12-12 15:26:27 -08:00
chenyu 00b611c156
simplify type promotion - remove weak types (#2730) 2023-12-12 16:12:57 -05:00
Nguyen Nguyen Phuong 07cf45e133
fix cuda matmul (#2725) 2023-12-12 07:59:31 -08:00
chenyu ef6e942a23
dtype promotion helpers (#2724)
* dtype promotion helpers

* better tests

* space
2023-12-11 23:14:23 -05:00
Christopher Mauri Milan 0232db294d
fix tolist issue (#2723) 2023-12-11 19:14:00 -08:00
chenyu 4075208127
some dtype creation spec test cases (#2722) 2023-12-11 19:33:49 -05:00
Guy Leroy ee9e1d3662
Extend available types for `safe_save` (#2720)
* Extend available types to save with

* Linter fix
2023-12-11 14:50:35 -08:00
George Hotz b5fd160b39 hotfix: increase rtol on simple_matmul 2023-12-11 10:10:29 -08:00
Gregor Kikelj 4feaaa27aa
ensure shrink is valid (#2717) 2023-12-11 09:58:43 -08:00
qazal a43bc78804
fix dtypes helpers for integers (#2716)
* scalar

* maybe do this instead

* Revert "scalar"

everything is a scalar

* add tests in test_dtype

* fuzz testing + fix unsigned ints

* fuzz everything
2023-12-11 09:28:19 -08:00
nimlgen bc3c4ce50b
cuda set context before sync (#2715)
* cuda set context before sync

* no helper
2023-12-11 09:26:53 -08:00
Ivan Vnučec 8d206f6bfd
fix help message (#2705)
llama -> mixtral
2023-12-10 22:04:35 -08:00
George Hotz 59ab3675a3
faster mixtral + green for new kernels (#2701)
* green for new kernels

* track ram
2023-12-10 19:04:58 -08:00
chenyu 2ee6f689c5
simpler einsum (#2700) 2023-12-10 21:24:44 -05:00
George Hotz b01e3907a1 mixtral touch up: two lines 2023-12-10 17:21:49 -08:00
George Hotz b3982187d1
Mixtral Example (#2691)
* mixtral

* simpler

* global counters

* simpler

* weights arg
2023-12-10 17:18:31 -08:00
George Hotz 0fd44259cd
bf16 fix + cleanups from mixtral (#2698)
* bf16 fix + cleanups from mixtral

* generic bf16 cast
2023-12-10 16:31:52 -08:00
Davi Silva 7fbebb3df6
Implement einsum (#2686)
* hopeful impl for Tensor.einsum

* satisfy mypy by having less typing. :(

* a few simple tests

* even more tests

* permute tests

* xfails for improper usage

* fix LLVM test fail

* use argfix

* more helpful error message on shape mismatch
2023-12-10 15:56:01 -08:00
chenyu 181b0970b5
slightly better extra/to_movement_ops dedups (#2695) 2023-12-10 11:05:44 -05:00
chenyu ef18d79faa
remove noop from to_movement_ops (#2693) 2023-12-10 00:50:24 -05:00
chenyu 2d0e38e201
fix jit input_rawbuffers check wrt consts (#2689)
* fix jit input_rawbuffers check wrt consts

* .numpy()
2023-12-09 15:59:03 -05:00
geohotstan 67ff2b2b18
Formatted test_indexing (#2688)
* added tensor.clone() for more correct cloning behavior

* some work and randint issue

* formatted

* final cleanups

* oops, bug fix
2023-12-09 11:38:36 -05:00
chenyu 1e7823e1f5
combine GROUP and GROUPTOP to a single block (#2687) 2023-12-09 01:19:32 -05:00
chenyu 0fb1d47aa0
two linearizer fuzzer failed test case for webgpu (#2685)
* add a linearizer fuzzer failed for webgpu

* CI specific
2023-12-08 22:52:34 -05:00
chenyu fae5394845
validate llama output (#2681)
* validate llama output

* does not work with quantize
2023-12-08 16:42:01 -05:00
nickovaras 182d067407
Update yolov3.py (#2680)
The current yolov3 example is broken with the current implementation of of fetch in the helpers. I was tempted to fix the helpers instead but that could have just as well broken other examples.
2023-12-08 12:59:38 -08:00
qazal 73b067f5ce
Bitcast p2 bfloat16 tests + clang fix (#2635)
* add bf16 test support

this model takes me almost a minute to download though:

https://huggingface.co/TinyPixel/Llama-2-7B-bf16-sharded/resolve/main/pytorch_model-00001-of-00014.bin?download=true: 100%|█████████████████████████████| 981M/981M [00:40<00:00, 24.2MB/s]

* ensure we first load if it is bitcast to avoid taking the address of an rvalue

* tiny bf16 in the cloud

skip GPU

* should skip torch

lint

* Revert "ensure we first load if it is bitcast to avoid taking the address of an rvalue"

This reverts commit b86a28ab84bc1173764b2d480218e8de41a32390.

* break the kernel

* skip LLVM and GPU in CI

* skip CUDA
2023-12-08 10:30:10 -08:00
qazal a29538a094
green more dtypes tests (#2656)
* universal test cast

* disable div

* midcast fixup

* add 64-bit types

* hack maximum

* use Metal precise::sin instead of default

This is because the default sin function defaults to single-percision math: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf#page=164

* LLVM code_for_op support for var_dtype

* comment out maximum for now with a TODO explaining it

* Revert "hack maximum"

This reverts commit d170048c5fc029eab41f8472dd53f44c448370a1.

* make the comment more specific

* slightly more forgiving

* ok does this fail in all backends?

* weird its only Metal CI

* add graph

* skip sin of nan for CUDACPU

This is only happening in the CUDACPU runtime and not CUDA itself. https://github.com/tinygrad/tinygrad/actions/runs/7128973726/job/19412000385#step:16:36

* METAL and CUDACPU behave differently in overflows with numpy running on CI

* that skip is wrong

* skip fp16 tests on LLVM similar to test_dtype

original commit that skipped LLVM in CI 1826ff6b89

* remove all of sin from CUDACPU

* limit range of values in CUDACPU and METAL CI

* Revert "use Metal precise::sin instead of default"

This reverts commit d960094d4a22fe69a9b6cb23ff7cd88e86a3c675.

* change atol and rtol for Metal sin

* METAL CI is more imprecise

* cleanup

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2023-12-08 10:29:20 -08:00
George Hotz 4164d0ebbd
multitensor start (#2676)
* multitensor work

* early gen fixes the tests

* atol for flaky test
2023-12-07 17:07:05 -08:00
Ahmed Harmouche 4b01839774
support vals on WebGPU, run more tests (#2668)
* Vals on webgpu, run more tests

* Skip slow tests, run symbolic ops tests

* Balance out tests
2023-12-07 16:45:21 -08:00
geohotstan d02ff21f1a
enable test_index and test_advancedindex (#2648)
* enable test_index and test_advancedindex with pretty diff

* removed contig

* created set_ helper function

* comment change

* del empty line

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2023-12-07 19:44:39 -05:00
George Hotz 00d9eda961
FROM -> COPY, move vars_from_ast (#2675) 2023-12-07 16:32:30 -08:00
chenyu 51af99367f
fix fuzz_linearizer using new device Buffer (#2674) 2023-12-07 19:21:47 -05:00
nimlgen 650117a8f6
split large jit into several graphs (#2650)
* jit graph split

* update

* that's fine, not all buffers are there now

* use logariphmic tho, seems good

* no keep it simple

* add test

* simplify

* split graph when jit item cannot be graphed
2023-12-07 10:58:25 -08:00
qazal 29f2653d8d
add graph (#2670) 2023-12-07 10:53:31 -08:00
chenyu 539b00a645
move llama getenv("JIT") from models to examples (#2671)
Transformer class has a jit param so we should use that in the caller
2023-12-07 12:43:22 -05:00
chenyu fd21eced74
reduce gpt2 kernel count in test_real_world (#2663) 2023-12-06 21:57:04 -05:00
chenyu 371005cb2d
use one kvcache tensor in gpt2 instead of two separate caches (#2662)
* use one kvcache tensor in gpt2

* test case

* is None

* better test cases
2023-12-06 20:59:17 -05:00
George Hotz 5a7b2ff1b2
masked shapetrackers (#2657) 2023-12-06 11:22:26 -08:00
chenyu b931a20882
minor shapetracker cleanup (#2652) 2023-12-06 11:43:52 -05:00
qazal c704a77ca0
green dtypes ALU tests (#2617)
* dtypes alu test

* those types don't exist in torch

* floats

* more tests

* disable those

* a couple unary tests

* skip float16 tests in CI for GPU

* fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM

* remove hardcoded float for LLVM ALU fns

* less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32.

* return on overflows

* fix CUDA exp2

* compute results of op regardless of bounds in a python backend

* skip fp16 in GPU and CUDACPU

* fuzz a smaller range in the float_midcast_int32 test

I sampled this and we overflow ~70% of the time.
because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here

* remove CUDA exp2 overload it's already there now

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2023-12-06 08:15:46 -08:00
Amrit Sahu 71d989b476
adding test to cover #2644 failure (#2645) 2023-12-06 11:00:30 -05:00
Ahmed Harmouche 50dcd532d5
Get all WEBGPU test_ops passing (#2646)
* Get all WEBGPU tests passing

* Custom render store is not needed in wgsl
2023-12-06 07:40:37 -08:00
chenyu 0978c24b8e
fast gpt2 embedding with variable bs=1 (#2596) 2023-12-05 23:01:17 -05:00
chenyu 229ada5fe5
Gpt2 benchmark with HALF and BEAM (#2636)
* benchmark gpt2 with half and beam

* BEAM=4

* optional validation

* green is good

* we care
2023-12-05 22:15:16 -05:00
George Hotz a73579919f mlx benchmark, a lil slower than tg 2023-12-05 19:00:43 -08:00
Oleg Rybalko 7c427d738c
don't apply padding on script call (#2585)
* don't apply padding on script call

* no need for new param because batch_size value can be utilized to check

* fixed argument naming
2023-12-05 16:34:10 -08:00
George Hotz 9d7ead84e1 hotfix: no need for model cache in examples/coder.py 2023-12-05 16:27:36 -08:00
qazal be09cc87c1
Bitcast support / fast bf16 load (#2011)
* bitcast renderers

* fast llama load

* make it one kernel

* regression testing p1: re-enable test_dtype for all backends

fix GPU

* regression testing p2: fuzz all possible cases against numpy

remove hancoded tests since the fuzzer covers them

* define ushort

* fix indent, probably need flake8 back for CI to catch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-05 16:19:28 -08:00