Commit Graph

303 Commits

Author SHA1 Message Date
George Hotz 03a6bc59c1
move autogen to runtime/autogen (#3254) 2024-01-26 12:44:19 -08:00
George Hotz a3869ffd46
move gpuctypes in tree (#3253)
* move gpuctypes in tree

* fix mypy

* regex exclude

* autogen sh

* mypy exclude

* does that fix it

* fix mypy

* add hip confirm

* verify all autogens

* build clang2py

* opencl headers

* gpu on 22.04
2024-01-26 12:25:03 -08:00
chenyu bc92c4cc32
onnx Einsum, CumSum, DepthToSpace, SpaceToDepth (#3252)
* onnx Einsum, CumSum, DepthToSpace, SpaceToDepth

Einsum inner product and `...` are not supported

* --durations=20
2024-01-26 10:47:53 -05:00
George Hotz aa0d1b6330 hotfix: don't use noqa: E702 that's just dumb 2024-01-24 20:01:00 -08:00
chenyu 2088937206
run full hlb_cifar training in tinybox ci (#3145)
* run full hlb_cifar training in tinybox ci

single gpu ~89 seconds

* time that
2024-01-15 23:59:20 -05:00
chenyu e078e2d060
add half @ half to mac benchmark (#3103) 2024-01-12 16:38:41 -05:00
chenyu 93e3f952aa
use BEAM=2 instead of BEAM=4 in cuda ci gpt2 (#3089)
BEAM=2 is faster and less search time. investigating why BEAM2+BEAM4 is slower than BEAM2 alone
2024-01-11 13:21:06 -05:00
chenyu 7f9590d357
hotfix disable flaky mac runner wino cifar (#3087) 2024-01-11 11:57:05 -05:00
jxdv ef3aa6d7fb
update gh actions (#3033)
* update checkout actions

* update upload artifact

* update setup python

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-01-09 17:52:22 -08:00
chenyu 1d730b8853
remove ACCUM_FP32 in simple_matmul.py (#3045)
* remove ACCUM_FP32 in simple_matmul.py

accumate for half inputs is always in float

* move test llama compile speed to metal
2024-01-08 17:37:57 -05:00
George Hotz 50754f1494
add caches there (#3042)
* add caches there

* no curl
2024-01-08 13:02:16 -08:00
George Hotz c5a941d466
webgl backend in extra (#3041)
* WebGL WIP

* 84% of ops passing test

* tests passing 100%

* Cleanup, refactor

* Shave off some lines

* Work on dtypes

* TestOps at 100% again

* Efficient net shaders compile in browser webgl2

* Compile all efficientnet shaders in browser

* Create empty textures for tensor buffers

* Run program. Up next weight loading

* Exported WebGL model working

* Add tests, refactor

* Explicit cast alu for GLSL

* Fix CI tests

* WebGL efficientnet demo

* Compile and run yolov8 in browser

* Fix imports

* Simplify yolo compile

* Fix bool*bool and cast cmplt to float

* More tests

* Do std tests pass on CI?

* Skip std tests on CI

* Remove explicit_cast_alu hack, and solve it in code_for_op

* Move to new dtype-less alloc api

* Remove local size hack: optimize local_size only if device has local

* Remove glsl.py, and move content to cstyle

* dont_use_locals in opts

* Fix dtype tests

* type_map in CStyleLanguage

* Make core changes smaller, cleaner, refactor export_model and demo

* Skip pad_slice

* Simplify: render_const, render_conditional

* solve bool alu for other binops, cleaner ops_webgl

* Fix noopt hack

* Remove some skipIfs

* WebGL image hack

* type_names is a better name

* global_max

* Fix dtype import

* Fix type_names -> type_map

* Fix lint

* Remove webgpu, back to 5k lines (#3040)

* remove webgpu

* max 5000 lines

* revert those to master

* retain that cstyle

---------

Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>
2024-01-08 09:29:13 -08:00
George Hotz 8cbcd1b342
Remove webgpu, back to 5k lines (#3040)
* remove webgpu

* max 5000 lines
2024-01-08 09:10:07 -08:00
George Hotz 60abc62a3f
fast hip read (#3014)
* fast hip read

* hip read faster

* fix tests

* to_mv

* simplify

* bump to 6k lines
2024-01-05 10:33:13 -08:00
chenyu 2b6670d2ea
separate entry for HALF hlb_cifar10 in benchmark (#3010) 2024-01-04 13:24:10 -05:00
George Hotz a0c7cb2564 hotfix: create weights dir in local tg checkout 2024-01-03 14:14:33 -08:00
George Hotz fc36a7d669 tinygrad weights 2024-01-03 14:09:28 -08:00
George Hotz 0be0f2f745 remove stable diffusion test on tinymac 2024-01-03 13:18:24 -08:00
George Hotz 753a7ecc05
Hip driver (#2992)
* start hip driver

* fix hip llama

* make HIP default if we can

* don't change those
2024-01-03 12:53:47 -08:00
Yixiang Gao ea3bc2f509 remove wino benchmark for now 2024-01-03 10:46:43 -08:00
Yixiang Gao 5663dd46b6 Merge branch 'master' of github.com:tinygrad/tinygrad into cifar_fp16 2024-01-03 10:11:46 -08:00
Yixiang Gao 7f1802cd50 update benchmark 2024-01-03 09:09:34 -08:00
George Hotz f494b9d463
simple multitensor API (#2903)
* simple multitensor API

* test multitensor

* mt work

* new api

* copies

* all but data parallel

* allreduce there

* works, but axis sharded

* fix all mt tests

* features/multi

* work

* backprop

* fix tests

* tests passing

* mt progress

* cleanups

* less lines

* tensor cleanup

* save more lines

* mypy passes

* fix tests

* skip for cuda too

* bump download cache
2024-01-02 17:49:44 -08:00
George Hotz dbe4a1a914
switch CI to tiny8 (#2984)
* switch CI to tiny8

* no copyin for disk

* Revert "no copyin for disk"

This reverts commit eb46b7e93da4a650d8125020c38f44d1f8f2c86e.

* rocm 6 broke llama

* rename it
2024-01-02 16:40:25 -08:00
Yixiang Gao 54cdba57e7 mend 2024-01-02 14:21:06 -08:00
Yixiang Gao 26303d181b re-enable half cifar benchmarks 2024-01-02 14:16:35 -08:00
George Hotz 17f0c3006b hotfix: do stable diffusion first on mac 2024-01-01 15:38:25 -08:00
chenyu e53b96fdbb
fix TC=2 tensor core op test (#2951)
* print DEBUG for TC=2 in CI

* enable TC=2

* no need to check src type

* LOAD has side effect

* don't push any local buffer

* update comment

* and BARRIER
2023-12-29 21:39:49 -05:00
George Hotz 7da2325dc7
get_lazyops() -> lazyops (#2884)
* get_lazyops() -> lazyops

* don't compare empty mem
2023-12-20 18:04:49 -08:00
George Hotz 1765849937
new lazy, benchmark (#2878)
* lazy rewrite, try 2

* min fix tests

* pass contig test

* put broken pads back

* move that to realize

* no contig child fixes array packing

* so wrong

* now that's correct

* base children

* fix bind issues

* disable to_image_idx

* fix tests

* that failure shouldn't break other tests

* more fixes

* fix torch

* skip failing tests in CI

* 1e-7

* half is broken

* 1e-6 margin of error
2023-12-20 14:33:21 -08:00
George Hotz ca59054463
fix shapetracker math (#2861)
* proper test

* all st math good now

* fix real_strides bug
2023-12-19 22:17:34 -08:00
chenyu 1231ec5a02
run the sz.py line count at the end of linter ci (#2857) 2023-12-19 16:33:12 -05:00
George Hotz 6617dcf095
move graph to runtime, check line count with sz.py (#2842)
* move graph to runtime, check line count with sz.py

* oops, didn't save

* dtype aliases

* restore comment, REALCOUNT
2023-12-18 20:30:06 -08:00
George Hotz 80f53245e8
shapetracker add and invert (#2828)
* invert (broken)

* decent invert

* shapetracker invert works

* plus is meh, invert is good

* support invert mask

* a few more invert tests

* shapetracker math invert test
2023-12-18 16:03:27 -08:00
chenyu 73cadfbb3c
Remove pytest markers (#2831)
* remove pytest marker

* fix some, skip some

* tweak

* fix

* skip slow

* skip more
2023-12-18 18:53:28 -05:00
chenyu 4e2a92cee1
run HALF GPT2 in nvidia benchmark in addition to HALF/BEAM (#2811)
easier to separate the issue between HALF and BEAM when it failed
2023-12-17 02:24:55 -05:00
George Hotz 051402625e
remove pushing contig + fix linearizer bug (#2798)
* remove that logic

* fix test, move LOADs

* fix repeat issue on LLVM

* with_phi
2023-12-16 09:36:31 -08:00
George Hotz c6eb618013
tests from new lazy branch (#2774)
* tests from new lazy branch

* fix lin 11

* that was needed

* doesn't fail

* mark

* meant that

* llvm passes
2023-12-14 23:06:39 -08:00
chenyu a044125c39
validate stable diffusion for seed 0 (#2773)
* validate stable diffusion for seed 0

the closest false positive i can get is with the setup and one less step. dist = 0.0036
same setup with fp16 has dist=5e-6.
so setting validation threshold to 1e-4 should be good

* run with --seed 0
2023-12-15 00:07:09 -05:00
Ahmed Harmouche 4b01839774
support vals on WebGPU, run more tests (#2668)
* Vals on webgpu, run more tests

* Skip slow tests, run symbolic ops tests

* Balance out tests
2023-12-07 16:45:21 -08:00
George Hotz 00d9eda961
FROM -> COPY, move vars_from_ast (#2675) 2023-12-07 16:32:30 -08:00
Ahmed Harmouche 50dcd532d5
Get all WEBGPU test_ops passing (#2646)
* Get all WEBGPU tests passing

* Custom render store is not needed in wgsl
2023-12-06 07:40:37 -08:00
chenyu 229ada5fe5
Gpt2 benchmark with HALF and BEAM (#2636)
* benchmark gpt2 with half and beam

* BEAM=4

* optional validation

* green is good

* we care
2023-12-05 22:15:16 -05:00
George Hotz 35b5e95097
parallel beam search (#2610)
* better print

* fix beam search with vars

* cleanups

* parallel is not default

* restore that

* bugfix

* cleanups

* bugfix
2023-12-05 10:09:45 -08:00
George Hotz bbeba8ec85
use default dict for external_model_benchmark (#2592)
* device default

* Device.DEFAULT

* half max for cuda

* CUDA_INCLUDE_PATH

* closer to working

* cuda fixups

* Update ops_cuda.py
2023-12-03 15:25:43 -08:00
George Hotz bc012f26b9 hotfix, disable model inference benchmark on NVIDIA 2023-12-03 13:52:41 -08:00
qazal 4380ccb169
Non fp32 math (#2264)
* `global_load` and `global_store` using buffer dtype

* `UOps.PHI` in all dtypes

* `UOps.ALU` in all dtypes

* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes

* -- endof implementation --
+tiny lint changes

* these tests require the fp16 extention

you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)

`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`

skip the new test_linearizer_failures in CI GPU because of the fp16 extention

This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`

see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)

* these tests fail in CI due to segfaults and CPU crashes

To confirm they're green locally, you can run the following commands:

1. For the tests skipped in test_ops.py (note: CLANG is very slow)

`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`

2. For the ONNX tests skipped in CLANG:

```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```

3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific

`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`

* Revert "these tests fail in CI due to segfaults and CPU crashes"

This reverts commit 15db57014381a4449d563526ac6c870e36257658.

* merge with cleanup-vectorized-hip-renders

* barely working HIP P1, ALU ops need a refactor?

* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:

```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```

* update the sign definition to avoid division by zero in all dtypes

* diff cleanup p1: why were these in the diff anyways

* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!

add ALU ops overloads for HIP

this will make HIP max work

handle mod

Revert "handle mod"

This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.

update max to use hmax

add HIP GEP render logic

enable CIFAR fp16 benchmark

test ops for HIP

back to store as float because this only works for float4 grouping right now

test_ops for hip!!

always sign

* back to the sign we had before because we cant do a backward pass on a Less node

* remove old hacks

HIP compiling test_ops in CI takes ~9 mins, not doing it for now

new HIP ALUs

* reduce accs done right

* refactor to function

* no device hacks

hacks p2

the other way

* LLVM ALU ops

half, float and double are all float

update max

* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool

* cleanup LLVM wrong code

* dummy change for the CUDA install glitch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-03 13:45:49 -08:00
chenyu 1ac958a058
update pytest marks and CI test filters (#2587)
* remove pytest marks

* test more stuff

* fine revert some

* add that mark back

* skip that

* hmm LLVM does not work on ubuntu

* too slow on CUDA CI

* dup test
2023-12-03 15:20:44 -05:00
George Hotz 5068e99d18
refactor to remove extra kernel params (#2563)
* refactor to have compiled kernel

* bugfixes

* docs/beautiful.py

* revert that

* fix tests
2023-12-02 00:32:25 -08:00
George Hotz 27481b9206
Switch ops_gpu -> gpuctypes (#2532)
* ops_gpu is go

* fix size 0

* fix image, and add more tests

* nerf openpilot test, doesn't test thneed

* run the schedule

* better

* oops, new inputs

* delete pyopencl

* Update ops_gpu.py
2023-12-01 22:30:21 -08:00