Commit Graph

3961 Commits

Author SHA1 Message Date
George Hotz 68ca4d4276
split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz da07f31fd4 hotfix: remove bf16 test entirely 2024-03-26 20:50:27 -07:00
George Hotz 0d5845fb5b hotfix: jit is flaky on mac 2024-03-26 20:44:05 -07:00
George Hotz 150ea2eb76
create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
George Hotz 629cbc5587
only abstractions 2 (#3947) 2024-03-26 20:02:18 -07:00
chenyu 77589bc7a5
rename Scalar to ConstType and cast_scalar to as_const (#3946)
prereq cleanup to make const arg same python type as dtype
2024-03-26 22:39:58 -04:00
uuuvn d6d902afe9
wtf (#3944) 2024-03-26 17:49:28 -07:00
Francis Lam 5530b0cbed
fuzz_linearizer: reduce debug verbosity and make easier for CI usage (#3942)
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage

* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures

* clean up naming and use set
2024-03-26 16:25:24 -04:00
chenyu 8df6587c41
hotfix 97.3 for beautiful_mnist (#3941) 2024-03-26 15:02:53 -04:00
chenyu b1e3817e18
correctly handle Tensor.rand whwn default_float = bf16 (#3940)
always casting to float32 makes default half to be slow
2024-03-26 14:56:16 -04:00
chenyu f6ff76be21
check only upcast int amount in upcasted_axis (#3938)
fixed typing and fixed #3932
2024-03-26 12:54:57 -04:00
nimlgen e2d6f76723
_alloc and _free with options (#3934)
* _alloc has options

* linter

* fix hsa
2024-03-26 09:11:41 -07:00
nimlgen 739f47eb0f
check on cuEventSynchronize (#3933) 2024-03-26 16:14:38 +03:00
George Hotz 778d17fbd3
intel matmul (#3830)
* almost right

* intel xmx
2024-03-25 22:37:20 -07:00
chenyu ef537672bf
bf16 support in metal (#3929)
it runs if device gpu supports bfloat. updated ci benchmark too
2024-03-25 23:17:36 -04:00
chenyu 72d617a37d
opencl on OSX does not support fp16 extension (#3931)
running `GPU=1 python -m pytest -rA test/test_dtype.py::TestHalfDtype::test_casts_from` on mac would fail.
2024-03-25 19:50:17 -04:00
Arseny Kapoulkine cb6e7b57a6
examples: Fix parameter bandwidth accounting for quantized LLama (#3930)
Instead of assuming every parameter is 2 bytes, just add up tensor sizes
in bytes
2024-03-25 18:41:05 -04:00
chenyu 4ecd5789ab
#include <tgmath.h> in ops_clang (#3927)
* different clang sqrt/log2/exp2/sin function based on dtype

fixed softmax_argmax issue in #3552 for clang.

* tgmath.h

* revert those
2024-03-25 17:48:57 -04:00
Arseny Kapoulkine 514c43201d
Fix issues with pointer provenance in load/store through ALU (#3916)
* Track pointer provenance in load/store through ALU

Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.

* Simplify the load provenance workaround

The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.

* Add a couple tests for basic local use

* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
2024-03-25 14:41:05 -07:00
chenyu d651835ef5
verify beautiful_mnist.py eval acc and put into benchmark ci (#3926)
* verify beautiful_mnist and put in ci

* 97.5 for eval verification
2024-03-25 16:47:49 -04:00
chenyu dc508022a9
clean up clang src header (#3925)
don't need to define int64 and uchar
2024-03-25 15:18:35 -04:00
uuuvn 2080325e8d
output_buffer isn't used anymore (#3919) 2024-03-25 16:03:56 +03:00
nimlgen f2a9ea4ea9
lru allocator for copyin host buffers (#3918)
* lru allocator for copyin host buffers

* linter happy
2024-03-25 15:57:18 +03:00
George Hotz e0e234bf94 hotfix, str compare version for cuda 2024-03-24 20:35:24 -07:00
Arseny Kapoulkine 715850aef9
Fix sm89 PTX=1 compilation (#3915)
* Fix sm89 PTX=1 compilation

The minimum PTX version that supports sm89 is 7.8 (same version also
supports sm90); without this ptxas fails when running tinygrad with
PTX=1 on RTX 4090.

* Use int(arch[3:]) for forward compat with SM10.0 if that happens
2024-03-24 20:32:29 -07:00
chenyu 83f39a8ceb
env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
George Hotz 03899a74bb increase atol on reset train 2024-03-24 15:17:31 -07:00
qazal d8fafca13a
assign regression (#3907)
* infra

* track mutations

* assign levels

* add seen back

* add test

* infra 2.0

* add assign targets

* dont need levels

* delete

* Update test_assign.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-24 15:12:31 -07:00
Szymon Ożóg 2d0bfdf01c
ptx cleanup (#3893) 2024-03-24 14:54:45 -07:00
chenyu 2e39f57594
move lines around in ops_python wmma (#3911) 2024-03-24 17:14:26 -04:00
Patrick Tsai e27129a798
Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
chenyu 10673d1447
tiny search cleanup (#3910)
* tiny search cleanup

removed some `assert isinstance(dev, Compiled)` and lines

* remove import
2024-03-24 14:20:55 -04:00
wozeparrot 9a9cac58f9
add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu 8c8b57fd5f
cleanup ops python (#3908)
i just want to merge lars!
2024-03-24 11:36:31 -04:00
chenyu 2c69888654
include negative float in test_dtype (#3884)
* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow
2024-03-24 02:39:15 -04:00
chenyu e22d78b3d2
training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
Francis Lam 0145366323
wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
sekstini 7c3632fd1e
add --minimal flag to nvrtc (#3899) 2024-03-23 16:38:31 -07:00
chenyu a2b2597fc2
replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
chenyu 24d004a89b
hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
2024-03-23 17:16:38 -04:00
chenyu 4d566f12b1
touchup einsum (#3900)
don't need rhs_letters
2024-03-23 16:46:39 -04:00
Alejandro F Queiruga 556dcfb8f2
Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-23 15:48:19 -04:00
nimlgen 4e18dd78d3
faster program start in llvm (#3897) 2024-03-23 15:20:15 +03:00
George Hotz 46a3501cec
nv ioctl sniffer (#3892)
* nv ioctl sniffer

* unused import

* Update __init__.py

* that work

* that fix it
2024-03-23 00:29:30 -07:00
chenyu 18e0cef14d
cheap less lines in ptx (#3890)
enought to merge lars
2024-03-23 01:12:31 -04:00
George Hotz f0c4e06ffd
fix cuda sync (#3888) 2024-03-22 19:02:30 -07:00
chenyu 2d3ce53348
touchup test_dtype.test_gradient_dtype (#3887)
add back bad merge from #3613 and add float.double and float.bfloat16 to test
2024-03-22 20:56:45 -04:00
David Hou fc11808a79
initialize Tensor grad same type as self (#3613)
* initialize Tensor grad same type as self

* also test different default float

* check dtype + try/finally

* don't test_gradient_dtype if f16 is not supported

* fix bad merge

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-22 20:33:18 -04:00
Francis Lam 8db7a6bbcc
debug: add optional detailed BEAM_LOG logging (#3883)
* debug: add optional detailed BEAM_LOG logging

show uop count, compile and run times for each candidate in search

also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts

* fix linter
2024-03-22 19:23:31 -04:00
chenyu f7f67e0cc5
simple fix llama shard with quantize (#3882)
copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.

70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.

`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize`

13B on 6 gpus uses 47 GB v.s. 34 GB quantized
2024-03-22 18:15:37 -04:00