Commit Graph

1584 Commits

Author SHA1 Message Date
chenyu c4c243f79d
update test_uops _equal to use assert_allclose (#3981)
it handles nan
2024-03-28 22:14:45 -04:00
reddyn12 9b5e15db6e
Mamba Implementation (#3456)
* first commit

* state back to orig

* mamba comparisions

* rm file

* rename file

* use Tensor.einsum and mke default model 370M

* Cleaned code and made a comparision test

* Simplyfy pull request. Only has 1 mamba implementation now.

* Update prompt

* rm whitespaces

* last space

* remove Einops dependency

* rm unused code

* add tests

* rm print statement

* rm imports

* skip CLANG

* Update skipIf description

* skip model test in CI and add CLANG fix

* rm Device import

* don't be stupid

* Fix conv assign

When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token

* fix p1

* temp

* fix jit import

---------

Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 17:49:12 -07:00
chenyu 1fa0351acb
fix DEFINE_ACC invalid_value to have same type as localtype (#3980) 2024-03-28 19:21:17 -04:00
chenyu b47f6cebb2
LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
George Hotz 42b9d999ea
Buffer isn't always allocated (#3974)
* buffer alloc

* allocate

* missing allocates

* last one
2024-03-28 13:33:47 -07:00
chenyu bfcaa2f70e
assert `__setitem__` if used other than disk (#3972)
* assert `__setitem__` if used other than disk

* that is not implemented
2024-03-28 12:16:38 -04:00
Francis Lam 7c5729a3bd
wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
George Hotz 60639cccac hotfix: RuntimeError for assign 2024-03-27 11:18:48 -07:00
qazal 9fb573d73c
DAG cycle asserts (#3955)
* assert cycles

* these are cycle errors

* flip to positive
2024-03-27 11:09:59 -07:00
geohotstan bd3a7d068c
correct device for validation test in model benchmark CI (#3960)
* fix tests

* add clang back for only metal

* change the name to reflect CLANG being ran

* add back cuda
2024-03-27 13:40:06 -04:00
chenyu 6c7df1445b
enforce UOps.CONST arg has python type based on dtype (#3952)
added an assert in uops, remove the cast in renderer
2024-03-27 01:41:38 -04:00
George Hotz 68ca4d4276
split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz 150ea2eb76
create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
Francis Lam 5530b0cbed
fuzz_linearizer: reduce debug verbosity and make easier for CI usage (#3942)
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage

* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures

* clean up naming and use set
2024-03-26 16:25:24 -04:00
nimlgen e2d6f76723
_alloc and _free with options (#3934)
* _alloc has options

* linter

* fix hsa
2024-03-26 09:11:41 -07:00
chenyu 72d617a37d
opencl on OSX does not support fp16 extension (#3931)
running `GPU=1 python -m pytest -rA test/test_dtype.py::TestHalfDtype::test_casts_from` on mac would fail.
2024-03-25 19:50:17 -04:00
chenyu 4ecd5789ab
#include <tgmath.h> in ops_clang (#3927)
* different clang sqrt/log2/exp2/sin function based on dtype

fixed softmax_argmax issue in #3552 for clang.

* tgmath.h

* revert those
2024-03-25 17:48:57 -04:00
Arseny Kapoulkine 514c43201d
Fix issues with pointer provenance in load/store through ALU (#3916)
* Track pointer provenance in load/store through ALU

Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.

* Simplify the load provenance workaround

The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.

* Add a couple tests for basic local use

* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
2024-03-25 14:41:05 -07:00
chenyu 83f39a8ceb
env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
George Hotz 03899a74bb increase atol on reset train 2024-03-24 15:17:31 -07:00
qazal d8fafca13a
assign regression (#3907)
* infra

* track mutations

* assign levels

* add seen back

* add test

* infra 2.0

* add assign targets

* dont need levels

* delete

* Update test_assign.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-24 15:12:31 -07:00
Patrick Tsai e27129a798
Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
wozeparrot 9a9cac58f9
add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu 2c69888654
include negative float in test_dtype (#3884)
* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow
2024-03-24 02:39:15 -04:00
Francis Lam 0145366323
wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
chenyu a2b2597fc2
replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
Alejandro F Queiruga 556dcfb8f2
Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-23 15:48:19 -04:00
chenyu 2d3ce53348
touchup test_dtype.test_gradient_dtype (#3887)
add back bad merge from #3613 and add float.double and float.bfloat16 to test
2024-03-22 20:56:45 -04:00
David Hou fc11808a79
initialize Tensor grad same type as self (#3613)
* initialize Tensor grad same type as self

* also test different default float

* check dtype + try/finally

* don't test_gradient_dtype if f16 is not supported

* fix bad merge

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-22 20:33:18 -04:00
Francis Lam 8db7a6bbcc
debug: add optional detailed BEAM_LOG logging (#3883)
* debug: add optional detailed BEAM_LOG logging

show uop count, compile and run times for each candidate in search

also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts

* fix linter
2024-03-22 19:23:31 -04:00
George Hotz 54dc48aa47
fix assign (#3878)
* fix assign

* remove terrible optimizer hack

* oops, not realized assigns
2024-03-22 11:48:48 -07:00
Francis Lam 5587594a00
fuzz_linearizer: add --ast and --file params to read kernels (#3877)
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
chenyu c5467e5bd6
diverse test value in test_dtype DATA based on dtype (#3864)
* diverse test value in test_dtype DATA based on dtype

* eh fix typo

* that too?

* PTX does not support i8 and s8

* skip that

* unused line

* pus the hack back

* remove that
2024-03-22 14:22:06 -04:00
George Hotz 86ee36e697
preschedule all (#3875) 2024-03-22 11:20:06 -07:00
Szymon Ożóg d8c3f1894a
Use UOpGraph in test (#3876) 2024-03-22 14:12:38 -04:00
qazal fe6ceff15f
proposal: multioutput JIT spec (#3856)
* corealize JIT

* requirements
2024-03-21 21:28:30 -07:00
uuuvn 6729f20aab
Ring allreduce try 2 (#3852)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

* RING=2 and use ContextVar

* DEBUG >= 2 and change name

* linter

* type

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-21 19:17:51 -04:00
Francis Lam 3c0478bfab
fuzz_linearizer: add additional DEBUG info for comparison errors (#3866) 2024-03-21 18:58:10 -04:00
chenyu e50b7abe4f
diversed buf inputs based on dtype in fuzz_linearizer (#3863) 2024-03-21 16:23:11 -04:00
chenyu 30fa03243e
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:12:27 -04:00
chenyu 33dd99acf4
remove helper_add_store from test_linearizer_failures (#3860) 2024-03-21 12:53:31 -04:00
chenyu 6bf0b82267
alloc new output in fuzz_linearizer between baseline and real one (#3859)
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen 85691c8e20
fix hsa sync issue (#3847)
* fix hsa sync issue

* linter
2024-03-21 04:00:30 +03:00
chenyu f271cd682b
user _resolve_dim in argmax (#3846)
also added comment of the behavior if there are multple, and more tests
2024-03-20 20:17:30 -04:00
Francis Lam 6d5dec2fef
log optimized kernels and a script to compare with non-optimized ones (#3829)
* search: add BEAM_VERIFY option to validate search results

refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py

* search: fix to verify the beam_search result and not the fastest

* search: fix typing and clean up

* device: remove imports from test and add LOGKERN options

LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness

* fix example in verify_kernel.py

* cleanup fixes

* fix to use f-strings
2024-03-20 19:22:08 -04:00
chenyu 519336cfea
factor out partial in SumNode div int (#3841)
* factor out partial in SumNode div int

* div not rem

* space
2024-03-20 16:34:33 -04:00
George Hotz 8cb5215885
Revert "Ring allreduce in multitensor (#3000)" (#3840)
This reverts commit c5bf9e4c96.
2024-03-20 11:41:49 -07:00
uuuvn c5bf9e4c96
Ring allreduce in multitensor (#3000)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-20 11:20:01 -07:00
chenyu 455f7bea9b
test example from half resnet that idx has number outside of int32 (#3838)
* test example from half resnet that idx has number outside of int32

* ruff
2024-03-20 13:44:20 -04:00
chenyu d17900bc45
use int32 instead of default_int in simplify_phi_loops (#3828)
* use int32 instead of default_int in simplify_phi_loops

indices are in int32 now and is separated from buffer dtype. fix #3823

* return early if not supported

* it's not that

* why is it failing for RHIP
2024-03-19 17:49:58 -04:00