Commit Graph

3910 Commits

Author SHA1 Message Date
nimlgen 16e31f7f0d
init multidevice cuda graph (#3858)
* init multidevice cuda graph

* cuda just works!

* clean

* linter happier

* liners happy

* update transfer inputs

* do not change free

* useless check for cuda

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 13:49:48 -07:00
George Hotz 0c197b9cf3 hotfix: hip bfloat formatting 2024-03-22 11:52:05 -07:00
George Hotz 54dc48aa47
fix assign (#3878)
* fix assign

* remove terrible optimizer hack

* oops, not realized assigns
2024-03-22 11:48:48 -07:00
Francis Lam 5587594a00
fuzz_linearizer: add --ast and --file params to read kernels (#3877)
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
chenyu c5467e5bd6
diverse test value in test_dtype DATA based on dtype (#3864)
* diverse test value in test_dtype DATA based on dtype

* eh fix typo

* that too?

* PTX does not support i8 and s8

* skip that

* unused line

* pus the hack back

* remove that
2024-03-22 14:22:06 -04:00
George Hotz 86ee36e697
preschedule all (#3875) 2024-03-22 11:20:06 -07:00
Szymon Ożóg d8c3f1894a
Use UOpGraph in test (#3876) 2024-03-22 14:12:38 -04:00
chenyu 1c51d586ea
replace raise Exception with specific errors (#3874) 2024-03-22 12:32:21 -04:00
nimlgen 8ef5490ec8
cuda tranfer + async copyin (#3873) 2024-03-22 09:01:37 -07:00
Szymon Ożóg 624bc89910
PTX - implement float 4, ptr arithmetics and other speed improvements (#3775)
* ptx float4 implementation

* remove from cache when trimming uops

* Gate for float4

* Linting fix

* disable test reasonable time for ptx

* import getenv

* Update uops.py

* linter

* Add div test for half

* upcast if op does not support operation

* fix offset

* Run only if dtype supported

* zero out registers when accessing by pred + cleanup

* Remove trailing whitespace

* revert

* spacing fix

* move cache clearing outside loop

* did this suddenly start working?

* unused import removed

* Remove cast

* Use pattern matching

* linting

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 08:54:02 -07:00
George Hotz f4055439dc
don't include hip common (#3851)
* don't install hip common

* only that

* Revert "only that"

This reverts commit 85f22015d98d2775641cb9c7851fe595bdc97d29.

* less

* needed

* sep comgr

* header file

* 6.0.2

* update hsa

* hsakmt

* Revert "hsakmt"

This reverts commit d3a118078ed1c032f31abddb9d30cf6c13fc4f5e.
2024-03-22 08:50:50 -07:00
qazal 4a27ce6ec9
tiny version of amd_hip_bfloat16 (#3868)
* add src_dtype

* add maker

* add bfloat16

* simpler
2024-03-22 08:37:30 -07:00
chenyu 82ce60e172
use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark (#3870)
smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090
2024-03-22 00:40:06 -04:00
qazal fe6ceff15f
proposal: multioutput JIT spec (#3856)
* corealize JIT

* requirements
2024-03-21 21:28:30 -07:00
Francis Lam a26090d404
search: change to use "spawn" and limit the number of tasks per child (#3862)
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
chenyu dca69df197
hot fix use DEBUG >= 3 for allreduce message (#3869) 2024-03-21 23:40:44 -04:00
uuuvn 6729f20aab
Ring allreduce try 2 (#3852)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

* RING=2 and use ContextVar

* DEBUG >= 2 and change name

* linter

* type

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-21 19:17:51 -04:00
Francis Lam 3c0478bfab
fuzz_linearizer: add additional DEBUG info for comparison errors (#3866) 2024-03-21 18:58:10 -04:00
chenyu bc482729d0
lower hlb_cifar acc to 93.3 (#3865)
ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now.

maybe reenable ema later if it reduces variance
2024-03-21 17:58:53 -04:00
chenyu e50b7abe4f
diversed buf inputs based on dtype in fuzz_linearizer (#3863) 2024-03-21 16:23:11 -04:00
chenyu c40f78499f
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:23:37 -04:00
chenyu 30fa03243e
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:12:27 -04:00
chenyu 33dd99acf4
remove helper_add_store from test_linearizer_failures (#3860) 2024-03-21 12:53:31 -04:00
chenyu 6bf0b82267
alloc new output in fuzz_linearizer between baseline and real one (#3859)
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen b78352b423
do not create structs every call in CUDAProgram (#3855)
* do not create structs in cuda

* fix graph

* linter

* do not exec twice

* fix graph
2024-03-21 17:51:40 +03:00
nimlgen e5745c1a0d
fix nan on multigpus cuda (#3854) 2024-03-21 15:21:55 +03:00
Anurag Lamsal 4e0819e40b
fixing the benchmark not printing in handcode resnet50 opt example (#3850) 2024-03-21 00:55:31 -04:00
nimlgen 85691c8e20
fix hsa sync issue (#3847)
* fix hsa sync issue

* linter
2024-03-21 04:00:30 +03:00
chenyu f271cd682b
user _resolve_dim in argmax (#3846)
also added comment of the behavior if there are multple, and more tests
2024-03-20 20:17:30 -04:00
chenyu 5c4cf62d2c
fix View.pad arg type (#3845)
close #3779
2024-03-20 19:36:02 -04:00
Francis Lam 6d5dec2fef
log optimized kernels and a script to compare with non-optimized ones (#3829)
* search: add BEAM_VERIFY option to validate search results

refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py

* search: fix to verify the beam_search result and not the fastest

* search: fix typing and clean up

* device: remove imports from test and add LOGKERN options

LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness

* fix example in verify_kernel.py

* cleanup fixes

* fix to use f-strings
2024-03-20 19:22:08 -04:00
chenyu 9d1d08fbb0
show llama bandwith with timing (#3844) 2024-03-20 17:19:15 -04:00
chenyu 7ff47e45a1
cifar TARGET_EVAL_ACC_PCT=93.5 (#3843) 2024-03-20 16:56:51 -04:00
qazal 92c5067439
conceptual small refactor (#3842) 2024-03-20 16:46:14 -04:00
chenyu 519336cfea
factor out partial in SumNode div int (#3841)
* factor out partial in SumNode div int

* div not rem

* space
2024-03-20 16:34:33 -04:00
George Hotz 8cb5215885
Revert "Ring allreduce in multitensor (#3000)" (#3840)
This reverts commit c5bf9e4c96.
2024-03-20 11:41:49 -07:00
uuuvn c5bf9e4c96
Ring allreduce in multitensor (#3000)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-20 11:20:01 -07:00
chenyu 455f7bea9b
test example from half resnet that idx has number outside of int32 (#3838)
* test example from half resnet that idx has number outside of int32

* ruff
2024-03-20 13:44:20 -04:00
chenyu 727de5ba1e
llama 7B on 3090 benchmark (#3837)
* llama 7B on 3090 benchmark

* symlink llama
2024-03-20 12:48:22 -04:00
qazal 9452994201
add a better error message for resnet training (#3836)
* add a better error message

* assert

* use FileNotFoundError
2024-03-20 09:22:15 -07:00
chenyu 47b9cc2dfe
use float32 for rand buffer in test_beam_search and test in metal (#3831) 2024-03-19 23:22:58 -04:00
chenyu d17900bc45
use int32 instead of default_int in simplify_phi_loops (#3828)
* use int32 instead of default_int in simplify_phi_loops

indices are in int32 now and is separated from buffer dtype. fix #3823

* return early if not supported

* it's not that

* why is it failing for RHIP
2024-03-19 17:49:58 -04:00
nimlgen 2d54e4d747
clean up hsa driver (#3818)
* clean up driver

* remove returns
2024-03-20 00:17:41 +03:00
chenyu 99cbc24390
use dtypes.int32 as return dtype for functions that return indices (#3827)
behavior matches jax. It's fine to have a tensor greater than max int8 size even if we set default int to int8
2024-03-19 17:06:57 -04:00
chenyu fa1921ec7d
move test_dtype tests to test dtype and output value (#3826) 2024-03-19 16:31:27 -04:00
Francis Lam 131bbb6563
test_linearizer_failure: add failure 27 from a gpt2 kernel (#3825)
* test_linearizer_failure: add failure 27 from a gpt2 kernel

found during a full fuzz test of applied_opts combos to a
depth of 4 on the gpt2 kernels w/o GROUPTOP.

added additional examples to failure 26 that don't have GROUPTOP

* add other platform failure
2024-03-19 16:29:50 -04:00
nimlgen 3fb13ff892
HIP -> HSA in docs/env_vars (#3824) 2024-03-19 22:53:33 +03:00
Francis Lam 9851e2c3b9
test_linearizer_failure: add failure 26 from a gpt2 kernel (#3821)
found during a full fuzz test of all applied_opts combos to a
depth of 3 on the gpt2 kernels
2024-03-19 13:19:54 -04:00
Patrick Tsai b436c9792f
Fix factoring bug (O(n) arange related) (#3817)
* Factoring bug

* Another one in case

* It works now so change tests back

* large arange cumsum optimization

* More cleanup

* symbolic no factor div test

* name change

* Rename test

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-19 11:49:42 -04:00
chenyu e12bc85014
use BS=128 and BS=768 for resent benchmark (#3815)
50% more hcopt perf with this one weird trick
2024-03-18 23:49:55 -04:00