nimlgen
16e31f7f0d
init multidevice cuda graph ( #3858 )
...
* init multidevice cuda graph
* cuda just works!
* clean
* linter happier
* liners happy
* update transfer inputs
* do not change free
* useless check for cuda
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 13:49:48 -07:00
George Hotz
0c197b9cf3
hotfix: hip bfloat formatting
2024-03-22 11:52:05 -07:00
George Hotz
54dc48aa47
fix assign ( #3878 )
...
* fix assign
* remove terrible optimizer hack
* oops, not realized assigns
2024-03-22 11:48:48 -07:00
Francis Lam
5587594a00
fuzz_linearizer: add --ast and --file params to read kernels ( #3877 )
...
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
chenyu
c5467e5bd6
diverse test value in test_dtype DATA based on dtype ( #3864 )
...
* diverse test value in test_dtype DATA based on dtype
* eh fix typo
* that too?
* PTX does not support i8 and s8
* skip that
* unused line
* pus the hack back
* remove that
2024-03-22 14:22:06 -04:00
George Hotz
86ee36e697
preschedule all ( #3875 )
2024-03-22 11:20:06 -07:00
Szymon Ożóg
d8c3f1894a
Use UOpGraph in test ( #3876 )
2024-03-22 14:12:38 -04:00
chenyu
1c51d586ea
replace raise Exception with specific errors ( #3874 )
2024-03-22 12:32:21 -04:00
nimlgen
8ef5490ec8
cuda tranfer + async copyin ( #3873 )
2024-03-22 09:01:37 -07:00
Szymon Ożóg
624bc89910
PTX - implement float 4, ptr arithmetics and other speed improvements ( #3775 )
...
* ptx float4 implementation
* remove from cache when trimming uops
* Gate for float4
* Linting fix
* disable test reasonable time for ptx
* import getenv
* Update uops.py
* linter
* Add div test for half
* upcast if op does not support operation
* fix offset
* Run only if dtype supported
* zero out registers when accessing by pred + cleanup
* Remove trailing whitespace
* revert
* spacing fix
* move cache clearing outside loop
* did this suddenly start working?
* unused import removed
* Remove cast
* Use pattern matching
* linting
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 08:54:02 -07:00
George Hotz
f4055439dc
don't include hip common ( #3851 )
...
* don't install hip common
* only that
* Revert "only that"
This reverts commit 85f22015d98d2775641cb9c7851fe595bdc97d29.
* less
* needed
* sep comgr
* header file
* 6.0.2
* update hsa
* hsakmt
* Revert "hsakmt"
This reverts commit d3a118078ed1c032f31abddb9d30cf6c13fc4f5e.
2024-03-22 08:50:50 -07:00
qazal
4a27ce6ec9
tiny version of amd_hip_bfloat16 ( #3868 )
...
* add src_dtype
* add maker
* add bfloat16
* simpler
2024-03-22 08:37:30 -07:00
chenyu
82ce60e172
use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark ( #3870 )
...
smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090
2024-03-22 00:40:06 -04:00
qazal
fe6ceff15f
proposal: multioutput JIT spec ( #3856 )
...
* corealize JIT
* requirements
2024-03-21 21:28:30 -07:00
Francis Lam
a26090d404
search: change to use "spawn" and limit the number of tasks per child ( #3862 )
...
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
chenyu
dca69df197
hot fix use DEBUG >= 3 for allreduce message ( #3869 )
2024-03-21 23:40:44 -04:00
uuuvn
6729f20aab
Ring allreduce try 2 ( #3852 )
...
* Ring allreduce v3
* Configurable size, number of gpus and jit in benchmark
* ScheduleBarrier v0
* GB/s that make sense
* ScheduleBarrier v0.1
* Fallback on 2 GPUs
* ScheduleBarrier v0.2
* ScheduleBarrier v0.3
* ScheduleBarrier v0.3.1
* ScheduleBarrier v0.3.2
* Replace ScheduleBarrier with automatic optimization
* unused import
* fix comment
* typing
* better fallback
* python 3.8
* RING=2 and use ContextVar
* DEBUG >= 2 and change name
* linter
* type
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-21 19:17:51 -04:00
Francis Lam
3c0478bfab
fuzz_linearizer: add additional DEBUG info for comparison errors ( #3866 )
2024-03-21 18:58:10 -04:00
chenyu
bc482729d0
lower hlb_cifar acc to 93.3 ( #3865 )
...
ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now.
maybe reenable ema later if it reduces variance
2024-03-21 17:58:53 -04:00
chenyu
e50b7abe4f
diversed buf inputs based on dtype in fuzz_linearizer ( #3863 )
2024-03-21 16:23:11 -04:00
chenyu
c40f78499f
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures ( #3861 )
2024-03-21 14:23:37 -04:00
chenyu
30fa03243e
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures ( #3861 )
2024-03-21 14:12:27 -04:00
chenyu
33dd99acf4
remove helper_add_store from test_linearizer_failures ( #3860 )
2024-03-21 12:53:31 -04:00
chenyu
6bf0b82267
alloc new output in fuzz_linearizer between baseline and real one ( #3859 )
...
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen
b78352b423
do not create structs every call in CUDAProgram ( #3855 )
...
* do not create structs in cuda
* fix graph
* linter
* do not exec twice
* fix graph
2024-03-21 17:51:40 +03:00
nimlgen
e5745c1a0d
fix nan on multigpus cuda ( #3854 )
2024-03-21 15:21:55 +03:00
Anurag Lamsal
4e0819e40b
fixing the benchmark not printing in handcode resnet50 opt example ( #3850 )
2024-03-21 00:55:31 -04:00
nimlgen
85691c8e20
fix hsa sync issue ( #3847 )
...
* fix hsa sync issue
* linter
2024-03-21 04:00:30 +03:00
chenyu
f271cd682b
user _resolve_dim in argmax ( #3846 )
...
also added comment of the behavior if there are multple, and more tests
2024-03-20 20:17:30 -04:00
chenyu
5c4cf62d2c
fix View.pad arg type ( #3845 )
...
close #3779
2024-03-20 19:36:02 -04:00
Francis Lam
6d5dec2fef
log optimized kernels and a script to compare with non-optimized ones ( #3829 )
...
* search: add BEAM_VERIFY option to validate search results
refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py
* search: fix to verify the beam_search result and not the fastest
* search: fix typing and clean up
* device: remove imports from test and add LOGKERN options
LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness
* fix example in verify_kernel.py
* cleanup fixes
* fix to use f-strings
2024-03-20 19:22:08 -04:00
chenyu
9d1d08fbb0
show llama bandwith with timing ( #3844 )
2024-03-20 17:19:15 -04:00
chenyu
7ff47e45a1
cifar TARGET_EVAL_ACC_PCT=93.5 ( #3843 )
2024-03-20 16:56:51 -04:00
qazal
92c5067439
conceptual small refactor ( #3842 )
2024-03-20 16:46:14 -04:00
chenyu
519336cfea
factor out partial in SumNode div int ( #3841 )
...
* factor out partial in SumNode div int
* div not rem
* space
2024-03-20 16:34:33 -04:00
George Hotz
8cb5215885
Revert "Ring allreduce in multitensor ( #3000 )" ( #3840 )
...
This reverts commit c5bf9e4c96
.
2024-03-20 11:41:49 -07:00
uuuvn
c5bf9e4c96
Ring allreduce in multitensor ( #3000 )
...
* Ring allreduce v3
* Configurable size, number of gpus and jit in benchmark
* ScheduleBarrier v0
* GB/s that make sense
* ScheduleBarrier v0.1
* Fallback on 2 GPUs
* ScheduleBarrier v0.2
* ScheduleBarrier v0.3
* ScheduleBarrier v0.3.1
* ScheduleBarrier v0.3.2
* Replace ScheduleBarrier with automatic optimization
* unused import
* fix comment
* typing
* better fallback
* python 3.8
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-20 11:20:01 -07:00
chenyu
455f7bea9b
test example from half resnet that idx has number outside of int32 ( #3838 )
...
* test example from half resnet that idx has number outside of int32
* ruff
2024-03-20 13:44:20 -04:00
chenyu
727de5ba1e
llama 7B on 3090 benchmark ( #3837 )
...
* llama 7B on 3090 benchmark
* symlink llama
2024-03-20 12:48:22 -04:00
qazal
9452994201
add a better error message for resnet training ( #3836 )
...
* add a better error message
* assert
* use FileNotFoundError
2024-03-20 09:22:15 -07:00
chenyu
47b9cc2dfe
use float32 for rand buffer in test_beam_search and test in metal ( #3831 )
2024-03-19 23:22:58 -04:00
chenyu
d17900bc45
use int32 instead of default_int in simplify_phi_loops ( #3828 )
...
* use int32 instead of default_int in simplify_phi_loops
indices are in int32 now and is separated from buffer dtype. fix #3823
* return early if not supported
* it's not that
* why is it failing for RHIP
2024-03-19 17:49:58 -04:00
nimlgen
2d54e4d747
clean up hsa driver ( #3818 )
...
* clean up driver
* remove returns
2024-03-20 00:17:41 +03:00
chenyu
99cbc24390
use dtypes.int32 as return dtype for functions that return indices ( #3827 )
...
behavior matches jax. It's fine to have a tensor greater than max int8 size even if we set default int to int8
2024-03-19 17:06:57 -04:00
chenyu
fa1921ec7d
move test_dtype tests to test dtype and output value ( #3826 )
2024-03-19 16:31:27 -04:00
Francis Lam
131bbb6563
test_linearizer_failure: add failure 27 from a gpt2 kernel ( #3825 )
...
* test_linearizer_failure: add failure 27 from a gpt2 kernel
found during a full fuzz test of applied_opts combos to a
depth of 4 on the gpt2 kernels w/o GROUPTOP.
added additional examples to failure 26 that don't have GROUPTOP
* add other platform failure
2024-03-19 16:29:50 -04:00
nimlgen
3fb13ff892
HIP -> HSA in docs/env_vars ( #3824 )
2024-03-19 22:53:33 +03:00
Francis Lam
9851e2c3b9
test_linearizer_failure: add failure 26 from a gpt2 kernel ( #3821 )
...
found during a full fuzz test of all applied_opts combos to a
depth of 3 on the gpt2 kernels
2024-03-19 13:19:54 -04:00
Patrick Tsai
b436c9792f
Fix factoring bug (O(n) arange related) ( #3817 )
...
* Factoring bug
* Another one in case
* It works now so change tests back
* large arange cumsum optimization
* More cleanup
* symbolic no factor div test
* name change
* Rename test
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-19 11:49:42 -04:00
chenyu
e12bc85014
use BS=128 and BS=768 for resent benchmark ( #3815 )
...
50% more hcopt perf with this one weird trick
2024-03-18 23:49:55 -04:00