Commit Graph

4213 Commits

Author SHA1 Message Date
David Hou ac9464f47a
allow specify number of beam workers (#4292) 2024-04-25 10:44:43 -04:00
qazal 74a1be88f5
test reduce graph permutations (#4291) 2024-04-25 11:34:44 +03:00
George Hotz 0f0627bc60 add mnist tutorial 2024-04-25 16:08:32 +08:00
chenyu d31e220cbf
add mlperf-logging to setup.py mlperf (#4289) 2024-04-24 23:34:34 -04:00
nimlgen 6b8a85939d
fix lds size for amd (#4287) 2024-04-24 22:54:42 +03:00
chenyu c11bad766d
prepare mlperf submission (#4270)
* prepare mlperf submission

* 28min compile and 3h53m

* red 30 minute compile and 56 TFLOPS
2024-04-24 13:19:31 -04:00
Szymon Ożóg c606a0ba6f
Docs link fix (#4286)
* Update quickstart.md

* Update README.md

* Update quickstart.md

* Update README.md
2024-04-24 12:54:43 -04:00
chenyu c1fbacb182
resnet benchmarks use DEFAULT_FLOAT=HALF (#4285)
also update LR default to scaled based on 1536 (the BS we are submitting)
2024-04-24 12:10:57 -04:00
Szymon Ożóg 002a14088e
Ptx store gate cast to bool (#4284)
* Cast gate to bool

* Update

* Add PTX fuzzing to benchmark
2024-04-24 11:43:44 -04:00
George Hotz dbe3e1d548
or true fixes ci (#4283)
* or true fixes ci

* all with two pipes
2024-04-24 20:48:26 +08:00
qazal 53853e6d08
save the schedule graph in SAVE_SCHEDULE (#4248)
* save the schedule graph with assigns

* extend graph
2024-04-24 12:08:51 +03:00
George Hotz acb32e1766 hotfix: PM4 supports timing 2024-04-24 08:38:59 +00:00
George Hotz ad28fdecb1
si.inputs+outputs -> bufs (#4279) 2024-04-24 15:12:34 +08:00
chenyu 8401de9922
resnet benchmark return early in eval (#4278)
only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes
2024-04-24 00:55:01 -04:00
George Hotz 38f97aa0fe
rename rawbufs to bufs in ExecItem (#4274) 2024-04-24 11:27:27 +08:00
George Hotz 60e3aa5cb1
more docs (#4271)
* more work on docs

* CompilerOptions is dataclass
2024-04-24 10:52:42 +08:00
chenyu 6637ecc5fe
use IGNORE_JIT_FIRST_BEAM to not BEAM in jit cnt=0 (#4269)
we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value.

Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time
2024-04-23 18:59:43 -04:00
nimlgen f3b4dff7c9
KFDProgram -> AMDProgram (#4268) 2024-04-24 00:29:50 +03:00
geohotstan 17328ded7d
setitem no return value (#4266)
* no ret value and just force contiguous

* ok revert contiguous stuff

* actually do force it contiguous

* revert again lol

* add simple regression test

* add assert for MLB

* guess we're contiguous everything from now on

* lol ugly af empty return...

* don't change order cuz i don't get disk
2024-04-23 16:28:14 -04:00
Elias Wahl 3a48773f1a
BERT dataloader (#4252)
* add dataloader

* comment
2024-04-23 13:44:49 -04:00
Elias Wahl 69341144ba
Wikipedia preprocessing script (#4229)
* Preprocessing script

* short seq prob

* comments + env vars

* Add preprocessing reference. Add test

* lint fix + add eval test support

* whitespaces

* point to commit

* comment

* rename

* better comments
2024-04-23 10:28:01 -04:00
chenyu 759b4f41c3
few more KFD -> AMD (#4262)
benchmark gemm and default_parallel
2024-04-23 10:15:37 -04:00
Szymon Ożóg 6c25f1abf7
Optimize ptx loops (#4263)
* Optimize PTX loops

* Update assembly.py
2024-04-23 12:20:14 +04:00
George Hotz 967638f0d5
update docs, remove corealize (#4264)
* update docs, remove corealize

* handle 0 line count

* tensor schedule
2024-04-23 12:05:29 +04:00
George Hotz 9b7efa72ea hotfix: skip 0 line count files in sz.py 2024-04-23 11:56:03 +04:00
George Hotz acf4ba5c9f
method cache respects beam option (#4261)
* method cache respects beam option

* cleanup get_runner
2024-04-23 09:00:41 +04:00
George Hotz 9a95781d51
renamed (#4260) 2024-04-23 09:00:28 +04:00
George Hotz 2ae4f45272
WIP PM4 Support (#4110)
* pm4 kernel launch works

* disable USE_THREAD_DIMENSIONS

* add kernel code

* work on real pm4

* pm4 signal

* same

* gate pm4

* hcq tests pass

* ops passes

* pm4 is closer

* pm4 debug (#4165)

* start debug tests passing

* prg

* smth

* hdp flush

* cleaner 1

* do not need this

* logs not need

* small things

* linter

* remove AQL

* test hcq

* fix tests

* it's subtracting, it shouldn't be -1

* pm4 changes (#4251)

* not need this anymore

* sdma signal with non atomic

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-04-23 08:31:27 +04:00
Francis Lam 3f6c7ca8bf
test: fix test_tensor_core_padded on CUDA and add to benchmarks (#4258)
* test: fix test_tensor_core_padded on CUDA and add to benchmarks

* fix linter

* run both tests in one call
2024-04-22 23:22:11 -04:00
Francis Lam a90de3b574
search: add additional 7 factors to the action space (#4256)
also bump the DB version after the padded TC merge
2024-04-22 19:14:23 -04:00
chenyu de2b1fb468
update adding_new_accelerators doc (#4255)
mlops -> function, and removed some old ops
2024-04-22 18:50:19 -04:00
Francis Lam bbb0ad4800
wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216)
* wmma: widen TC usage in search by using PADTO on TC axes when possible

* test: start tests for the new padding TC behavior

* search: upgrade padded TC search to TC_OPT >= 2

* test: add behavior and correctness test for padded TC

added optional argument to apply_tensor_core to set TC_OPT level

* linearizer: add tests for the PADTO behvaior and docs
2024-04-22 16:50:31 -04:00
George Hotz 9e53d6cffa hotfix: 8000 lines 2024-04-22 20:58:16 +04:00
nimlgen e6227bdb15
nv driver (#4044)
* start

* fix err 93

* gpu

* ioctl mappings

* alloc like cuda

* semaphores

* wait for semaphores value

* start ops_nv

* very simple kernels work

* init several gpus

* qmd dumper

* dirty, but most of kernels work

* always all test_ops

* progress, more tests, stable

* test_ops passes, gpt2 works

but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated

* need better sync

* fix sync

* alloc2

* all tests pass!

* cleanup 1

* cleanup

* multigpu, simple transfer

* fix sync

* correct init

* nv_gpu autogen + sync bug fix

* clean extra/nv_gpu_driver

* p2p

* clean up

* remove old gen

* small fixes

* cleanup

* cleanup 2

* small fixes

* bigger queue size

* cleanups

* wait

* fixed signals for devs

* fix hang + parallel beam

* small fixes

* detect when local memory is big in kernel

* correct assert

* small fixes

* correct tls size est

* one va space

* less lines

* shorter

* save 2 lines

* save some lines

* remove type ignores

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-22 19:50:20 +04:00
qazal 77a3780005
assert reduce recompute (#4250) 2024-04-22 16:12:39 +03:00
qazal a9bc7c1c49
unify assign tests (#4247) 2024-04-22 11:01:15 +03:00
chenyu 37f8be6450
resnet print epoch ops and mem in benchmark (#4244)
* resnet print epoch ops and mem in benchmark

also added a flag to optionally disable reset jitted steps

* real per epoch stats
2024-04-21 18:32:31 -04:00
Micah Zoltu 7bc862767c
Improves error message when CUDA module fails to load. (#4243) 2024-04-21 11:10:14 -04:00
wozeparrot 4c99d49c4d
some docstrings (#4201)
* feat: create and data access docstrings

* fix: linter

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-21 16:34:08 +04:00
chenyu 30fc1ad415
remove TODO: remove explicit dtypes after broadcast fix in stable_diffusion (#4241)
this is done
2024-04-21 00:31:24 -04:00
chenyu a1940ced77
remove the assign hack in whisper (#4240)
no longer needed, the commented test case was removed too
2024-04-20 23:56:44 -04:00
chenyu 3f126c7664
fix examples vits / converstion.py (#4239)
it was passing a const numpy array into Tensor.arange
2024-04-20 23:29:12 -04:00
chenyu 31c9d9a228
fix test_linearizer tc opt tests for bf16 (#4237)
bf16 tc has larger rtol
2024-04-20 11:51:50 -04:00
chenyu f1d9d0a151
cleanup external_test_opt (#4234)
no more OPT=2 or OPT=3, check strict number of kernels, enabled tests that fusion works now
2024-04-20 04:00:08 -04:00
David Hou dc4b1af09c
more realistic edge behavior for resnet benchmark (#4231)
* more realistic edge behavior for resnet benchmark

* schedule_step

* realize all parameters ahead of time

* don't save setup and misc schedules
2024-04-19 20:07:46 -04:00
David Hou f6eea03749
SAVE_SCHEDULE as contextvar (#4230) 2024-04-19 18:51:57 -04:00
qazal 2094b3b327
graph ScheduleItems (#4224)
* graph schedules

* add logging

* inplace

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-19 16:17:11 +04:00
George Hotz cd88afc98b
datasets isn't a feature + filter docstrings (#4228)
* datasets isn't a feature

* filter docstrings in sz
2024-04-19 16:16:10 +04:00
George Hotz b9570d6100
clean up update stats (#4226)
* WIP: clean up update stats

* line savings now

* fix graphs

* fix tests

* tighter prints

* remove extra jit=false

* debug=2 means wait

* that won't update stats

* still wait
2024-04-19 15:41:30 +04:00
qazal 1c87e5dbf6
fuzz schedule context vars (#4223)
* fuzz schedule context vars

* fuzz unique toposorts

* merge ground truth with the rest

* Revert "merge ground truth with the rest"

This reverts commit 1f3463bb57794859e164d2e66a4bf9cc4b03e5ca.

* readability>

* can override
2024-04-19 13:16:25 +03:00