Commit Graph

4182 Commits

Author SHA1 Message Date
Francis Lam bbb0ad4800
wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216)
* wmma: widen TC usage in search by using PADTO on TC axes when possible

* test: start tests for the new padding TC behavior

* search: upgrade padded TC search to TC_OPT >= 2

* test: add behavior and correctness test for padded TC

added optional argument to apply_tensor_core to set TC_OPT level

* linearizer: add tests for the PADTO behvaior and docs
2024-04-22 16:50:31 -04:00
George Hotz 9e53d6cffa hotfix: 8000 lines 2024-04-22 20:58:16 +04:00
nimlgen e6227bdb15
nv driver (#4044)
* start

* fix err 93

* gpu

* ioctl mappings

* alloc like cuda

* semaphores

* wait for semaphores value

* start ops_nv

* very simple kernels work

* init several gpus

* qmd dumper

* dirty, but most of kernels work

* always all test_ops

* progress, more tests, stable

* test_ops passes, gpt2 works

but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated

* need better sync

* fix sync

* alloc2

* all tests pass!

* cleanup 1

* cleanup

* multigpu, simple transfer

* fix sync

* correct init

* nv_gpu autogen + sync bug fix

* clean extra/nv_gpu_driver

* p2p

* clean up

* remove old gen

* small fixes

* cleanup

* cleanup 2

* small fixes

* bigger queue size

* cleanups

* wait

* fixed signals for devs

* fix hang + parallel beam

* small fixes

* detect when local memory is big in kernel

* correct assert

* small fixes

* correct tls size est

* one va space

* less lines

* shorter

* save 2 lines

* save some lines

* remove type ignores

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-22 19:50:20 +04:00
qazal 77a3780005
assert reduce recompute (#4250) 2024-04-22 16:12:39 +03:00
qazal a9bc7c1c49
unify assign tests (#4247) 2024-04-22 11:01:15 +03:00
chenyu 37f8be6450
resnet print epoch ops and mem in benchmark (#4244)
* resnet print epoch ops and mem in benchmark

also added a flag to optionally disable reset jitted steps

* real per epoch stats
2024-04-21 18:32:31 -04:00
Micah Zoltu 7bc862767c
Improves error message when CUDA module fails to load. (#4243) 2024-04-21 11:10:14 -04:00
wozeparrot 4c99d49c4d
some docstrings (#4201)
* feat: create and data access docstrings

* fix: linter

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-21 16:34:08 +04:00
chenyu 30fc1ad415
remove TODO: remove explicit dtypes after broadcast fix in stable_diffusion (#4241)
this is done
2024-04-21 00:31:24 -04:00
chenyu a1940ced77
remove the assign hack in whisper (#4240)
no longer needed, the commented test case was removed too
2024-04-20 23:56:44 -04:00
chenyu 3f126c7664
fix examples vits / converstion.py (#4239)
it was passing a const numpy array into Tensor.arange
2024-04-20 23:29:12 -04:00
chenyu 31c9d9a228
fix test_linearizer tc opt tests for bf16 (#4237)
bf16 tc has larger rtol
2024-04-20 11:51:50 -04:00
chenyu f1d9d0a151
cleanup external_test_opt (#4234)
no more OPT=2 or OPT=3, check strict number of kernels, enabled tests that fusion works now
2024-04-20 04:00:08 -04:00
David Hou dc4b1af09c
more realistic edge behavior for resnet benchmark (#4231)
* more realistic edge behavior for resnet benchmark

* schedule_step

* realize all parameters ahead of time

* don't save setup and misc schedules
2024-04-19 20:07:46 -04:00
David Hou f6eea03749
SAVE_SCHEDULE as contextvar (#4230) 2024-04-19 18:51:57 -04:00
qazal 2094b3b327
graph ScheduleItems (#4224)
* graph schedules

* add logging

* inplace

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-19 16:17:11 +04:00
George Hotz cd88afc98b
datasets isn't a feature + filter docstrings (#4228)
* datasets isn't a feature

* filter docstrings in sz
2024-04-19 16:16:10 +04:00
George Hotz b9570d6100
clean up update stats (#4226)
* WIP: clean up update stats

* line savings now

* fix graphs

* fix tests

* tighter prints

* remove extra jit=false

* debug=2 means wait

* that won't update stats

* still wait
2024-04-19 15:41:30 +04:00
qazal 1c87e5dbf6
fuzz schedule context vars (#4223)
* fuzz schedule context vars

* fuzz unique toposorts

* merge ground truth with the rest

* Revert "merge ground truth with the rest"

This reverts commit 1f3463bb57794859e164d2e66a4bf9cc4b03e5ca.

* readability>

* can override
2024-04-19 13:16:25 +03:00
George Hotz d99b512084
llm.c timing (#4219)
* add timing info

* fix malloc

* 8s with beam
2024-04-19 12:43:21 +04:00
qazal 43841a32b7
Merge pull request #4222 from Qazalin/fuzz-multi0
Tunable multi output fusion
2024-04-19 08:07:45 +03:00
qazal b2fe3884fc
Merge branch 'master' into fuzz-multi0 2024-04-19 07:56:26 +03:00
qazal abb10c83cd tunable multi output fusion 2024-04-19 07:44:31 +03:00
chenyu a1133beb80
KFD GEMM (#4221)
added to benchmark CI and fixed duplicated filenames between cuda and ptx
2024-04-19 00:43:18 -04:00
chenyu 3f3af0fb85
test_linearizer_failures 29 passes now (#4215)
TC + PADTO fixed
2024-04-18 19:49:23 -04:00
Elias Wahl 2ecd61e3e2
monkey patching (#4214) 2024-04-18 19:20:52 -04:00
Francis Lam 126826afc8
linearizer: refactor to define accs with potentially TC-modified idxs (#4211) 2024-04-18 15:31:06 -04:00
George Hotz 39b60a25f0
more llm c work (#4207)
* more llm c work

* print nicely

* fake load pretrained

* select warmups

* output c code
2024-04-18 22:20:44 +04:00
chenyu f7416916df
update resnet hparams based on BS=1632 RCP (#4210)
https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_4.0.0/rcps_resnet.json
2024-04-18 12:01:46 -04:00
George Hotz fa57c3e7ce
continue llm.c (#4190)
* continue llm.c

* export more

* progress on llm.c

* simpler optim, names work
2024-04-18 10:57:54 +04:00
geohotstan 269a58d5fa
tolist to return multidimensional list (#4192)
* lol does this work

* some more changes

* a tiny note

* rename a variable

* add test for data const and add TODO comment

* make type correct

make type correct
2024-04-18 07:43:10 +04:00
Francis Lata 3644077a42
[MLPerf][UNet3D] Add DICE loss + metrics (#4204)
* add DICE loss and metrics

* update dice to include reference implementation's link

* remove unused imports

* remove unnecessary test file and update pred + label for metrics and losses test

* add tests to CI + add exclusion of mlperf_unet3d

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-17 20:09:33 -04:00
chenyu cd801a15f3
scipy.signal.gaussian -> scipy.signal.windows.gaussian (#4205)
fixed unet3d model_eval, will add to CI after merging new dice loss
2024-04-17 19:15:37 -04:00
Elias Wahl 6eef8ee22a
Wikipedia download script for MLPerf BERT training (#4202)
* wikipedia download script

* add link

* checksum valueError

* ops
2024-04-17 16:34:57 -04:00
qazal f75020a903
minimal diff for multioutput reduce pairs (#4030)
* simple fusion

* compiler cache patch

* Revert "compiler cache patch"

This reverts commit fa180495974456a1748a64865c4d329eae0a55e9.

* Revert "Revert "compiler cache patch""

This reverts commit 57f8d41f985ac8acfff997136024b0b43577f195.

* delete that

* early sort

* teeny renames

* spec

* .empty is great

* delete sort

* Update test_schedule.py

* this is one kernel now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-17 10:55:44 -04:00
George Hotz 8564e28a1b
new memory scheduler with explicit refcounts (#4198)
* new memory scheduler with explict refcounts

* move central memory planner

* typo + use central memory planner in openpilot

* cleanups

* include lb_refcount in pickle

* replace PlaceHolder with memory planner

* cleaner
2024-04-17 08:46:47 +04:00
Francis Lam c91b7b1739
test: add fuzz_matmul and better debugging for simple_matmul (#4199)
also show unoptimized shape in verify_kernel
2024-04-16 23:40:31 -04:00
qazal ba8602612b
Fuzz all permutations of schedule (#4136)
* simple toposort

* fuzzer

* init in_degree

* move to tests

* same seed

* configure paths

* internal graph

* compare LazyBuffers

* simpler

* simple graph

* assign works

* simpler

* fix JIT

* upstream ci

* move ci

* fix the path

* DEBUG=1

* limit max paths

* launch a cmp kernel

* Revert "launch a cmp kernel"

This reverts commit 791c6089922fa7d800456f28fc167842f188ac7e.

* exec ground truth

* better perf

* copy ground truth once

* gpu allclose ast try1

* Revert "gpu allclose ast try1"

This reverts commit 1f82103af3a7bfedb9f858b6c58b0b94f1c7e6b0.

* prerealized bufs freezing

* teeny cleanups

* reuse Buffers

* Revert "reuse Buffers"

This reverts commit a71de94b035bd5ceb1ec257f6b2529b166bcd30b.

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-17 05:03:21 +04:00
nimlgen 4ed6b42a8a
fix kernargs check in kfd (#4194) 2024-04-17 00:44:50 +03:00
David Hou 97d846dd67
in forced_realize, unchase last op if it is upcast (#4185)
* in forced_realize, unchase last op if it is upcast

* start on test

* flesh out test

* more test

* comment

* comment out parallel reduce test

* reorder

* unused
2024-04-16 17:15:17 -04:00
Francis Lam e9c1616b27
logging: change LOGKERN to LOGKERNS to match LOGOPS (#4193)
also add printing of ast and applied_opts during verify_kernel
to more easily debug errors if they come up
2024-04-16 16:08:32 -04:00
David Hou 7fb220a567
touchup resnet_layer_bench (#4191) 2024-04-16 14:43:00 -04:00
David Hou 1dbf3b2b19
Benchmarks for individual resnet layers (#4182)
* resnet individual layer benchmarks!

* small

* 1 and 2

* mem_used

* no ci

* better conv print

* defaults

* prints

* adjust

* adjust

* adjust

* benchmark only one layer example

* tensor.training, zero_grad, sum instead of mean, last mem, last kernel count

* default jitcnt=1

* scale flops/kernels with jitcnt

* add note about jitcnt memory

* touchup
2024-04-16 13:53:18 -04:00
George Hotz d49d4324a3
update docs (#4189) 2024-04-16 16:07:02 +04:00
George Hotz 55ae73e951
Replicate llm.c in tinygrad (#4179)
* write llm.c and add a few new methods to tensor

* training works

* add jit

* tests for new functions

* test tolist

* simple fix for onnx test failures (#4186)

* write llm.c and add a few new methods to tensor

* training works

* add jit

* tests for new functions

* bump line count to 7500

* simplest fix

* safenumpy tolist for now

---------

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>

---------

Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>
2024-04-16 15:40:48 +04:00
George Hotz b6e7243bfa hotfix: skip slow pre-commit test 2024-04-16 11:48:43 +04:00
George Hotz cda0010020 hotfix: docs-legacy 2024-04-16 11:06:56 +04:00
George Hotz 8f749ae0eb
New docs are in mkdocs (#4178)
* start mkdocs

* simple docs for tensor

* more docs

* move those back

* more docs

* copy markdown extensions

* docs legacy

* docs building workflow

* fix showcase links

* only that?

* install tinygrad

* add docs to setup.py

* Delete examples/llm.c/data
2024-04-16 10:59:51 +04:00
chenyu aa093efa43
fix handcode_resnet50_opt flops count (#4184) 2024-04-15 22:13:45 -04:00
chenyu d5b67c1ca3
log resnet TRAIN_BEAM / EVAL_BEAM (#4181)
also run eval in benchmark mode if either one is positive
2024-04-15 19:29:08 -04:00