Commit Graph

2786 Commits

Author SHA1 Message Date
Yixiang Gao 902f00b095
adding cuda TC headers (#2165)
* split cuda to renderer and add headers for tc

* fix TritonRenderer

* remove unused import
2023-10-27 14:25:59 -10:00
David Hou 7f4f925385
fix hip del on compile fail (#2163)
* fix hip del on compile fail

* the test doesn't actually work
2023-10-27 11:38:07 -10:00
Francis Lam 8cf0bb9351
optimizer: simplify GROUP and LOCAL to have one of each (#2162)
* optimizer: simplify GROUP and LOCAL to have one of each

Now that tensor cores only use LASTLOCAL, we can simplify to use
only that op everywhere.

The only use of GROUP is in matvec hand-coded opts and it doesn't
make a performance difference so switching to use only the top
behavior.

Also adds additional asserts to prevent tensor core dims from
being altered which causes bad kernels to be generated.

* search: remove duplicated actions
2023-10-27 11:37:44 -10:00
George Hotz e0201922e3
Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var

* added from unaligned np test (#2134)

* align cpu buffer before copy into cl buffer (#2135)

* remove shelve from handcode_resnet50_opt.py (#2139)

* Add dictionary keys to reduce db size (#2131)

* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood

* more lin to feats

* sts

* training policynet

* net sort of works

* dedup

* refactor, stupid new actions

* fix uops deduping

* BEAM_ESTIMATE

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
2023-10-27 10:53:06 -10:00
will bc0829b677
Fix llama json loading (#2160) 2023-10-27 10:21:56 -10:00
nimlgen 8d41b3eb3f
beam=16 makes gpt2 gpu-time < 5ms on 3090 (#2154) 2023-10-27 10:21:27 -10:00
nimlgen 5204864eca
init cudagraph (#2153)
* init cudagraph

* linter happy

* print warning when cuda graph creation failed
2023-10-27 16:19:50 -04:00
chenyu 9215bccb41
Tensor.uniform set default to standard uniform (#2158)
* Tensor.uniform set default to standard uniform

* clean up test to reuse function
2023-10-27 16:15:30 -04:00
Roelof van Dijk 36ab04ae35
perf: lazyop as dataclass (#1603)
* perf: lazyop as dataclass

fix: linter

fix: restore eq

* use builtin methods, buffers to property to allow freezing

* fix: reduce diff

* fix: can't freeze due to KOPT tests, mypy

* fix: explicit hash

* can freeze if tests are fixed

* fix: typo

---------

Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 17:54:30 -04:00
chenyu 0ca0e9ee5e
exclude ast with variables from beam search (#2140)
* exclude ast with variables from beam search

* test that

* add to CI
2023-10-25 16:35:29 -04:00
Szymon Ożóg a52b420fb3
switch ocelot back to main repo (#2147)
* return to ocelot main branch

* cd before checkout
2023-10-25 15:14:26 -04:00
George Hotz 12dd165d38 add WINO/HALF/HIP to AMD benchmark 2023-10-25 13:22:45 -04:00
Francis Lam bf3490cdf9
wmma: refactor tensor cores using existing local dims (#2097)
* wmma: refactor tensor cores using existing local dims

* optimizer: fix bad rebase and break after one late local

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 13:10:46 -04:00
wozeparrot c29653605e
hip multigpu training (#1878)
* feat: move to hip

* feat: special path for RawBufferTransfer

* feat: initial rawbuffertransfer

* feat: hip ipc

* feat: working hip ipc

* feat: need to base device without args

* feat: close mem handle

* feat: modified test

* feat: more multihip stuff

* clean: cleanup

* feat: cleaner

* feat: don't crash

* feat: test more

* clean: way cleaner hip wrapper

* feat: barrier

* feat: barrier

* feat: this breaks stuff

* feat: we can use empty here

* feat: maybe fix tests

* feat: maybe fix tests again?

* fix: probably fix tests

* feat: no waiting here

* feat: wait here

* feat: much larger test

* feat: need to sync here

* feat: make this async

* feat: no waiting!

* feat: cut here

* feat: sync copy

* feat: random imports

* feat: much cleaner world

* feat: restore this

* feat: restore this

* clean: cleanup

* feat: set this
2023-10-24 17:35:53 -04:00
nimlgen 2e89fd264f
Refactor hipgraph (#2141)
* refactor hip graph

* linter happy

* happy liner
2023-10-24 15:45:56 -04:00
nimlgen e21bf776c8
fix debug=1 llama/gpt2 timings (#2143) 2023-10-24 15:45:00 -04:00
chenyu 4444e6d4b3
stable diffusion < 324ms (#2129)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var
2023-10-24 14:56:12 -04:00
George Hotz cea2bc7964
Add dictionary keys to reduce db size (#2131)
* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood
2023-10-24 10:49:22 -04:00
chenyu d5e2fdea22
remove shelve from handcode_resnet50_opt.py (#2139) 2023-10-24 10:37:30 -04:00
imaolo 228b310478
align cpu buffer before copy into cl buffer (#2135) 2023-10-23 21:04:35 -04:00
imaolo 6ee0435263
added from unaligned np test (#2134) 2023-10-23 11:38:57 -04:00
George Hotz 3c56c181f6 string formatting 25 -> 30 to fit 2023-10-22 10:57:34 -07:00
George Hotz 6dc8eb5bfd
universal disk cache (#2130)
* caching infra for tinygrad

* nons tr key

* fix linter

* no shelve in beam search

* beam search caching

* check tensor cores with beam too

* pretty print

* LATEBEAM in stable diffusion
2023-10-22 10:56:57 -07:00
Francis Lam ace6b2a151
optimizer: add test for correctness of opts (#2124)
* optimizer: add test for correctness of opts

Also added OptOps.UPCASTMID to constrain valid axes for opts with
group_for_reduce.

* llvm: fix LinearizerOptions to correctly not has_shared

* optimizer: remove premature test scaffold for TC opts

* search: fix the action space
2023-10-22 08:02:22 -07:00
George Hotz abeba8f1fc
optimization: get actions in CI (#2125)
* get actions in CI

* actually run the test

* pythonpath
2023-10-20 12:22:01 -07:00
qazal 14625721e9
minor triton casting refactor (#2118)
* minor refactor

* render_cast taking an x like cstyle

* fix fmt strings

* tl.where

* fix alu render

* use dtype

* newline eof

* better diff
2023-10-20 12:11:55 -07:00
George Hotz cb508e6923
uops graphing + phi (#2120)
* uops graphing

* add_phi_node

* less phi nodes

* where graph uops should live

* naming

* move it to external

* fix triton yolo

* fix clang and preserve behavior
2023-10-19 22:26:28 -07:00
20kdc bedd028061
waifu2x vgg7: testcase, auto-RGBA->RGB, function to grab pretrained models, training "fix" (#2117) 2023-10-19 22:07:15 -07:00
Szymon Ożóg e0b2bf46b4
Improve triton generated code quality (#2119) 2023-10-19 22:06:19 -07:00
qazal 36d4001b4f
add test coverage for search (#2104)
* add test coverage for search

* only in compiled backends

* dont use device.default in decorator

* time_til is the other way around xd
2023-10-19 17:06:47 -07:00
Szymon Ożóg 7268b3c6fb
make triton not write to disk (#2116) 2023-10-18 23:06:47 -07:00
David Hou 95e17ff0d4
fix wino mask upcast calculation (#2057)
* fix wino mask upcast calculation

* add tests for wino upcast hcopt

* add info to note

* real world wino hcopt test

* wino backward test

* whitespace
2023-10-18 16:54:48 -07:00
George Hotz 5cfec59abc
hlb cifar touchups (#2113)
* types and cnt and EVAL_STEPS

* eval time + always print eval
2023-10-18 16:26:15 -07:00
chenyu 5d5921d2c8
small doc env update (#2112) 2023-10-18 14:49:25 -07:00
George Hotz 4526891db7
parallel apt (#2111) 2023-10-18 14:49:00 -07:00
George Hotz 87b714b8cb split test_conv2d 2023-10-18 14:00:50 -07:00
George Hotz 15da96f393
print test durations and add speed (#2107)
* print test durations

* decrease sizes to increase speed

* faster

* GPU/CLANG onnx in seperate runner

* test split, move ONNX CPU CI

* simpler tests

* simpler uops test

* faster

* less cuda apt

* running ninja install

* apt install

* split fancy indexing
2023-10-18 13:46:42 -07:00
George Hotz e2a1c2aaa6 force ruff reinstall 2023-10-18 11:40:46 -07:00
George Hotz 0d2b3a9d33 full path for ruff 2023-10-18 11:27:49 -07:00
George Hotz 8940c89d13
tests: remove 2 runners, make cache reliable (#2106)
* remove 2 runners

* device.DEFAULT printing

* explain rebuild

* disable ocelot rebuild

* try again to fix workflow

* this? fix cache hash

* force no rebuild

* fix pylint
2023-10-18 11:10:41 -07:00
George Hotz b3afe0106b
typo, src printing, and no verbose on triton (#2105) 2023-10-18 09:44:36 -07:00
20kdc 967a88a505
examples/waifu2x: Cleanup waifu2x vgg7 model format (now uses safetensors) (#2082) 2023-10-18 09:20:11 -07:00
George Hotz 881fd7c141
add mops to graph, refactor IMAGE (#2100)
* add mops to graph, refactor IMAGE

* no reshape pushing

* add todo

* fix openpilot model alt

* push reshapes reduces kernels in new op

* IMAGE=2 is a first class citizen now
2023-10-17 21:27:51 -07:00
George Hotz 2498802b46
fix beam search for llvm, this needs tests (#2101) 2023-10-17 20:09:42 -07:00
wozeparrot 4d1e59abfd
fix: only when distributed (#2102) 2023-10-17 20:09:04 -07:00
Sean D'Souza 999c95ea29
fix: hlb cifar types (#2099) 2023-10-17 19:23:50 -07:00
George Hotz 9b1c3cd9ca hlb_cifar: support EVAL_STEPS=1000, print when dataset is shuffled 2023-10-18 01:11:08 +00:00
Ahmed Harmouche 2b5ea7d9cb
Fix output Float32Array size in webgpu export (#2096) 2023-10-17 15:28:19 -07:00
Umut Zengin 01b98b7f42
MulNode.__lt__ rule (#2086)
* Added the rule

* Added tests

* flake8

* self.b == -1 shortcut
2023-10-17 13:18:35 -07:00
Szymon Ożóg f76fbd23e9
cleanup triton (#2092)
* Revert "disable flaky triton test"

This reverts commit 1e15fdaee7.

* Update test.yml

* check if has shared for matvec

* disable ocelot cache for triton

* disable ocelot cache

* disable ocelot cache

* pass shared to triton uops tests

* temporary debugs for CI crash

* Revert "temporary debugs for CI crash"

This reverts commit fee3ea96c818e83c19b935c2f8482e0ccc91a542.

* Revert "triton isn't tested, and allows this refactor (#2007)"

This reverts commit dea8bb0938.

* add runtime_args to every renderer, move triton local size override to runtime args

* Add binary to args, correct type returned

* update to new loops

* Update test.yml

* cleanup triton
2023-10-17 12:49:44 -07:00