Commit Graph

2711 Commits

Author SHA1 Message Date
George Hotz 7103b716c4
merge kernel and optimizer (#2200)
* merge kernel and optimizer

* linearize is reentrant

* move global/local size

* clean up linearizer copy

* remove unneeded lin copies

* stop linearizing twice

* oops, that should be None
2023-11-01 15:20:01 -07:00
George Hotz 33bb650e94
use mad in opencl (#2198)
Co-authored-by: Comma Device <device@comma.ai>
2023-11-01 10:40:08 -07:00
George Hotz c8b6a811ea
no locals as opt action (#2196)
* switch barrier, add clear_l2

* no locals can be searched

* revert barrier

* fix ci

* put it there
2023-11-01 09:47:44 -07:00
Comma Device 2e9982fe2d fastvits example that's 10% faster 2023-10-31 21:48:23 -07:00
George Hotz 8ba7ced7f9
extract const if it's const (#2193)
* extract const if it's const

* fix if statement

* fast math issue

* fix graphing and casting

* disable flaky copyout test
2023-10-31 18:52:35 -07:00
George Hotz b245f1307e
add exp2 (#2192) 2023-10-31 17:48:42 -07:00
qazal e2428b63a6
external (#2191) 2023-10-31 13:57:24 -07:00
Elias Wahl 7e8c5f1a0f
Modernize setup.py (#2187)
* Added pyproject.toml

* Pin onnx
2023-10-31 13:55:45 -07:00
nimlgen 8c07c73a9b
Fix cl map buffer (#2190)
* fix gpu enqueue_map_buffer out of space

* add test
2023-10-31 12:02:46 -07:00
George Hotz c59ea32f90 prevent over unrolling in optimzer 2023-10-31 11:45:18 -07:00
George Hotz 5aaa8a0cc1 fix shape 2023-10-31 11:36:19 -07:00
George Hotz a27c9f9de5
openpilot compile2 (#2189)
* try compile2

* pass to thneed

* fix tanh onnx
2023-10-31 11:08:58 -07:00
qazal be5f185ac0
Higher test coverage for dtypes (#2156)
* refactor unit tests for dtypes

* add missing dtypes in llvmir.py and lib.py

* skip torch tests

* webgpu

* cleaner skips

* fix llvm bool casting issue using compare

* llvm 100% passing

* llvm segfault

* TEMP decrease timeout mins to 11

debug

* add bf16 to setup

* skip half tests in cuda cpu

* check for CUDACPU insetad

* add int16 to triton dtypes

* u16 for triton

* remove debug - diff is still hard to read

* derive from base class TestDType

* enhance test_upcast and downcast by running on every possible version

* dummy commit to rerun the flakey test

* skip the correct tests for CUDA

* bf16 should be skipped in the common TestDType cases

* re-enable bf16

* more consistent structure

* tiny changes to is_dtype_supported 1

* tiny changes 2

add reason

* fuzz

* fuzzer p2

* run fp32 twice

* remove duplicate fp32 run

* clang: use stdbool

* skip triton on bool casts

* merge and resolve conflicts
2023-10-30 22:38:42 -07:00
forcefieldsovereign f294bdd681
fixed imports (#2185) 2023-10-30 22:07:17 -07:00
Akshay Kashyap 018bd29e37
Enable Multi-Output Export (#2179)
* Enable Multi-Output Export

* Add test

* Update examples and lint

* fix padding

* test ops

* dummy commit to rerun test

* revert cuda lint

* Enforce tuple/list of tensors

* subscripted generics

* put back webgpu test

* Re-enable WebGPU Efficientnet test
2023-10-30 18:42:26 -07:00
qazal a7439af786
Fix llvm int->bool cast (#2164)
* add to ir

* add test case

* minimize diff

* todo

* enable fast math

* added both False and True case
2023-10-30 15:28:23 -07:00
George Hotz 94cf652b6b don't use locals applies to GROUP also 2023-10-30 13:56:43 -07:00
George Hotz 5cc536bcc0 don't use locals applies to LASTLOCAL 2023-10-30 13:53:42 -07:00
chenyu 3c88af5071
use unique table name for each disk_cache test (#2184) 2023-10-30 13:49:49 -07:00
George Hotz 608e3ee800
fix no locals search and search both (#2171)
* fix no locals search and search both

* pretty print

* nolocals default no other search
2023-10-30 10:22:50 -07:00
George Hotz 194e4ad6f8
Revert "optimizer: simplify GROUP and LOCAL to have one of each (#2162)" (#2182)
This reverts commit 8cf0bb9351.
2023-10-30 10:22:26 -07:00
Ahmed Harmouche 95f7183c3a
Reenable global, local limiting (#2095) 2023-10-30 10:17:23 -07:00
chenyu 8548b20b23
fix codellama params and repeat_kv (#2181) 2023-10-30 10:16:26 -07:00
George Hotz c7f4dd6cb0 CACHELEVEL for smaller caches 2023-10-28 07:26:03 -10:00
chenyu 6c58bf3e9c
in time_linearizer, allocate a scratch buffer if output buffer is also input (#2152)
* in time_linearizer, allocate a scratch buffer if output buffer is also input

* move scratch buffer creation outside search
2023-10-28 07:17:41 -10:00
Yixiang Gao 902f00b095
adding cuda TC headers (#2165)
* split cuda to renderer and add headers for tc

* fix TritonRenderer

* remove unused import
2023-10-27 14:25:59 -10:00
David Hou 7f4f925385
fix hip del on compile fail (#2163)
* fix hip del on compile fail

* the test doesn't actually work
2023-10-27 11:38:07 -10:00
Francis Lam 8cf0bb9351
optimizer: simplify GROUP and LOCAL to have one of each (#2162)
* optimizer: simplify GROUP and LOCAL to have one of each

Now that tensor cores only use LASTLOCAL, we can simplify to use
only that op everywhere.

The only use of GROUP is in matvec hand-coded opts and it doesn't
make a performance difference so switching to use only the top
behavior.

Also adds additional asserts to prevent tensor core dims from
being altered which causes bad kernels to be generated.

* search: remove duplicated actions
2023-10-27 11:37:44 -10:00
George Hotz e0201922e3
Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var

* added from unaligned np test (#2134)

* align cpu buffer before copy into cl buffer (#2135)

* remove shelve from handcode_resnet50_opt.py (#2139)

* Add dictionary keys to reduce db size (#2131)

* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood

* more lin to feats

* sts

* training policynet

* net sort of works

* dedup

* refactor, stupid new actions

* fix uops deduping

* BEAM_ESTIMATE

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
2023-10-27 10:53:06 -10:00
will bc0829b677
Fix llama json loading (#2160) 2023-10-27 10:21:56 -10:00
nimlgen 8d41b3eb3f
beam=16 makes gpt2 gpu-time < 5ms on 3090 (#2154) 2023-10-27 10:21:27 -10:00
nimlgen 5204864eca
init cudagraph (#2153)
* init cudagraph

* linter happy

* print warning when cuda graph creation failed
2023-10-27 16:19:50 -04:00
chenyu 9215bccb41
Tensor.uniform set default to standard uniform (#2158)
* Tensor.uniform set default to standard uniform

* clean up test to reuse function
2023-10-27 16:15:30 -04:00
Roelof van Dijk 36ab04ae35
perf: lazyop as dataclass (#1603)
* perf: lazyop as dataclass

fix: linter

fix: restore eq

* use builtin methods, buffers to property to allow freezing

* fix: reduce diff

* fix: can't freeze due to KOPT tests, mypy

* fix: explicit hash

* can freeze if tests are fixed

* fix: typo

---------

Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 17:54:30 -04:00
chenyu 0ca0e9ee5e
exclude ast with variables from beam search (#2140)
* exclude ast with variables from beam search

* test that

* add to CI
2023-10-25 16:35:29 -04:00
Szymon Ożóg a52b420fb3
switch ocelot back to main repo (#2147)
* return to ocelot main branch

* cd before checkout
2023-10-25 15:14:26 -04:00
George Hotz 12dd165d38 add WINO/HALF/HIP to AMD benchmark 2023-10-25 13:22:45 -04:00
Francis Lam bf3490cdf9
wmma: refactor tensor cores using existing local dims (#2097)
* wmma: refactor tensor cores using existing local dims

* optimizer: fix bad rebase and break after one late local

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 13:10:46 -04:00
wozeparrot c29653605e
hip multigpu training (#1878)
* feat: move to hip

* feat: special path for RawBufferTransfer

* feat: initial rawbuffertransfer

* feat: hip ipc

* feat: working hip ipc

* feat: need to base device without args

* feat: close mem handle

* feat: modified test

* feat: more multihip stuff

* clean: cleanup

* feat: cleaner

* feat: don't crash

* feat: test more

* clean: way cleaner hip wrapper

* feat: barrier

* feat: barrier

* feat: this breaks stuff

* feat: we can use empty here

* feat: maybe fix tests

* feat: maybe fix tests again?

* fix: probably fix tests

* feat: no waiting here

* feat: wait here

* feat: much larger test

* feat: need to sync here

* feat: make this async

* feat: no waiting!

* feat: cut here

* feat: sync copy

* feat: random imports

* feat: much cleaner world

* feat: restore this

* feat: restore this

* clean: cleanup

* feat: set this
2023-10-24 17:35:53 -04:00
nimlgen 2e89fd264f
Refactor hipgraph (#2141)
* refactor hip graph

* linter happy

* happy liner
2023-10-24 15:45:56 -04:00
nimlgen e21bf776c8
fix debug=1 llama/gpt2 timings (#2143) 2023-10-24 15:45:00 -04:00
chenyu 4444e6d4b3
stable diffusion < 324ms (#2129)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var
2023-10-24 14:56:12 -04:00
George Hotz cea2bc7964
Add dictionary keys to reduce db size (#2131)
* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood
2023-10-24 10:49:22 -04:00
chenyu d5e2fdea22
remove shelve from handcode_resnet50_opt.py (#2139) 2023-10-24 10:37:30 -04:00
imaolo 228b310478
align cpu buffer before copy into cl buffer (#2135) 2023-10-23 21:04:35 -04:00
imaolo 6ee0435263
added from unaligned np test (#2134) 2023-10-23 11:38:57 -04:00
George Hotz 3c56c181f6 string formatting 25 -> 30 to fit 2023-10-22 10:57:34 -07:00
George Hotz 6dc8eb5bfd
universal disk cache (#2130)
* caching infra for tinygrad

* nons tr key

* fix linter

* no shelve in beam search

* beam search caching

* check tensor cores with beam too

* pretty print

* LATEBEAM in stable diffusion
2023-10-22 10:56:57 -07:00
Francis Lam ace6b2a151
optimizer: add test for correctness of opts (#2124)
* optimizer: add test for correctness of opts

Also added OptOps.UPCASTMID to constrain valid axes for opts with
group_for_reduce.

* llvm: fix LinearizerOptions to correctly not has_shared

* optimizer: remove premature test scaffold for TC opts

* search: fix the action space
2023-10-22 08:02:22 -07:00
George Hotz abeba8f1fc
optimization: get actions in CI (#2125)
* get actions in CI

* actually run the test

* pythonpath
2023-10-20 12:22:01 -07:00