Commit Graph

624 Commits

Author SHA1 Message Date
qazal 99ee2ec37a
Refactor code_for_op to accept a dtype (#2555)
* update cstyle renderers to take a dtype in code_for_op

* implement NEG for bools in LLVM

* update triton

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-01 22:05:28 -08:00
George Hotz 4c984bba7e
bump version to 0.8.0, clean CI, remove requests (#2545)
* bump version to 0.8.0, clean CI, remove requests

* why was that even there
2023-12-01 10:42:50 -08:00
nimlgen badc97f824
hip & cuda to gpuctypes (#2539)
* cuda with gpuctypes

* hip gpuctypes

* graphs

* rename + linter happy

* use cpu_time_execution

* no ji in build_kernel_node_params

* remove hip_wrapper

* hip fix

* no arc

* smalle changes

* no clean moduke in cudacpu
2023-12-01 09:25:27 -08:00
chenyu 7fec966b5e
bye bye NOOP (#2534)
* bye bye NOOP

* SIN

* NEG
2023-11-30 23:10:35 -08:00
Matthias Kronberg 5394a05b9d
Fix: Get item from ndarray before casting to int (#2525)
Directly casting is deprecated and will error in the future.
2023-11-30 18:34:31 -08:00
George Hotz 2c363b5f0b
new style device (#2530)
* cpu tests pass

* torch works

* works

* metal works

* fix ops_disk

* metal jit works

* fix openpilot

* llvm and clang work

* fix webgpu

* docs are rly broken

* LRU works on metal

* delete comment

* revert name to ._buf. LRU only on Compiled

* changes

* allocator

* allocator, getting closer

* lru alloc

* LRUAllocator

* all pass

* metal

* cuda

* test examples

* linearizer

* test fixes

* fix custom + clean realize

* fix hip

* skip tests

* fix tests

* fix size=0

* fix MOCKHIP

* fix thneed

* copy better

* simple

* old style metal copy

* fix thneed

* np reshape

* give cuda a device
2023-11-30 17:07:16 -08:00
Davi Silva ddeec24fa8
Cleanup & fix llama.py (#2524)
* docs, cleanup crap

* comma AI

* fix 70B

* this is why lexical scope exists
2023-11-30 16:00:17 -05:00
George Hotz 6707f2588e
use copyin (#2500)
* it's always copyin

* all RawBuffer are RawBufferCopyIn

* cleanups

* this fixes it

* requirements='C'

* more correct
2023-11-29 09:34:00 -08:00
George Hotz 5629fc368c
Use Buffer.STORE at the end of ASTs (#2494)
* work

* store broken

* interpreteds work

* this passes

* symbolic cpu

* fix tests

* fix opt tests

* images fail

* fix InterpretedFlopCounter

* stupid hack for images
2023-11-28 20:11:37 -08:00
Jake 5588922884
Update cuda_matmul.py (#2495) 2023-11-28 19:46:01 -08:00
George Hotz d87a246439
move to new cached fetch (#2493)
* move to new cached fetch

* extra.utils is over

* loads

* bump download cache

* bump timeout
2023-11-28 17:36:55 -08:00
George Hotz ab5d14d4ba
MEM -> LOAD (#2492)
* MEM -> LOAD

* keep legacy working
2023-11-28 16:46:37 -08:00
George Hotz 3f137b134a jax parallel matmul example 2023-11-28 13:48:11 -08:00
Davi Silva 186ac77ec3
Update hip_matmul.py (#2480) 2023-11-27 18:36:19 -08:00
George Hotz 9e07824542
move device to device.py (#2466)
* move device to device.py

* pylint test --disable R,C,W,E --enable E0611

* fix tests
2023-11-27 11:34:37 -08:00
George Hotz 7170a9a057
coder.py can write and run code (#2439)
* wip mistral

* coder

* touchups

* cleanups

* mistral cleanups

* clean up cache create

* download the weights, fix tests

* fix llama loading

* global fixup

* clean up all

* move llama model

* cleanups

* Revert "cleanups"

This reverts commit a71c5d59eb86290634a258704d8bab2378b8d63d.

* fine, leave it
2023-11-25 12:27:54 -08:00
George Hotz 8ff2e13550
From teeny (#2426)
* changes from teenygrad work

* support not supporting ImageDType/PtrDType

* fixups from teeny
2023-11-24 12:50:56 -08:00
nimlgen e68aebfff9
bring hip graph back (#2385)
* bring hip graph back

* share with metal

* fix linter

* remove hasattrs

* Update ops_hip.py

* hip wrapper does not use _buf

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-24 07:53:44 -08:00
George Hotz 12023b6824
onnx ops cleanup (#2413)
* onnx ops cleanup

* revert those
2023-11-23 18:39:49 -08:00
George Hotz 095e2ced61
add name support to fetch (#2407)
* add name support

* use fetch in gpt2

* remove requests from main lib, networkx also optional

* umm, keep that assert

* updates to fetch

* i love the walrus so much

* stop bundling mnist with tinygrad

* err, https

* download cache names

* add DOWNLOAD_CACHE_VERSION

* need env.

* ugh, wrong path

* replace get_child
2023-11-23 14:16:17 -08:00
George Hotz 0505c5ea50
remove force_wait, refactor to graph (#2405)
* remove force_wait

* refactor

* get rid of stupid ASTRunner

* fix del in diskbuffer

* BufferOps.FROM_UNDERLYING

* put offset in the rawbuffer

* fix bugs

* use exec
2023-11-23 12:46:07 -08:00
George Hotz 4f8f0ac139
minor cleanups, remove dead files (#2398)
* minor cleanups, remove dead files

* s.name

* use disk

* pytest passes on mac
2023-11-23 09:01:50 -08:00
George Hotz 66c75f30c6
remove triton (#2396) 2023-11-23 07:40:59 -08:00
chenyu 8798d120bb
autopad shapetracker for BEAM (#2375)
* autopad shapetracker for BEAM

* OptOps.PADTO

* skip that test for now

* correct padding reduce axis

* just 32

* avoid more than double the FLOPs

* cleanups

* test case

* no support for triton and llvm yet

* typos

* symbolic shape would not work

* cannot PADTO with MAX kernel

* advance db version

* no breaking change - don't advance db version

* is triton just python?

* Revert "is triton just python?"

This reverts commit 17e776c25587615e33a3634c2fb0bb8591ce65d4.

* Revert "Revert "is triton just python?""

This reverts commit 6c434c01e1c4b0ea0431ec18632cd859fb3cf260.

* support llvm

* is it really passing in CI only?

* update tests

* oh triton test passed

* simpler

* revert that, with a test

* check if st are the same

* Revert "check if st are the same"

This reverts commit d2a5eac110a5da1af82a2728c883779ef69c3cad.

* update the db version

* rebase artifact
2023-11-22 21:05:25 -05:00
qazal 0eda545946
dtypes.float.vec(sz) (#2386)
* replace all _dtypen with dtype.vec(n)

fix: print works

* conceptul refactor of cstyle render_load logic

* linearizer GEP is explicit that its dtype is the scalar version of localtype

* vectorized global_store and load don't need a conditional
2023-11-22 17:43:14 -08:00
George Hotz cbb8486779
ResNet training changes (update benchmark) (#2390)
* default arg for chunk

* bring back to_

* good changes

* new set

* unused hash

* fix optim

* new torch loader

* fix test lr scheduler
2023-11-22 17:41:12 -08:00
wozeparrot abbcc7aefa
missed cleanup from cache_id removal (#2376) 2023-11-21 01:03:43 -05:00
George Hotz a0890f4e6c
move fetch to helpers (#2363)
* switch datasets to new fetch

* add test_helpers

* fix convnext and delete old torch load
2023-11-19 12:29:51 -08:00
chenyu d7d078c7f9
Node.vars() returns a set and properly dedup (#2356)
* dedup RedNode.vars()

* vars returns a set

* fix more vars

* unused import

* update to_movement_ops

* comment
2023-11-18 17:44:52 -05:00
George Hotz 40246d35bc
ops_shm removed (#2351)
* ops_shm removed

* buf.cast

* err, forgot those
2023-11-18 11:41:58 -08:00
George Hotz c7b38b324b
A beautiful MNIST training example (#2272)
* beautiful mnist

* beautiful mnist example

* from tinygrad import Tensor

* more beautiful

* the jit is super core tinygrad

* globalcounters reset on jit run

* symlinks and exclude

* beautiful_cartpole

* evaluate is it's own function

* no symlinks

* more beautiful

* jit reset for double speed

* type hinting for JIT

* beautiful_mnist gets 98%

* beautiful_mnist < 4s with BEAM=2

* better cartpole

* use actor critic

* zero_grad got lost

* delete double relu

* stable cartpole with PPO

* beautiful_cartpole is more beautiful

* REPLAY_BUFFER

* beautiful stuff typechecks

* None support in shape

* hp tuning
2023-11-17 19:42:43 -08:00
chenyu d2c0035c73
add back as_strided, move rebuilt mops to extra (#2344)
* add back as_strided, move rebuilt mops to extra

* negative stride for ops_cpu

* Revert "negative stride for ops_cpu"

This reverts commit a13b6815ac31478d31ae71c26f4d4e4d274bf155.

* skip that

* style
2023-11-17 14:34:30 -05:00
George Hotz 652d2de256
wow how did i think that was okay (#2339) 2023-11-16 21:21:11 -08:00
chenyu 822d6e6f18
Simpler mops verify (#2325)
* rewrite the to_movement_ops check using symbolic

* tweak
2023-11-15 21:47:18 -05:00
forcefieldsovereign b64738e1d6
Remove AS_STRIDED from shapetracker (#2216)
* very close

* remove comment

* negative strides working

* almost everything passes

* calculate offset with list comprehension

* some cleanup

* got disk load working

* review suggestions

* fix after merge

* overlap working

* did it

* clean

* fixed disk load

* lint

* mypy

* removed as_strided

* trying without simplify

* added back simplify

* make sure expanding to smaller shape

* cleanup

* removed comment

* removed env file

* trying whisper test again

* onnx test sqlite issue

* working on test

* finished test

* eliminate unnecessary shrink-then-pad

* don't shrink buffer

* added strides check

* added to ci under linters

* switch issue

* allow symbolic stride

* removed .env

* isinstance

* adjust strides for double expand

* cleanup

* needed to add type hint for mypy

* set pythonpath
2023-11-15 15:50:17 -05:00
geohotstan 3c5a51fb3a
aaaaaaa finally (#2310) 2023-11-15 07:12:38 -08:00
George Hotz 4f7b1ac0d2
cleanups before interpreted jit (#2306)
* jit mnist

* InterpretedFlopCounter doesn't rely on Interpreted

* allocator for cpu and torch

* types for exec_ast

* fix type issues

* fix onnx, remove print

* always self.from_underlying
2023-11-14 21:44:25 -08:00
nimlgen 4e0d47533e
beam works with var vals (#2296)
* beam works with var vals

* test passes now

* better comment

* linter happy
2023-11-14 13:03:19 -05:00
George Hotz 0cbf6c1811
move things, clean up extra (#2292)
* move things

* idk why pylint needs that now

* delete unused
2023-11-13 20:18:40 -08:00
George Hotz b1f7f29525
metal indirect command buffers (#2285)
* metal indirect command buffers

* sub 1ms gpt

* metal batch exec is good

* remove whitespace

* input_replace

* fix ci

* useResources

* very simple cacheallocator

* update_stats

* fix CI

* minor

* remove that from jit
2023-11-13 17:58:26 -08:00
rodfer 53c5baa8b6
add dilation to avg_pool2d (#2270)
* add dilation to avg_pool2d

* avg_pool_fix

* avg_pool_fix

* woo

* oops

* force it correct

---------

Co-authored-by: rodfer0x80 <rodfer0x80@proton.me>
Co-authored-by: zibokapi <zibokapi@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-13 08:47:56 -08:00
valar 123ea051e6
refactor/ci: delete many `# type: ignore` (#2281)
* refactor/ci: delete many `# type: ignore`

* replace `axis.__class__ is int` with `isinstance(axis, int)` to make mypy happy
* add `--warn-unused-ignores` to mypy flag

refs #2240

* ci: move `--warn-unused-ignores` flag to mypy config

refs #2240
2023-11-12 11:04:20 -08:00
geohotstan b853e9bb8c
Onnx 1.15.0 gogogo (#2217)
* lol

* lol

* add GELULULULUL

* onnx 1.50

* fuk torch bool neg

* exclude regex tests

* exclude dequantizelinear for now

* is sunny in philly

* damn it affinegrid

* fixed auto_pad VALID

* skip 0 shape tests

* add temporary cast in Reduces

* tests should pass now

* added comments and cleanup

* try moving dequantizelinear to onnx.py

* fixed dequantizedlinear?

* cleanup

* try?

* float16 segfaults LLVM CI..???

* cleanup comments

* pin to 1.50.0

* remove use of -np.inf cuz numpy is kill

* 1.50? lol I'm actually retarded

* thx for review, muhbad

* moved Gelu higher up
2023-11-10 15:36:48 -08:00
chenyu a753c8e071
examples of new GPT2 and JIT change (#2261)
* var_vals are global

* working with global ish

* better

* fix export model

* fix tests

* better kv cache

* does it run?

* use where for kvmask

* fix excessive var_vals

* fix import

* how does multigpu use this?

* llama kinda work

* faster and simpler

* cleanup

* fix conversation mode

* test cleanups

* fix one more test

* test cleanup

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2023-11-10 15:07:02 -05:00
George Hotz 80bf0b8586
proper wmma (#2245)
* proper wmma

* hip cast

* bugfixes

* bugfix

* that bug is fixed

---------

Co-authored-by: George Hotz <george@tinygrad.org>
2023-11-09 15:15:18 -08:00
wozeparrot 4c44d1344b
feat: remove cache_id (#2236) 2023-11-08 08:09:21 -08:00
Rory Clear 553688f12a
update metal matmul and matvec for compile api (#2238) 2023-11-08 08:08:35 -08:00
George Hotz 2f7aab3d13
move optimize_local_size (#2221)
* move optimize_local_size

* interpret_ast
2023-11-05 21:00:52 -08:00
chenyu f582ec56d5
Replace (getenv("CI", "") != "") with helpers.CI (#2213) 2023-11-03 15:20:44 -07:00
George Hotz f17bc16f46
simple runtime args (#2211)
* simple runtime args

* fix some tests

* fix abstractions and triton

* fix search
2023-11-03 12:31:29 -07:00
George Hotz ddbc6eecaf
some refactors in the realization (#2206)
* some refactors

* delete old kernel search
2023-11-02 19:51:28 -07:00
George Hotz 03cf0afa4f
move all to compile api (#2203)
* move metal+clang to compile api

* all to the new style

* remove binary arg

* fix triton

* fixup tests

* fix clang

* diskcache is generic

* __wrapped__

* compile_gpu

* fix thneed

* keep the src in the ASTRunner

* lib

* move compile_gpu

* compile_gpu in device

* put compiler in astrunner

* test reverts

* triton compiler

* ugh, that too
2023-11-01 23:01:32 -07:00
George Hotz 8932816816
remove arm64, caching for cuda (#2201)
* remove arm64, caching for cuda

* caching in llvm

* switch cache_compiled to new cache

* fix clang

* caching for metal

* fix pylint

* cleanups

* perf_counter and binary
2023-11-01 18:44:00 -07:00
George Hotz 7103b716c4
merge kernel and optimizer (#2200)
* merge kernel and optimizer

* linearize is reentrant

* move global/local size

* clean up linearizer copy

* remove unneeded lin copies

* stop linearizing twice

* oops, that should be None
2023-11-01 15:20:01 -07:00
George Hotz 33bb650e94
use mad in opencl (#2198)
Co-authored-by: Comma Device <device@comma.ai>
2023-11-01 10:40:08 -07:00
Comma Device 2e9982fe2d fastvits example that's 10% faster 2023-10-31 21:48:23 -07:00
George Hotz 8ba7ced7f9
extract const if it's const (#2193)
* extract const if it's const

* fix if statement

* fast math issue

* fix graphing and casting

* disable flaky copyout test
2023-10-31 18:52:35 -07:00
George Hotz 5aaa8a0cc1 fix shape 2023-10-31 11:36:19 -07:00
George Hotz a27c9f9de5
openpilot compile2 (#2189)
* try compile2

* pass to thneed

* fix tanh onnx
2023-10-31 11:08:58 -07:00
forcefieldsovereign f294bdd681
fixed imports (#2185) 2023-10-30 22:07:17 -07:00
Akshay Kashyap 018bd29e37
Enable Multi-Output Export (#2179)
* Enable Multi-Output Export

* Add test

* Update examples and lint

* fix padding

* test ops

* dummy commit to rerun test

* revert cuda lint

* Enforce tuple/list of tensors

* subscripted generics

* put back webgpu test

* Re-enable WebGPU Efficientnet test
2023-10-30 18:42:26 -07:00
chenyu 6c58bf3e9c
in time_linearizer, allocate a scratch buffer if output buffer is also input (#2152)
* in time_linearizer, allocate a scratch buffer if output buffer is also input

* move scratch buffer creation outside search
2023-10-28 07:17:41 -10:00
George Hotz e0201922e3
Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var

* added from unaligned np test (#2134)

* align cpu buffer before copy into cl buffer (#2135)

* remove shelve from handcode_resnet50_opt.py (#2139)

* Add dictionary keys to reduce db size (#2131)

* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood

* more lin to feats

* sts

* training policynet

* net sort of works

* dedup

* refactor, stupid new actions

* fix uops deduping

* BEAM_ESTIMATE

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
2023-10-27 10:53:06 -10:00
chenyu 0ca0e9ee5e
exclude ast with variables from beam search (#2140)
* exclude ast with variables from beam search

* test that

* add to CI
2023-10-25 16:35:29 -04:00
wozeparrot c29653605e
hip multigpu training (#1878)
* feat: move to hip

* feat: special path for RawBufferTransfer

* feat: initial rawbuffertransfer

* feat: hip ipc

* feat: working hip ipc

* feat: need to base device without args

* feat: close mem handle

* feat: modified test

* feat: more multihip stuff

* clean: cleanup

* feat: cleaner

* feat: don't crash

* feat: test more

* clean: way cleaner hip wrapper

* feat: barrier

* feat: barrier

* feat: this breaks stuff

* feat: we can use empty here

* feat: maybe fix tests

* feat: maybe fix tests again?

* fix: probably fix tests

* feat: no waiting here

* feat: wait here

* feat: much larger test

* feat: need to sync here

* feat: make this async

* feat: no waiting!

* feat: cut here

* feat: sync copy

* feat: random imports

* feat: much cleaner world

* feat: restore this

* feat: restore this

* clean: cleanup

* feat: set this
2023-10-24 17:35:53 -04:00
nimlgen 2e89fd264f
Refactor hipgraph (#2141)
* refactor hip graph

* linter happy

* happy liner
2023-10-24 15:45:56 -04:00
George Hotz cea2bc7964
Add dictionary keys to reduce db size (#2131)
* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood
2023-10-24 10:49:22 -04:00
George Hotz 6dc8eb5bfd
universal disk cache (#2130)
* caching infra for tinygrad

* nons tr key

* fix linter

* no shelve in beam search

* beam search caching

* check tensor cores with beam too

* pretty print

* LATEBEAM in stable diffusion
2023-10-22 10:56:57 -07:00
George Hotz abeba8f1fc
optimization: get actions in CI (#2125)
* get actions in CI

* actually run the test

* pythonpath
2023-10-20 12:22:01 -07:00
Sean D'Souza 999c95ea29
fix: hlb cifar types (#2099) 2023-10-17 19:23:50 -07:00
Ahmed Harmouche 2b5ea7d9cb
Fix output Float32Array size in webgpu export (#2096) 2023-10-17 15:28:19 -07:00
Szymon Ożóg 4bef1591f0
Disable ocelot cache + fix matvec in triton (#2010)
* Revert "disable flaky triton test"

This reverts commit 1e15fdaee7.

* Update test.yml

* check if has shared for matvec

* disable ocelot cache for triton

* disable ocelot cache

* disable ocelot cache

* pass shared to triton uops tests

* temporary debugs for CI crash

* Revert "temporary debugs for CI crash"

This reverts commit fee3ea96c818e83c19b935c2f8482e0ccc91a542.

* Revert "triton isn't tested, and allows this refactor (#2007)"

This reverts commit dea8bb0938.

* add runtime_args to every renderer, move triton local size override to runtime args

* Add binary to args, correct type returned

* update to new loops

* Update test.yml
2023-10-17 10:33:32 -07:00
geohotstan 5ed630204b
Add ONNX to CI for other backends (#2069)
* some cleanup

* move continue back

* more more more

* added to CI

* try

* try intentionally break some tests

* wtf

* del True for test

* yay tests broke, now pls no break

* try AGAIN

* gahy

* lol

* try

* move over constant

* moved over MORE

* move shrink over

* trailing lines

* try CUDA CI

* try again

* boom

* oops

* improved comments

* try: disable some flags and disable CUDA

* try breaking tests

* traceback has too much info so add --tb=no

* revert forced CI failure

* add comments and del unused imports

* oooooooo using regular debug try enable tb

* intentionally break tests

* added tb back. Maybe not too verbose

* strip whitespcae

* missed something

* Shape op int32 -> int64

* oops missed something

* add some types

* get rid of crazy 1 liners in pad op

* actually test Split this time LOL

* strip that whitespace
2023-10-17 09:33:54 -07:00
George Hotz 1bf4aef0f5
fix image dtype cmp (#2089)
* fix image dtype cmp

* print that with debug 3
2023-10-16 17:52:38 -07:00
George Hotz a7b18ac325
try beam search on device (#2085)
* try beam search on device

* fix beam with nolocals

* ops too

---------

Co-authored-by: Comma Device <device@comma.ai>
2023-10-16 12:52:42 -07:00
George Hotz c36d306606
KOPT is over, BEAM is upstream (#2071)
* create cache for q learning

* make linter happy

* global beam

* where it belongs

* bugfix

* ditch the kopt, use the beam

* faster lin and DEBUG=2 okay

* remove kopt, move search to features
2023-10-16 09:46:03 -07:00
George Hotz 5472a14544
openpilot compile2 (#1977)
* start compile2

* tweak

* why are there two more kernels?

* minor cleanups

* don't break onnx tests

* add __metadata__ support to safetensors

* no early realize in onnx

* cleanups

* bugfix

* clean up image type, add optimize

* opt to match old

* try that

* opt work

* run compile2

* optimizer

* prt more

* prerealize

* imp

* NOLOCALS works

* no locals means no locals

* support fractional globals

* all locals welcome

* int that

* cleanups

* show gemv regression

* clean up diff

* use idx for the cond

* nolocals

---------

Co-authored-by: Comma Device <device@comma.ai>
2023-10-15 20:39:46 -07:00
George Hotz 49bcfec383
0s in the action space (#2070)
* 0s in the action space

* simpler

* skip duplicate actions
2023-10-14 11:22:48 -07:00
George Hotz 4124cf1df5
cleanup tensor cores, expose exclude local upcast (#2064)
* expose exclude_local_upcast

* convert apply tensor cores to ops

* update comment

* put LOCAL back to what it was, BEAM is better than way
2023-10-14 09:21:03 -07:00
George Hotz 90c777d815
remove apply_auto_opt (#2063) 2023-10-13 07:44:14 -07:00
George Hotz 6f1810af2d
with unroll, the action space goes from 161 -> 127 (#2060)
* with unroll, the action space goes from 161 -> 127

* more reliable instrumentation

* beam search is so op

* beam bugfix
2023-10-12 20:52:23 -07:00
George Hotz c5edb3c374
train value net, improve API, add BCE (#2047)
* api cleanups, BCE losses

* valuenet

* fixup examples

* learning okay

* add valuenet runner

* net improvements

* net improvements

* 40% win rate
2023-10-12 07:56:38 -07:00
George Hotz 0ba629c7b9
add world dataset (#2045) 2023-10-11 15:54:30 -07:00
George Hotz 0c3b6f13a8
Latest opt (#2044)
* split out actions

* rl algorithm
2023-10-11 15:46:14 -07:00
George Hotz 41bfeb2c1e
start work on auto opt (#2034)
* start work on auto opt

* lin failure

* not beating hcopt

* greedy

* timing is fast

* codegen.search

* greedy search in handcode_opt

* track running gflops

* clean up those files

* no failure
2023-10-11 12:54:53 -07:00
chenyu 1c980517c5
s/var_vals_from_ast/vars_from_ast (#2038) 2023-10-10 20:21:55 -07:00
George Hotz f139060103
Rewrite hand coded opt with action space (#2030)
* tests passing

* hand coded opt with new abstractions

* simpler opts

* split out tensor cores
2023-10-10 07:38:38 -07:00
George Hotz 16ca8410f8
op logger + replay (#2021)
* logops

* fix dtype printing

* needs inf

* ops dataset

* minor improvements

* 12k kernels

* opt can compile

* graph flops
2023-10-08 15:10:18 -07:00
George Hotz 8db92bd060 fix tvm gemm example 2023-10-08 05:57:41 -07:00
Francis Lam dece9958f8
wmma: clean up to make WMMA arg order consistent (#2014)
also add cache defeat to extra/gemm/simple_matmul.py
2023-10-07 17:45:40 -07:00
George Hotz 6ee9cae44f don't extract CIFAR every time / use the cache 2023-10-07 12:33:50 -07:00
George Hotz dea8bb0938
triton isn't tested, and allows this refactor (#2007)
* triton isn't tested

* cuda buffer
2023-10-07 07:29:59 -07:00
Roelof van Dijk 26fcc8dff6
fix: remove runtime imports (#1982)
fix: import what is used

probably monkeypatched

fix: import

revert selective import
2023-10-07 05:23:08 -07:00
George Hotz f54959e5cd
move print tree into graph (#2003)
* move print tree into graph

* add winograd profiling test

* change pre-commit to run ruff first
2023-10-07 04:39:21 -07:00
Ahmed Harmouche 2114dc13d1
Allow multi-input model export (#1995)
* Allow multi-input model export

* Add model export unit test

* Fix efficientnet compilation

* Only run model export test on JIT supported devices

* Skip export model test if not EXPORT_SUPPORTED_DEVICE
2023-10-07 04:13:34 -07:00
George Hotz ffa33d743a
good changes from openpilot_compile2 (#2000)
* good changed from openpilot_compile2

* float32 image type was wrong

* cleaner way to write that + a test
2023-10-06 13:33:24 -07:00
Francis Lam 0ba75c4370
optimizer: add matvec optimizations (#1972)
* optimizer: add matvec optimizations

* renderer: fix alignment of shared memory in opencl
2023-10-04 14:16:27 -07:00
George Hotz de5d603ec1
corealize + remove realize from lazybuffer (#1968)
* corealize + remove realize from lazybuffer

* fix multigpu

* fix graph
2023-10-04 10:59:31 -07:00
nimlgen 2ea1dd3e87
no process() in Linearizer (#1966)
* no process() in Linearizer

* more process() clean up
2023-10-04 07:18:42 -07:00
George Hotz 717451a244
Revert "optimizer: add matvec optimizations (#1753)" (#1959)
This reverts commit f520323054.
2023-10-03 00:28:42 -07:00
Francis Lam f520323054
optimizer: add matvec optimizations (#1753)
* optimizer: add matvec optimizations

* Update optimizer.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-03 00:01:59 -07:00
David Hou 8e9db88474
expand after expr_idxs in Linearizer.global_load (#1818)
* small changes

* expand in terms of substitute, directly expand g_idxs g_valid

* delete expand_ops

* don't compare using hash

* any instead of in

thanks gijskoning

Co-authored-by: Gijs Koning <gijs-koning@live.nl>

* support tc

* testing code

* no more create_rednode

* maxsize none in view/node

* oops

* undo

* typing

* oops

* oops

* lmao

* lmao

* add expand multi test

* Node.iter_idxs

* type

* type

* delete checks!

* clean up a little?

* expand_idx in symbolic

* un-golf

* play around with types >.>

* test_substitute and also remove an incorrect test?

* get rid of range

* Update symbolic.py

* split out view cache change

* split out flat components change

* reduce diff

* reduce diff

* add some float4 tests

* fix

---------

Co-authored-by: Gijs Koning <gijs-koning@live.nl>
2023-09-29 10:33:34 -07:00
Francis Lam f445e056ed
wmma: add test and tensor core shape (#1925) 2023-09-28 18:04:28 -07:00
Yixiang Gao 094d3d71be
with Tensor.train() (#1935)
* add with.train

* remove the rest TODOs

* fix pyflake

* fix pyflake error

* fix mypy
2023-09-28 18:02:31 -07:00
George Hotz c36d0e3bd8 tvm import hook 2023-09-28 09:24:32 -07:00
George Hotz adab724caa
schedule2, keep the tests working with small changes (#1932)
* lazy cleanups

* ast functions take in LazyOps

* op instead of self.op

* _base for mops

* fix contiguous

* start schedule

* test_schedule

* fix openpilot

* more tests

* bugfix and test skip

* work

* make sure things get freed

* fix zerosized tensors

* fix failing test

* fix ceil and friends

* fix openpilot

* disable training

* disable test collectives
2023-09-28 09:14:43 -07:00
nimlgen 45f02393f0
HipGraph support (#1880)
* init hip graph

* optimize args update

* cache symbolic in jit

* remove NOSTAT

* init BasicBatchExecutor

* symbolic infer cache per jit instance

* basicbatchexec is defualt for compiled

* batch_exec is taken from ASTRunner

* no infer cache

* batched execution of hip graph

* add comment about hip graph batches

* readable hip graph
2023-09-24 20:14:36 +08:00
Szymon Ożóg 58296c079d
Make Triton work again (#1547)
* Move ops_triton to runtime and remove errors from deprecated code

* Remove deprecated AST Kernel

* Remove deprecated buffer

* Add TritonProgram

* Triton Buffer

* Use RawCUDABuffer

* triton_compile

* Added new parameter

* pass _buf to program

* remove deprecated include

* Added triton tests

* Deprecated includes removed

* remove double print

* Disable float4 support

* Disable float4 support

* variable load fix

* Track local size

* Add pycuda to triton dependencies

* Merge test.yml

* install cuda packages for testing

* merge double package install

* remove emulated from triton tests

* upscale local index to power of 2 and add masking

* cuda envs

* Add TernaryOps

* ConstOp loading

* proper function name

* remove deprecated variables

* get global program from name

* const ops match local shape

* Enable test_nn

* remove deprecated import

* fix linter error

* Add wait logic

* Add local size override

* accumulate local shapes instead of using max shape

* Merge triton tests into global tests

* fix envs in testing

* Old testing routine

* split file into renderer and program

* remove print and starting whitespace

* pretty ptx print on debug 5

* linter errors

* ignore triton saturation tests

* ignore test example

* remove pytorch cpu extra index

* Add triton to existing testing routine

* use triton tests

* disable cuda backend in triton tests

* use cudacpu in tests

* print used device

* Print device default

* Remove print

* ensure we are running triton backend

* update variable signatures

* update dtypes for load

* infinity render fixed

* limit global size

* negative infinity now properly rendered

* split chain with parentheses for and node

* Add option to disable shared memory, disable for triton

* missing import

* Properly index and mask conditional load

* use mask only if not loading a block pointer

* nan support

* fix symbolic tests to include chain split

* proper masking for stores

* Implemented bool dtype

* Add mod

* fix loads for variables with valid range

* merge triton with cuda runtime

* merge from master

* run triton tests with cuda

* Correct target when running from triton

* conftest with triton compiler config

* use triton nightly

* verbose tests for triton

* capture stdout

* fix function depth when exiting multiple loops

* add render valid function for readabilty

* fix mask for local loops

* add _arg_int32 datatype

* fix dims for conditional loads

* enable non float stores

* correct variable dtypes

* fix type for arg_int32

* remove junk

* Added get max function for range based var.max

* remove deprecated code

* Fix triton ptxas path

* Fix testing for CI

* clamp local size by max local size instead of always running max

* Disable matmul test in triton cpu

* rerun tests

* Disable broken test in triton cpu

* whitespace removed

* rerun tests again

* Disable TestSymbolicOps for triton

* update to new uops

* linter fix

* ignore test/extra

* linting fix

* Update tinygrad/renderer/triton.py

Co-authored-by: Gijs Koning <gijs-koning@live.nl>

* remove deprecated line

* quotes type fix

* linter

* Remove unnecesary lines

* UnaryOps.NEG

* dont define constants

* Linting fix

* Disable tests that are broken in ocelot

* remove trailing whitespace

* reduce line count

* linting fix

* update to new uast

* New looping style

* Update to new uast

* make AST runner work with triton

* linting fix

* set renderer var for testing

* disable local for ocelot

* reenable all tests for ocelot

* Pass shared to cuda

* Don't group if the backend doesn't support shared mem

* use working gpuocelot branch

* enable all tests

* enable local for ocelot

* cleanup

* Update test.yml

* update cache key

* reenable test symbolic and extra

* Update test.yml

* Revert "Update test.yml" (rerun tests)

This reverts commit 98c0630ee5da4379e5c6b2437a5145fe87058c35.

* Revert "fix symbolic tests to include chain split"

This reverts commit 22a9a4c9cd14d23735e6540c8d90ee005ac4ea17.

* Revert "split chain with parentheses for and node"

This reverts commit 7499a7004ef4db785d0cd05cf292fdeff65ca90d.

* use global size from linearizer

* rename newvar to dtype to match other renderers

* join program start lines

* simplify code that adds axis to local dims

* assign r[u] in ssa

* We no longer need to replace target in src

* we no longer need to cast indices to int by hand

* Update triton.py(rerun tests)

* Update triton.py(rerun tests)

* Update triton.py(rerun tests)

---------

Co-authored-by: Gijs Koning <gijs-koning@live.nl>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-09-23 14:17:12 +08:00
qazal d0e752003d
fixes (#1893) 2023-09-22 07:20:27 +08:00
wozeparrot 009a99a0b1
feat: way cleaner hip wrapper (#1895) 2023-09-22 07:20:03 +08:00
kormann 864746d6aa
polish print_tree (#1868)
* fix

* isinstance
2023-09-21 11:13:10 +08:00
chenyu 3ec301c2d7
apply view.py patch (#1844) 2023-09-10 17:32:15 -07:00
kormann 7ac65a93b4
utils.printtree (#1816)
* utils.printtree

* linter compliance

* rename to print_tree
2023-09-07 23:08:57 -07:00
George Hotz 4613c9e77c
add tvm example, formatting (#1813)
* add tvm example

* no realize
2023-09-07 11:50:41 -07:00
Pavol Rusnak 52a92bf95d
use class Foo: instead of class Foo(): (#1797)
* use class Foo: instead of class Foo():

* add ruff linter, copy settings from .flake8 to ruff.toml
2023-09-06 12:20:25 -07:00
geohotstan 9af5645ba3
onnx full passing (#1076)
* 1

* 83 failed

* learning how git works

* lol idk

* zero shape aaaa

* space lol

* aaa

* test check

* haha

* fixed gather

* 73 failing

* 71 failing

* 68 failing

* added some debug

* fking resize

* lol

* 62 failing

* 58 failling fucking did nearest resize hell yeah

* clean up

* 56 failing

* janitor duty

* lol

* 53 failing

* hi mom

* 50 failing

* added linear interp, but coord_trans is wrong

* did lin interpolation woohoo

* 43 failing

* 40 failing

* temporary Gather fix

* 39 failing

* fixed slice onnxver<10

* 37 failing

* 35 failing

* excluded tests that use float64

* 32 failing with hacks

* added _batchnorm() for 3D 5D batchnorm, 29 failing

* changed ALLOWED_KERNEL_COUNT from 199 to 207

* added improved Gather op, reverted ALLOWED_KERNEL_COUNT commit

* support Round op

* added storage_order/indices maxpool, 27 failing

* support maxunpool, 25 failures

* support Gradient, 23 failures

* merged new where

* added Adam

* cleanups

* added Momentum and Nesterov Momentum

* added Adagrad

* support sequence_type, 20 failing

* ugh git

* I give up on cubic interp :D, 9 failing

* sexy 1 liner gather, much improved, wow

* polished gather to make it shine bright like a diamond

* clean 1 liner for gather

* improved readability of gather

* uhh

* clean up

* more clean up

* WHITEspace

* implemented SoftmaxCrossEntropyLoss op

* added comments and cleaned up if statements

* update

* thank based wozeparrot for pow and new GatherElements

* CPU and TORCH all pass | cast float64 -> float32 for all fromCPU()

* _nearest_gather() failing on yolo

* reverted ops_cpu change and added assert in Resize

* added comments for resize for multiple channels

* oops

* merge

* test

* switched np.pad to Tensor.pad for constant padding

* gah

* gah2

* sexy reflect pad with movementops -> add

* delete commented out lines

* edge mode pad sexy as well

* trying out model_benchmark

* revert gitignore change lol

* init

* Revert "init"

This reverts commit 682bf2073a8b4eca111596c67cf6ebd79f59e585.

* wrote cast workaround for CPU, CPU and TORCH all pass

* wrote cast workaround for CPU, CPU and TORCH all pass

* skipped tests w/ 0 shape for METAL and GPU

* excluded tests for CLANG, CPU, TORCH, CLANG pass

* fixed hacky ConvTranspose

* gotta figure out autopad

* UOps.STORE support cast bool -> float

* small fix for fast gather

* reverted 0 shape skipped tests

* oops missed a file

* added comment

* fixed slice op hack

* First commit to pr

* More trig ops

* More trig ops

* format

* isinf support

* More ops

* changed onnx_ops to use our new gather :D

* Det op bug fix

* rebase

* fixed some tests

* det broken and slow

* fixed compress to use new gather

* implemented argmax argmin

* support variable types in type_proto

* support Upsample and Identity sequence

* we support float64 now and tinygrad support automatic broadcasting

* added EyeLike op

* resize does support multiple channels now actually

* yolov8 onnx runs successfully

* added batch size 1

* oops

* finally fixed type_proto I think

* fixed some llvm bugs

* del whitespaces

* added ZenginU Format PR

* test

* oops

* added float64 exclude tests back

* more skipped tests

* try

* ok openpilot pass

* flake8 pass

* woooooohooo

* revert external_model_benchmark changes

* perf tested gather

* removed promote types from ops_cpu

* numerical errors from 1681 is fixed

---------

Co-authored-by: ZenginU <umutzengin00@gmail.com>
2023-09-05 13:23:32 -07:00
George Hotz 56abe04e4b
disable assembly (#1755) 2023-09-04 09:41:20 -07:00
wozeparrot bf05534c6e
hip multidevice (#1728)
* feat: hip multidevice support + p2p

* feat: default device
2023-09-01 06:46:13 -07:00
Karan Handa a8aa13dc91
[ready] Replacing os with pathlib (#1708)
* replace os.path with pathlib

* safe convert dirnames to pathlib

* replace all os.path.join

* fix cuda error

* change main chunk

* Reviewer fixes

* fix vgg

* Fixed everything

* Final fixes

* ensure consistency

* Change all parent.parent... to parents
2023-08-30 10:41:08 -07:00
nimlgen 1c0449e190
add cache collector (#1595)
* init cache collector

* add test_cache_collector.py

* switch GlobalCounters.cache to CacheCollector

* init jit models test

* jitted SD

* add debug msg to print loaded bufs count

* moved cache collctor to jit

* clearer SD

* no double device import
2023-08-28 19:59:55 -07:00
George Hotz a6d842af7a
move device to ops (#1646)
* move device to ops

* mlops types

* 2 lines
2023-08-23 08:30:17 -07:00
George Hotz 718ced296c
move state to nn/state (#1619) 2023-08-22 07:36:24 -07:00
Umut Zengin f720682beb
np.argmax to Tensor.argmax (#1608)
* to tensor argmax

* removed keepdim

* training update
2023-08-21 15:22:29 -07:00
Yixiang Gao 4d54afb6df
sparse cat cross entropy (#1597)
* add sparse cat cross entropy

* minor fix

* add log_softmax into loss function

* add test

* update docs

* fix training loss

* add device
2023-08-21 14:14:54 -07:00
George Hotz 2e60920317
Revert "sparse cat cross entropy (#1591)" (#1596)
This reverts commit f0ee850e98.
2023-08-21 10:04:26 -07:00
Yixiang Gao f0ee850e98
sparse cat cross entropy (#1591)
* add sparse cat cross entropy

* minor fix

* add log_softmax into loss function

* add test

* update docs
2023-08-21 09:56:41 -07:00
Yixiang Gao 8d6662a741
.cpu().numpy() -> .numpy() (#1594)
* .cpu().numpy() -> .numpy()

* restore ops_torch

* restore test_speed_v_torch
2023-08-21 09:53:29 -07:00
George Hotz e464442adf
WMMA for 7900XTX (#1563)
* go

* hip no LRU

* work

* works

* 16 TFLOPS

* 29 TFLOPS

* 30 TFLOPS

* never mind, it's 60 TFLOPS

* fix metal WMMA

* put hip alloc back
2023-08-19 09:07:23 -07:00
chenyu ae39cf84ab
Symbolic Shape JIT main PR (#1353)
* Symbolic Shape JIT

update tests

2 variables symbolic ops, adding more tests

test passing

cleanup

* more test cases

* single flag

* review update

* jit attention one piece

* realize

* symbolic_jit test for cuda

* old artifact

* works with cuda gpu but failed ci

* CUDACPU
2023-08-18 14:39:55 -07:00
wozeparrot 50decf0d45
train cifar using multigpu (#1529)
* feat: train cifar using multigpu

* feat: split eval batch across 5

* feat: cleaner allreduce

* feat: 93.88%

* feat: cleaner batch chunking from bert

* feat: cleaner grad sync

* feat: tinygrad argmax

* feat: make it work with different gpu counts

* feat: move some stuff into the normal __init__

* feat: autodetect gpu count

* feat: move import inside
2023-08-18 09:35:44 -07:00
wozeparrot 15150d60c4
fix: small fix for lru on hip (#1567) 2023-08-18 09:18:38 -07:00
Ethan Sorrell cb62911f6b
PTX Reintegration and Passing Tests (#1512)
* move assembly, assembly_ptx

* successful but broken rendering of ptx asm

* clear ins before render asm

* slightly less broken :')

* we needed thread syncs

* fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half

* Fix runtime_args for gpuocelot

* our casts were flipped on both ends

* more casting

* add ternary where op

* dealing with storing/loading bool

* add test for casting to bool from negative

* Fix args.valid on ConstOp

* add to CI, TODO: fix runtime_args for test_uops

* fix placement of runtime_args to work with lazy.Device

* undo ci changes so I can push

* fix lints

* start cleanup and fix things we broke fixing lints

* add checks for PTX specifc asm instructions

* revert added test -- doesn't pass on llvm

* skip tests for underflow,overflow

* another fix for how we're setting runtime args

* Less broken cleanup

* add to CI

* add more env variables for ci test

* fix ci to install pycuda for ptx

* ci: copy cuda test command

* cleanup

* assert to make sure we're actually running ptx in ci

* remove test assert

* move is_ptx arg

* move assembly, assembly_ptx back to extras

* fix imports

* initial merge fixes

* clear registers, fix UOps.LOAD with invalid value

* draft merge fixes

* remove prints

* quick lint and merge fixes

* cleanup

* remove PTXProgram wrapper

* final cleanup

* temp change for ci rerun

* ci rerun

* rollback ISA version
2023-08-16 16:20:20 -07:00
JaSpa99 491e85597a
Run onnx commavq model (#1537)
* try to run commavq

* fix 0 dim, start implementing new ops

- Implement EmbedLayerNormalization
- Implement Attention

* SkipLayerNormalization and FastGelu

* use original torch model, cast inputs

* fix some ops:

- properly do Cast
- Attention: bi- and unidirectional
- FastGelu: add bias before gelu

* cleanup onnx_ops.py

* add validation option to benchmark

* cleanup imports

* add checks incase onnx2torch implements ops in future

* run onnx instead of original torch

* just skip gpu on m1

* reactivate the other models

* check for strange params & squash whitespace

* cleanup

* fix causal mask Attention

* Range doesn't need int cast

* embedding vocab_counter same dtype as input

* no need to cast

* always validate, fix PosixPath ort

---------

Co-authored-by: George Hotz <george@comma.ai>
2023-08-16 12:24:40 -07:00
George Hotz f8109b830c
promote assembly to the main codebase (#1544)
* promote assembly to the main codebase

* not namedtuple
2023-08-14 22:47:45 -07:00
Steven Anderson 93a36c3659
Arm (#1421)
* testing new memops

* better debugging

* testing padded conv

* branching with load

* refactoring a bit

* first try

* fixing bugs

* fixing some

* eq

* eq2

* do not use x's

* working

* fixing imm

* getting things working

* refactor

* pow not working

* working except one

* refactor: one store mem

* refactor: global load

* refactor: imm

* refactor: cleaning

* fixing big offsets

* refactor with ci

* try ci

* typo

* another typo

* ubuntu default

* forgot git

* do i need git?

* missing packages

* adding python-dev

* with cache?

* buildx action

* buildx name issue?

* maybe now?

* python3

* newline warning

* maybe now

* i actually need this

* ci should work now

* improved caching

* fixing cache

* maybe now it will cache

* this

* testing cache

* trying again

* load

* missing platform

* caching gha

* testing cache

* full testing

* typo

* now?

* why

* adding checkout back

* bad formatting

* fixing convention issues

* supporting python

* adding CI flag

* testing all

* better comments

* adding debugging

* takes 12x longer

* does it output progress now?

* ignore models for speed

* fixing merge

* excluding conv_transpose2d

* only 2 test cuz is to slow

* another approach

* let's see

* faster duh

* my bad

* T_T

* typo

* sup

* with output?

* comment test

* comment test

* comment test

* :?

* no comment

* with cache

* back to normal

* testing that ci works

* back to passing

* trying again

* does it create another entry

* does it create another entry?

* build local

* hey

* Revert "excluding conv_transpose2d"

This reverts commit cc7348de03033e032f47d69caff174e2f1a7bfea.

* does it cache if done before?

* does it cache?

* done

* adding test ops

* bad formatting

* no need for this

* working static mem

* sum 1d

* add ndim

* better reg import

* fix stack

* back to np

* working except for softmax

* 5 failing

* no pogress

* remove keystone

* remove keystone

* testops passing

* cleanups

* more cleanup

* typo

* ci

* ci2

* cond import

* ci3

* ci4

* ci4

* ci5

* ci5

* ci6

* aligment

* test all

* correct test

* err read_unmapped

* passing test

* ignore for speed

* ignore for speed

* ci7

* cleanup

* remove docker

* fixing merge

* fixing bugs

* add skipload for const ops

* comments

* First merge to master: Renderer

* fix emulation

* passing all tests arm64

* cleaning

* fix handcoded binary

* cleaning

* fix errs

* fix runtime arg binary

* clean git diff

* fix and clean

* fixing metal test

* cleaning

* fix metal test

* ci ~8 min

* fix pylint and clang

* cache the files in ops_clang

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-08-14 19:29:30 -07:00
Szymon Ożóg 330fb7b1a3
Print more meaningfull hip error messages (#1530) 2023-08-12 07:16:20 -07:00
wozeparrot 29d5801387
distributed collectives (#1519)
* feat: world

* feat: tests

* feat: no more backwards

* feat: recv into

* feat: whoops

* feat: test in ci

* feat: some debug logging

* feat: workflow naming

* feat: need to set pythonpath

* feat: just send to same device

* feat: allreduce

* feat: test

* feat: need contiguous

* feat: test in ci

* feat: exit with correct code

* feat: don't need that

* feat: opencl wait_for just doesn't work

* feat: synchronize on out

* feat: try?

* feat: try again?

* feat: add extra realizes

* feat: print

* feat: seed

* feat: tol

* feat: test ones and zeros

* feat: remove print

* feat: are you just flaky

* feat: seperate scatter and gather?

* feat: just try synchronizing

* feat: remove print again

* feat: bring back difference

* feat: no sync

* feat: revert that

* feat: back to wait_for

* fix: typo
2023-08-11 10:22:07 -07:00
wozeparrot 7e7c9001e9
distributed world (#1481)
* feat: world

* feat: tests

* feat: no more backwards

* feat: recv into

* feat: whoops

* feat: test in ci

* feat: some debug logging

* feat: workflow naming

* feat: need to set pythonpath

* feat: just send to same device
2023-08-10 10:00:51 -07:00
George Hotz c417cd3c97
fast HIP gemm -> 100 TFLOPS (#1476)
* fast HIP gemm

* wmma

* correct b

* fix spilling

* 60 TFLOPS

* 64 TFLOPS

* 65 TFLOPS
2023-08-09 06:54:15 -07:00
Yixiang Gao 6480a1a180
CIFAR 94.03% (#1340)
* add disk_tensor

* fix jit

* new baseline before whitening

* whitening through torch

* whiting done currently at 91.65%

* 91.99%

* clean up mixup and 92.3%

* clean up 92.30%

* 92.49% before searching for new hyper-parameters

* fix CI

* fix white space

* add whitening init in test

* refactor, update hyperpara, 92.72%

* converting whiting to tinygrad operation

* update CI kernels count for CIFAR

* add pad reflect

* add random crop 92.53%

* update hyperpara 93%

* 93.15% on docker container, need to refactor the assignment for hyper param

* print out weights and bias to be separated

* bias/non-bias params separated

* fix whitespace

* clean up

* refactor hyper-param with dict

* refactor lr schedular params

* fix whitespace

* fix cross entropy loss

* fix whitespace

* move opt hyp to hyp dict

* minor fixup

* adjust model, loss scaling

* 92.74% while using half of compute as before

* update hyp for cutmix

* random shuffle during batches

* clean up

* updating the model

* update ConvGroup

* disable gradients for batchnorm layer weights

* whitespace

* 93.92%

* clean up

* finally 94%git add .!

* rewrite whitening to remove dependency on torch

* whitespace

* remove dependency on torch, 93.91%

* back to 94.03%

* clean up

* update test_real_world
2023-08-08 15:13:24 -07:00
George Hotz d24f936501
just cmplt (#1493)
* just cmplt

* fix maximum

* don't save, there's no backward

* ugh, no slot either

* eq is a scam
2023-08-08 13:58:10 -07:00
Roelof van Dijk 0ce7511110
fix: is not use with a literal (#1487)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-08 07:35:30 -07:00
Diogo 4dc8595069
simple exporting models (#1344)
* unified exporting

* json exporting

* ignore more

* simplified buffer export

* added dtypes

* added assert

* swift example

* fix tests

* linter

* remove whitespace

* fixed tests

* remove swift example

* remove unintended changes

* allow callable models to be used

* whitespace

* more readable json export

* name change

* whitespace

* whitespace
2023-08-01 09:35:48 -07:00
David Hou 3300d0aeaf
syncthreads before wmma (#1389)
(venv) chaos@tiny3:~/tinygrad$ KX=2 KY=2 N=2048 python extra/gemm/hip_matmul.py
   4194304    289.60 us, would be  59322.55 GFLOPS matmul, 173.80 GB/s
2023-07-31 17:05:49 -07:00
George Hotz 37fa7e96fb
Revert "update editorconfig, enforce via CI (#1343)" (#1380)
This reverts commit da2efecbe2.
2023-07-31 10:35:50 -07:00
Pavol Rusnak da2efecbe2
update editorconfig, enforce via CI (#1343)
* update editorconfig to set unix-style newlines and trim whitespace

* add editorconfig github action to the CI

* fix whitespace
2023-07-30 18:44:30 -07:00
Cole Sutyak 2d4e182294
change fetch to allow for local file selection (#1309) 2023-07-23 15:00:16 -04:00
Jacob Pradels b112edd2c3
Add pylint trailing whitespace rule (#1314) 2023-07-21 13:37:55 -04:00
madt2709 d2c1e8409a
Update arange to be (start, stop, step) (#1308) 2023-07-21 00:27:23 -04:00
wozeparrot 37cc33269a
cl fixes for multigpu (#1276)
* feat: opencl fixes for multigpu usage

* clean: who needs this import anyways
2023-07-18 19:59:30 -07:00
George Hotz ab3d281a6e
Refactor MemOps (#1256)
* metal tests pass locally

* define global

* refactor DEFINE_GLOBAL

* move assembly out. it isn't tested

* fix llvm
2023-07-17 16:36:33 -07:00
Stan 91f797cd52
Moved mkdir in `utils.download_file` to diff line (#1249)
* Moved mkdir to diff line

.mkdir does not return the actual directory being created.

* use walrus operator to simplify
2023-07-16 00:30:46 -07:00
Yixiang Gao a8f2c16f8e
add contiguous (#1246) 2023-07-15 08:36:34 -07:00
George Hotz 67e34b356a
good stuff from tensor cores branch (#1199) 2023-07-08 16:58:26 -07:00
Jacky Lee e0c2ae8984
Update file paths (#1179) 2023-07-07 18:41:58 -07:00
George Hotz b8dfbba703 hip_matmul: f16 gemm 2048x2048 gets 36 TFLOPS 2023-07-08 00:35:45 +00:00
Stan 69d33cab0d
Fix: auto create parent dir when downloading file (#1173)
* Fix: auto create parent dir when downloading file

also removed duplicate import `os`

* Added test for auto parent dir creation when downloading file
2023-07-07 13:40:29 -07:00
terafo aa60feda48
Fix naming conflict with huggingface datasets (#1161)
* Rename in files

* Move files

* Moved to extra/datasets as suggested

* Changes to files

* Fixed stupid mistake

---------

Co-authored-by: terafo <terafo@protonmail.com>
2023-07-07 10:43:44 -07:00
Stan 9b6e57eccd
helpers.py: improved test coverage + exception handling (#1165)
* Fixes + improved test coverage for helpers.py

- added exception handling in `proc`, if an exception was thrown, the thread would hang
- made `_early_exec_process` catch any Exception, before if an exception was thrown before the process was started, it would hand the thread

* Made `_early_exec_process` catch any Exception

 Otherwise, if an exception was thrown before the process was started, it would hang the thread. For example a type error for an argument passed to `subprocess.check_output`

* Fixed `from tinygrad.helpers import Timing` import

oops, for some reason my IDE cleaned that import from extra/helpers.

* Fixed import in llama.py

Another one that I skipped by accident, mybad

* Extracted a class for tests of early exec

* Normalize line endings, windows uses /r/n

* Made `cross_process` not a daemon
2023-07-07 10:26:05 -07:00
Kunwar Raj Singh 8391648822
Over 90% on CIFAR with examples/hlb_cifar10.py (#1073)
* fix eval, lr decay, best eval

* 82.27

* 82.64

* 82.79, reproducable

* add lr sched, 85.26

* 87.42

* 87.94

* 87.42

* tta with flip

* training flip aug

* refactor

* using Tensor for LR is faster

* 89.5

* refactor, flip only train set

* 90.01

* 90.64

* eval jit

* refactor

* only JIT model

* fix eval JIT

* fix eval JIT

* 90.82

* STEPS=900 reaches 90.22

* TTA envvar

* TTA default 0

* fully jit training

* refactor optim

* fix sched

* add label smoothing

* param changes

* patial gelu

* OneCycle with pause

* gelu maybe works

* 90.12

* remove pause lr

* maybe fix lr schedulers

* scheduler test passing

* comments

* try mixup

* shuffle!

* add back the missing last eval

* fix shuffle bugs

* add mixup prob

* fix mixup prob

* 90.19

* correct mixup

* correct mixup

* correct mixup

* 90.24

* 90.33

* refactor, add type hints

* add gradient clipping

* maybe fix test

* full JIT

* back to relu for now

* pass mixup prob as param

* add typehints

* maybe CI works

* try erf gelu

* CI, types

* remove useless import/

* refactor optim

* refactor optim

* try leakyrelu

* try celu

* gelu

* 90.67

* remove grad clip

* remove grad clip tests

* revert params

* add test for OneCycleLR

* 90.62

* fix eval timing

* fix eval timing again

* so where i calculate mixup_prob matters

---------

Co-authored-by: Kunwar Raj Singh <kunwar31@pop-os.localdomain>
2023-07-06 20:46:22 -07:00
Eli Frigo 801564f31b
Remove POW llop and add SQRT llop (#1104)
* fixed division by zero for fast operations

* made et closer to 0

* replace POW llop with SQRT

* updated mlops to swap SQRT and POW llops

* updated hlops to swap POW and SQRT

* added sqrt llop to cpu runtime

* added sqrt llop to cstyle codegen

* added POW llop to llvm ir codegen

* added SQRT llop to torch runtime

* moved pow from mlops to hlops

* found a better way to do reverse pow

* fixed indentation

* added SQRT llop to triton

* update docs to match new llops

* removed POW operator from assembly codegen

* added sqrt and rsqrt to pow hlop

* rewrote pow function in tensor.py

* Adjust tolerance

* Adjust for adamw

* Reduce for Adam too

* removed accidental leftover code

* removed all of accidental code

* added rsqrt test

* removed pow from mlops again

it was added back when resolving merge conflicts

---------

Co-authored-by: Jacky Lee <jla524@sfu.ca>
2023-07-05 18:07:58 -07:00
Reza Rezvan d1356cac27
Fix: Jacobian tests [WIP] (#1126)
* Fix: Jacobian tests; num_jacobian either bugged or not accurate enough;

* Fix: Jacobian tests;

* Fix: Gradcheck;
2023-07-05 15:36:22 -07:00
Mehmet Kuzucu c3173ff281
Add return statement to the train function (#1135)
add a return statement to the train function in order to provide access to the losses and accuracies lists
2023-07-05 08:13:38 -07:00
George Hotz 2f968f8547 ignore cloudpickle type for local mypy 2023-07-04 13:51:20 -07:00
Daniel Hipke b4ce23e4b8
Make cross_process use cloudpickle (#1118)
* fix syntax issues in imagenet_download.py

* use cloudpickle in cross_process to make it work in Python 3.9+

* add cross_process test

* prevent unpickling on every function call

* add cloudpickle to setup.py

* add support for args/kwargs
2023-07-04 00:47:34 -07:00
Anselm Coogan a22aad7d32
Use generators instead of lists in `any`s and `all`s (#1111)
* Use generators in any(..) instead of lists for better best-case

* Use generators in all(...) instead of lists

* enable R1729 in .pylintrc

* revert import sorting

---------

Co-authored-by: Anselm Coogan <anselm@scandit.com>
2023-07-03 16:06:06 -07:00
Frank Pinnola 2071e53da8
Handle broadcast flag on gemm (#1103) 2023-07-02 22:15:07 -07:00
Rob Grossman c8ddc34368
include missing queue in thneed load (#1095) 2023-07-02 12:33:59 -07:00
George Hotz e234bf2298 hip matmul : add K support 2023-06-28 19:54:33 +00:00
George Hotz 0e93b9642a hip matmul 2023-06-28 19:21:01 +00:00
George Hotz 6ec0a24706 imagenet eval in 1 min 28 sec 2023-06-28 04:23:26 +00:00
George Hotz 9c6e507518 move accel into extra 2023-06-23 16:38:15 -07:00
Diogo 57d3aa76a5
Windows & Ubuntu CLANG CI support (#1011)
* matrix strategy

* push env to GITHUB_ENV

* use printf instead of echo

* use temp helper function for cross os paths

* use path join

* switched to using temp helper function

* skip test on windows due to memory limit

* small fix

* removed semi

* touchups

* clean up

* seperate tests

* test changes to test_utils on windows

* small refactor

* more cleanups

* undo helpers change

* only skip if in CI and WINDOWS
2023-06-19 09:33:24 -07:00
Alex Wang 3d63c71e27
HIP backend (#750)
* llama works for HIP backend

* Use hipMemcpyAsync; Less lines of code

* Remove unused code

* Refactor

* Add comments; hipDeviceSynchronize

* HIP over GPU; Remove PyHIP dependency

* Cleanups

* Fix mypy check

* Merge master; Dump assembly code
2023-06-18 11:35:57 -07:00
Casey Primozic 805eef10dd
Add tensorflow GEMM benchmark script (#1000)
* Modelled closely after the existing torch benchmark script but just adapted slightly for tensorflow
2023-06-18 10:57:45 -07:00
Diogo d2b837c1d9
Adds floor/ceil (#989)
* floor ceil impl

* control casting in numpy
2023-06-17 10:56:21 -07:00
George Hotz fe71282ba1
faster RDNA assembly backend (#990)
* fast asm

* torch gemm
2023-06-16 12:06:38 -07:00
George Hotz ba56ee6020
RDNA assembly backend ($1000 bounty) (#787)
* Revert "Revert "ops rdna""

This reverts commit 0400315078.

* Revert "Revert "writing 2""

This reverts commit 325a3bf2cf.

* no dump

* 2x 2

* simple asm

* local size

* sub

* lil work

* support args != 3

* assembler work

* generate that

* ptx assembler

* begin index renderer

* max

* ptx loops

* gemms work

* valid works

* asm working a bit more

* close

* passing all ops tests

* ptx is a codegen only, not a backend

* ptx

* float16 support

* rdna goes here

* install types

* make amd disassemble

* ansilen for pretty print

* fix ptx log2/exp2

* assemblyinstruction

* new asm

* working gemm

* fix cmp

* more passing

* mod

* ptx works again

* rdan3 add works

* log exp

* sin is sin 2pi

* fix types

* progress

* loops work

* rdna xyz

* better addressing

* cleanups

* handle exception in early process

* div support

* rdna float4

* locals work

* fix neg index

* cast

* smaller diff

* yaml

* import only if selected

* fromimport

* types

* this all needs rewriting

* a few more
2023-06-16 09:33:18 -07:00
Yahya Lmallas 804c45b5fc
FIX: Can't pickle local object (#979)
_early_exec_process is a local function that is defined whiting the scope of another function, should be global
2023-06-14 12:32:17 -07:00
Steven Anderson e54b6c5e7f
One hot (#972)
* passing with 1d indices

* passing all test

* cleanup

* using safe_numpy for scalar
2023-06-12 10:13:29 -07:00
Diogo 2d4370b487
Adds tril & triu support (#936)
* triu & tril support

* lint and kernel count error

* switched shape indicies

* larger shape tests

* reverted numpy removal until #942 is resolved
2023-06-09 22:13:20 -07:00
Steven Anderson c0e558b77c
Test nllloss (#958)
* works but slow

* work with NC and NCd1 it still slow

* refactor

* support for k dimensions

* without numpy
2023-06-09 09:00:29 -07:00
Diogo 6b1280f01c
fixes to Onnx ops LayerNormalization/Prelu and added OptionalHasElement/OptionalGetElement (#956)
* prelu and where casting

* typing for safe_numpy

* optional

* get rid of tracing in ci

* cleanup and resolved layernorm issues

* removed debug print
2023-06-08 16:09:19 -07:00
Diogo 666d151f8a
Onnx slice fixups (#952)
* resolved some slice test errors and added some more debugging logs

* use same device in cumsum

* increased float priority

* onnx debug ouput match input
2023-06-07 19:44:30 -07:00
M4tthewDE 664d6cc7e5
Implement onnx MeanVarianceNormalization (#943) 2023-06-06 10:28:19 -07:00
Steven Anderson 079ea217a3
fix test_pow_type - autocasting for Pow with inputs of diff type (#937) 2023-06-05 15:22:35 -07:00
M4tthewDE 70f12fdb57
Fix wrong op version being used if versions equal (#934) 2023-06-05 07:45:10 -07:00
Steven Anderson 79613eb83e
Test min (#932)
* fix __neg__ defaulting to float32 due to 0.0

* fixed __neg__ always defaulting to float32

* fixed openpilot (OpenCL) Test
2023-06-05 00:03:30 -07:00
George Hotz fbf17f0031 intel benchmark matmul gets 60 TFLOPS? 2023-06-04 17:01:50 +00:00
Steven Anderson 657e642e3a
Fixed test suite for Clip (#912)
* Fixed test suite for Clip

* fixed issue with clip when taking large negative numbers as min

* Remove typings
2023-06-04 09:01:01 -07:00
George Hotz afd0be8a9c intel example 2023-06-04 06:43:09 +00:00
George Hotz ed1963b899
Fast DiskTensor to other Tensor (#916)
* make disktensors fast

* loading

* loader for sd and llama
2023-06-03 12:25:41 -07:00
George Hotz 791530045d
Refactor LoadOps (#910)
* test

* work

* upd test

* loadops

* cleanups

* real ones

* remove LazyNumpyArray

* fix assign test

* remove range

* np.require

* llama uses arange kernels

* no caching consts

* fix enet

* torch load support

* tests cleanup

* fix shufflenet

* fix image

* fix torch_load test
2023-06-03 09:40:43 -07:00
Steven Anderson 513aeb2f66
Fixed all ConstantOfShape test suite (#907) 2023-06-02 11:26:40 -07:00
Steven Anderson 301f7b54c6
ConstantOfShape ONNX test fixed. (#890)
* ConstantOfShape ONNX test fixed.

* removed redundant if statement

* value is optional and should default to a float32 tensor with value of 0

* fixed: default parameters are created at function definition, bad for mutable objects.
2023-06-02 07:34:25 -07:00
kposborne2 ae83e9844c
add output_padding to transposed conv (#875) 2023-06-01 00:03:22 -07:00
Friedrich Carl Eichenroth 740304ef9d
Small Onnx Parser Improvements (#885)
* wip

* rename onnx_version to onnx_model_versioN

* add type

* add types

* small cleanup

* revert some changes from before

* add todo

* dumb fix
2023-06-01 00:01:01 -07:00
Marcello Fuschi 3924aae8ed
Fix ONNX dropout and unify the implementation (#857)
* Fix ONNX dropout and unify the implementation

* Use tensor rand method for dropout

* Change approach for RNG in ONNX Dropout

* Fix style

* Test legacy RNG seeding

* Remove the necessity for legacy RNG in Tensor class
2023-05-31 07:40:47 -07:00
skobsman 2e393f7ef2
InstanceNormalization ONNX test fixed. (#870) 2023-05-30 16:07:44 -07:00
Friedrich Carl Eichenroth f91f28d9e2
fix a bunch of tests (#856) 2023-05-29 17:48:26 -07:00