Commit Graph

711 Commits

Author SHA1 Message Date
George Hotz 94599c0637
fixup ast in kernel to be MetaOps.SINK [run_process_replay] (#5424)
* fixup ast in kernel to be MetaOps.SINK [run_process_replay]

* fix tests

* fix more tests
2024-07-12 14:01:03 -07:00
uuuvn 3cb94a0a15
Rename tinygrad/runtime/driver to support (#5413) 2024-07-12 11:06:42 -07:00
wozeparrot a02b38c0ac
download openimages by running it (#5396) 2024-07-11 16:06:13 -07:00
wozeparrot fa873df9c1
bring tinychat more inline with tinyos' version (#5358) 2024-07-10 13:13:52 -07:00
George Hotz c13da83f12
tests from lowerer branch (#5339)
* tests from lowerer branch

* Update test_image_dtype.py

* Update test_image_dtype.py

* Update test_image_dtype.py
2024-07-08 21:23:19 -07:00
nimlgen 51d6f372e4
nv get classes based on device (#5325)
* nv get classes

* support in mockgpu

* choose sm based on gpu

* fix

* fix

* fix arch
2024-07-08 18:25:05 +03:00
Tobias Fischer 0c3a35e5c2
Stable Diffusion v2 Inference (#5283)
* model implementation

* clip fix, more qol options
2024-07-03 22:47:10 -04:00
chenyu b2c3a28a5e
nn.RMSNorm (#5272)
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
Tobias Fischer 8c9c1cf62f
Pulled CLIP and UNet into Seperate Files (#5253)
* pulled clip and unet into seperate files

* reference cleanup, lru cache fix

* better pool indexing
2024-07-01 22:33:01 -04:00
nimlgen 57e89645cd
hcq spec test (#5226)
* start hcq spec test

* more test

* fixes

* run on amd as well

* test amdgpu exec

* fix amd

* amd mockgpu support sdma timestamp
2024-07-01 17:36:37 +03:00
George Hotz 14980f79dd hotfix: unbreak llama 2024-06-30 15:27:54 -07:00
George Hotz 3df47bc21e
OpenELM + repeat_interleave (#5234)
* start writing openelm

* progress...hit bug

* repeat_interleave support

* gqa

* add rotary embedding

* spp

* i think it runs correctly

* broken

* output is good now

* cleanups

* no io_uring on android
2024-06-30 15:18:39 -07:00
nimlgen dd7eef7d71
libc defs to autogen (#5217)
* libc defs to autogen

* amd import libc

* linter

* better a bit

* remove comment, check this

* not hardcoded path
2024-06-29 14:37:33 +03:00
qazal 3e56c8422c
remu err handling (#5208)
* add error handling

* use pre release

* minor

* works
2024-06-28 13:15:18 +03:00
reddyn12 f1c7944c44
Fix batchnorm shapes for resnet.load_pretrained (#5167)
* Fix batchnorm shapes

* make it general reshape
2024-06-26 18:44:10 -04:00
nimlgen 69f116a7e1
nv/amd profiler (#4718)
* nv/amd profiler

* fix

* fix

* profile copies

* profile logger

* fixes

* more fixes

* less lines and fixes

* fixes

* some linter

* back sync, no related change

* fix gpu2cpu time def

* simpler

* linter

* linter

* docs

* add add_event api
2024-06-23 17:10:12 +03:00
chenyu e356807696
tinytqdm.set_description and tinytrange (#5101) 2024-06-22 14:45:06 -04:00
chenyu 8080298739
s/tinytqdm/tqdm (#5103)
except in unit test where tqdm is imported
2024-06-22 14:18:26 -04:00
chenyu e468601226
update llama attention casting (#5096)
* update llama attention casting

updated scaled_dot_product_attention middle cast and removed hard-coded half in llama attention.

* fix that
2024-06-22 10:57:17 -04:00
chenyu 8bd6cb9511
update llama model RMSNorm casting (#5095)
following the original implementation, cast back to input dtype before multiplying weight. slightly faster
https://github.com/meta-llama/llama/blob/main/llama/model.py
2024-06-21 23:02:04 -04:00
chenyu 0c857ae2d6
some onnx_ops cleanups (#5094) 2024-06-21 22:01:32 -04:00
nimlgen fb1bf48cfe
io_uring for copies from disk (#5035)
* exp uring

* fixes and old version

* nv

* cleaner

* cmp vs aio

* fix

* no lib

* fix nv

* linter

* disk_speed_test now runs default

* fixes

* uring -> io_uring

* linter happy

* get_temp_buf comment added

* tiny nits

* put wait back

* test runs everywhere

* remove consts

* remove mmap consts

* do not require iouring to run test, they are generic
2024-06-21 11:36:51 +03:00
chenyu f6d6760f71
don't cast tuple to list before creating Tensor (#5071)
Tensor constructor supports creating from tuple now
2024-06-20 13:32:56 -04:00
chenyu e2c5054bdd
update resnet.load_from_pretrained (#5040) 2024-06-18 16:29:22 -04:00
chenyu a3ed4176c8
use tinytqdm in active tests and examples (#5038)
* use tinytqdm in active tests and examples

stress test this before 0.9.1

* no set_description
2024-06-18 16:01:19 -04:00
Junjun Dong c8cd6e725c
Remove BinaryOps.SUB. Replace SUB by ADD and NEG in all tests. Regenerate dataset (#4977)
* feat: remove BinaryOps.SUB

* remove SUB in test_early_end_local

* regenerate dataset. remove SUB in test_linearizer_*

* reenable overflow tests

* simplify tensor.sub function by returning a+(-b)

* remove whitespaces

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-18 09:06:13 -04:00
chenyu 67e8df4969
remove numpy from dtype (#4969)
replaced all dtype.np with _to_np_dtype defined in tensor.py.

after this, the only numpy usages are (1) Tensor(np.ndarray), (2) construct .numpy() output, (3) numpy random buffer
2024-06-14 15:38:45 -04:00
George Hotz 9823752397
make uops.add private (#4950)
* make uops.add private

* modernize all tests
2024-06-14 03:23:25 -07:00
Jhenner Tigreros dc9e9e4363
Convert BinaryOps.DIV to UnaryOps.RECIP and BinaryOps.IDIV (#4887)
* Create UnaryOps.RECIP and BinaryOps.IDIV and changing uses of BinaryOps.DIV

* Delete unused import

* Add cstyle renderer

* Fix formatting text

* Fix test error due to bad implementation of renderer

* Add PTX support

* Add RECIP to LLVMIR

* Remove BinaryOps.DIV from symbolic test

* Change some test and fix C floor division

* Change references to DIV for the RECIP or IDIV

* Add mimic idiv for symbolic test

* Restore floor

* Mimic idiv

* cast to int

* Fix some test and renderer

* Remove DIV for render nodes

* Resolve issue with div

* Add TestRenderer

* Fix test

* fix error

* Fix PAD test

* Fix div implementation

* Remove DIV

* Add upcast to rshift, due to use of MUL and RECIP on DIV

* Fix linter

* Remove complete BinaryOps.DIV

* Fix lint

* Fix some test

* Revert mul modification

* Fix tests

* Fix CLANG for uops

* Revert IDIV function

* Minor fix

* modify pattern matching rule to support nan

* Fix UNSAFE_PADS_OPS to add UnaryOps.RECIP

* Remove const folding for IDIV and fix PTX

* Complete remove IDIV from extra

* Remove test_div from TestFloatUOps due to test on recip

* Fix linearizer

* fix

* Fix test_22

* Fix llvm

* Apply trunc function for llvmlit

* use floor instead of trunc

* Use correct type

* Generate new fuzz db

* Fix rshift, do not cast to float to support idiv

* Return upcast=false to rshift

* Add to unsafepad BinaryOps.IDIV

* Remove RECIP override for CUDA

* add atol / rtol for the test

* Remove cast to int on IDIV

* Regenerate sops

* delete sops.gz

* regenerate

* regenerate

* regenerate

* Reduce margins

* pass atol and rtol as parametersg for _test_metrics

* regenerated dataset

* Regenerate

* Remove duplicated

* Revert changes on extra

* Remove changes extra and NOQA for test

* Remove E501

* Remove and change line

* Remove E501

* Fix atan2

* Revert import and E501

* Remove E501

* Add hrcp to halp ops

* Remove 1 of hrcp

* Remove last DIV and add type check on uops for IDIV

* Fix new tests

* Fix tests and custom function

* Regenerate dataset

* Regenerate dataset

* Revert dataset

* Change generate dataset script

* Remove line

* Change IDIV, type checker validate if x,y and z are int

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-06-14 02:43:46 -07:00
George Hotz e63701fbd4
RDNA3 assembly support (#3637)
* amazing that i can use comgr for this

* compile empty kernel

* cleanups

* tiny_add compiles

* ugh

* more work

* put that in extra
2024-06-13 09:09:24 +02:00
nimlgen fd071ba27e
amd mockgpu correct timer resolution (#4942)
* amd mockgpu correct timer resolution

* test it
2024-06-13 10:07:34 +03:00
Elias Wahl d2e3c391e8
Residual in MLM loss + Change default steps (#4935)
* Residual in mlm loss

* Reduce default steps to 160K * 24

* oops

* comment
2024-06-12 16:09:18 -04:00
nimlgen 58cf6eaba9
add missing dir level for amd mockgpu (#4911) 2024-06-11 18:35:04 +02:00
nimlgen 654a8b9ef7
retire hsa (#4885)
* retire hsa

* EMULATE_AMD
2024-06-09 11:33:03 +03:00
Nik 085c0bbf6b
add mlperf train subset of openimages (#4841) 2024-06-05 10:10:11 -04:00
Elias Wahl 04e237328b
Refactor to class style (#4804) 2024-06-04 14:08:31 -07:00
chenyu 3afc914617
CMPEQ -> CMPNE and make it safe to pad (#4818)
* CMPNE

* new dataset
2024-06-03 18:02:15 -04:00
nimlgen 7384ee08a0
amd cleanup sdma (#4796)
* amd cleanup sdma

* faster enqueue for sdma

* typo

* remove commnted lines

* fix overrun check

* flushhdp better command
2024-06-01 17:06:44 +03:00
nimlgen bd2e7c8b31
amd registers from file (#4778)
* amd registers from file

* remove commentes

* linetr

* no off
2024-05-31 18:48:57 +03:00
chenyu e614b7c696
docs: showcase remove mnist_gan and add conversation.py (#4757)
fixed both examples, and i think it's better to show conversation
2024-05-28 11:09:26 -04:00
nimlgen 50e95b8212
nv qmd sync (#4740)
* qmd sync

* better hcq

* mockgpu support chain qmd

* fix mockgpu & linter
2024-05-27 18:51:30 +03:00
nimlgen c87b066b66
optimize nv sync (#4729)
* optimize nv sync

* sdma signal without wfi

* nv mockgou support

* sep change
2024-05-25 23:10:41 +03:00
chenyu 31358cbea5
change Tensor.stack to method (#4719) 2024-05-24 17:04:19 -04:00
qazal c170ddceaf
fix commavq benchmark (#4712)
* fix _slice and assert explicit device

* with _slice
2024-05-24 19:40:57 +03:00
chenyu 47aba47f64
update Torch.gather api (#4692)
* update Torch.gather api

gather(self, dim, index) to match torch

* fix that
2024-05-22 21:54:06 -04:00
chenyu 792a494eb8
fix various examples (#4691)
* fix examples that used ax1 and ax2 for transpose

* fix that

* update those
2024-05-22 20:43:21 -04:00
chenyu 225dcab3be
prepend `_` to broadcast_shape and deepwalk (#4683)
* prepend `_` to broadcast_shape and deepwalk

internal only

* that too
2024-05-22 16:39:05 -04:00
chenyu ae861325ce
update llama sample for mac 32 input buffer limit (#4662)
set default sampling params to function call to 0, and top k in llama3 to 25.
2024-05-20 17:23:39 -04:00
wozeparrot b144d4b460
new llama3 example (#4576) 2024-05-19 22:42:23 -07:00
nimlgen daf57af3eb
move tc to renderers (#4631)
* move tc to renderers

* missed import

* fix typo

* fix

* fix imports

* remove from tests

* fix 4607

* nv emulate timestamp

* time is int

* correct time
2024-05-18 00:36:29 +03:00