Commit Graph

800 Commits

Author SHA1 Message Date
wozeparrot dc2617bffd
feat: use more correct reg for local dims (#6048) 2024-08-12 11:15:37 -07:00
chenyu e6c7c3e499
update pylint path to check indent/space for all (#6022)
also fixed many errors. it was not checking nested dirs. exclude autogen for now.

can we use ruff for this?
2024-08-10 14:41:09 -04:00
wozeparrot d269bc95fa
faster tinychat (#5993) 2024-08-08 19:16:26 -07:00
George Hotz bc55c8a30e
pmatmul example + GB/s bugfix [run_process_replay] (#5974)
* pmatmul example + bugfix

* improve pmatmul

* Update real_pmatmul.py
2024-08-07 22:32:11 -07:00
George Hotz bf8ec23b00 hotfix: contiguous on precompute_freqs_cis 2024-08-07 14:40:56 -07:00
wozeparrot 5808e8a30f
mockgpu remu changes (#5925) 2024-08-05 19:26:58 -07:00
wozeparrot 6740a0a6a0
hip_ioctl changes (#5917) 2024-08-05 11:58:38 -07:00
chenyu 996ff0c135
pow(2) -> square in RMSNorm [run_process_replay] (#5901)
reads nicer in metadata
2024-08-04 14:21:31 -04:00
Elias Wahl 4a114756f6
New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00
nimlgen 34168a64e3
optimize nv profiler (#5856)
* nv profiler fix

* cleanup hcq a bit

* fixes

* fix

* typo

* all signals put timestamp

* a bit cleaner

* merge fields

* type

* import

* tiny fix
2024-08-01 23:57:45 +03:00
Vyacheslav Pachkov 610e454132
fix opencl_ioctl on comma (#5814)
- remove unused code
- add CP_REG_TO_MEM opcode
- fixed parse_cmd_buf for more than 1 command object by correcting
an offset
- fixed memory mappings for cases when memory was allocated with
KGSL_MEMFLAGS_USE_CPU_MAP.
KGSL_MEMFLAGS_USE_CPU_MAP: If set on call and return, the returned GPU
address will be 0. Calling mmap() will set the GPU address.
So there are no IOCTL_KGSL_GPUOBJ_INFO ioctls for that type of memory
and it resulted to crash right after get_mem.
2024-07-30 20:44:06 -07:00
David Hou 9a485f36e4
shard kvcache (#5830) 2024-07-30 20:29:54 -07:00
George Hotz 4e89d45513 hotfix: put contiguous back in llama 2024-07-30 18:43:48 -07:00
George Hotz 21c5e8e1b7
extreme llama speed, 57.34 tok/s (#5827)
* extreme llama speed

* mergable
2024-07-30 18:32:09 -07:00
George Hotz e6879035a0
work to make GEMV fast (#5824)
* work to make GEMV fast

* half8 cast

* align struct

* fix amd

* float8 is a later problem
2024-07-30 17:41:40 -07:00
Francis Lata ce61be16f1
clean up how preprocessed folder is defined (#5813) 2024-07-30 12:35:26 -04:00
chenyu 471b188d79
fix mypy errors in latest mypy (#5794)
* fix mypy errors in latest mypy

mypy has stricter partial and api arg checks now

* PYTHONPATH="."
2024-07-29 14:53:30 -04:00
nimlgen ea27ec4cd0
nv switch classlist_v2 to classlist (#5763)
* nv switch classlist_v2 to classlist

* support in mockgpu

* fix mockgpu
2024-07-28 20:24:42 +03:00
chenyu 3686b6726a
move GraphException to jit.py (#5744)
same place where GraphRunner is defined
2024-07-26 19:01:12 -04:00
George Hotz 489a5b99a5 hotfix: triton_nv_matmul touchups 2024-07-24 23:24:29 +00:00
George Hotz bf24be4c8c triton gets 163 TFLOPS on 4090 2024-07-24 18:32:29 +00:00
George Hotz 4d47968580
fix acc folding for NV tensor cores (#5658)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand
2024-07-23 13:03:02 -07:00
nimlgen 08a9c0ae5e
hcq cache invalidation for beam (#5630)
* nv full cache invalidation

* the same command on amd

* linter

* fix amd

* nv no hardcoded consts

* beam default
2024-07-22 18:13:17 +03:00
George Hotz 6c6d74d922
parallel mcts (#5626)
* start work on parallel mcts

* compile was linearizing twice

* typing + more early stopping

* fix compiler error
2024-07-21 14:53:23 -07:00
George Hotz ef179087a4
mcts exit condition wasn't right, also use it with BEAM>=100 (#5619)
* mcts exit condition wasn't right, also use it with BEAM>=100

* mcts touchups

* clean up sample
2024-07-21 10:16:47 -07:00
George Hotz 0f67ef4674
mcts graph and dedup support (#5618)
* mcts graph and dedup support

* usable graph

* mcts colors

* C=4 seems better

* C=3 even better

* sample_tree

* backprop is external function

* late expand to match algo
2024-07-20 23:29:14 -07:00
chenyu eddc5bcfd7
MCTS tweaks (#5616)
MCTS 500 is competitive with BEAM=8 on resnet on M1 Max.
- increment trial times even with compiled error and runtime error.
- use best time of children as the node value.
2024-07-20 19:45:59 -07:00
George Hotz 1113e47f96 print best in MCTS + light up the winner in hcopt 2024-07-20 09:39:36 -07:00
George Hotz ac99ecd94e
use statistics.median for timing (#5606) 2024-07-20 08:37:32 -07:00
George Hotz 06e336bccb
mcts search (#5598)
* mcts search

* mcts cleanups

* mcts cleanup

* random shuffle children order

* mcts in handcode_opt

* src and remove_node

* debug 3 to print ast

* print the type

* mcts in extra
2024-07-19 21:38:39 -07:00
Tobias Fischer 72da3fe7e6
added clip vision model (#5595)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-19 18:35:51 -04:00
George Hotz fa7e734b49
MetaOps.KERNEL (#5543) 2024-07-17 19:41:23 -07:00
Francis Lam 2d53abb04a
test/external/fuzz_linearizer: fix for new AST changes (#5519)
* test/external/fuzz_linearizer: fix for new AST changes

also add beautiful_mnist failures

* add CLANG and LLVM to test_failure_35 failed_platforms

* fix test_linearizer_failure names
2024-07-17 00:08:07 -04:00
Tobias Fischer 85d4ca7caa
FID Inception Model (#5516)
* added model impl

* minor cleanups

* extracted weights loading into from_pretrained

* reorganized model for better weight loading

* removed lru cache for state dict loading
2024-07-16 23:12:03 -04:00
chenyu 28972418c4
s/get_linearizer/get_kernel [run_process_replay] (#5467) 2024-07-13 20:32:22 -04:00
George Hotz 03c2dc8bd7
lowerer is kernel [run_process_replay] (#5437) 2024-07-12 18:50:55 -07:00
chenyu 00813a92a0
update Tensor.eye api to match torch (#5433)
* update Tensor.eye api to match torch

input is n for nrows and optional m for ncols

* space

* fix onnx
2024-07-12 20:25:12 -04:00
George Hotz 870dc8c350
s/Linearizer/Lowerer [run_process_replay] (#5428) 2024-07-12 15:54:07 -07:00
George Hotz 6707c778d0
scheduleitem is not Tuple [run_process_replay] (#5425)
* scheduleitem is not Tuple [run_process_replay]

* fix tests

* fix op + fuzzers

* fix mop test
2024-07-12 15:13:19 -07:00
George Hotz 94599c0637
fixup ast in kernel to be MetaOps.SINK [run_process_replay] (#5424)
* fixup ast in kernel to be MetaOps.SINK [run_process_replay]

* fix tests

* fix more tests
2024-07-12 14:01:03 -07:00
uuuvn 3cb94a0a15
Rename tinygrad/runtime/driver to support (#5413) 2024-07-12 11:06:42 -07:00
wozeparrot a02b38c0ac
download openimages by running it (#5396) 2024-07-11 16:06:13 -07:00
wozeparrot fa873df9c1
bring tinychat more inline with tinyos' version (#5358) 2024-07-10 13:13:52 -07:00
George Hotz c13da83f12
tests from lowerer branch (#5339)
* tests from lowerer branch

* Update test_image_dtype.py

* Update test_image_dtype.py

* Update test_image_dtype.py
2024-07-08 21:23:19 -07:00
nimlgen 51d6f372e4
nv get classes based on device (#5325)
* nv get classes

* support in mockgpu

* choose sm based on gpu

* fix

* fix

* fix arch
2024-07-08 18:25:05 +03:00
Tobias Fischer 0c3a35e5c2
Stable Diffusion v2 Inference (#5283)
* model implementation

* clip fix, more qol options
2024-07-03 22:47:10 -04:00
chenyu b2c3a28a5e
nn.RMSNorm (#5272)
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
Tobias Fischer 8c9c1cf62f
Pulled CLIP and UNet into Seperate Files (#5253)
* pulled clip and unet into seperate files

* reference cleanup, lru cache fix

* better pool indexing
2024-07-01 22:33:01 -04:00
nimlgen 57e89645cd
hcq spec test (#5226)
* start hcq spec test

* more test

* fixes

* run on amd as well

* test amdgpu exec

* fix amd

* amd mockgpu support sdma timestamp
2024-07-01 17:36:37 +03:00
George Hotz 14980f79dd hotfix: unbreak llama 2024-06-30 15:27:54 -07:00
George Hotz 3df47bc21e
OpenELM + repeat_interleave (#5234)
* start writing openelm

* progress...hit bug

* repeat_interleave support

* gqa

* add rotary embedding

* spp

* i think it runs correctly

* broken

* output is good now

* cleanups

* no io_uring on android
2024-06-30 15:18:39 -07:00
nimlgen dd7eef7d71
libc defs to autogen (#5217)
* libc defs to autogen

* amd import libc

* linter

* better a bit

* remove comment, check this

* not hardcoded path
2024-06-29 14:37:33 +03:00
qazal 3e56c8422c
remu err handling (#5208)
* add error handling

* use pre release

* minor

* works
2024-06-28 13:15:18 +03:00
reddyn12 f1c7944c44
Fix batchnorm shapes for resnet.load_pretrained (#5167)
* Fix batchnorm shapes

* make it general reshape
2024-06-26 18:44:10 -04:00
nimlgen 69f116a7e1
nv/amd profiler (#4718)
* nv/amd profiler

* fix

* fix

* profile copies

* profile logger

* fixes

* more fixes

* less lines and fixes

* fixes

* some linter

* back sync, no related change

* fix gpu2cpu time def

* simpler

* linter

* linter

* docs

* add add_event api
2024-06-23 17:10:12 +03:00
chenyu e356807696
tinytqdm.set_description and tinytrange (#5101) 2024-06-22 14:45:06 -04:00
chenyu 8080298739
s/tinytqdm/tqdm (#5103)
except in unit test where tqdm is imported
2024-06-22 14:18:26 -04:00
chenyu e468601226
update llama attention casting (#5096)
* update llama attention casting

updated scaled_dot_product_attention middle cast and removed hard-coded half in llama attention.

* fix that
2024-06-22 10:57:17 -04:00
chenyu 8bd6cb9511
update llama model RMSNorm casting (#5095)
following the original implementation, cast back to input dtype before multiplying weight. slightly faster
https://github.com/meta-llama/llama/blob/main/llama/model.py
2024-06-21 23:02:04 -04:00
chenyu 0c857ae2d6
some onnx_ops cleanups (#5094) 2024-06-21 22:01:32 -04:00
nimlgen fb1bf48cfe
io_uring for copies from disk (#5035)
* exp uring

* fixes and old version

* nv

* cleaner

* cmp vs aio

* fix

* no lib

* fix nv

* linter

* disk_speed_test now runs default

* fixes

* uring -> io_uring

* linter happy

* get_temp_buf comment added

* tiny nits

* put wait back

* test runs everywhere

* remove consts

* remove mmap consts

* do not require iouring to run test, they are generic
2024-06-21 11:36:51 +03:00
chenyu f6d6760f71
don't cast tuple to list before creating Tensor (#5071)
Tensor constructor supports creating from tuple now
2024-06-20 13:32:56 -04:00
chenyu e2c5054bdd
update resnet.load_from_pretrained (#5040) 2024-06-18 16:29:22 -04:00
chenyu a3ed4176c8
use tinytqdm in active tests and examples (#5038)
* use tinytqdm in active tests and examples

stress test this before 0.9.1

* no set_description
2024-06-18 16:01:19 -04:00
Junjun Dong c8cd6e725c
Remove BinaryOps.SUB. Replace SUB by ADD and NEG in all tests. Regenerate dataset (#4977)
* feat: remove BinaryOps.SUB

* remove SUB in test_early_end_local

* regenerate dataset. remove SUB in test_linearizer_*

* reenable overflow tests

* simplify tensor.sub function by returning a+(-b)

* remove whitespaces

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-18 09:06:13 -04:00
chenyu 67e8df4969
remove numpy from dtype (#4969)
replaced all dtype.np with _to_np_dtype defined in tensor.py.

after this, the only numpy usages are (1) Tensor(np.ndarray), (2) construct .numpy() output, (3) numpy random buffer
2024-06-14 15:38:45 -04:00
George Hotz 9823752397
make uops.add private (#4950)
* make uops.add private

* modernize all tests
2024-06-14 03:23:25 -07:00
Jhenner Tigreros dc9e9e4363
Convert BinaryOps.DIV to UnaryOps.RECIP and BinaryOps.IDIV (#4887)
* Create UnaryOps.RECIP and BinaryOps.IDIV and changing uses of BinaryOps.DIV

* Delete unused import

* Add cstyle renderer

* Fix formatting text

* Fix test error due to bad implementation of renderer

* Add PTX support

* Add RECIP to LLVMIR

* Remove BinaryOps.DIV from symbolic test

* Change some test and fix C floor division

* Change references to DIV for the RECIP or IDIV

* Add mimic idiv for symbolic test

* Restore floor

* Mimic idiv

* cast to int

* Fix some test and renderer

* Remove DIV for render nodes

* Resolve issue with div

* Add TestRenderer

* Fix test

* fix error

* Fix PAD test

* Fix div implementation

* Remove DIV

* Add upcast to rshift, due to use of MUL and RECIP on DIV

* Fix linter

* Remove complete BinaryOps.DIV

* Fix lint

* Fix some test

* Revert mul modification

* Fix tests

* Fix CLANG for uops

* Revert IDIV function

* Minor fix

* modify pattern matching rule to support nan

* Fix UNSAFE_PADS_OPS to add UnaryOps.RECIP

* Remove const folding for IDIV and fix PTX

* Complete remove IDIV from extra

* Remove test_div from TestFloatUOps due to test on recip

* Fix linearizer

* fix

* Fix test_22

* Fix llvm

* Apply trunc function for llvmlit

* use floor instead of trunc

* Use correct type

* Generate new fuzz db

* Fix rshift, do not cast to float to support idiv

* Return upcast=false to rshift

* Add to unsafepad BinaryOps.IDIV

* Remove RECIP override for CUDA

* add atol / rtol for the test

* Remove cast to int on IDIV

* Regenerate sops

* delete sops.gz

* regenerate

* regenerate

* regenerate

* Reduce margins

* pass atol and rtol as parametersg for _test_metrics

* regenerated dataset

* Regenerate

* Remove duplicated

* Revert changes on extra

* Remove changes extra and NOQA for test

* Remove E501

* Remove and change line

* Remove E501

* Fix atan2

* Revert import and E501

* Remove E501

* Add hrcp to halp ops

* Remove 1 of hrcp

* Remove last DIV and add type check on uops for IDIV

* Fix new tests

* Fix tests and custom function

* Regenerate dataset

* Regenerate dataset

* Revert dataset

* Change generate dataset script

* Remove line

* Change IDIV, type checker validate if x,y and z are int

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-06-14 02:43:46 -07:00
George Hotz e63701fbd4
RDNA3 assembly support (#3637)
* amazing that i can use comgr for this

* compile empty kernel

* cleanups

* tiny_add compiles

* ugh

* more work

* put that in extra
2024-06-13 09:09:24 +02:00
nimlgen fd071ba27e
amd mockgpu correct timer resolution (#4942)
* amd mockgpu correct timer resolution

* test it
2024-06-13 10:07:34 +03:00
Elias Wahl d2e3c391e8
Residual in MLM loss + Change default steps (#4935)
* Residual in mlm loss

* Reduce default steps to 160K * 24

* oops

* comment
2024-06-12 16:09:18 -04:00
nimlgen 58cf6eaba9
add missing dir level for amd mockgpu (#4911) 2024-06-11 18:35:04 +02:00
nimlgen 654a8b9ef7
retire hsa (#4885)
* retire hsa

* EMULATE_AMD
2024-06-09 11:33:03 +03:00
Nik 085c0bbf6b
add mlperf train subset of openimages (#4841) 2024-06-05 10:10:11 -04:00
Elias Wahl 04e237328b
Refactor to class style (#4804) 2024-06-04 14:08:31 -07:00
chenyu 3afc914617
CMPEQ -> CMPNE and make it safe to pad (#4818)
* CMPNE

* new dataset
2024-06-03 18:02:15 -04:00
nimlgen 7384ee08a0
amd cleanup sdma (#4796)
* amd cleanup sdma

* faster enqueue for sdma

* typo

* remove commnted lines

* fix overrun check

* flushhdp better command
2024-06-01 17:06:44 +03:00
nimlgen bd2e7c8b31
amd registers from file (#4778)
* amd registers from file

* remove commentes

* linetr

* no off
2024-05-31 18:48:57 +03:00
chenyu e614b7c696
docs: showcase remove mnist_gan and add conversation.py (#4757)
fixed both examples, and i think it's better to show conversation
2024-05-28 11:09:26 -04:00
nimlgen 50e95b8212
nv qmd sync (#4740)
* qmd sync

* better hcq

* mockgpu support chain qmd

* fix mockgpu & linter
2024-05-27 18:51:30 +03:00
nimlgen c87b066b66
optimize nv sync (#4729)
* optimize nv sync

* sdma signal without wfi

* nv mockgou support

* sep change
2024-05-25 23:10:41 +03:00
chenyu 31358cbea5
change Tensor.stack to method (#4719) 2024-05-24 17:04:19 -04:00
qazal c170ddceaf
fix commavq benchmark (#4712)
* fix _slice and assert explicit device

* with _slice
2024-05-24 19:40:57 +03:00
chenyu 47aba47f64
update Torch.gather api (#4692)
* update Torch.gather api

gather(self, dim, index) to match torch

* fix that
2024-05-22 21:54:06 -04:00
chenyu 792a494eb8
fix various examples (#4691)
* fix examples that used ax1 and ax2 for transpose

* fix that

* update those
2024-05-22 20:43:21 -04:00
chenyu 225dcab3be
prepend `_` to broadcast_shape and deepwalk (#4683)
* prepend `_` to broadcast_shape and deepwalk

internal only

* that too
2024-05-22 16:39:05 -04:00
chenyu ae861325ce
update llama sample for mac 32 input buffer limit (#4662)
set default sampling params to function call to 0, and top k in llama3 to 25.
2024-05-20 17:23:39 -04:00
wozeparrot b144d4b460
new llama3 example (#4576) 2024-05-19 22:42:23 -07:00
nimlgen daf57af3eb
move tc to renderers (#4631)
* move tc to renderers

* missed import

* fix typo

* fix

* fix imports

* remove from tests

* fix 4607

* nv emulate timestamp

* time is int

* correct time
2024-05-18 00:36:29 +03:00
nimlgen 10cf8e459b
hcq update queue in place (#4626)
* do not self wait in hcq

* faster enqueue

* comments

* tests

* linter

* fix typo
2024-05-17 22:18:20 +03:00
nimlgen eb9689336e
nv mockgpu (#4600)
* mockgpu nv

* works

* comment that out

* fix merge

* setup gpuocelot

* install packages

* not run all of them

* passes

* fix ci

* almost

* should pass

* linter

* linter 2

* try this?

* ugn, not supported

* ci

* remove ticket from description

* better descs
2024-05-15 23:46:08 +03:00
Ahmed Harmouche 662bca8134
Split UnaryOps.CAST into CAST and BITCAST (#4487)
* Separate cast and bitcast

* Fix lint

* No more arg[0]

* Revert "No more arg[0]"

This reverts commit dee6911335513f092fe2cbb9684e8a9d26aad964.

* CAST/BITCAST arg is the dtype only, no more tuple

* No image bitcast, regenerate dataset

* Small fixes
2024-05-15 11:43:31 -04:00
George Hotz ff64bcab69
move graph/search to engine (#4596) 2024-05-14 23:12:59 -07:00
George Hotz fd02ab1e8b
move disassemblers and openpilot (#4592)
* move disassemblers and openpilot

* delete junk

* put that in pre-commit

* fixup readme
2024-05-14 19:30:02 -07:00
chenyu a65c8de735
move .half() llama freq_cis to the end of sin and cos (#4587)
otherwise arange has inf if either dim or context length exceeds half.max
2024-05-14 15:00:18 -04:00
nimlgen 9b02aef45a
remove rhip (#4579)
* remove rhip

* remove hip runner
2024-05-14 17:58:19 +03:00
nimlgen 2131556c2c
amd mockgpu (#4535)
* start mock amd gpu

* virt files

* cleaner

* init ci

* small fixes

* linter

* better?

* ugh

* linter

* fix

* diable some

* run shorter

* fixes

* add hcq test

* fix

* fix cmd revert
2024-05-14 14:28:04 +03:00
chenyu da10cf0be1
extra/threefry.py for mem usage (#4533)
for now it needs 8N mem to generate size N rand
2024-05-11 13:46:44 -04:00
chenyu 8a0fb3d765
delete old extra/autopad.py (#4532) 2024-05-11 13:06:10 -04:00
George Hotz 2f970a4fc2
all realize 2 (#4527)
* all realize 2

* tests fixup

* fix more tests

* fix openpilot

* fix tests

* unneeded
2024-05-10 22:43:09 -07:00