Commit Graph

4862 Commits

Author SHA1 Message Date
chenyu 5eee974b2a
construct Tensor from python list/tuple directly (#4947)
* construct Tensor from python list/tuple directly

no numpy. annoying that half memoryview is 3.12 feature...

* simpler, and test

* flat already

* simpler

* cute

* 10% faster

* 5%
2024-06-14 11:36:05 -04:00
geohotstan 90332eb529
Getitem pin None dimension (#4960)
* fix

* remove torch out of bounds test

* 1 more test case
2024-06-14 10:48:59 -04:00
qazal 2eeddf1a46
IF ends with STORE, RANGE ends with PHI [run_process_replay] (#4953) 2024-06-14 16:00:32 +03:00
George Hotz d5a92b9b83
sort the axis in reduce op [run_process_replay] (#4956) 2024-06-14 05:16:05 -07:00
George Hotz 14189bca68
graph_dedup function [run_process_replay] (#4955) 2024-06-14 04:24:37 -07:00
George Hotz 63a8add2c2
move uops add logic to linearize (#4952)
* move logic to linearize

* idk how this should work

* empty
2024-06-14 03:52:37 -07:00
qazal 7e32b8c930
refactor generic UOps.END* insertion (#4951)
* merge loops children

* rename to scope_children

* refactor ends

* merge with ends [run_process_replay]
2024-06-14 13:42:41 +03:00
George Hotz 9823752397
make uops.add private (#4950)
* make uops.add private

* modernize all tests
2024-06-14 03:23:25 -07:00
Jhenner Tigreros dc9e9e4363
Convert BinaryOps.DIV to UnaryOps.RECIP and BinaryOps.IDIV (#4887)
* Create UnaryOps.RECIP and BinaryOps.IDIV and changing uses of BinaryOps.DIV

* Delete unused import

* Add cstyle renderer

* Fix formatting text

* Fix test error due to bad implementation of renderer

* Add PTX support

* Add RECIP to LLVMIR

* Remove BinaryOps.DIV from symbolic test

* Change some test and fix C floor division

* Change references to DIV for the RECIP or IDIV

* Add mimic idiv for symbolic test

* Restore floor

* Mimic idiv

* cast to int

* Fix some test and renderer

* Remove DIV for render nodes

* Resolve issue with div

* Add TestRenderer

* Fix test

* fix error

* Fix PAD test

* Fix div implementation

* Remove DIV

* Add upcast to rshift, due to use of MUL and RECIP on DIV

* Fix linter

* Remove complete BinaryOps.DIV

* Fix lint

* Fix some test

* Revert mul modification

* Fix tests

* Fix CLANG for uops

* Revert IDIV function

* Minor fix

* modify pattern matching rule to support nan

* Fix UNSAFE_PADS_OPS to add UnaryOps.RECIP

* Remove const folding for IDIV and fix PTX

* Complete remove IDIV from extra

* Remove test_div from TestFloatUOps due to test on recip

* Fix linearizer

* fix

* Fix test_22

* Fix llvm

* Apply trunc function for llvmlit

* use floor instead of trunc

* Use correct type

* Generate new fuzz db

* Fix rshift, do not cast to float to support idiv

* Return upcast=false to rshift

* Add to unsafepad BinaryOps.IDIV

* Remove RECIP override for CUDA

* add atol / rtol for the test

* Remove cast to int on IDIV

* Regenerate sops

* delete sops.gz

* regenerate

* regenerate

* regenerate

* Reduce margins

* pass atol and rtol as parametersg for _test_metrics

* regenerated dataset

* Regenerate

* Remove duplicated

* Revert changes on extra

* Remove changes extra and NOQA for test

* Remove E501

* Remove and change line

* Remove E501

* Fix atan2

* Revert import and E501

* Remove E501

* Add hrcp to halp ops

* Remove 1 of hrcp

* Remove last DIV and add type check on uops for IDIV

* Fix new tests

* Fix tests and custom function

* Regenerate dataset

* Regenerate dataset

* Revert dataset

* Change generate dataset script

* Remove line

* Change IDIV, type checker validate if x,y and z are int

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-06-14 02:43:46 -07:00
SnakeOnex f87ba6016a
tqdm total=0 fix (#4939)
* fixes

* fixes

* removed auto loop closing

* one line shorter
2024-06-14 02:31:59 -07:00
nimlgen 225f792330
amd hdp flush regs are on seg2 (#4925) 2024-06-14 01:42:23 +03:00
nimlgen 4bfd1904f6
nv do not modify prg's qmd (#4948) 2024-06-14 01:15:40 +03:00
chenyu 845c10bc28
add Node to _broadcasted type annotation (#4946) 2024-06-13 14:10:56 -04:00
chenyu 287d3c3b84
support list, tuple input in dtypes.from_py (#4945)
* support list, tuple input in dtypes.from_py

and used it to infer dtype from python list and tuple in Tensor constructor.

* fix tests
2024-06-13 13:38:06 -04:00
chenyu 7aecea4f56
support creating Tensor from python tuple (#4944)
added a small fuzzer to test data with mixed tuple and list of numbers matched with numpy
2024-06-13 12:18:37 -04:00
chenyu 74586bc339
fix getitem with leading None (#4943)
i think all None handling can be unified and remove the calc_dim in advanced indexing
2024-06-13 11:23:40 -04:00
George Hotz e63701fbd4
RDNA3 assembly support (#3637)
* amazing that i can use comgr for this

* compile empty kernel

* cleanups

* tiny_add compiles

* ugh

* more work

* put that in extra
2024-06-13 09:09:24 +02:00
nimlgen fd071ba27e
amd mockgpu correct timer resolution (#4942)
* amd mockgpu correct timer resolution

* test it
2024-06-13 10:07:34 +03:00
chenyu fae08c4d48
fix Tensor.triu / Tensor.triu with boolean input (#4941)
`where(self, 0)` incorrectly upcasted the output. `where(self, False)` is correct but looks unnatural, so added a cast at the end. Pattern matcher can fold the cast into where branches
2024-06-12 20:16:13 -04:00
chenyu cc90b3ef9f
simpler Tensor.gather (#4940)
get rid of some confusing transpose and permute, and the if condition on dim. Saved a kernel for each dim != 0 case in test by removing the dangling transpose at the end
2024-06-12 19:42:40 -04:00
George Hotz fa00ef66fd
Update README.md 2024-06-13 00:29:19 +02:00
chenyu eb0f5b5660
failed test case for getitem with leading Nones (#4936)
* failed test case for getitem with leading Nones

torch matched numpy so tinygrad is incorrect.
another repro
```
t = np.arange(12).reshape((3, 4))
print(t[None, None, np.array([1, 2])])

t = torch.arange(12).reshape((3, 4))
print(t[None, None, torch.tensor([1, 2])].numpy())

t = Tensor.arange(12).reshape(3, 4)
print(t[None, None, Tensor([1, 2])].numpy())
```

* # noqa
2024-06-12 16:19:42 -04:00
Elias Wahl d2e3c391e8
Residual in MLM loss + Change default steps (#4935)
* Residual in mlm loss

* Reduce default steps to 160K * 24

* oops

* comment
2024-06-12 16:09:18 -04:00
chenyu a21ea165bc
skip linearizer test_failure_22 on llvm (#4937)
getting flaky recently
2024-06-12 16:03:38 -04:00
chenyu 27903c5ed5
minor minor Tensor.__getitem__ cleanup (#4934)
more consistent variable names and update comments before next minor cleanup that touches logic
[run_process_replay]
2024-06-12 15:08:18 -04:00
chenyu 5e6336edda
minor Tensor.gather cleanup (#4933)
`permarg[i]` is just `i`, and break the big return into two lines.
[run_process_replay]
2024-06-12 13:57:28 -04:00
Timmy 720c700a8a
Multireduce-Kernels: Linearizer Changes and Tests (#4259)
* basic tests

* cleanup

* pylint

* ruff

* use define acc as a proxy for rendered reductions

* use define acc as a proxy for rendered reductions

* recursive reduceop rendering via ast_parse

* linters + cleanup

* fixing late buf loading

* plus linters

* removing extra line

* linters

* does this break ci?

* added tests and if add end change

* typo in add_ends

* linters

* removing comments

* allow endifs to be inserted before the end of the graph

* find add ENDIF before next BARRIER

* removing tests with manual ENDIF + linters

* specifically the next barrier aftr the store of the local result

* Revert "specifically the next barrier aftr the store of the local result"

This reverts commit b288a5c3cec4114480cdb835a8d0ad01aac49519.

* keeping up to date

* linters + merge changes

* cleaning up old bad decisions

* linters and opts

* mrged linearizer tests

* fixing merge issues

* removing the big ugly uop test (functionality tested end-to-end by test_linearizer additions

* small diff fixes

* updating linearizer to work without uops.add( ... cachable)

* linters

* comment in multireduce tests

* skipping tests without locals

* full tests

* linters

* load_cache[key] fix for multiple accs

* linters

* assert only one reduceop

* fix loop_scope test to actually cause an issue

* self.load_cache[key] key for DEFINE_ACC changed to use a string to make sure each acc is unique

* updated tests

* fixing merge

* removing debug prints

* complete merge fix

* linters

* diff cleanup

* adding tests in

* give each reduce it's own local buffer

* gpu=1 changes

* store and load locals with upcasting

* modifying test?

* make multireduce_netsted_local_upcast test match single reduce shapes

* removing todo

* cleaning up the diff

* unroll test

* unroll and upcast tests

* fix gpu

* seq and self.load_cache[key] cleaning

* linters

* padto works

* merge fixes

* fixes

* add skips for amd

* linters + seq

* cleaning & more tests

* softmax tests

* linters

* [run_process_replay]

* add new tests back

This reverts commit 19dec22e0178bca711719cee3e79f327c9e69c12.

* more hardcoded -1s

* fix ptx

* Fix name for loop in ptx

* cleaning up the diff

* cleaning up the uops diff

* nv ci is too slow

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-06-12 13:29:43 -04:00
Nicklas Boman 6e86472cd6
fix typing for test to run in py38 (#4930) 2024-06-12 13:22:30 -04:00
chenyu 1326f29e24
fix Tensor.gather shape checking criteria (#4932)
it's fine if `self.shape[d] >= index.shape[d]` for all `d != dim`, not for all `d`
2024-06-12 13:10:14 -04:00
qazal 898430c004
more typing in linearizer uoping utils (#4929)
* type check everything

* idxs will be uops
2024-06-12 11:00:02 -04:00
George Hotz 828c98d5c4 add slides from code europe to docs 2024-06-12 14:35:08 +02:00
George Hotz 9a3c1e4a17
fix mul div failure (#4928) 2024-06-12 13:58:46 +02:00
George Hotz 11a03cbbf5
don't use uops.add while constructing (#4913)
* don't use uops.add while constructing

* rebase

* bugfixes

* have to use BFS

* prove it's late

* simpler uop symbolic test (why we did this)

* use dict, not set
2024-06-12 13:31:34 +02:00
qazal d894acbb50
remove hardcoded -1s referencing late reduce (#4926) 2024-06-12 04:50:15 -04:00
qazal b833a112ba
allocate shared memory per block (#4924)
* define temp

* use idx

* cleaner [run_process_replay]
2024-06-12 03:43:10 -04:00
George Hotz ca4ccddcd6 docsfix: nn.Tensor -> Tensor 2024-06-12 09:18:32 +02:00
wozeparrot 3d13c23bfa
llama3 `--download_model` (#4922) 2024-06-11 22:59:59 -07:00
chenyu f902af4f0b
increase metal ci test timeout to 20 minutes (#4920)
make it less annoying for now
2024-06-11 18:45:51 -04:00
chenyu fdbb4305cb
skip unsupported dtype in fuzz_linearizer (#4917)
resolve issues in #4887. dataset generated from ubuntu but metal does not support double
2024-06-11 18:18:21 -04:00
qazal 7f3d9e6d94
revert hsa autogen removal (#4914)
* Revert "only install comgr in AMD CI (#4909)"

This reverts commit 7f03420d05.

* rocm-llvm only removal
2024-06-11 12:55:45 -04:00
nimlgen 58cf6eaba9
add missing dir level for amd mockgpu (#4911) 2024-06-11 18:35:04 +02:00
chenyu b886d250fb
improve test_dropout_on_shard (#4912)
tested some basic property, also minor formatting for a few Tensor.training setups
2024-06-11 11:36:02 -04:00
qazal 7f03420d05
only install comgr in AMD CI (#4909)
* test

* delete hsa autogen
2024-06-11 06:19:33 -04:00
George Hotz 35e53c0809
add sharded arange test (#4908) 2024-06-11 10:58:33 +02:00
chenyu 798ea61377
widen test_ops [low, high] and more strict atol (#4906)
default [low, high] changed from [-1.5, 1.5] to [-2, 2] (except tan).
dropped several explicit atol if it's unnecessarily larger than default 1e-6.
tested on mac, tinybox red / green
2024-06-10 20:47:09 -04:00
chenyu 97b05f567e
revert the .detach() in layernorm (#4904)
* revert the .detach() in layernorm

it's only correct in LayerNorm where input is the data, and not correct in GroupNorm and InstanceNorm that reused layernorm.
Added backward tests for weights, bias and input for these norms.

* bigger atol for llvm

* relax backward more
2024-06-10 18:02:05 -04:00
qazal 8b5bcf309a
process replay in all of CI (#4884) 2024-06-10 14:49:29 -04:00
George Hotz 9715a7193a
replace set with dedup (#4901) 2024-06-10 18:20:38 +02:00
chenyu c8cd637236
test case for Tensor.var reducing over size = 1 axis (#4902)
backward failed when correction >= reducing n
2024-06-10 12:11:39 -04:00
chenyu c0fb7eee09
cleanup lazy const fold for binary (#4900)
removed pylint: disable=possibly-used-before-assignment
[run_process_replay]
2024-06-10 10:46:58 -04:00