chenyu
7afca52796
replace pow in LAMB by tracking b1**t and b2**t per step ( #4582 )
...
* replace pow in LAMB by tracking b1**t and b2**t per step
* remove t, add [self.b1_t, self.b2_t] to return
* adam has one less kernel
2024-05-14 13:08:22 -04:00
nimlgen
9b02aef45a
remove rhip ( #4579 )
...
* remove rhip
* remove hip runner
2024-05-14 17:58:19 +03:00
Szymon Ożóg
5eb81ff764
Fix speed compare script ( #4581 )
...
* Fix speed compare script
* Update speed_compare_cuda_ptx.py
* Update speed_compare_cuda_ptx.py
* Remove unused function
2024-05-14 17:47:03 +03:00
nimlgen
2131556c2c
amd mockgpu ( #4535 )
...
* start mock amd gpu
* virt files
* cleaner
* init ci
* small fixes
* linter
* better?
* ugh
* linter
* fix
* diable some
* run shorter
* fixes
* add hcq test
* fix
* fix cmd revert
2024-05-14 14:28:04 +03:00
geohotstan
089eeec271
setitem in-place operator tests ( #4577 )
...
* tests and error
* rename to in-place
* add a note
* more comments
* more comments
* disable folded advanced setitem tests for now
2024-05-14 01:28:02 -04:00
chenyu
0fa57b8ce9
raise error if setitem tensors have requires_grad ( #4575 )
...
* raise error if setitem tensors have requires_grad
working on supporting this, first properly raises error
* NotImplementedError
2024-05-13 18:56:47 -04:00
Filip Brzek
f7d08bd454
feat: add acc_dtype to einsum ( #4571 )
2024-05-13 14:02:07 -04:00
Szymon Ożóg
d97d5a7689
Optimize PTX gated loads index calculation ( #4304 )
...
* WIP but working
* Cleanup
* Remove float4 pred and alt
* Cleanup
* this is somehow slowin it down
* Simplify
* add define var to ignore when optimizing gates
* Update assembly.py
* Test for optimizing gated loads
* Cleanup
* Fix NEG needed before if
* Remove unused parameters
* Update assembly.py
* Fix for cachable gone
---------
Co-authored-by: oz <oz@oz-MS-7B86.NAT.gliwice.vectranet.pl>
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-05-13 10:14:01 -07:00
qazal
c67b70ca67
small scheduler refactor ( #4569 )
...
* outputs
* consistent
* more style
* doesnt need tuple
2024-05-13 10:47:39 +03:00
qazal
77aa8659f5
use assign_targets in LazyOp creation ( #4568 )
...
* start
* correct error
* this is possible
* document it
2024-05-13 10:24:35 +03:00
qazal
b0fa97e176
assert error detail in test_assign ( #4567 )
...
* use regex assert
* that shouldnt raise
2024-05-13 09:56:05 +03:00
chenyu
25ec40ca93
cleanup dtype of tensor creation from list ( #4566 )
2024-05-13 02:47:41 -04:00
qazal
4e1135a0bc
assign buffer read/write tests ( #4565 )
...
* simple tests
* more tests
2024-05-13 09:43:36 +03:00
George Hotz
b660f60125
all uops are now cachable ( #4564 )
...
* all uops are now cachable
* cachable is gone
2024-05-12 22:34:35 -07:00
George Hotz
02327b8adf
simple stuff from new_uops branch ( #4563 )
2024-05-12 22:18:05 -07:00
ziereis
f53a23d21e
Test for optim assertion ( #4558 )
...
* add test for assertion
* whitespace
* restore state
---------
Co-authored-by: Thomas Ziereis <thomas.ziereis@web.de>
2024-05-12 14:21:28 -07:00
wozeparrot
d7670f8141
quantized llama multilazybuffer fix ( #4557 )
2024-05-12 14:19:21 -07:00
ziereis
bcee4743ce
fix error message ( #4556 )
...
* fix error messgae
* typo
* add suggestion to fix error
---------
Co-authored-by: Thomas Ziereis <thomas.ziereis@web.de>
2024-05-12 12:35:51 -07:00
chenyu
01a0c1a948
slightly faster nf4 llama ( #4542 )
2024-05-12 14:24:42 -04:00
qazal
4c232dc0ae
refactor LoadOps scheduling ( #4553 )
...
* refactor
* op -> lop
2024-05-12 12:59:24 +03:00
qazal
3da152f0fe
scheduler docs 2 ( #4551 )
...
* docs
* delete cleanups
2024-05-12 12:15:39 +03:00
wozeparrot
e07c7668b3
nf4 llama ( #4540 )
2024-05-11 22:22:34 -07:00
George Hotz
7a26bdac65
move scheduleitem to schedule.py ( #4541 )
...
* move scheduleitem to schedule.py
* don't need that type checking anymore
2024-05-11 21:13:04 -07:00
George Hotz
508e8a6666
add cpu objdump to LLVM/CLANG ( #4537 )
2024-05-11 14:28:44 -07:00
chenyu
bed70b130c
mlperf bert getenv-able EVAL_STEP_FREQ ( #4534 )
2024-05-11 14:36:56 -04:00
George Hotz
328b083e66
lil profiling script
2024-05-11 11:02:44 -07:00
chenyu
da10cf0be1
extra/threefry.py for mem usage ( #4533 )
...
for now it needs 8N mem to generate size N rand
2024-05-11 13:46:44 -04:00
chenyu
8a0fb3d765
delete old extra/autopad.py ( #4532 )
2024-05-11 13:06:10 -04:00
chenyu
04a4980a51
touchup bert script ( #4531 )
...
small adjustments, remove duplicated training setting and stop the script once target is hit
2024-05-11 13:02:02 -04:00
qazal
4871476a1e
move copy kernel to out of schedule ordering ( #4530 )
...
* delete from sorting
* move the logic
2024-05-11 14:44:44 +03:00
qazal
2fb564c125
multi reduce linearizer tests start ( #4529 )
...
* test_end_local
* test_early_end_local
* todos
* mean+std
* skip no locals
2024-05-11 14:06:40 +03:00
qazal
3cba22920f
test_linearizer_correctness ( #4458 )
...
* test helper
* uops asserts
* cleanup args
* nits
2024-05-11 13:02:08 +03:00
qazal
b3d9fd48d0
infra for testing linearizer correctness ( #4528 )
...
* refactor outbufs
* delete helper
2024-05-11 12:10:33 +03:00
George Hotz
2f970a4fc2
all realize 2 ( #4527 )
...
* all realize 2
* tests fixup
* fix more tests
* fix openpilot
* fix tests
* unneeded
2024-05-10 22:43:09 -07:00
wozeparrot
d2c347fc74
faster gather for bert ( #4526 )
2024-05-10 22:28:48 -07:00
George Hotz
922e6e056a
hotfix: fix docs
2024-05-10 21:51:35 -07:00
George Hotz
347a3acb37
add renderer class ( #4524 )
...
* add renderer class
* tests pass
* fix pylint
* fix tensor cores
2024-05-10 21:40:02 -07:00
chenyu
b00b6b16f0
fix TRAIN_BEAM and Tensor.training for mlperf bert ( #4525 )
...
also hard coded bert model config instead of looking up a file
2024-05-11 00:18:36 -04:00
chenyu
7fab8c9e17
add symbolic mean test cases in test_symbolic_ops and test_symbolic_jit ( #4523 )
...
* add symbolic mean test cases in test_symbolic_ops and test_symbolic_jit
2d symbolic mean in jit does not quite work, order of the variable inputs are not deterministic?
* skip
2024-05-10 23:19:55 -04:00
George Hotz
827058f030
update tests get_runner ( #4522 )
2024-05-10 20:09:22 -07:00
George Hotz
a0448ff595
use copy kernel in schedule ( #4520 )
...
* use copy kernel in schedule
* imports
2024-05-10 15:30:33 -07:00
chenyu
b15e2309bd
verbose error message in getitem ( #4519 )
...
* verbose error message in getitem
still hard to undetstand, at least it prints what it's trying to expand
* sure
* :
2024-05-10 17:25:41 -04:00
George Hotz
d438d5698d
bring buffer back to device ( #4517 )
2024-05-10 11:22:31 -07:00
qazal
a2b707a3eb
scheduler comments 1 ( #4515 )
2024-05-10 20:44:28 +03:00
George Hotz
4eef1ee9bf
move renderer into options ( #4514 )
...
* move renderer into options
* fix tests
* renders are functions
2024-05-10 10:01:51 -07:00
George Hotz
7c630a9a53
hotfix: fix llama spacing + fix hcq
2024-05-10 15:10:13 +00:00
George Hotz
58e7256ce9
restore hcq graph ( #4513 )
...
* Reapply "hcq graph (#4380 )" (#4512 )
This reverts commit 06c1e7498e
.
* bring back hcq graph
2024-05-10 07:45:05 -07:00
George Hotz
06c1e7498e
Revert "hcq graph ( #4380 )" ( #4512 )
...
This reverts commit 84a2e2b8c1
.
2024-05-10 07:18:09 -07:00
nimlgen
84a2e2b8c1
hcq graph ( #4380 )
...
* start hcq graph
* hack-fix sync on amd
* nv
* fix nv
* multigrah
* fixes
* temp fix for graph
* this is not needed
* fix
* cleaner
* linetr
* fix none
* faster cuda copy
* faster amd copy
* temp nv fixes
* alloc on gpu
* exp: faster amd
* Revert "exp: faster amd"
This reverts commit 2e4cfd1f7d8a33634c50fb5655cff1b40269d28c.
* revert, unrelated
* not in this pr
* linter
2024-05-10 07:15:12 -07:00
qazal
2b7ab60584
dfs fusion ( #4491 )
...
* use continue
* simplify
* flip
* track r
* derive forced_realize
* scheduler needs comments
2024-05-10 17:00:48 +03:00