nimlgen
58cf6eaba9
add missing dir level for amd mockgpu ( #4911 )
2024-06-11 18:35:04 +02:00
nimlgen
654a8b9ef7
retire hsa ( #4885 )
...
* retire hsa
* EMULATE_AMD
2024-06-09 11:33:03 +03:00
Nik
085c0bbf6b
add mlperf train subset of openimages ( #4841 )
2024-06-05 10:10:11 -04:00
Elias Wahl
04e237328b
Refactor to class style ( #4804 )
2024-06-04 14:08:31 -07:00
chenyu
3afc914617
CMPEQ -> CMPNE and make it safe to pad ( #4818 )
...
* CMPNE
* new dataset
2024-06-03 18:02:15 -04:00
nimlgen
7384ee08a0
amd cleanup sdma ( #4796 )
...
* amd cleanup sdma
* faster enqueue for sdma
* typo
* remove commnted lines
* fix overrun check
* flushhdp better command
2024-06-01 17:06:44 +03:00
nimlgen
bd2e7c8b31
amd registers from file ( #4778 )
...
* amd registers from file
* remove commentes
* linetr
* no off
2024-05-31 18:48:57 +03:00
chenyu
e614b7c696
docs: showcase remove mnist_gan and add conversation.py ( #4757 )
...
fixed both examples, and i think it's better to show conversation
2024-05-28 11:09:26 -04:00
nimlgen
50e95b8212
nv qmd sync ( #4740 )
...
* qmd sync
* better hcq
* mockgpu support chain qmd
* fix mockgpu & linter
2024-05-27 18:51:30 +03:00
nimlgen
c87b066b66
optimize nv sync ( #4729 )
...
* optimize nv sync
* sdma signal without wfi
* nv mockgou support
* sep change
2024-05-25 23:10:41 +03:00
chenyu
31358cbea5
change Tensor.stack to method ( #4719 )
2024-05-24 17:04:19 -04:00
qazal
c170ddceaf
fix commavq benchmark ( #4712 )
...
* fix _slice and assert explicit device
* with _slice
2024-05-24 19:40:57 +03:00
chenyu
47aba47f64
update Torch.gather api ( #4692 )
...
* update Torch.gather api
gather(self, dim, index) to match torch
* fix that
2024-05-22 21:54:06 -04:00
chenyu
792a494eb8
fix various examples ( #4691 )
...
* fix examples that used ax1 and ax2 for transpose
* fix that
* update those
2024-05-22 20:43:21 -04:00
chenyu
225dcab3be
prepend `_` to broadcast_shape and deepwalk ( #4683 )
...
* prepend `_` to broadcast_shape and deepwalk
internal only
* that too
2024-05-22 16:39:05 -04:00
chenyu
ae861325ce
update llama sample for mac 32 input buffer limit ( #4662 )
...
set default sampling params to function call to 0, and top k in llama3 to 25.
2024-05-20 17:23:39 -04:00
wozeparrot
b144d4b460
new llama3 example ( #4576 )
2024-05-19 22:42:23 -07:00
nimlgen
daf57af3eb
move tc to renderers ( #4631 )
...
* move tc to renderers
* missed import
* fix typo
* fix
* fix imports
* remove from tests
* fix 4607
* nv emulate timestamp
* time is int
* correct time
2024-05-18 00:36:29 +03:00
nimlgen
10cf8e459b
hcq update queue in place ( #4626 )
...
* do not self wait in hcq
* faster enqueue
* comments
* tests
* linter
* fix typo
2024-05-17 22:18:20 +03:00
nimlgen
eb9689336e
nv mockgpu ( #4600 )
...
* mockgpu nv
* works
* comment that out
* fix merge
* setup gpuocelot
* install packages
* not run all of them
* passes
* fix ci
* almost
* should pass
* linter
* linter 2
* try this?
* ugn, not supported
* ci
* remove ticket from description
* better descs
2024-05-15 23:46:08 +03:00
Ahmed Harmouche
662bca8134
Split UnaryOps.CAST into CAST and BITCAST ( #4487 )
...
* Separate cast and bitcast
* Fix lint
* No more arg[0]
* Revert "No more arg[0]"
This reverts commit dee6911335513f092fe2cbb9684e8a9d26aad964.
* CAST/BITCAST arg is the dtype only, no more tuple
* No image bitcast, regenerate dataset
* Small fixes
2024-05-15 11:43:31 -04:00
George Hotz
ff64bcab69
move graph/search to engine ( #4596 )
2024-05-14 23:12:59 -07:00
George Hotz
fd02ab1e8b
move disassemblers and openpilot ( #4592 )
...
* move disassemblers and openpilot
* delete junk
* put that in pre-commit
* fixup readme
2024-05-14 19:30:02 -07:00
chenyu
a65c8de735
move .half() llama freq_cis to the end of sin and cos ( #4587 )
...
otherwise arange has inf if either dim or context length exceeds half.max
2024-05-14 15:00:18 -04:00
nimlgen
9b02aef45a
remove rhip ( #4579 )
...
* remove rhip
* remove hip runner
2024-05-14 17:58:19 +03:00
nimlgen
2131556c2c
amd mockgpu ( #4535 )
...
* start mock amd gpu
* virt files
* cleaner
* init ci
* small fixes
* linter
* better?
* ugh
* linter
* fix
* diable some
* run shorter
* fixes
* add hcq test
* fix
* fix cmd revert
2024-05-14 14:28:04 +03:00
chenyu
da10cf0be1
extra/threefry.py for mem usage ( #4533 )
...
for now it needs 8N mem to generate size N rand
2024-05-11 13:46:44 -04:00
chenyu
8a0fb3d765
delete old extra/autopad.py ( #4532 )
2024-05-11 13:06:10 -04:00
George Hotz
2f970a4fc2
all realize 2 ( #4527 )
...
* all realize 2
* tests fixup
* fix more tests
* fix openpilot
* fix tests
* unneeded
2024-05-10 22:43:09 -07:00
wozeparrot
d2c347fc74
faster gather for bert ( #4526 )
2024-05-10 22:28:48 -07:00
George Hotz
347a3acb37
add renderer class ( #4524 )
...
* add renderer class
* tests pass
* fix pylint
* fix tensor cores
2024-05-10 21:40:02 -07:00
George Hotz
d438d5698d
bring buffer back to device ( #4517 )
2024-05-10 11:22:31 -07:00
George Hotz
1e843d495e
cleaning up search with Program ( #4500 )
...
* cleaning up search
* fix tests
* test fix
* minor compiler cleanup
2024-05-09 19:01:53 -07:00
George Hotz
c9e84ed0da
refactor to Program class ( #4476 )
...
* refactor to Program class
* switch to Program
* fix tests
* smaller diff
* self.p
* more tests
* fix metal test
* tests
* fix openpilot
* move that to linearizer
* p.launchdims
2024-05-09 17:29:07 -07:00
Francis Lam
c8595a9655
update sops.gz, fix tests and add new linearizer test ( #4437 )
...
* update sops.gz, fix tests and add new linearizer test
* remove METAL CI skip for test_failure_22
* re-add skip to METAL CI to test_failure_22
2024-05-05 17:31:25 -04:00
George Hotz
12be536c06
Clang graph ( #4424 )
...
* clang graph runner
* render_dtype
* name it ClangGraph
* JIT=2
* JIT=2 goes there
* JIT as context var
2024-05-05 09:54:12 -07:00
George Hotz
cb7289f9c9
remove clang program header ( #4422 )
...
* remove clang program header
* proper max
* bools are numbers
* fix compile enet
2024-05-04 08:38:01 -07:00
chenyu
22376e53b7
resnet mlperf logging ( #4361 )
...
* resnet mlperf logging
* cropping too much?
2024-05-02 00:00:04 -04:00
George Hotz
8bcf533a84
gitignore open-images-v6TEST
2024-05-01 13:55:38 +00:00
Elias Wahl
27613dd881
MLPerf BERT: Main training loop ( #4288 )
...
* BERT language modeling head + trunc normal initializers
* add train loop + helpers
* shuffle in dataloaders + slight changes in main loop
* beam change
* Minor changes
* random.shuffle
* HParam update
* Use deque for dataloader
* wandb bert project name
* half fixes
* BENCHMARK + remove epoch
* cast + print()
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-29 14:35:27 -04:00
geohotstan
bf412aeb80
use tolist instead of numpy for extracting parameters in onnx ( #4333 )
...
* still some numpy left
* all pass
* oops indent
* fix up safe_python
* to_python_const
2024-04-29 10:48:20 -04:00
Francis Lata
bb849a57d1
[MLPerf] UNet3D dataloader ( #4343 )
...
* add support for train/val datasets for kits19
* split dataset into train and val sets
* add tests for kits19 dataloader
* add MLPerf dataset tests to CI
* update unet3d model_eval script
* fix linting
* add nibabel
* fix how mock dataset gets created
* update ref implementation with permalink and no edits
* clean up test and update rand_flip implementation
* cleanups
2024-04-28 22:34:18 -04:00
chenyu
82d0ed3cf3
cap default dataset wikipedia max_workers to 32 ( #4345 )
...
64 on tinybox OOM
2024-04-28 21:55:21 -04:00
geohotstan
bc36940c28
fix ( #4319 )
2024-04-28 16:29:04 +08:00
chenyu
5ae252ae83
use at least float32 for optim.lr ( #4297 )
...
* use at least float32 for optim.lr
when doing mixed precision training (float32 weight, default_float=half), still use float32 to store lr.
it would have been upcasted later in actual weight update, but would have lost precision.
this improved resnet convergence significantly
* undo type annotation
2024-04-25 14:42:28 -04:00
George Hotz
38f97aa0fe
rename rawbufs to bufs in ExecItem ( #4274 )
2024-04-24 11:27:27 +08:00
nimlgen
f3b4dff7c9
KFDProgram -> AMDProgram ( #4268 )
2024-04-24 00:29:50 +03:00
Elias Wahl
69341144ba
Wikipedia preprocessing script ( #4229 )
...
* Preprocessing script
* short seq prob
* comments + env vars
* Add preprocessing reference. Add test
* lint fix + add eval test support
* whitespaces
* point to commit
* comment
* rename
* better comments
2024-04-23 10:28:01 -04:00
George Hotz
9a95781d51
renamed ( #4260 )
2024-04-23 09:00:28 +04:00
George Hotz
2ae4f45272
WIP PM4 Support ( #4110 )
...
* pm4 kernel launch works
* disable USE_THREAD_DIMENSIONS
* add kernel code
* work on real pm4
* pm4 signal
* same
* gate pm4
* hcq tests pass
* ops passes
* pm4 is closer
* pm4 debug (#4165 )
* start debug tests passing
* prg
* smth
* hdp flush
* cleaner 1
* do not need this
* logs not need
* small things
* linter
* remove AQL
* test hcq
* fix tests
* it's subtracting, it shouldn't be -1
* pm4 changes (#4251 )
* not need this anymore
* sdma signal with non atomic
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-04-23 08:31:27 +04:00