Commit Graph

655 Commits

Author SHA1 Message Date
nimlgen 9b02aef45a
remove rhip (#4579)
* remove rhip

* remove hip runner
2024-05-14 17:58:19 +03:00
nimlgen 2131556c2c
amd mockgpu (#4535)
* start mock amd gpu

* virt files

* cleaner

* init ci

* small fixes

* linter

* better?

* ugh

* linter

* fix

* diable some

* run shorter

* fixes

* add hcq test

* fix

* fix cmd revert
2024-05-14 14:28:04 +03:00
chenyu da10cf0be1
extra/threefry.py for mem usage (#4533)
for now it needs 8N mem to generate size N rand
2024-05-11 13:46:44 -04:00
chenyu 8a0fb3d765
delete old extra/autopad.py (#4532) 2024-05-11 13:06:10 -04:00
George Hotz 2f970a4fc2
all realize 2 (#4527)
* all realize 2

* tests fixup

* fix more tests

* fix openpilot

* fix tests

* unneeded
2024-05-10 22:43:09 -07:00
wozeparrot d2c347fc74
faster gather for bert (#4526) 2024-05-10 22:28:48 -07:00
George Hotz 347a3acb37
add renderer class (#4524)
* add renderer class

* tests pass

* fix pylint

* fix tensor cores
2024-05-10 21:40:02 -07:00
George Hotz d438d5698d
bring buffer back to device (#4517) 2024-05-10 11:22:31 -07:00
George Hotz 1e843d495e
cleaning up search with Program (#4500)
* cleaning up search

* fix tests

* test fix

* minor compiler cleanup
2024-05-09 19:01:53 -07:00
George Hotz c9e84ed0da
refactor to Program class (#4476)
* refactor to Program class

* switch to Program

* fix tests

* smaller diff

* self.p

* more tests

* fix metal test

* tests

* fix openpilot

* move that to linearizer

* p.launchdims
2024-05-09 17:29:07 -07:00
Francis Lam c8595a9655
update sops.gz, fix tests and add new linearizer test (#4437)
* update sops.gz, fix tests and add new linearizer test

* remove METAL CI skip for test_failure_22

* re-add skip to METAL CI to test_failure_22
2024-05-05 17:31:25 -04:00
George Hotz 12be536c06
Clang graph (#4424)
* clang graph runner

* render_dtype

* name it ClangGraph

* JIT=2

* JIT=2 goes there

* JIT as context var
2024-05-05 09:54:12 -07:00
George Hotz cb7289f9c9
remove clang program header (#4422)
* remove clang program header

* proper max

* bools are numbers

* fix compile enet
2024-05-04 08:38:01 -07:00
chenyu 22376e53b7
resnet mlperf logging (#4361)
* resnet mlperf logging

* cropping too much?
2024-05-02 00:00:04 -04:00
George Hotz 8bcf533a84 gitignore open-images-v6TEST 2024-05-01 13:55:38 +00:00
Elias Wahl 27613dd881
MLPerf BERT: Main training loop (#4288)
* BERT language modeling head + trunc normal initializers

* add train loop + helpers

* shuffle in dataloaders + slight changes in main loop

* beam change

* Minor changes

* random.shuffle

* HParam update

* Use deque for dataloader

* wandb bert project name

* half fixes

* BENCHMARK + remove epoch

* cast + print()

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-29 14:35:27 -04:00
geohotstan bf412aeb80
use tolist instead of numpy for extracting parameters in onnx (#4333)
* still some numpy left

* all pass

* oops indent

* fix up safe_python

* to_python_const
2024-04-29 10:48:20 -04:00
Francis Lata bb849a57d1
[MLPerf] UNet3D dataloader (#4343)
* add support for train/val datasets for kits19

* split dataset into train and val sets

* add tests for kits19 dataloader

* add MLPerf dataset tests to CI

* update unet3d model_eval script

* fix linting

* add nibabel

* fix how mock dataset gets created

* update ref implementation with permalink and no edits

* clean up test and update rand_flip implementation

* cleanups
2024-04-28 22:34:18 -04:00
chenyu 82d0ed3cf3
cap default dataset wikipedia max_workers to 32 (#4345)
64 on tinybox OOM
2024-04-28 21:55:21 -04:00
geohotstan bc36940c28
fix (#4319) 2024-04-28 16:29:04 +08:00
chenyu 5ae252ae83
use at least float32 for optim.lr (#4297)
* use at least float32 for optim.lr

when doing mixed precision training (float32 weight, default_float=half), still use float32 to store lr.
it would have been upcasted later in actual weight update, but would have lost precision.
this improved resnet convergence significantly

* undo type annotation
2024-04-25 14:42:28 -04:00
George Hotz 38f97aa0fe
rename rawbufs to bufs in ExecItem (#4274) 2024-04-24 11:27:27 +08:00
nimlgen f3b4dff7c9
KFDProgram -> AMDProgram (#4268) 2024-04-24 00:29:50 +03:00
Elias Wahl 69341144ba
Wikipedia preprocessing script (#4229)
* Preprocessing script

* short seq prob

* comments + env vars

* Add preprocessing reference. Add test

* lint fix + add eval test support

* whitespaces

* point to commit

* comment

* rename

* better comments
2024-04-23 10:28:01 -04:00
George Hotz 9a95781d51
renamed (#4260) 2024-04-23 09:00:28 +04:00
George Hotz 2ae4f45272
WIP PM4 Support (#4110)
* pm4 kernel launch works

* disable USE_THREAD_DIMENSIONS

* add kernel code

* work on real pm4

* pm4 signal

* same

* gate pm4

* hcq tests pass

* ops passes

* pm4 is closer

* pm4 debug (#4165)

* start debug tests passing

* prg

* smth

* hdp flush

* cleaner 1

* do not need this

* logs not need

* small things

* linter

* remove AQL

* test hcq

* fix tests

* it's subtracting, it shouldn't be -1

* pm4 changes (#4251)

* not need this anymore

* sdma signal with non atomic

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-04-23 08:31:27 +04:00
Francis Lam bbb0ad4800
wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216)
* wmma: widen TC usage in search by using PADTO on TC axes when possible

* test: start tests for the new padding TC behavior

* search: upgrade padded TC search to TC_OPT >= 2

* test: add behavior and correctness test for padded TC

added optional argument to apply_tensor_core to set TC_OPT level

* linearizer: add tests for the PADTO behvaior and docs
2024-04-22 16:50:31 -04:00
nimlgen e6227bdb15
nv driver (#4044)
* start

* fix err 93

* gpu

* ioctl mappings

* alloc like cuda

* semaphores

* wait for semaphores value

* start ops_nv

* very simple kernels work

* init several gpus

* qmd dumper

* dirty, but most of kernels work

* always all test_ops

* progress, more tests, stable

* test_ops passes, gpt2 works

but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated

* need better sync

* fix sync

* alloc2

* all tests pass!

* cleanup 1

* cleanup

* multigpu, simple transfer

* fix sync

* correct init

* nv_gpu autogen + sync bug fix

* clean extra/nv_gpu_driver

* p2p

* clean up

* remove old gen

* small fixes

* cleanup

* cleanup 2

* small fixes

* bigger queue size

* cleanups

* wait

* fixed signals for devs

* fix hang + parallel beam

* small fixes

* detect when local memory is big in kernel

* correct assert

* small fixes

* correct tls size est

* one va space

* less lines

* shorter

* save 2 lines

* save some lines

* remove type ignores

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-22 19:50:20 +04:00
Elias Wahl 2ecd61e3e2
monkey patching (#4214) 2024-04-18 19:20:52 -04:00
chenyu cd801a15f3
scipy.signal.gaussian -> scipy.signal.windows.gaussian (#4205)
fixed unet3d model_eval, will add to CI after merging new dice loss
2024-04-17 19:15:37 -04:00
Elias Wahl 6eef8ee22a
Wikipedia download script for MLPerf BERT training (#4202)
* wikipedia download script

* add link

* checksum valueError

* ops
2024-04-17 16:34:57 -04:00
Francis Lam c91b7b1739
test: add fuzz_matmul and better debugging for simple_matmul (#4199)
also show unoptimized shape in verify_kernel
2024-04-16 23:40:31 -04:00
George Hotz 55ae73e951
Replicate llm.c in tinygrad (#4179)
* write llm.c and add a few new methods to tensor

* training works

* add jit

* tests for new functions

* test tolist

* simple fix for onnx test failures (#4186)

* write llm.c and add a few new methods to tensor

* training works

* add jit

* tests for new functions

* bump line count to 7500

* simplest fix

* safenumpy tolist for now

---------

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>

---------

Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>
2024-04-16 15:40:48 +04:00
George Hotz b7e281cf10
JitItem -> ExecItem (#4146)
* JitItem -> ExecItem

* execitem in realize

* cleaner

* JITRunner -> Runner
2024-04-11 08:24:57 -07:00
George Hotz e79a11b99c hotfix: revert llama change 2024-04-10 20:13:15 -07:00
George Hotz 2e6c39b0b2
Do less realizes (#4141)
* less realize

* corealize jit inputs

* prints

* print before we run
2024-04-10 19:50:50 -07:00
geohotstan fe88591890
update onnx to 1.16.0 (#4127)
* update

* pass tests and skip tests
2024-04-10 11:19:13 -04:00
Francis Lam 46850a0269
search: add a BEAM_COMPARE env to optionally not compare to hc/tc (#4107)
* search: add a BEAM_COMPARE env to optionally not compare to hc/tc

setting BEAM_COMPARE=0 will prevent additional memory allocation
needed to do the timing tests assuming the BEAM result is in
the diskcache.

* change to optionally use Buffer.allocate
2024-04-08 18:54:01 -04:00
chenyu f8dc82a8a7
use single tensor for llama kv chache (#4108)
similar to optimization in gpt2
2024-04-08 00:38:32 -04:00
chenyu 92c0675ccf
setitem initial support (#4093)
* wip setitem

it's an eager assign to output shapetracker view

* cleanups and tests

* more cleanups
2024-04-07 20:35:22 -04:00
geohotstan 183708b3fd
broadcast expand to match torch (#4085)
* initial version

* heh gimme grrrreen

* version 2

* clean ups

* some test confusion

* fix onnx

* rename to _broadcast_tensors

* improved errors and test

* fixed?

* some test fixup

* version 3 lol

* comments

* cleaner

* add failure test for expand to 0 test

* 1 more assertRaises test

* make err msg better

* also rewrite the expand onnx op? :s
2024-04-07 16:23:13 -04:00
George Hotz fffd9b05f5
mock mnist data for imagenet trainer (#4095)
* mock mnist data for imagenet

* move print and test

* needed to reshape
2024-04-06 08:08:40 -07:00
geohotstan dafa42e864
clean up (#4081)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-05 11:57:44 -04:00
nimlgen d6ba44bc1e
kfd free buffers (#4027)
* kfd free buffers

* unmap

* all test passes

* better pm4

* forgot these

* invalidate only range

* better cache

* forgot

* comments

* fixes
2024-04-01 15:50:58 -07:00
Francis Lam dcb58d3bed
extra/gemm/simple_matvec: add simple_matvec.py (#4021)
we can test with this or add it to CI for benchmarks
2024-03-31 16:38:52 -04:00
chenyu d3f27761b0
move const folding of ADD/SUB/MUL from tensor to lazy (#4020)
* move const folding of ADD/SUB/MUL from tensor to lazy

will do div and pow separately.

* fix onnx adding with None
2024-03-31 16:35:36 -04:00
George Hotz 2abb474d43
kfd driver wip (#3912)
* kfd driver wip

* cleanups

* kfd almost ready to ring doorbell

* ding dong?

* issues with signals

* something

* works

* ops kfd

* add amd_signal_t

* works...sometimes

* program runs

* _gpu_alloc cleanup

* cleanups

* work

* header + enable profiling (#3959)

* header + enable profiling

* just cleaner

* measure

* only local time domain

* remove old comments

* fix with master

* elf parsing (#3965)

* elf parsing

* fix kernels with private

* not used

* clean up

* clean up 2

* add flags

* kfd sdma (#3970)

* working sdma

* remove driver, shorter

* all commands we might need

* svm

* kfd remove hardcoded values (#4007)

* remove hardcoded values

* match above line

* 7k lines + revert hsa

* update that from origin

* fix sdma reg gen

* not the updated SDMA

* compiler_opts

* don't require kfd_ioctl

* get ioctls from python

* get ioctls from python

* remove build_sdma_command

* merge into 64-bit fields

* shorter

* fix property spelling and off by one

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-30 15:08:12 -07:00
Francis Lam 04746022b1
extra/gemm/hip_matmul: fix to use new HSA devices and no headers (#3999)
* extra/gemm/hip_matmul: fix to use new HSA devices and no headers

* remove compile_hip import
2024-03-30 15:42:23 -04:00
chenyu c71627fee6
move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
Akshit Talwar 0affbbf81c
update amx gemm (#3991) 2024-03-29 11:45:03 -04:00