* refactor to Program class
* switch to Program
* fix tests
* smaller diff
* self.p
* more tests
* fix metal test
* tests
* fix openpilot
* move that to linearizer
* p.launchdims
* add support for train/val datasets for kits19
* split dataset into train and val sets
* add tests for kits19 dataloader
* add MLPerf dataset tests to CI
* update unet3d model_eval script
* fix linting
* add nibabel
* fix how mock dataset gets created
* update ref implementation with permalink and no edits
* clean up test and update rand_flip implementation
* cleanups
* use at least float32 for optim.lr
when doing mixed precision training (float32 weight, default_float=half), still use float32 to store lr.
it would have been upcasted later in actual weight update, but would have lost precision.
this improved resnet convergence significantly
* undo type annotation
* Preprocessing script
* short seq prob
* comments + env vars
* Add preprocessing reference. Add test
* lint fix + add eval test support
* whitespaces
* point to commit
* comment
* rename
* better comments
* pm4 kernel launch works
* disable USE_THREAD_DIMENSIONS
* add kernel code
* work on real pm4
* pm4 signal
* same
* gate pm4
* hcq tests pass
* ops passes
* pm4 is closer
* pm4 debug (#4165)
* start debug tests passing
* prg
* smth
* hdp flush
* cleaner 1
* do not need this
* logs not need
* small things
* linter
* remove AQL
* test hcq
* fix tests
* it's subtracting, it shouldn't be -1
* pm4 changes (#4251)
* not need this anymore
* sdma signal with non atomic
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* wmma: widen TC usage in search by using PADTO on TC axes when possible
* test: start tests for the new padding TC behavior
* search: upgrade padded TC search to TC_OPT >= 2
* test: add behavior and correctness test for padded TC
added optional argument to apply_tensor_core to set TC_OPT level
* linearizer: add tests for the PADTO behvaior and docs
* start
* fix err 93
* gpu
* ioctl mappings
* alloc like cuda
* semaphores
* wait for semaphores value
* start ops_nv
* very simple kernels work
* init several gpus
* qmd dumper
* dirty, but most of kernels work
* always all test_ops
* progress, more tests, stable
* test_ops passes, gpt2 works
but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated
* need better sync
* fix sync
* alloc2
* all tests pass!
* cleanup 1
* cleanup
* multigpu, simple transfer
* fix sync
* correct init
* nv_gpu autogen + sync bug fix
* clean extra/nv_gpu_driver
* p2p
* clean up
* remove old gen
* small fixes
* cleanup
* cleanup 2
* small fixes
* bigger queue size
* cleanups
* wait
* fixed signals for devs
* fix hang + parallel beam
* small fixes
* detect when local memory is big in kernel
* correct assert
* small fixes
* correct tls size est
* one va space
* less lines
* shorter
* save 2 lines
* save some lines
* remove type ignores
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* write llm.c and add a few new methods to tensor
* training works
* add jit
* tests for new functions
* test tolist
* simple fix for onnx test failures (#4186)
* write llm.c and add a few new methods to tensor
* training works
* add jit
* tests for new functions
* bump line count to 7500
* simplest fix
* safenumpy tolist for now
---------
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
---------
Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>
* search: add a BEAM_COMPARE env to optionally not compare to hc/tc
setting BEAM_COMPARE=0 will prevent additional memory allocation
needed to do the timing tests assuming the BEAM result is in
the diskcache.
* change to optionally use Buffer.allocate
* initial version
* heh gimme grrrreen
* version 2
* clean ups
* some test confusion
* fix onnx
* rename to _broadcast_tensors
* improved errors and test
* fixed?
* some test fixup
* version 3 lol
* comments
* cleaner
* add failure test for expand to 0 test
* 1 more assertRaises test
* make err msg better
* also rewrite the expand onnx op? :s
* kfd driver wip
* cleanups
* kfd almost ready to ring doorbell
* ding dong?
* issues with signals
* something
* works
* ops kfd
* add amd_signal_t
* works...sometimes
* program runs
* _gpu_alloc cleanup
* cleanups
* work
* header + enable profiling (#3959)
* header + enable profiling
* just cleaner
* measure
* only local time domain
* remove old comments
* fix with master
* elf parsing (#3965)
* elf parsing
* fix kernels with private
* not used
* clean up
* clean up 2
* add flags
* kfd sdma (#3970)
* working sdma
* remove driver, shorter
* all commands we might need
* svm
* kfd remove hardcoded values (#4007)
* remove hardcoded values
* match above line
* 7k lines + revert hsa
* update that from origin
* fix sdma reg gen
* not the updated SDMA
* compiler_opts
* don't require kfd_ioctl
* get ioctls from python
* get ioctls from python
* remove build_sdma_command
* merge into 64-bit fields
* shorter
* fix property spelling and off by one
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>