* start
* fix err 93
* gpu
* ioctl mappings
* alloc like cuda
* semaphores
* wait for semaphores value
* start ops_nv
* very simple kernels work
* init several gpus
* qmd dumper
* dirty, but most of kernels work
* always all test_ops
* progress, more tests, stable
* test_ops passes, gpt2 works
but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated
* need better sync
* fix sync
* alloc2
* all tests pass!
* cleanup 1
* cleanup
* multigpu, simple transfer
* fix sync
* correct init
* nv_gpu autogen + sync bug fix
* clean extra/nv_gpu_driver
* p2p
* clean up
* remove old gen
* small fixes
* cleanup
* cleanup 2
* small fixes
* bigger queue size
* cleanups
* wait
* fixed signals for devs
* fix hang + parallel beam
* small fixes
* detect when local memory is big in kernel
* correct assert
* small fixes
* correct tls size est
* one va space
* less lines
* shorter
* save 2 lines
* save some lines
* remove type ignores
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>