* re-enable gpt2 half/beam mac benchmark
from fuzzer it seems to be flaky due to numerical issue, not kernel bug. we used to have half in splitted reduce.
run this in M1 Max for 20 loops and it's fine
* that should be jitted
* add support for train/val datasets for kits19
* split dataset into train and val sets
* add tests for kits19 dataloader
* add MLPerf dataset tests to CI
* update unet3d model_eval script
* fix linting
* add nibabel
* fix how mock dataset gets created
* update ref implementation with permalink and no edits
* clean up test and update rand_flip implementation
* cleanups
* wmma: widen TC usage in search by using PADTO on TC axes when possible
* test: start tests for the new padding TC behavior
* search: upgrade padded TC search to TC_OPT >= 2
* test: add behavior and correctness test for padded TC
added optional argument to apply_tensor_core to set TC_OPT level
* linearizer: add tests for the PADTO behvaior and docs
* start
* fix err 93
* gpu
* ioctl mappings
* alloc like cuda
* semaphores
* wait for semaphores value
* start ops_nv
* very simple kernels work
* init several gpus
* qmd dumper
* dirty, but most of kernels work
* always all test_ops
* progress, more tests, stable
* test_ops passes, gpt2 works
but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated
* need better sync
* fix sync
* alloc2
* all tests pass!
* cleanup 1
* cleanup
* multigpu, simple transfer
* fix sync
* correct init
* nv_gpu autogen + sync bug fix
* clean extra/nv_gpu_driver
* p2p
* clean up
* remove old gen
* small fixes
* cleanup
* cleanup 2
* small fixes
* bigger queue size
* cleanups
* wait
* fixed signals for devs
* fix hang + parallel beam
* small fixes
* detect when local memory is big in kernel
* correct assert
* small fixes
* correct tls size est
* one va space
* less lines
* shorter
* save 2 lines
* save some lines
* remove type ignores
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* add DICE loss and metrics
* update dice to include reference implementation's link
* remove unused imports
* remove unnecessary test file and update pred + label for metrics and losses test
* add tests to CI + add exclusion of mlperf_unet3d
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* init
* add failed case
* fix: temp comment out MULACC cast
* is this right?
* add test case
* oops, forgot to get rid of temp test
* WOOOOOO TOOK OUT 2 TRANSPOSES IN GATHER YAY
* cleaner
* comment cleanup
* update docs
* resolve conflict
* oops
* SUPA FAST
* comment out a test
* del some print statements
* use new broadcast stuff
* more clean up
* move try except
* skip fancy indexing for python backend test_ops
the annoying thing to remove all FlopCounter is that for device that does not support local, matmul index alu is huge.
we can remove the dtype first.
sneak in updating `ruff` command to `ruff check`
* tensor cores
* Merge from master
* faster program start in llvm (#3897)
* Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.
* Delete stray line used for breaking tests
* Fix linter error by renaming twice-used variable
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* touchup einsum (#3900)
don't need rhs_letters
* hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
* replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
* add --minimal flag to nvrtc (#3899)
* wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias. now it splits it properly into 8 and the
remaining 2 into the correct local stride
* training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA
memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.
* simpler bf16 functions
* bf16 cifar works for HSA too just very slow
* simpler bf16 functions, we love cuda
* include negative float in test_dtype (#3884)
* include negative float in test_dtype
* that is ub
* too annoying
* pack can overflow
* add to benchmark
* change var name to satisfy mypy
* spacing
* Update to new TensorCore format
* Spacing
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* kfd driver wip
* cleanups
* kfd almost ready to ring doorbell
* ding dong?
* issues with signals
* something
* works
* ops kfd
* add amd_signal_t
* works...sometimes
* program runs
* _gpu_alloc cleanup
* cleanups
* work
* header + enable profiling (#3959)
* header + enable profiling
* just cleaner
* measure
* only local time domain
* remove old comments
* fix with master
* elf parsing (#3965)
* elf parsing
* fix kernels with private
* not used
* clean up
* clean up 2
* add flags
* kfd sdma (#3970)
* working sdma
* remove driver, shorter
* all commands we might need
* svm
* kfd remove hardcoded values (#4007)
* remove hardcoded values
* match above line
* 7k lines + revert hsa
* update that from origin
* fix sdma reg gen
* not the updated SDMA
* compiler_opts
* don't require kfd_ioctl
* get ioctls from python
* get ioctls from python
* remove build_sdma_command
* merge into 64-bit fields
* shorter
* fix property spelling and off by one
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage
* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures
* clean up naming and use set
* env var to change default float to fp16 or bf16
looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.
working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
__bf16 cast0 = (nv_bfloat16)(val0);
```
remove that in cifar
* DEFAULT_FLOAT
* default of default
* unit test
* don't check default
* tests work on linux
* training cifar with BF16 on CUDA
memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.
* simpler bf16 functions
* bf16 cifar works for HSA too just very slow
* simpler bf16 functions, we love cuda
* remove HIP in core tinygrad
ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag
* EMULATE_CUDA
* feat: initial xor
* feat: initial threefly
* feat: remove custom random
* fix: really need to install precommit
* feat: lmao forgot that this is rotate not a shift
* clean: put that there
* feat: numpy xor
* feat: quick test for xor
* feat: llvm xor
* feat: slightly working xor in torch
* feat: rand works in jit
* clean: save a line
* feat: match jax
* feat: maybe test against jax
* feat: requires_grad
* fix: fix test_symbolic_ops
* feat: lower alpha
* feat: just pad
* fix: maybe fix training tests?
* fix: fix some llvm stuff
* feat: cursed realize on the way out
* feat: testing jax
* fix: why is the jax install process not simple
* fix: maybe passing test
* fix: symbolic workarounds
* clean: still need that precommit
* fix: aaaa
* fix: more test fixes
* fix: quick fix for wgsl
* feat: need to set requires_grad on the final tensor
* feat: one more tensor
* feat: don't take forever
* feat: seeing y ci is brok
* feat: can't allocate 64GiB lmao
* fix: fix this
* feat: hope this doesn't break smth before i go to bed
* feat: don't destroy ram
* feat: int
* feat: remove jax
* feat: properish workaround?
* feat: skip slow webgpu tests
* feat: no longer fails
* feat: use dtypes
* feat: real number
* fix: torch
* fix: don't test against reference for torch
* feat: to device
* feat: fix advanced indexing
* feat: correct casting
* feat: even rng_counter
* feat: match master
* feat: this was actually bad
* fix: maybe?
* feat: store
* feat: remove realizes
* feat: somehow this is important
* feat: somehow this is also important
* feat: save a line
* fix: don't need that anymore
* feat: restore this
* fix: linter
* feat: remove realizes
* fix: realized is in base now
* fix: add back cast
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: :(
* fix: :(
* fix: not being dumb
* feat: try changing less tests
* feat: shouldn't have to change that
* feat: contiguous bumps it by one
* fix: hmm
* fix: numpy memory moment
* fix: cl_khr_fp16
* fix: torch has different tensor count
* fix: missing contiguous
* hmm: hmm
* fix: some fixes
* fix: typing
* feat: dont do that
* feat: typing fixes
* feat: why is this realize required?
* feat: ngl kinda odd typing
* feat: oh
* feat: remove realizes
* feat: why is this realize required?
* fix: hacky patch for cudacpu
* fix: without this realize pytest crashes?????
* fix: shorter line
* fix: cudacpu fixes
* fix: cudacpu fixes
* feat: real buffer
* feat: don't search when searching lmao
* fix: can't use contiguous things
* fix: no more 100GB arrays
* fix: revert
* fix: skip 7 and 10
* feat: working ish beam
* feat: minimize changes
* feat: seed 0 stable diffusion example changed
* fix: different on ci
* fix: no beam
* feat: make threefry optional
* fix: check value
* fix: unused import
* feat: threefry default
* fix: 5d
* feat: allow non upcast div
* fix: 5d better
* fix: 5d better
* fix: save all dtype
* feat: proper error
* feat: lazyop key
* fix: check float
* feat: try removing this realize now
* feat: disable threefry for uops hip tensor cores
* feat: don't need that
* feat: only check upcast
* fix: disable threefry for some metal tests
* feat: disable for metal tensor uops as well
* feat: disable for most uops
* fix: disable threefry for new uops tests
* feat: multitensor
* fix: typing
* feat: threefry default off
* feat: skip threefry half rand
* feat: restore old
* fix: bad git
* clean: ruff
* feat: bfloat16 fix
* fix: :|
* feat: restore old
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* feat: initial xor
* feat: initial threefly
* feat: remove custom random
* fix: really need to install precommit
* feat: lmao forgot that this is rotate not a shift
* clean: put that there
* feat: numpy xor
* feat: quick test for xor
* feat: llvm xor
* feat: slightly working xor in torch
* feat: rand works in jit
* clean: save a line
* feat: match jax
* feat: maybe test against jax
* feat: requires_grad
* fix: fix test_symbolic_ops
* feat: lower alpha
* feat: just pad
* fix: maybe fix training tests?
* fix: fix some llvm stuff
* feat: cursed realize on the way out
* feat: testing jax
* fix: why is the jax install process not simple
* fix: maybe passing test
* fix: symbolic workarounds
* clean: still need that precommit
* fix: aaaa
* fix: more test fixes
* fix: quick fix for wgsl
* feat: need to set requires_grad on the final tensor
* feat: one more tensor
* feat: don't take forever
* feat: seeing y ci is brok
* feat: can't allocate 64GiB lmao
* fix: fix this
* feat: hope this doesn't break smth before i go to bed
* feat: don't destroy ram
* feat: int
* feat: remove jax
* feat: properish workaround?
* feat: skip slow webgpu tests
* feat: no longer fails
* feat: use dtypes
* feat: real number
* fix: torch
* fix: don't test against reference for torch
* feat: to device
* feat: fix advanced indexing
* feat: correct casting
* feat: even rng_counter
* feat: match master
* feat: this was actually bad
* fix: maybe?
* feat: store
* feat: remove realizes
* feat: somehow this is important
* feat: somehow this is also important
* feat: save a line
* fix: don't need that anymore
* feat: restore this
* fix: linter
* feat: remove realizes
* fix: realized is in base now
* fix: add back cast
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: :(
* fix: :(
* fix: not being dumb
* feat: try changing less tests
* feat: shouldn't have to change that
* feat: contiguous bumps it by one
* fix: hmm
* fix: numpy memory moment
* fix: cl_khr_fp16
* fix: torch has different tensor count
* fix: missing contiguous
* hmm: hmm
* fix: some fixes
* fix: typing
* feat: dont do that
* feat: typing fixes
* feat: why is this realize required?
* feat: ngl kinda odd typing
* feat: oh
* feat: remove realizes
* feat: why is this realize required?
* fix: hacky patch for cudacpu
* fix: without this realize pytest crashes?????
* fix: shorter line
* fix: cudacpu fixes
* fix: cudacpu fixes
* feat: real buffer
* feat: don't search when searching lmao
* fix: can't use contiguous things
* fix: no more 100GB arrays
* fix: revert
* fix: skip 7 and 10
* feat: working ish beam
* feat: minimize changes
* feat: seed 0 stable diffusion example changed
* fix: different on ci
* fix: no beam
* feat: make threefry optional
* fix: check value
* fix: unused import
* feat: threefry default
* fix: 5d
* feat: allow non upcast div
* fix: 5d better
* fix: 5d better
* fix: save all dtype
* feat: proper error
* feat: lazyop key
* fix: check float
* feat: try removing this realize now
* feat: disable threefry for uops hip tensor cores
* feat: don't need that
* feat: only check upcast
* fix: disable threefry for some metal tests
* feat: disable for metal tensor uops as well
* feat: disable for most uops
* fix: disable threefry for new uops tests
* feat: multitensor
* fix: typing
* feat: threefry default off
* feat: skip threefry half rand
* feat: restore old
* fix: bad git
* clean: ruff
* feat: bfloat16 fix
* fix: :|
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* this is a lot of stuff
TEST_TRAIN env for less data
don't diskcache get_train_files
debug message
no lr_scaler for fp32
comment, typo
type stuff
don't destructure proc
make batchnorm parameters float
make batchnorm parameters float
resnet18, checkpointing
hack up checkpointing to keep the names in there
oops
wandb_resume
lower lr
eval/ckpt use e+1
lars
report top_1_acc
some wandb stuff
split fw and bw steps to save memory
oops
save model when reach target
formatting
make sgd hparams consistent
just always write the cats tag...
pass X and Y into backward_step to trigger input replace
shuffle eval set to fix batchnorm eval
dataset is sorted by class, so the means and variances are all wrong
small cleanup
hack restore only one copy of each tensor
do bufs from lin after cache check (lru should handle it fine)
record epoch in wandb
more digits for topk in eval
more env vars
small cleanup
cleanup hack tricks
cleanup hack tricks
don't save ckpt for testeval
cleanup
diskcache train file glob
clean up a little
device_str
SCE into tensor
small
small
log_softmax out of resnet.py
oops
hack :(
comments
HeNormal, track gradient norm
oops
log SYNCBN to wandb
real truncnorm
less samples for truncated normal
custom init for Linear
log layer stats
small
Revert "small"
This reverts commit 988f4c1cf35ca4be6c31facafccdd1e177469f2f.
Revert "log layer stats"
This reverts commit 9d9822458524c514939adeee34b88356cd191cb0.
rename BNSYNC to SYNCBN to be consistent with cifar
optional TRACK_NORMS
fix label smoothing :/
lars skip list
only weight decay if not in skip list
comment
default 0 TRACK_NORMS
don't allocate beam scratch buffers if in cache
clean up data pipeline, unsplit train/test, put back a hack
remove print
run test_indexing on remu (#3404)
* emulated ops_hip infra
* add int4
* include test_indexing in remu
* Revert "Merge branch 'remu-dev-mac'"
This reverts commit 6870457e57dc5fa70169189fd33b24dbbee99c40, reversing
changes made to 3c4c8c9e16.
fix bad seeding
UnsyncBatchNorm2d but with synced trainable weights
label downsample batchnorm in Bottleneck
:/
:/
i mean... it runs... its hits the acc... its fast...
new unsyncbatchnorm for resnet
small fix
don't do assign buffer reuse for axis change
* remove changes
* remove changes
* move LARS out of tinygrad/
* rand_truncn rename
* whitespace
* stray whitespace
* no more gnorms
* delete some dataloading stuff
* remove comment
* clean up train script
* small comments
* move checkpointing stuff to mlperf helpers
* if WANDB
* small comments
* remove whitespace change
* new unsynced bn
* clean up prints / loop vars
* whitespace
* undo nn changes
* clean up loops
* rearrange getenvs
* cpu_count()
* PolynomialLR whitespace
* move he_normal out
* cap warmup in polylr
* rearrange wandb log
* realize both x and y in data_get
* use double quotes
* combine prints in ckpts resume
* take UBN from cifar
* running_var
* whitespace
* whitespace
* typo
* if instead of ternary for resnet downsample
* clean up dataloader cleanup a little?
* separate rng for shuffle
* clean up imports in model_train
* clean up imports
* don't realize copyin in data_get
* remove TESTEVAL (train dataloader didn't get freed every loop)
* adjust wandb_config entries a little
* clean up wandb config dict
* reduce lines
* whitespace
* shorter lines
* put shm unlink back, but it doesn't seem to do anything
* don't pass seed per task
* monkeypatch batchnorm
* the reseed was wrong
* add epoch number to desc
* don't unsyncedbatchnorm is syncbn=1
* put back downsample name
* eval every epoch
* Revert "the reseed was wrong"
This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.
* cast lr in onecycle
* support fp16
* cut off kernel if expand after reduce
* test polynomial lr
* move polynomiallr to examples/mlperf
* working PolynomialDecayWithWarmup + tests.......
add lars_util.py, oops
* keep lars_util.py as intact as possible, simplify our interface
* no more half
* polylr and lars were merged
* undo search change
* override Linear init
* remove half stuff from model_train
* update scheduler init with new args
* don't divide by input mean
* mistake in resnet.py
* restore whitespace in resnet.py
* add test_data_parallel_resnet_train_step
* move initializers out of resnet.py
* unused imports
* log_softmax to model output in test to fix precision flakiness
* log_softmax to model output in test to fix precision flakiness
* oops, don't realize here
* is None
* realize initializations in order for determinism
* BENCHMARK flag for number of steps
* add resnet to bechmark.yml
* return instead of break
* missing return
* cpu_count, rearrange benchmark.yml
* unused variable
* disable tqdm if BENCHMARK
* getenv WARMUP_EPOCHS
* unlink disktensor shm file if exists
* terminate instead of join
* properly shut down queues
* use hip in benchmark for now
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* lars optimizer + tests
* fix skip list!
* use id to compare in skip list
* go back to using set
* Tensor(bool) * Tensor(bool) is and
* don't lint external/mlperf_resnet
* whitespace
* add external_test_optim to opencl tests
* give mlperf task a name
* mlperf under onnx
* remove track_gnorm
* contiguous instead of realize
* assert momentum and weight decay positive
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* run test_linearizer_failures on PYTHON backend
only test 1, some have hanging issues and gated store is not implemented
* --durations=20
* two less slow ones
* Cast correctly in python emulator
* Update test yml and fix lint
* make ruff pass
* mypy passes
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* remove cpu and torch backends
* don't copy to cpu
* use clang instead of cpu
* multitensor gathers on the first device
* clang is cpu + use default
* fixup
* bugfix
* fix OverflowError in UnaryOps.EXP2
* avoid accessing outputs for void uops
* skip execution for UOps.IF and UOps.ENDIF
* initialize bytearray to the correct size in UOps.DEFINE_LOCAL
* validate len of input that has .sz > 1
* remove comment in code
* reinitialize loop of already iterated
* validate first value in input to be a list for inputs with .sz > 1
* add python ops tests to CI
* skip long runtime tests for PYTHON backend
* respect dtype.sz arg in UOps.CONST, and remove incorrect validation in UOps.STORE
* use math.inf instead of float('int')
* handle 0 args to UnaryOPs.LOG2
* handle load op with default of .sz > 1
* initialize the loop correctly using UOps.LOOP arg
* remove unnecessary TODO comment
* remove newline
* select a subset of 22 ops tests to skip in CI when PYTHON=1
* handle gated UOps.LOAD referencing values that have .sz > 1
* Revert "select a subset of 22 ops tests to skip in CI when PYTHON=1"
This reverts commit 7674fee81d37f8865cdcc72cc0f06f67cdf59783.
* skip tests in python backend CI command
* push fix lost in conflict resolve
* Revert "skip long runtime tests for PYTHON backend"
This reverts commit 5dd2a0376e653319551c7056742d61a5fd98f60a.
* clear loop state after last iteration
* emulated ops_hip infra
* add int4
* include test_indexing in remu
* Revert "Merge branch 'remu-dev-mac'"
This reverts commit 6870457e57dc5fa70169189fd33b24dbbee99c40, reversing
changes made to 3c4c8c9e16.
* generic rendering of half and bf16
hotfix
* fix uops + regression test
* fix the test for metal's half4
* uop.uop fixup
* mypy with --strict-equality, fix ops_gpu
* ops_python: add HIP tensor core mock and refactor METAL
* Add tests to CI
* add DEBUG=2 to full tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style
run on TORCH since it's the fastest one on CI.
caught a bug in multinomial, and update the behavior of fancy index and gather to move the indices Tensor to same device as self.
* shrink MLB on sharded axis
use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.
draft version in https://github.com/chenyuxyz/tinygrad/pull/109
* SYNCBN flag
* test unclean shrinks
* UnsyncedBatchNorm reuses BatchNorm
* more robust pad arg check
* better types
* more tests!
* 6 gpus in benchmark
* disable slow GPUS=6 benchmark
* mockhip->hipcpu
* allocate buffers
* launch a kernel
read_asm api
* run remu in CI
* remu 0.0.2, real test ops
* simple driver
* 0.0.3, all test_ops
* run the latest emulator
* 9 minutes is way too long, drop backprop in CI
* bring back the backward pass
* Revert "bring back the backward pass"
This reverts commit 3781e1bc56fc06b424e7c7bed1224f819247fb8f.
* Print slowest tests
* emulated device directly in ops_hip
* fix ruff, override mypy for specific rules
* test in the same code path
- hip backend env variables
- install packages and verify autogen
- run certain tests
- remove the other hip tests path
- verify Device.DEFAULT
* remove the emulated hip in extra
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* move gpuctypes in tree
* fix mypy
* regex exclude
* autogen sh
* mypy exclude
* does that fix it
* fix mypy
* add hip confirm
* verify all autogens
* build clang2py
* opencl headers
* gpu on 22.04
* WebGL WIP
* 84% of ops passing test
* tests passing 100%
* Cleanup, refactor
* Shave off some lines
* Work on dtypes
* TestOps at 100% again
* Efficient net shaders compile in browser webgl2
* Compile all efficientnet shaders in browser
* Create empty textures for tensor buffers
* Run program. Up next weight loading
* Exported WebGL model working
* Add tests, refactor
* Explicit cast alu for GLSL
* Fix CI tests
* WebGL efficientnet demo
* Compile and run yolov8 in browser
* Fix imports
* Simplify yolo compile
* Fix bool*bool and cast cmplt to float
* More tests
* Do std tests pass on CI?
* Skip std tests on CI
* Remove explicit_cast_alu hack, and solve it in code_for_op
* Move to new dtype-less alloc api
* Remove local size hack: optimize local_size only if device has local
* Remove glsl.py, and move content to cstyle
* dont_use_locals in opts
* Fix dtype tests
* type_map in CStyleLanguage
* Make core changes smaller, cleaner, refactor export_model and demo
* Skip pad_slice
* Simplify: render_const, render_conditional
* solve bool alu for other binops, cleaner ops_webgl
* Fix noopt hack
* Remove some skipIfs
* WebGL image hack
* type_names is a better name
* global_max
* Fix dtype import
* Fix type_names -> type_map
* Fix lint
* Remove webgpu, back to 5k lines (#3040)
* remove webgpu
* max 5000 lines
* revert those to master
* retain that cstyle
---------
Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>
* simple multitensor API
* test multitensor
* mt work
* new api
* copies
* all but data parallel
* allreduce there
* works, but axis sharded
* fix all mt tests
* features/multi
* work
* backprop
* fix tests
* tests passing
* mt progress
* cleanups
* less lines
* tensor cleanup
* save more lines
* mypy passes
* fix tests
* skip for cuda too
* bump download cache
* switch CI to tiny8
* no copyin for disk
* Revert "no copyin for disk"
This reverts commit eb46b7e93da4a650d8125020c38f44d1f8f2c86e.
* rocm 6 broke llama
* rename it
* print DEBUG for TC=2 in CI
* enable TC=2
* no need to check src type
* LOAD has side effect
* don't push any local buffer
* update comment
* and BARRIER
* lazy rewrite, try 2
* min fix tests
* pass contig test
* put broken pads back
* move that to realize
* no contig child fixes array packing
* so wrong
* now that's correct
* base children
* fix bind issues
* disable to_image_idx
* fix tests
* that failure shouldn't break other tests
* more fixes
* fix torch
* skip failing tests in CI
* 1e-7
* half is broken
* 1e-6 margin of error
* invert (broken)
* decent invert
* shapetracker invert works
* plus is meh, invert is good
* support invert mask
* a few more invert tests
* shapetracker math invert test
* validate stable diffusion for seed 0
the closest false positive i can get is with the setup and one less step. dist = 0.0036
same setup with fp16 has dist=5e-6.
so setting validation threshold to 1e-4 should be good
* run with --seed 0
* `global_load` and `global_store` using buffer dtype
* `UOps.PHI` in all dtypes
* `UOps.ALU` in all dtypes
* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes
* -- endof implementation --
+tiny lint changes
* these tests require the fp16 extention
you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)
`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`
skip the new test_linearizer_failures in CI GPU because of the fp16 extention
This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`
see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)
* these tests fail in CI due to segfaults and CPU crashes
To confirm they're green locally, you can run the following commands:
1. For the tests skipped in test_ops.py (note: CLANG is very slow)
`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`
2. For the ONNX tests skipped in CLANG:
```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```
3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific
`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`
* Revert "these tests fail in CI due to segfaults and CPU crashes"
This reverts commit 15db57014381a4449d563526ac6c870e36257658.
* merge with cleanup-vectorized-hip-renders
* barely working HIP P1, ALU ops need a refactor?
* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:
```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```
* update the sign definition to avoid division by zero in all dtypes
* diff cleanup p1: why were these in the diff anyways
* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!
add ALU ops overloads for HIP
this will make HIP max work
handle mod
Revert "handle mod"
This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.
update max to use hmax
add HIP GEP render logic
enable CIFAR fp16 benchmark
test ops for HIP
back to store as float because this only works for float4 grouping right now
test_ops for hip!!
always sign
* back to the sign we had before because we cant do a backward pass on a Less node
* remove old hacks
HIP compiling test_ops in CI takes ~9 mins, not doing it for now
new HIP ALUs
* reduce accs done right
* refactor to function
* no device hacks
hacks p2
the other way
* LLVM ALU ops
half, float and double are all float
update max
* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool
* cleanup LLVM wrong code
* dummy change for the CUDA install glitch
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* remove pytest marks
* test more stuff
* fine revert some
* add that mark back
* skip that
* hmm LLVM does not work on ubuntu
* too slow on CUDA CI
* dup test
* ops_gpu is go
* fix size 0
* fix image, and add more tests
* nerf openpilot test, doesn't test thneed
* run the schedule
* better
* oops, new inputs
* delete pyopencl
* Update ops_gpu.py
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* rebalance
* balance
* parallel apt-get for all
* .local/lib/python3.11/site-packages
* what is user doing
* is that path right
* Update test.yml
* okay where are you
* site-packages
* image support weird loads
* umm, that was always wrong
* openpilot compile fails with a weird error
* image test passes
* we have valids now
* clean that up
* no more required opts
* add fastvits test, fix bug
* minor cleanups
* hip amd compilation
* gate the test properly
* cleanup unused import
* remove superfluous numpy conversion
* add SpeedyNet tests (f32 [passes] & f16 [fails])
* make CI verbose (error log from hip compiler)
* test the real ops_hip
* Merge branch 'tinygrad:master' into ci/hip-compilation
* fix CI
* cleanup
* really fix CI
* Fix CI Three: the refixening
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* hip amd compilation
* gate the test properly
* cleanup unused import
* remove superfluous numpy conversion
* add SpeedyNet tests (f32 [passes] & f16 [fails])
* make CI verbose (error log from hip compiler)
* test the real ops_hip
* Merge branch 'tinygrad:master' into ci/hip-compilation
* fix CI
* cleanup
* really fix CI
* add name support
* use fetch in gpt2
* remove requests from main lib, networkx also optional
* umm, keep that assert
* updates to fetch
* i love the walrus so much
* stop bundling mnist with tinygrad
* err, https
* download cache names
* add DOWNLOAD_CACHE_VERSION
* need env.
* ugh, wrong path
* replace get_child
* force rebuild of ocelot
* SzymonOzog gpuocelot
* delete that
* downgrade that
* non parallel
* force rebuild
* use llvm
* nauto
* less mem maybe
* print test
* helper_test_exception skip CUDACPU
* helper_test_exception
* shippable
* very close
* remove comment
* negative strides working
* almost everything passes
* calculate offset with list comprehension
* some cleanup
* got disk load working
* review suggestions
* fix after merge
* overlap working
* did it
* clean
* fixed disk load
* lint
* mypy
* removed as_strided
* trying without simplify
* added back simplify
* make sure expanding to smaller shape
* cleanup
* removed comment
* removed env file
* trying whisper test again
* onnx test sqlite issue
* working on test
* finished test
* eliminate unnecessary shrink-then-pad
* don't shrink buffer
* added strides check
* added to ci under linters
* switch issue
* allow symbolic stride
* removed .env
* isinstance
* adjust strides for double expand
* cleanup
* needed to add type hint for mypy
* set pythonpath
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test