* keep old cookie around until next step has been queued (-10ms 6gpu)
* also for eval
* drop cookie before data_get?
* Revert "drop cookie before data_get?"
This reverts commit b01e6aa2b27f49aeab04b448f09e0ef9e689ea53.
* Revert "Revert "drop cookie before data_get?""
This reverts commit 23464e73d445007c15537c69818fdee89adf0740.
* mlperf/resnet: update beam params to increase time and quality
* revert upcast 8 in search space and add rocm setup function
* refactor to independent setup.sh script
it did not ask too many details. will put software versions later with tinygrad commit.
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v4.0/tinycorp/systems/tinybox_red.json training 4.0.0
INFO - System description checker passed for tinybox red
```
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v4.0/tinycorp/systems/tinybox_green.json training 4.
0.0
INFO - System description checker passed for tinybox green
```
* fix mean underflow for half tensor
divide only the reduce factor. added unit test and non-nan assertion in resnet training. also added a failed test cast for symbolic shape var
* skip for python backend
* kernel: change PADTO check to allow up to 4x padding
also optionally remove PADTO from the search action space with
BEAM_PADTO=0.
* fix test_linearizer test_tensor_cores_padded tests
* update resnet runs to use SPLIT_REDUCEOP=1
* fix up search TC axis and amt checking
* fix up the dimensions of the TC tests
* add support for train/val datasets for kits19
* split dataset into train and val sets
* add tests for kits19 dataloader
* add MLPerf dataset tests to CI
* update unet3d model_eval script
* fix linting
* add nibabel
* fix how mock dataset gets created
* update ref implementation with permalink and no edits
* clean up test and update rand_flip implementation
* cleanups
we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value.
Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time
* add DICE loss and metrics
* update dice to include reference implementation's link
* remove unused imports
* remove unnecessary test file and update pred + label for metrics and losses test
* add tests to CI + add exclusion of mlperf_unet3d
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* resnet individual layer benchmarks!
* small
* 1 and 2
* mem_used
* no ci
* better conv print
* defaults
* prints
* adjust
* adjust
* adjust
* benchmark only one layer example
* tensor.training, zero_grad, sum instead of mean, last mem, last kernel count
* default jitcnt=1
* scale flops/kernels with jitcnt
* add note about jitcnt memory
* touchup
* write llm.c and add a few new methods to tensor
* training works
* add jit
* tests for new functions
* test tolist
* simple fix for onnx test failures (#4186)
* write llm.c and add a few new methods to tensor
* training works
* add jit
* tests for new functions
* bump line count to 7500
* simplest fix
* safenumpy tolist for now
---------
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
---------
Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>
* rewrite the jit in the context of new schedule
* mypy better
* fix placeholder
* tests
* all functionality should work
* fix tests
* no CacheCollector
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
* first commit
* state back to orig
* mamba comparisions
* rm file
* rename file
* use Tensor.einsum and mke default model 370M
* Cleaned code and made a comparision test
* Simplyfy pull request. Only has 1 mamba implementation now.
* Update prompt
* rm whitespaces
* last space
* remove Einops dependency
* rm unused code
* add tests
* rm print statement
* rm imports
* skip CLANG
* Update skipIf description
* skip model test in CI and add CLANG fix
* rm Device import
* don't be stupid
* Fix conv assign
When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token
* fix p1
* temp
* fix jit import
---------
Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fp16 resnet
* cast running mean and var back to default float
* extra cast
* check symbolic no overflow
* add linearizer failure
* loss scaler after grad contig
* oops
* i think this works
* don't loss scale fp32
* remove overflow test case
* remove symbolic bounds check
* loss scaler should be float
* temporarily disable padto cuz bug
shruggie
* make running stats in batchnorm float32?
* calculate lars stuff in fp32?
* oops
* remove most changes
* move loss scaler out of optimizer
* no more FP16 var
* oops
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* env var to change default float to fp16 or bf16
looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.
working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
__bf16 cast0 = (nv_bfloat16)(val0);
```
remove that in cifar
* DEFAULT_FLOAT
* default of default
* unit test
* don't check default
* tests work on linux
* training cifar with BF16 on CUDA
memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.
* simpler bf16 functions
* bf16 cifar works for HSA too just very slow
* simpler bf16 functions, we love cuda
copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.
70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.
`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize`
13B on 6 gpus uses 47 GB v.s. 34 GB quantized
This is how bf16 load is tested in test_bf16_disk_write_read now and it should fix#2775.
I tested that it fixed loading coder using PYTHON backend.
Will separate this special bf16 load v.s. regular bf16 support
* simple LoadOps.ASSIGN
* skip that test
* don't assign in onnx ops gemm
* track cache usage
* recreate the lazybuffer to avoid the cache
* fix contigs
* skip that test
* lol
* better letters
* this is a lot of stuff
TEST_TRAIN env for less data
don't diskcache get_train_files
debug message
no lr_scaler for fp32
comment, typo
type stuff
don't destructure proc
make batchnorm parameters float
make batchnorm parameters float
resnet18, checkpointing
hack up checkpointing to keep the names in there
oops
wandb_resume
lower lr
eval/ckpt use e+1
lars
report top_1_acc
some wandb stuff
split fw and bw steps to save memory
oops
save model when reach target
formatting
make sgd hparams consistent
just always write the cats tag...
pass X and Y into backward_step to trigger input replace
shuffle eval set to fix batchnorm eval
dataset is sorted by class, so the means and variances are all wrong
small cleanup
hack restore only one copy of each tensor
do bufs from lin after cache check (lru should handle it fine)
record epoch in wandb
more digits for topk in eval
more env vars
small cleanup
cleanup hack tricks
cleanup hack tricks
don't save ckpt for testeval
cleanup
diskcache train file glob
clean up a little
device_str
SCE into tensor
small
small
log_softmax out of resnet.py
oops
hack :(
comments
HeNormal, track gradient norm
oops
log SYNCBN to wandb
real truncnorm
less samples for truncated normal
custom init for Linear
log layer stats
small
Revert "small"
This reverts commit 988f4c1cf35ca4be6c31facafccdd1e177469f2f.
Revert "log layer stats"
This reverts commit 9d9822458524c514939adeee34b88356cd191cb0.
rename BNSYNC to SYNCBN to be consistent with cifar
optional TRACK_NORMS
fix label smoothing :/
lars skip list
only weight decay if not in skip list
comment
default 0 TRACK_NORMS
don't allocate beam scratch buffers if in cache
clean up data pipeline, unsplit train/test, put back a hack
remove print
run test_indexing on remu (#3404)
* emulated ops_hip infra
* add int4
* include test_indexing in remu
* Revert "Merge branch 'remu-dev-mac'"
This reverts commit 6870457e57dc5fa70169189fd33b24dbbee99c40, reversing
changes made to 3c4c8c9e16.
fix bad seeding
UnsyncBatchNorm2d but with synced trainable weights
label downsample batchnorm in Bottleneck
:/
:/
i mean... it runs... its hits the acc... its fast...
new unsyncbatchnorm for resnet
small fix
don't do assign buffer reuse for axis change
* remove changes
* remove changes
* move LARS out of tinygrad/
* rand_truncn rename
* whitespace
* stray whitespace
* no more gnorms
* delete some dataloading stuff
* remove comment
* clean up train script
* small comments
* move checkpointing stuff to mlperf helpers
* if WANDB
* small comments
* remove whitespace change
* new unsynced bn
* clean up prints / loop vars
* whitespace
* undo nn changes
* clean up loops
* rearrange getenvs
* cpu_count()
* PolynomialLR whitespace
* move he_normal out
* cap warmup in polylr
* rearrange wandb log
* realize both x and y in data_get
* use double quotes
* combine prints in ckpts resume
* take UBN from cifar
* running_var
* whitespace
* whitespace
* typo
* if instead of ternary for resnet downsample
* clean up dataloader cleanup a little?
* separate rng for shuffle
* clean up imports in model_train
* clean up imports
* don't realize copyin in data_get
* remove TESTEVAL (train dataloader didn't get freed every loop)
* adjust wandb_config entries a little
* clean up wandb config dict
* reduce lines
* whitespace
* shorter lines
* put shm unlink back, but it doesn't seem to do anything
* don't pass seed per task
* monkeypatch batchnorm
* the reseed was wrong
* add epoch number to desc
* don't unsyncedbatchnorm is syncbn=1
* put back downsample name
* eval every epoch
* Revert "the reseed was wrong"
This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.
* cast lr in onecycle
* support fp16
* cut off kernel if expand after reduce
* test polynomial lr
* move polynomiallr to examples/mlperf
* working PolynomialDecayWithWarmup + tests.......
add lars_util.py, oops
* keep lars_util.py as intact as possible, simplify our interface
* no more half
* polylr and lars were merged
* undo search change
* override Linear init
* remove half stuff from model_train
* update scheduler init with new args
* don't divide by input mean
* mistake in resnet.py
* restore whitespace in resnet.py
* add test_data_parallel_resnet_train_step
* move initializers out of resnet.py
* unused imports
* log_softmax to model output in test to fix precision flakiness
* log_softmax to model output in test to fix precision flakiness
* oops, don't realize here
* is None
* realize initializations in order for determinism
* BENCHMARK flag for number of steps
* add resnet to bechmark.yml
* return instead of break
* missing return
* cpu_count, rearrange benchmark.yml
* unused variable
* disable tqdm if BENCHMARK
* getenv WARMUP_EPOCHS
* unlink disktensor shm file if exists
* terminate instead of join
* properly shut down queues
* use hip in benchmark for now
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
prepared bfloat16 change. added float() and cast(default_float) in whiteing, explicitly set dtype in various places that convert between numpy and Tensor
* examples/stable_diffusion: support model checkpoints without alphas_cumprod key
(which is most models on civitai)
* fix indent
---------
Co-authored-by: a <a@a.aa>
* working PolynomialDecayWithWarmup + tests.......
add lars_util.py, oops
* keep lars_util.py as intact as possible, simplify our interface
* whitespace
* clean up
* clean up
* asserts
* test polylr for full resnet training run
* add comment
* rename
* fix do_optim
* don't cast lr
* info
* calculate from train_files
* skip it
* lars optimizer + tests
* fix skip list!
* use id to compare in skip list
* go back to using set
* Tensor(bool) * Tensor(bool) is and
* don't lint external/mlperf_resnet
* whitespace
* add external_test_optim to opencl tests
* give mlperf task a name
* mlperf under onnx
* remove track_gnorm
* contiguous instead of realize
* assert momentum and weight decay positive
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* allow LB <- MLB assign, but don't reuse buffer
* update test
* update test
* assign assert axes are the same
* update tests to manually shard running stats
* unused import
* UnsyncedBatchNorm with synced trainable weights for hlb cifar
* multitensor reshape tests
* test mlb assign change axis
* E501
* argfix axis
* don't import batchnorm from hlb_cifar in test_multitensor
* pass num_devices to UnsyncedBatchNorm in test, allow UnsyncedBatchNorm to be used with LB
* add backprop test for UnsyncedBatchNorm
* break out MLB assign and reshape changes
* manually shard running mean and running var
* don't shard unless syncbn=0
* replace nn.BatchNorm2d with UnsyncedBatchNorm
* don't increment num_batches_tracked if not tracking running stats
* update tests
* oops
* Revert "oops"
This reverts commit 5e8a67a535abea2ff288b1b804a9aa95eba40732.
* Revert "update tests"
This reverts commit 7ebf65d89ace1d3a32c3b28ee323ddee253262d6.
* Revert "don't increment num_batches_tracked if not tracking running stats"
This reverts commit 78de0ea9ee8cbd65dce28bd4abcc131c98451aa2.
* Revert "replace nn.BatchNorm2d with UnsyncedBatchNorm"
This reverts commit d03da53da70f009338e95f2b46315ac02a30149a.
* don't increment num_batched_tracked if not tracking running stats
* oops
* test_batchnorm_axis
* compare against torch
* types
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* shrink MLB on sharded axis
use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.
draft version in https://github.com/chenyuxyz/tinygrad/pull/109
* SYNCBN flag
* test unclean shrinks
* UnsyncedBatchNorm reuses BatchNorm
* more robust pad arg check
* better types
* more tests!
* 6 gpus in benchmark
* disable slow GPUS=6 benchmark
* shard llama
* sharding works
* simpler
* simpler
* consume option
* disable that test
* save a line
---------
Co-authored-by: George Hotz <george@tinygrad.org>
* initial multitensor jit support and tests
* Added graphs to multitensor jit and updated tests
* update unbind api
* fix set device, add TinyJit to resnet
* update_stats includes device
---------
Co-authored-by: ramenguy99 <ramenguy99@gmail.com>
* WebGL WIP
* 84% of ops passing test
* tests passing 100%
* Cleanup, refactor
* Shave off some lines
* Work on dtypes
* TestOps at 100% again
* Efficient net shaders compile in browser webgl2
* Compile all efficientnet shaders in browser
* Create empty textures for tensor buffers
* Run program. Up next weight loading
* Exported WebGL model working
* Add tests, refactor
* Explicit cast alu for GLSL
* Fix CI tests
* WebGL efficientnet demo
* Compile and run yolov8 in browser
* Fix imports
* Simplify yolo compile
* Fix bool*bool and cast cmplt to float
* More tests
* Do std tests pass on CI?
* Skip std tests on CI
* Remove explicit_cast_alu hack, and solve it in code_for_op
* Move to new dtype-less alloc api
* Remove local size hack: optimize local_size only if device has local
* Remove glsl.py, and move content to cstyle
* dont_use_locals in opts
* Fix dtype tests
* type_map in CStyleLanguage
* Make core changes smaller, cleaner, refactor export_model and demo
* Skip pad_slice
* Simplify: render_const, render_conditional
* solve bool alu for other binops, cleaner ops_webgl
* Fix noopt hack
* Remove some skipIfs
* WebGL image hack
* type_names is a better name
* global_max
* Fix dtype import
* Fix type_names -> type_map
* Fix lint
* Remove webgpu, back to 5k lines (#3040)
* remove webgpu
* max 5000 lines
* revert those to master
* retain that cstyle
---------
Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>
the compiler error was due to `error: call to 'max' is ambiguous` when we have max(int, float) in kernel.
it was first fixed in 4380ccb1 the non fp32 math PR, and further solidified with dtype refactor
* lazy rewrite, try 2
* min fix tests
* pass contig test
* put broken pads back
* move that to realize
* no contig child fixes array packing
* so wrong
* now that's correct
* base children
* fix bind issues
* disable to_image_idx
* fix tests
* that failure shouldn't break other tests
* more fixes
* fix torch
* skip failing tests in CI
* 1e-7
* half is broken
* 1e-6 margin of error
* validate stable diffusion for seed 0
the closest false positive i can get is with the setup and one less step. dist = 0.0036
same setup with fp16 has dist=5e-6.
so setting validation threshold to 1e-4 should be good
* run with --seed 0
The current yolov3 example is broken with the current implementation of of fetch in the helpers. I was tempted to fix the helpers instead but that could have just as well broken other examples.
* feat: working voice 2 text using whisper
* feat: added llama generation
* feat: vits init
* feat: more accurate voice conversion
* feat: support for tts and working pipeline for the first pass
* fix: linter checks
* refactored vits initialization and inference, added mmts-tts support
* fixed process sync and now we can have an infinite conversation
* reuse output stream to remove overhead of creating a new one each time
* added pre-prompt configuration with yaml files
* adjusted code to merge PR which changed whisper
* optimized whisper, now it's blazing fast and also reduced number of lines
* added better debug printing
* use jitted encode function for whisper, added timings and removed response delim to save speed on generating those tokens
* fixed hf convert and now it's working with tinyllama
* added tinyllama config
* refactored code and made it work with all llama models
* prettier order
* prettier order
* fixed suffix for tinyllama and refactored convert_from_hf
* added missing parameters
* fixed stream release and added missing params
* jitted dp and encoder
* jitted flow forward
* removed re-init of espeak on each call to save up time
* jitted generator forward for blazing fast tts
* added contextmanager for displaying a chat log
* removed whitespace for pylint
* updated code to support latest fetch func
* wait for llama eos token and pass params from cli to llama
* listen for not fixed amount of time
* refactored code a bit
* removed thresholding and now the output streams directly to whisper
* tokenize llama output for vits batch size to work and stream each sentence to a speaker
* changed speaker
* whisper is now printing on the same line
* don't trigger llama on whisper output in parens
* added tinyllama chat model
* adjusted code to work with tinyllama chat model
* removed unused cli arg
* autofetch tokenizer and tinyllama model. add 3 chat tokens to the tokenizer
* fixed issue with long sentences by chunking them
* support for multiline llama output
* prettified log output
* adjusted sentence length
* remove quote from response to avoid funny tts
* fixed prompts
* added missing parameter
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* torch and numpy don't share ops anymore
* that should be filtered out elsewhere
* still const
* graph + enet example cleanup
* hmm, we do still need it because of symbolic
* add name support
* use fetch in gpt2
* remove requests from main lib, networkx also optional
* umm, keep that assert
* updates to fetch
* i love the walrus so much
* stop bundling mnist with tinygrad
* err, https
* download cache names
* add DOWNLOAD_CACHE_VERSION
* need env.
* ugh, wrong path
* replace get_child
* fixed hf convert and now it's working with tinyllama
* added tinyllama config
* refactored code and made it work with all llama models
* prettier order
* prettier order
* fixed suffix for tinyllama and refactored convert_from_hf
* dynamically update help if MODEL_PARAMS changes and default size is the 1st
* beautiful mnist
* beautiful mnist example
* from tinygrad import Tensor
* more beautiful
* the jit is super core tinygrad
* globalcounters reset on jit run
* symlinks and exclude
* beautiful_cartpole
* evaluate is it's own function
* no symlinks
* more beautiful
* jit reset for double speed
* type hinting for JIT
* beautiful_mnist gets 98%
* beautiful_mnist < 4s with BEAM=2
* better cartpole
* use actor critic
* zero_grad got lost
* delete double relu
* stable cartpole with PPO
* beautiful_cartpole is more beautiful
* REPLAY_BUFFER
* beautiful stuff typechecks
* None support in shape
* hp tuning
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* WIP: Stable diffusion WebGPU port
* Load whole model: split safetensor to avoid Chrome allocation limit
* Gitignore .DS_Store, remove debug print
* Clip tokenizer in JS
* WIP: Compile model in parts (text model, diffusor, get_x_prev_and_pred_x0, decoder), and recreate forward logic in JS
* e2e stable diffusion flow
* Create initial random latent tensor in JS
* SD working e2e
* Log if some weights were not loaded properly
* Remove latent_tensor.npy used for debugging
* Cleanup, remove useless logs
* Improve UI
* Add progress bar
* Remove .npy files used for debugging
* Add clip tokenizer as external dependency
* Remove alphas_cumprod.js and load it from safetensors
* Refactor
* Simplify a lot
* Dedup base when limiting elementwise merge (webgpu)
* Add return type to safe_load_metadata
* Do not allow run when webgpu is not supported
* Add progress bar, refactor, fix special names
* Add option to chose from local vs huggingface weights
* lowercase tinygrad :)
* fp16 model dl, decompression client side
* Cache f16 model in browser, better progress
* Cache miss recovery
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test
* stable diffusion < 324ms
* revert swap action
* fix tests due to more sum splitting
* REDUCEOP_SPLIT_THRESHOLD env var
* added from unaligned np test (#2134)
* align cpu buffer before copy into cl buffer (#2135)
* remove shelve from handcode_resnet50_opt.py (#2139)
* Add dictionary keys to reduce db size (#2131)
* work
* ignore beam cache
* dictionary keys are generic
* minor db cleanups
* fix baseline and extract dataset
* fix training
* log likelihood
* more lin to feats
* sts
* training policynet
* net sort of works
* dedup
* refactor, stupid new actions
* fix uops deduping
* BEAM_ESTIMATE
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
* feat: move to hip
* feat: special path for RawBufferTransfer
* feat: initial rawbuffertransfer
* feat: hip ipc
* feat: working hip ipc
* feat: need to base device without args
* feat: close mem handle
* feat: modified test
* feat: more multihip stuff
* clean: cleanup
* feat: cleaner
* feat: don't crash
* feat: test more
* clean: way cleaner hip wrapper
* feat: barrier
* feat: barrier
* feat: this breaks stuff
* feat: we can use empty here
* feat: maybe fix tests
* feat: maybe fix tests again?
* fix: probably fix tests
* feat: no waiting here
* feat: wait here
* feat: much larger test
* feat: need to sync here
* feat: make this async
* feat: no waiting!
* feat: cut here
* feat: sync copy
* feat: random imports
* feat: much cleaner world
* feat: restore this
* feat: restore this
* clean: cleanup
* feat: set this
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features