* switch symbolic from old to uops, final PR
* two wrong answers
* not needed resolves
* symbolic ops passes
* symbolic ops passes
* progress
* tests pass (almost)
* fix last test
* fix some tests
* global binding and unbinding
* Revert "global binding and unbinding"
This reverts commit 9456725630316487509980af20c6d2981de00bec.
* that test works now
* vars on uop doesn't recurse
* fix fuzzer
* update
* fix type
* fix gpt, it's UOp now
* ssimplify symbolics
removed DISABLE_DROPOUT=1.
updated BS to 54 that works on tinyboxes with dropouts.
used bert's sparse_categorical_crossentropy that takes Tensor ignore_index in accuracy method
* add training set transforms
* add DICE cross entropy loss
* convert pred and label to Tensor when calculating DICE score
* cleanups and allow train dataset batching
* fix DICE CE loss calculation
* jitted training step
* clean up DICE CE loss calculation
* initial support for sharding
* Revert "initial support for sharding"
This reverts commit e3670813b8a67469e7f694e09f2d15a8c40065da.
* minor updates
* cleanup imports
* add support for sharding
* apply temp patch to try to avoid OOM
* revert cstyle changes
* add gradient acc
* hotfix
* add FP16 support
* add ability to train on smaller image sizes
* add support for saving and loading checkpoints + cleanup some various modes
* fix issue with using smaller patch size + update W&B logging
* disable LR_WARMUP_EPOCHS
* updates
* minor cleanups
* cleanup
* update order of transformations
* more cleanups
* realize loss
* cleanup
* more cleanup
* some cleanups
* add RAM usage
* minor cleanups
* add support for gradient accumulation
* cleanup imports
* minor updates to not use GA_STEPS
* remove FP16 option since it's available now globally
* update multi-GPU setup
* add timing logs for training loop
* go back to using existing dataloader and add ability to preprocess data to save time
* clean up optimization and re-enable JIT and multi-GPU support for training and evaluation
* free train and eval steps memory
* cleanups and scale batch size based on the number of GPUs
* fix GlobalCounters import
* fix seed
* fix W&B setup
* update batch size default size
* add back metric divergence check
* put back JIT on UNet3d eval
* move dataset preprocessing inside training code
* add test for dice_loss
* add config logging support to W&B and other cleanups
* change how default float is getting retrieved
* remove TinyJit import duplicate
* update config logging to W&B and remove JIT on eval_step
* no need for caching preprocessed data anymore
* fix how evaluation is ran and how often
* add support for LR scaling
* fix issue with gaussian being moved to scipy.signal.windows
* remove DICE loss unit test
* fix issue where loss isn't compatible with multiGPU
* add individual BEAM control for train and eval steps
* fix ndimage scipy import
* add BENCHMARK
* cleanups on BENCHMARK + fix on rand_flip augmentation during training
* cleanup train and eval BEAM envs
* add checkpointing support after every eval
* cleanup model_eval
* disable grad during eval
* use new preprocessing dataset mechanism
* remove unused import
* use training and inference_mode contexts
* start eval after benchmarking
* add data fetching time
* cleanup decorators
* more cleanups on training script
* add message during benchmarking mode
* realize when reassigning LR on scheduler and update default number of epochs
* add JIT on eval step
* remove JIT on eval_step
* add train dataloader for unet3d
* move checkpointing to be done after every epoch
* revert removal of JIT on unet3d inference
* save checkpoint if metric is not successful
* Revert "add train dataloader for unet3d"
This reverts commit c166d129dfbe2e1c46d1937135a60b4ed25caa3d.
* Revert "Revert "add train dataloader for unet3d""
This reverts commit 36366c65d26f59ed1227acb670d5ce7b997606ae.
* hotfix: seed was defaulting to a value of 0
* fix SEED value
* remove the usage of context managers for setting BEAM and going from training to inference
* support new stack API for calculating eval loss and metric
* Revert "remove the usage of context managers for setting BEAM and going from training to inference"
This reverts commit 2c0ba8d322ec912bd8617cbe167c542e9ba229d9.
* check training and test preprocessed folders separately
* clean up imports and log FUSE_CONV_BW
* use train and val preprocessing constants
* add kits19 dataset setup script
* update to use the new test decorator for disabling grad
* update kits19 dataset setup script
* add docs on how to train the model
* set default value for BASEDIR
* add detailed instruction about BASEDIR usage
---------
Co-authored-by: chenyu <chenyu@fastmail.com>