Commit Graph

145 Commits

Author SHA1 Message Date
chenyu b5546912e2
10% more TRAIN_STEPS for bert (#6971)
got two very close run, adding more steps for buffer
2024-10-09 19:21:43 -04:00
chenyu 35cf48659b
limit beam param for bert on green (#6966)
seems to mitigate the crash
2024-10-09 11:48:18 -04:00
chenyu 1ff2c98f8a
fix logfile name for bert red (#6952) 2024-10-08 05:37:52 -04:00
chenyu a78c96273a
update bert epoch logging (#6940)
* update bert epoch logging

epoch for bert is simply number of examples seen (which is used for RCP check)

* update total steps too

* more changes
2024-10-08 00:34:06 -04:00
chenyu 102dfe5510
back to 2**10 for bert loss scaler (#6934)
getting 2 NaN for this, revert back to 2**10
2024-10-07 10:17:21 -04:00
chenyu 0cf815a93a
bert use BS=66 and update hparams (#6932)
with dropout memory improvement, we can fit BS=66 now. revert back to the hparams in #5891 too
2024-10-07 05:08:27 -04:00
chenyu 718b959349
log epoch start and stop for bert (#6912) 2024-10-06 06:39:46 -04:00
chenyu 16c1fa4208
use BEAM=3 for red box bert runs (#6904)
BEAM=4 slightly exceeded 30 minutes setup
2024-10-05 09:21:12 -04:00
chenyu 0e706227a2
add seed to bert result log filename (#6903)
* add seed to bert result log filename

* different name for different benchmark
2024-10-05 09:15:24 -04:00
chenyu 7391376528
update bert hparams (#6876)
4h32m with this https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/q99frv1l/overview.

loss scaler 2**13->2**10. matched the closest submission, no nan for ~10 runs.

increased lr and total step a bit.

`PARALLEL=0` after setup, same as resnet.
2024-10-04 00:39:06 -04:00
chenyu 5f77217772
bert default CKPT to 0 (#6840)
not required
2024-10-01 21:55:56 -04:00
chenyu f59517754e
add RESET_STEP in bert to control reset (#6818)
same as resnet
2024-09-30 09:39:04 -04:00
chenyu 494b20e886
bert BS back to 54 (#6791)
60 does not run end to end
2024-09-27 22:16:05 -04:00
chenyu 572d77d1d9
bert script delete eval data after eval (#6790)
fits BS=60 which is 2% faster than 54. also fixed wandb logging params
2024-09-27 20:54:00 -04:00
chenyu f9c8e144ff
chmod +x mlperf bert script for red (#6789)
also disabled raising power cap in setup. wozeparrot mentioned that's unstable and might cause bert training issue on red
2024-09-27 11:27:32 -04:00
Francis Lata d3a387be63
[MLPerf] Prepare openimages dataset script (#6747)
* prepare openimages for MLPerf

* cleanup

* fix issue when clearing jit_cache on retinanet eval

* revert pandas specific changes
2024-09-27 11:13:56 -04:00
chenyu bea7ed5986
add RUNMLPERF=1 to bert dev_run.sh (#6775)
already set in run_and_time.sh, need RUNMLPERF=1 for it to load real data
2024-09-26 11:00:49 -04:00
chenyu 12de203a43
add IGNORE_JIT_FIRST_BEAM to bert scripts (#6769)
* update bert BEAM params

copied from resnet to start with

* just IGNORE_JIT_FIRST_BEAM
2024-09-26 05:38:24 -04:00
chenyu 5a5fbfa1eb
smaller bert script change (#6768)
only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently
2024-09-26 04:54:28 -04:00
chenyu 396c96357b
update mlperf bert scripts (#6755)
removed DISABLE_DROPOUT=1.
updated BS to 54 that works on tinyboxes with dropouts.
used bert's sparse_categorical_crossentropy that takes Tensor ignore_index in accuracy method
2024-09-25 23:55:05 -04:00
Anurag Lamsal 568757e087
fix model_eval.py in the mlperf folder searching for bert vocab in the wrong directory (#6649) 2024-09-24 11:20:44 +08:00
Francis Lata b7ce9a1530
UNet3D MLPerf (#3470)
* add training set transforms

* add DICE cross entropy loss

* convert pred and label to Tensor when calculating DICE score

* cleanups and allow train dataset batching

* fix DICE CE loss calculation

* jitted training step

* clean up DICE CE loss calculation

* initial support for sharding

* Revert "initial support for sharding"

This reverts commit e3670813b8a67469e7f694e09f2d15a8c40065da.

* minor updates

* cleanup imports

* add support for sharding

* apply temp patch to try to avoid OOM

* revert cstyle changes

* add gradient acc

* hotfix

* add FP16 support

* add ability to train on smaller image sizes

* add support for saving and loading checkpoints + cleanup some various modes

* fix issue with using smaller patch size + update W&B logging

* disable LR_WARMUP_EPOCHS

* updates

* minor cleanups

* cleanup

* update order of transformations

* more cleanups

* realize loss

* cleanup

* more cleanup

* some cleanups

* add RAM usage

* minor cleanups

* add support for gradient accumulation

* cleanup imports

* minor updates to not use GA_STEPS

* remove FP16 option since it's available now globally

* update multi-GPU setup

* add timing logs for training loop

* go back to using existing dataloader and add ability to preprocess data to save time

* clean up optimization and re-enable JIT and multi-GPU support for training and evaluation

* free train and eval steps memory

* cleanups and scale batch size based on the number of GPUs

* fix GlobalCounters import

* fix seed

* fix W&B setup

* update batch size default size

* add back metric divergence check

* put back JIT on UNet3d eval

* move dataset preprocessing inside training code

* add test for dice_loss

* add config logging support to W&B and other cleanups

* change how default float is getting retrieved

* remove TinyJit import duplicate

* update config logging to W&B and remove JIT on eval_step

* no need for caching preprocessed data anymore

* fix how evaluation is ran and how often

* add support for LR scaling

* fix issue with gaussian being moved to scipy.signal.windows

* remove DICE loss unit test

* fix issue where loss isn't compatible with multiGPU

* add individual BEAM control for train and eval steps

* fix ndimage scipy import

* add BENCHMARK

* cleanups on BENCHMARK + fix on rand_flip augmentation during training

* cleanup train and eval BEAM envs

* add checkpointing support after every eval

* cleanup model_eval

* disable grad during eval

* use new preprocessing dataset mechanism

* remove unused import

* use training and inference_mode contexts

* start eval after benchmarking

* add data fetching time

* cleanup decorators

* more cleanups on training script

* add message during benchmarking mode

* realize when reassigning LR on scheduler and update default number of epochs

* add JIT on eval step

* remove JIT on eval_step

* add train dataloader for unet3d

* move checkpointing to be done after every epoch

* revert removal of JIT on unet3d inference

* save checkpoint if metric is not successful

* Revert "add train dataloader for unet3d"

This reverts commit c166d129dfbe2e1c46d1937135a60b4ed25caa3d.

* Revert "Revert "add train dataloader for unet3d""

This reverts commit 36366c65d26f59ed1227acb670d5ce7b997606ae.

* hotfix: seed was defaulting to a value of 0

* fix SEED value

* remove the usage of context managers for setting BEAM and going from training to inference

* support new stack API for calculating eval loss and metric

* Revert "remove the usage of context managers for setting BEAM and going from training to inference"

This reverts commit 2c0ba8d322ec912bd8617cbe167c542e9ba229d9.

* check training and test preprocessed folders separately

* clean up imports and log FUSE_CONV_BW

* use train and val preprocessing constants

* add kits19 dataset setup script

* update to use the new test decorator for disabling grad

* update kits19 dataset setup script

* add docs on how to train the model

* set default value for BASEDIR

* add detailed instruction about BASEDIR usage

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-10 04:37:28 -04:00
Elias Wahl c9b4602854
no load in INITMLPERF (#5957) 2024-08-08 11:28:24 -04:00
Elias Wahl c9862e17d4
MLPERF BERT submission scripts (#5931)
* green

* red

* fix benchmark

* log

* count train samples

* oops. 4.0 -> 4.1

* note to todo

* no pillow
2024-08-06 18:09:18 -04:00
chenyu 1dab75ae37
clean up mlperf dataloader import (#5940)
use tinygrad tqdm for dataset, and PIL Image is only needed for resnet
2024-08-06 17:10:08 -04:00
Elias Wahl 937bf5fe12
better hparam (#5891) 2024-08-03 12:38:53 -04:00
Elias Wahl 4a114756f6
New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00
Francis Lata a0baff7a3d
update dataloader script example (#5818) 2024-07-30 15:18:29 -04:00
chenyu d71308ed68
copy mlperf 4.0 to mlperf 4.1 (#5614) 2024-07-20 16:12:00 -04:00
Francis Lata 0345577032
UNet3D dataloader shared memory fix (#5465)
* create separate SharedMemory between inputs and labels

* update path check for shared mem

* clean up unit test for dataset
2024-07-13 20:26:00 -04:00
Elias Wahl 73bddc44f6
Fix fake dataloader (#5326) 2024-07-08 09:07:44 -04:00
chenyu 43c3f73fbc
handcode_bert_opt.py (#5295)
similar to handcode_resnet50_opt.py, one file to check bert kernels without dataset.
2024-07-05 11:01:20 -04:00
Elias Wahl e267f3161d
Add MLLogger (#5125)
* add MLPerf logger

* eval steps

* start with step 1

* compliance for 3.1.0 and 4.0.0

* more compliance

* assert, comment and contiguous
2024-06-26 12:23:56 -04:00
Elias Wahl f31ef11537
Better default hparams for large BS (#5030)
* better default hparams for large BS

* bf16 too

* use tuple
2024-06-18 11:13:06 -04:00
Elias Wahl 7bfa9101c0
Float in scaled dot product attention (#4985)
* Monkeypatch scaled-dot-product-attention

* Use dot instead of matmul

* new api

* imports

* least_upper_dtype
2024-06-18 08:16:41 -04:00
Elias Wahl d2e3c391e8
Residual in MLM loss + Change default steps (#4935)
* Residual in mlm loss

* Reduce default steps to 160K * 24

* oops

* comment
2024-06-12 16:09:18 -04:00
Nik 085c0bbf6b
add mlperf train subset of openimages (#4841) 2024-06-05 10:10:11 -04:00
Elias Wahl e576aca044
Disable dropout (#4837) 2024-06-04 18:57:26 -04:00
Elias Wahl bb248a0dd1
Optional half matmul (#4835)
* half linear

* move weight cast back

* oops

* matmul dtype var

* todo comment
2024-06-04 17:53:41 -04:00
Elias Wahl 04e237328b
Refactor to class style (#4804) 2024-06-04 14:08:31 -07:00
Francis Lata 707099487a
Multiprocessing UNet3D dataloader (#4801)
* testing dataloader

* matching dataloader implementation for unet3d

* remove comments

* clean up dataloader

* add cookie and cleanup

* use shm_path when creating SharedMemory

* add support for testing resnet and unet3d dataloaders

* update dataset test to return preprocesed data directory in prep for dataloader testing

* pass preprocessed dataset directory properly

* update loader function for dataloader

* add shuffling on indices

* update shm name

* more cleanup for unet3d dataloader

* remove changes to tests

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-02 11:30:47 -04:00
Elias Wahl c4b0acf095
Global norm + small changes (#4749)
* norm

* no empty

* default loss scaler in float
2024-05-27 18:35:27 -04:00
Elias Wahl acc0039cfc
Resume fix + scheduler for non weight decay params (#4679)
* move ckpt dir

* fix resume. Add scheduler group
2024-05-21 19:38:13 -04:00
Elias Wahl 993091adfa
loss scaler + nan fixes (#4661) 2024-05-20 17:08:35 -04:00
chenyu bed70b130c
mlperf bert getenv-able EVAL_STEP_FREQ (#4534) 2024-05-11 14:36:56 -04:00
chenyu 04a4980a51
touchup bert script (#4531)
small adjustments, remove duplicated training setting and stop the script once target is hit
2024-05-11 13:02:02 -04:00
chenyu b00b6b16f0
fix TRAIN_BEAM and Tensor.training for mlperf bert (#4525)
also hard coded bert model config instead of looking up a file
2024-05-11 00:18:36 -04:00
chenyu b399d98e41
fix resnet eval (#4507) 2024-05-10 00:49:00 -04:00
wozeparrot a602dc67d3
feat: more mlperf fixes (#4505) 2024-05-09 20:50:20 -07:00
chenyu 0e8aa0e288
use fake data in beam searching resnet (#4504) 2024-05-09 23:43:50 -04:00