Commit Graph

589 Commits

Author SHA1 Message Date
Francis Lam a26090d404
search: change to use "spawn" and limit the number of tasks per child (#3862)
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
Anurag Lamsal 4e0819e40b
fixing the benchmark not printing in handcode resnet50 opt example (#3850) 2024-03-21 00:55:31 -04:00
chenyu 9d1d08fbb0
show llama bandwith with timing (#3844) 2024-03-20 17:19:15 -04:00
chenyu dccefab23f
remove mixtral weight to clang first (#3792)
seems fine without it now
2024-03-17 23:33:17 -04:00
chenyu 5ac1fa933f
apply the same fix_bf16 in llama and coder (#3789)
* apply the same fix_bf16 in llama and coder

did not realize the same logic was in llama too.
really fix #2775

* flag for native SUPPORT_BF16 cast
2024-03-17 21:25:24 -04:00
chenyu 639bd5dbfc
move bf16 cast hack to Tensor.llvm_bf16_cast (#3788) 2024-03-17 18:51:22 -04:00
chenyu 9255332d9e
use llvm as bridge to fix_bf16 loading (#3774)
This is how bf16 load is tested in test_bf16_disk_write_read now and it should fix #2775.
I tested that it fixed loading coder using PYTHON backend.

Will separate this special bf16 load v.s. regular bf16 support
2024-03-16 15:22:19 -04:00
chenyu e1c5aa9cce
estimated resnet training time for BENCHMARK (#3769) 2024-03-15 22:36:58 -04:00
chenyu 4bd5535d72
update mlperf resnet default hparams (#3758)
we might be able to have higher lr given smaller BS, but this is good.

Trained to 75.9%
https://wandb.ai/chenyuxyz/tinygrad-examples_mlperf/runs/xi2f48se/overview
2024-03-15 12:09:26 -04:00
George Hotz 641f347232
simple LoadOps.ASSIGN (#3745)
* simple LoadOps.ASSIGN

* skip that test

* don't assign in onnx ops gemm

* track cache usage

* recreate the lazybuffer to avoid the cache

* fix contigs

* skip that test

* lol

* better letters
2024-03-14 20:44:34 -07:00
chenyu 557c7a5c54
fix yolov8.py (#3742)
replaced an `assign` with `replace`, and add '.png' for output if input URL does not contain an extention
2024-03-14 17:33:45 -04:00
George Hotz 3527c5a9d2
add Tensor.replace (#3738)
* add Tensor.replace

* fix dtypes in that test

* should be replace

* and mixtral
2024-03-14 13:34:14 -07:00
David Hou 199f7c4342
MLPerf Resnet (cleaned up) (#3573)
* this is a lot of stuff

TEST_TRAIN env for less data

don't diskcache get_train_files

debug message

no lr_scaler for fp32

comment, typo

type stuff

don't destructure proc

make batchnorm parameters float

make batchnorm parameters float

resnet18, checkpointing

hack up checkpointing to keep the names in there

oops

wandb_resume

lower lr

eval/ckpt use e+1

lars

report top_1_acc

some wandb stuff

split fw and bw steps to save memory

oops

save model when reach target

formatting

make sgd hparams consistent

just always write the cats tag...

pass X and Y into backward_step to trigger input replace

shuffle eval set to fix batchnorm eval

dataset is sorted by class, so the means and variances are all wrong

small cleanup

hack restore only one copy of each tensor

do bufs from lin after cache check (lru should handle it fine)

record epoch in wandb

more digits for topk in eval

more env vars

small cleanup

cleanup hack tricks

cleanup hack tricks

don't save ckpt for testeval

cleanup

diskcache train file glob

clean up a little

device_str

SCE into tensor

small

small

log_softmax out of resnet.py

oops

hack :(

comments

HeNormal, track gradient norm

oops

log SYNCBN to wandb

real truncnorm

less samples for truncated normal

custom init for Linear

log layer stats

small

Revert "small"

This reverts commit 988f4c1cf35ca4be6c31facafccdd1e177469f2f.

Revert "log layer stats"

This reverts commit 9d9822458524c514939adeee34b88356cd191cb0.

rename BNSYNC to SYNCBN to be consistent with cifar

optional TRACK_NORMS

fix label smoothing :/

lars skip list

only weight decay if not in skip list

comment

default 0 TRACK_NORMS

don't allocate beam scratch buffers if in cache

clean up data pipeline, unsplit train/test, put back a hack

remove print

run test_indexing on remu (#3404)

* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57dc5fa70169189fd33b24dbbee99c40, reversing
changes made to 3c4c8c9e16.

fix bad seeding

UnsyncBatchNorm2d but with synced trainable weights

label downsample batchnorm in Bottleneck

:/

:/

i mean... it runs... its hits the acc... its fast...

new unsyncbatchnorm for resnet

small fix

don't do assign buffer reuse for axis change

* remove changes

* remove changes

* move LARS out of tinygrad/

* rand_truncn rename

* whitespace

* stray whitespace

* no more gnorms

* delete some dataloading stuff

* remove comment

* clean up train script

* small comments

* move checkpointing stuff to mlperf helpers

* if WANDB

* small comments

* remove whitespace change

* new unsynced bn

* clean up prints / loop vars

* whitespace

* undo nn changes

* clean up loops

* rearrange getenvs

* cpu_count()

* PolynomialLR whitespace

* move he_normal out

* cap warmup in polylr

* rearrange wandb log

* realize both x and y in data_get

* use double quotes

* combine prints in ckpts resume

* take UBN from cifar

* running_var

* whitespace

* whitespace

* typo

* if instead of ternary for resnet downsample

* clean up dataloader cleanup a little?

* separate rng for shuffle

* clean up imports in model_train

* clean up imports

* don't realize copyin in data_get

* remove TESTEVAL (train dataloader didn't get freed every loop)

* adjust wandb_config entries a little

* clean up wandb config dict

* reduce lines

* whitespace

* shorter lines

* put shm unlink back, but it doesn't seem to do anything

* don't pass seed per task

* monkeypatch batchnorm

* the reseed was wrong

* add epoch number to desc

* don't unsyncedbatchnorm is syncbn=1

* put back downsample name

* eval every epoch

* Revert "the reseed was wrong"

This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.

* cast lr in onecycle

* support fp16

* cut off kernel if expand after reduce

* test polynomial lr

* move polynomiallr to examples/mlperf

* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* no more half

* polylr and lars were merged

* undo search change

* override Linear init

* remove half stuff from model_train

* update scheduler init with new args

* don't divide by input mean

* mistake in resnet.py

* restore whitespace in resnet.py

* add test_data_parallel_resnet_train_step

* move initializers out of resnet.py

* unused imports

* log_softmax to model output in test to fix precision flakiness

* log_softmax to model output in test to fix precision flakiness

* oops, don't realize here

* is None

* realize initializations in order for determinism

* BENCHMARK flag for number of steps

* add resnet to bechmark.yml

* return instead of break

* missing return

* cpu_count, rearrange benchmark.yml

* unused variable

* disable tqdm if BENCHMARK

* getenv WARMUP_EPOCHS

* unlink disktensor shm file if exists

* terminate instead of join

* properly shut down queues

* use hip in benchmark for now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-14 00:53:41 -04:00
chenyu 3d9b882d37
hotfix unlink /dev/shm/resnet_X if it already exists (#3726) 2024-03-13 18:53:03 -04:00
chenyu ad1d873f8d
fix llama shard convo mode (#3716) 2024-03-13 12:07:02 -04:00
qazal 337cd53444
multioutput ScheduleItem (#3699)
* refactor realize.py

* update docs

* update test_sched

* update runners and devices

* update openpilot and unit tests

* cleanup runner lowering

* update more tests
2024-03-13 08:59:38 -07:00
David Hou 2befdf86d9
dataloader worker/shm cleanup (#3710) 2024-03-12 21:44:24 -04:00
chenyu b13457e4a7
explicit dtypes in hlb_cifar (#3707)
prepared bfloat16 change. added float() and cast(default_float) in whiteing, explicitly set dtype in various places that convert between numpy and Tensor
2024-03-12 18:20:23 -04:00
qazal aec4c4f01b
linearizer ast as a tuple of lazyops (#3689)
* multi store op linearizer

* currently we do only one output per kernel

* named opts
2024-03-11 15:39:04 -07:00
rnxyfvls 490c5a3ec3
examples/stable_diffusion: support model checkpoints without alphas_cumprod key (#3681)
* examples/stable_diffusion: support model checkpoints without alphas_cumprod key

(which is most models on civitai)

* fix indent

---------

Co-authored-by: a <a@a.aa>
2024-03-11 16:05:52 -04:00
chenyu d69170e27e
add llama 2 70B in ci and verify output (#3682)
* add llama 2 70B in ci and verify output

* ln -s llama2 dir
2024-03-11 12:48:22 -04:00
George Hotz 3415b0ee54 hotfix: mixtral copies norms together for 2% speed 2024-03-11 01:28:03 +00:00
chenyu bad6adaf8c
add mixtral and 6 gpus cifar to tinybox ci (#3676)
* add mixtral and 6 gpus cifar to tinybox ci

* print total ram used at the end of loading
2024-03-10 18:25:31 -04:00
David Hou 9f66dcf718
PolynomialDecayWithWarmup + tests (#3649)
* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* whitespace

* clean up

* clean up

* asserts

* test polylr for full resnet training run

* add comment

* rename

* fix do_optim

* don't cast lr

* info

* calculate from train_files

* skip it
2024-03-07 18:53:36 -05:00
chenyu fcf4a5ccf2
fix example that calls Tensor.__bool__ (#3650)
also removed `.cpu()` calls in mask_rcnn so `python3 examples/mlperf/model_spec.py` runs
2024-03-07 16:59:26 -05:00
David Hou 0afaf70d57
lars optimizer + tests (#3631)
* lars optimizer + tests

* fix skip list!

* use id to compare in skip list

* go back to using set

* Tensor(bool) * Tensor(bool) is and

* don't lint external/mlperf_resnet

* whitespace

* add external_test_optim to opencl tests

* give mlperf task a name

* mlperf under onnx

* remove track_gnorm

* contiguous instead of realize

* assert momentum and weight decay positive

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-06 18:11:01 -05:00
David Hou d16aa89561
don't allow MLB assigns with different axes (#3557)
* allow LB <- MLB assign, but don't reuse buffer

* update test

* update test

* assign assert axes are the same

* update tests to manually shard running stats

* unused import
2024-03-01 07:59:06 -05:00
David Hou e5385eecfc
UnsyncedBatchNorm with synced trainable weights for hlb cifar (#3472)
* UnsyncedBatchNorm with synced trainable weights for hlb cifar

* multitensor reshape tests

* test mlb assign change axis

* E501

* argfix axis

* don't import batchnorm from hlb_cifar in test_multitensor

* pass num_devices to UnsyncedBatchNorm in test, allow UnsyncedBatchNorm to be used with LB

* add backprop test for UnsyncedBatchNorm

* break out MLB assign and reshape changes

* manually shard running mean and running var

* don't shard unless syncbn=0

* replace nn.BatchNorm2d with UnsyncedBatchNorm

* don't increment num_batches_tracked if not tracking running stats

* update tests

* oops

* Revert "oops"

This reverts commit 5e8a67a535abea2ff288b1b804a9aa95eba40732.

* Revert "update tests"

This reverts commit 7ebf65d89ace1d3a32c3b28ee323ddee253262d6.

* Revert "don't increment num_batches_tracked if not tracking running stats"

This reverts commit 78de0ea9ee8cbd65dce28bd4abcc131c98451aa2.

* Revert "replace nn.BatchNorm2d with UnsyncedBatchNorm"

This reverts commit d03da53da70f009338e95f2b46315ac02a30149a.

* don't increment num_batched_tracked if not tracking running stats

* oops

* test_batchnorm_axis

* compare against torch

* types

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-02-29 22:52:07 -05:00
George Hotz 2e60012bcf
move create schedule and delete old API (#3377)
* move create schedule and delete old API

* fix test multitensor
2024-02-12 18:10:45 +01:00
George Hotz 41efaa848c
move graph.py and jit.py into features (#3376)
* move graph.py into features

* move jit into features

* fix quickstart
2024-02-12 17:34:34 +01:00
chenyu d8ad9e5660
verify eval acc for hlb_cifar training (#3344)
set to 93% to reduce flakiness for now
2024-02-07 19:19:59 -05:00
chenyu 18e854cdbf
shrink MLB on sharded axis (#3255)
* shrink MLB on sharded axis

use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.

draft version in https://github.com/chenyuxyz/tinygrad/pull/109

* SYNCBN flag

* test unclean shrinks

* UnsyncedBatchNorm reuses BatchNorm

* more robust pad arg check

* better types

* more tests!

* 6 gpus in benchmark

* disable slow GPUS=6 benchmark
2024-01-31 21:48:25 -05:00
chenyu 77251336d5
fix handcode_resnet50_opt.py (#3289)
linearizer_opts has moved. also update the logging to print after total_tm update
2024-01-31 19:01:08 -05:00
chenyu b0a755288f
cifar EVAL_BS set default value to BS (#3274)
less compile time for eval due to cache. 500 was a slow uneven number for 6 GPU too. eval time 5.9s -> 3.4s
2024-01-29 17:37:12 -05:00
Francis Lata 86748f4a8c
fix bbox format to be a list (#3265) 2024-01-27 17:54:19 -08:00
chenyu 9e5409be6c
cifar move GlobalCounters.reset() before shard (#3217)
* cifar move GlobalCounters.reset() before shard

also shard mini batch inplace

* don't eval with DISABLE_BACKWARD
2024-01-23 16:07:43 -05:00
chenyu 3c179cc27c
cifar only shuffle data at epoch start (#3216)
save 1ms CPU time per batch. also only shuffle training set
2024-01-23 14:41:22 -05:00
chenyu 8465938d29
minor hlb_cifar cleanups (#3208)
mostly cosmetic. LATEBEAM=4 single 7900xtx 59.2 seconds
2024-01-22 12:38:39 -05:00
chenyu 827b7a3c64
cleanup pad_reflect and make_square_mask in hlb_cifar (#3206)
removed some complicated looking stuff. no wall time difference
2024-01-22 11:30:46 -05:00
chenyu 99884f4c98
cifar flags for RANDOM_CROP, RANDOM_FLIP, and CUTMIX (#3204)
experimenting with different setups, also would like to jit the data augmentation next
2024-01-22 01:12:51 -05:00
chenyu 53afec2841
add HALF to handcode_resnet50_opt.py (#3202)
use this to study tensor cores on HIP
2024-01-21 23:03:59 -05:00
chenyu 836883fedc
comment out cutmix in hlb_cifar (#3201)
it's no-op with multi gpu and less STEPS. also the patch was selected from the whole dataset, not from the same batch
2024-01-21 22:24:53 -05:00
George Hotz c80884884e
event driven hip (#3160)
* event driven hip

* simpler, src makes copy

* pass mypy
2024-01-18 14:35:18 -08:00
chenyu e52a609240
make WINO a context var, and LATEWINO in hlb_cifar (#3161) 2024-01-17 20:21:26 -05:00
George Hotz 9cc2577a08
use hip events (#3157)
* use hip events

* cleanup
2024-01-17 10:39:57 -08:00
George Hotz a72b1b6d65
sharding for llama (#3151)
* shard llama

* sharding works

* simpler

* simpler

* consume option

* disable that test

* save a line

---------

Co-authored-by: George Hotz <george@tinygrad.org>
2024-01-16 19:28:00 -08:00
chenyu 589c16756f
hlb_cifar multi gpu training (#3150)
* cifar train with multi gpu

* GPUS=1 is noop
2024-01-16 14:38:45 -05:00
George Hotz 228f30b96a
multitensor jit (#3149)
* initial multitensor jit support and tests

* Added graphs to multitensor jit and updated tests

* update unbind api

* fix set device, add TinyJit to resnet

* update_stats includes device

---------

Co-authored-by: ramenguy99 <ramenguy99@gmail.com>
2024-01-16 09:09:15 -08:00
chenyu b9d470577c
gelu -> quick_gelu in hlb_cifar (#3147)
89 -> 86 seconds, same eval acc
2024-01-16 02:03:37 -05:00
chenyu ec5a212b0a
modernize hlb_cifar (#3146)
* modernize hlb_cifar

do more things in Tensor space instead of numpy, clean up dtypes and use more Tensor methods.

* eigens are float64
2024-01-16 01:35:11 -05:00