Commit Graph

92 Commits

Author SHA1 Message Date
George Hotz d9c62a33c3
add cifar to datasets.py (#6210) 2024-08-20 11:42:49 -07:00
George Hotz 8390feb7b9
optim.OptimizerGroup in hlb_cifar (#5401) 2024-07-11 20:14:36 -07:00
George Hotz 5ba611787d
move image into tensor.py. delete features (#4603)
* move image into tensor.py

* change setup.py

* openpilot tests need pythonpath now
2024-05-15 10:50:25 -07:00
David Hou c0a048c044
batchnorm d(var)/d(mean) = 0 (#4430)
* d(var)/d(mean) = 0

* drop the number in test_schedule!
2024-05-05 00:25:45 -04:00
David Hou 593c90d7d6
Resnet fp16 training with fp32 master weight copy (#4144)
* add casts to layers

* FLOAT flag

* detach

* no_grad for eval

* whitespace

* explicit fp32 initialization

* oops

* whitespace

* put back config['DEFAULT_FLOAT']

* bad

* live dangerously (don't hide bugs)

* don't bundle changes

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-14 11:25:08 -04:00
chenyu c71627fee6
move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
David Hou 4b95350c41
fp16 resnet (without expand backwards sum in float, doesn't work) (#3816)
* fp16 resnet

* cast running mean and var back to default float

* extra cast

* check symbolic no overflow

* add linearizer failure

* loss scaler after grad contig

* oops

* i think this works

* don't loss scale fp32

* remove overflow test case

* remove symbolic bounds check

* loss scaler should be float

* temporarily disable padto cuz bug

shruggie

* make running stats in batchnorm float32?

* calculate lars stuff in fp32?

* oops

* remove most changes

* move loss scaler out of optimizer

* no more FP16 var

* oops

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-28 01:25:37 -04:00
chenyu 83f39a8ceb
env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
chenyu e22d78b3d2
training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
Francis Lam a26090d404
search: change to use "spawn" and limit the number of tasks per child (#3862)
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
chenyu b13457e4a7
explicit dtypes in hlb_cifar (#3707)
prepared bfloat16 change. added float() and cast(default_float) in whiteing, explicitly set dtype in various places that convert between numpy and Tensor
2024-03-12 18:20:23 -04:00
David Hou d16aa89561
don't allow MLB assigns with different axes (#3557)
* allow LB <- MLB assign, but don't reuse buffer

* update test

* update test

* assign assert axes are the same

* update tests to manually shard running stats

* unused import
2024-03-01 07:59:06 -05:00
David Hou e5385eecfc
UnsyncedBatchNorm with synced trainable weights for hlb cifar (#3472)
* UnsyncedBatchNorm with synced trainable weights for hlb cifar

* multitensor reshape tests

* test mlb assign change axis

* E501

* argfix axis

* don't import batchnorm from hlb_cifar in test_multitensor

* pass num_devices to UnsyncedBatchNorm in test, allow UnsyncedBatchNorm to be used with LB

* add backprop test for UnsyncedBatchNorm

* break out MLB assign and reshape changes

* manually shard running mean and running var

* don't shard unless syncbn=0

* replace nn.BatchNorm2d with UnsyncedBatchNorm

* don't increment num_batches_tracked if not tracking running stats

* update tests

* oops

* Revert "oops"

This reverts commit 5e8a67a535abea2ff288b1b804a9aa95eba40732.

* Revert "update tests"

This reverts commit 7ebf65d89ace1d3a32c3b28ee323ddee253262d6.

* Revert "don't increment num_batches_tracked if not tracking running stats"

This reverts commit 78de0ea9ee8cbd65dce28bd4abcc131c98451aa2.

* Revert "replace nn.BatchNorm2d with UnsyncedBatchNorm"

This reverts commit d03da53da70f009338e95f2b46315ac02a30149a.

* don't increment num_batched_tracked if not tracking running stats

* oops

* test_batchnorm_axis

* compare against torch

* types

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-02-29 22:52:07 -05:00
chenyu d8ad9e5660
verify eval acc for hlb_cifar training (#3344)
set to 93% to reduce flakiness for now
2024-02-07 19:19:59 -05:00
chenyu 18e854cdbf
shrink MLB on sharded axis (#3255)
* shrink MLB on sharded axis

use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.

draft version in https://github.com/chenyuxyz/tinygrad/pull/109

* SYNCBN flag

* test unclean shrinks

* UnsyncedBatchNorm reuses BatchNorm

* more robust pad arg check

* better types

* more tests!

* 6 gpus in benchmark

* disable slow GPUS=6 benchmark
2024-01-31 21:48:25 -05:00
chenyu b0a755288f
cifar EVAL_BS set default value to BS (#3274)
less compile time for eval due to cache. 500 was a slow uneven number for 6 GPU too. eval time 5.9s -> 3.4s
2024-01-29 17:37:12 -05:00
chenyu 9e5409be6c
cifar move GlobalCounters.reset() before shard (#3217)
* cifar move GlobalCounters.reset() before shard

also shard mini batch inplace

* don't eval with DISABLE_BACKWARD
2024-01-23 16:07:43 -05:00
chenyu 3c179cc27c
cifar only shuffle data at epoch start (#3216)
save 1ms CPU time per batch. also only shuffle training set
2024-01-23 14:41:22 -05:00
chenyu 8465938d29
minor hlb_cifar cleanups (#3208)
mostly cosmetic. LATEBEAM=4 single 7900xtx 59.2 seconds
2024-01-22 12:38:39 -05:00
chenyu 827b7a3c64
cleanup pad_reflect and make_square_mask in hlb_cifar (#3206)
removed some complicated looking stuff. no wall time difference
2024-01-22 11:30:46 -05:00
chenyu 99884f4c98
cifar flags for RANDOM_CROP, RANDOM_FLIP, and CUTMIX (#3204)
experimenting with different setups, also would like to jit the data augmentation next
2024-01-22 01:12:51 -05:00
chenyu 836883fedc
comment out cutmix in hlb_cifar (#3201)
it's no-op with multi gpu and less STEPS. also the patch was selected from the whole dataset, not from the same batch
2024-01-21 22:24:53 -05:00
chenyu e52a609240
make WINO a context var, and LATEWINO in hlb_cifar (#3161) 2024-01-17 20:21:26 -05:00
chenyu 589c16756f
hlb_cifar multi gpu training (#3150)
* cifar train with multi gpu

* GPUS=1 is noop
2024-01-16 14:38:45 -05:00
chenyu b9d470577c
gelu -> quick_gelu in hlb_cifar (#3147)
89 -> 86 seconds, same eval acc
2024-01-16 02:03:37 -05:00
chenyu ec5a212b0a
modernize hlb_cifar (#3146)
* modernize hlb_cifar

do more things in Tensor space instead of numpy, clean up dtypes and use more Tensor methods.

* eigens are float64
2024-01-16 01:35:11 -05:00
chenyu 22920a7e55
add LATEBEAM to hlb_cifar (#3142)
still too slow to search on tinybox though
2024-01-15 23:26:03 -05:00
Yixiang Gao 8e1fd6ae9d test works 2024-01-03 07:22:01 -08:00
Yixiang Gao 4f89f8b73a make sure the old hyp breaks the test 2024-01-03 07:13:54 -08:00
Yixiang Gao b753d280f7 move hyp out of the train so it can be imported 2024-01-02 15:56:17 -08:00
Yixiang Gao 2e4d9ad936 adjsut div factor to avoid underflow 2024-01-02 13:47:13 -08:00
George Hotz a280cfe169
move dtypes to dtype.py (#2964)
* move dtypes to dtype.py

* fix urllib
2024-01-01 14:58:48 -08:00
George Hotz c81ce9643d
move globalcounters to ops (#2960)
* move globalcounters to ops

* missed a few

* sick of that failing
2024-01-01 14:21:02 -08:00
chenyu 6d7e9e0a56
hotfix convert Y_train to int before passing into index (#2850) 2023-12-19 11:40:56 -05:00
chenyu 0723f26c80
dtypes.default_float and dtypes.default_int (#2824) 2023-12-18 12:21:44 -05:00
George Hotz c6eb618013
tests from new lazy branch (#2774)
* tests from new lazy branch

* fix lin 11

* that was needed

* doesn't fail

* mark

* meant that

* llvm passes
2023-12-14 23:06:39 -08:00
qazal ab2d4d8d29
Fix cl import in the copy_speed test and cifar example (#2586)
* fix CL import

* update test to only run on GPU

* update hlb_cifar too
2023-12-03 09:22:07 -08:00
George Hotz 2c363b5f0b
new style device (#2530)
* cpu tests pass

* torch works

* works

* metal works

* fix ops_disk

* metal jit works

* fix openpilot

* llvm and clang work

* fix webgpu

* docs are rly broken

* LRU works on metal

* delete comment

* revert name to ._buf. LRU only on Compiled

* changes

* allocator

* allocator, getting closer

* lru alloc

* LRUAllocator

* all pass

* metal

* cuda

* test examples

* linearizer

* test fixes

* fix custom + clean realize

* fix hip

* skip tests

* fix tests

* fix size=0

* fix MOCKHIP

* fix thneed

* copy better

* simple

* old style metal copy

* fix thneed

* np reshape

* give cuda a device
2023-11-30 17:07:16 -08:00
George Hotz 9e07824542
move device to device.py (#2466)
* move device to device.py

* pylint test --disable R,C,W,E --enable E0611

* fix tests
2023-11-27 11:34:37 -08:00
wozeparrot 4c44d1344b
feat: remove cache_id (#2236) 2023-11-08 08:09:21 -08:00
George Hotz 2f7aab3d13
move optimize_local_size (#2221)
* move optimize_local_size

* interpret_ast
2023-11-05 21:00:52 -08:00
wozeparrot c29653605e
hip multigpu training (#1878)
* feat: move to hip

* feat: special path for RawBufferTransfer

* feat: initial rawbuffertransfer

* feat: hip ipc

* feat: working hip ipc

* feat: need to base device without args

* feat: close mem handle

* feat: modified test

* feat: more multihip stuff

* clean: cleanup

* feat: cleaner

* feat: don't crash

* feat: test more

* clean: way cleaner hip wrapper

* feat: barrier

* feat: barrier

* feat: this breaks stuff

* feat: we can use empty here

* feat: maybe fix tests

* feat: maybe fix tests again?

* fix: probably fix tests

* feat: no waiting here

* feat: wait here

* feat: much larger test

* feat: need to sync here

* feat: make this async

* feat: no waiting!

* feat: cut here

* feat: sync copy

* feat: random imports

* feat: much cleaner world

* feat: restore this

* feat: restore this

* clean: cleanup

* feat: set this
2023-10-24 17:35:53 -04:00
George Hotz 5cfec59abc
hlb cifar touchups (#2113)
* types and cnt and EVAL_STEPS

* eval time + always print eval
2023-10-18 16:26:15 -07:00
wozeparrot 4d1e59abfd
fix: only when distributed (#2102) 2023-10-17 20:09:04 -07:00
Sean D'Souza 999c95ea29
fix: hlb cifar types (#2099) 2023-10-17 19:23:50 -07:00
George Hotz 9b1c3cd9ca hlb_cifar: support EVAL_STEPS=1000, print when dataset is shuffled 2023-10-18 01:11:08 +00:00
Yixiang Gao 3187962476
CIFAR HALF mode (#2041)
* load weights in fp16

* add dtype option in nn

* fix test

* no need for dtype in nn

* add option to load weights in FP16, but NaN

* change loss scaler

* cast to float32 for norm layer

* add a todo for the forward pass padding

* fix transform
2023-10-12 10:19:51 -07:00
Yixiang Gao 094d3d71be
with Tensor.train() (#1935)
* add with.train

* remove the rest TODOs

* fix pyflake

* fix pyflake error

* fix mypy
2023-09-28 18:02:31 -07:00
Yixiang Gao cb5d6576cb
cifar step time 65ms while stay above 94% (#1888)
* change reduceop heruistics

* add model ema and jit hack

* add ema eval

* have to create a duplicate eval function for jit

* remove manual seed

* 94% achieveable with normal eval

* ema is outputting the same results as normal

* fix ema bug

* ema achieves 94% with fix seed

* multigpu tested

* constant fold decay, fix jit, adjust message for multigpu

* pull SpeedyResNet out of train_cifar()
2023-09-21 11:19:32 +08:00
Yixiang Gao 9d93a82354
remove FAKEDATA (#1685) 2023-08-26 20:15:54 -04:00