Commit Graph

448 Commits

Author SHA1 Message Date
chenyu a1133beb80
KFD GEMM (#4221)
added to benchmark CI and fixed duplicated filenames between cuda and ptx
2024-04-19 00:43:18 -04:00
Francis Lata 3644077a42
[MLPerf][UNet3D] Add DICE loss + metrics (#4204)
* add DICE loss and metrics

* update dice to include reference implementation's link

* remove unused imports

* remove unnecessary test file and update pred + label for metrics and losses test

* add tests to CI + add exclusion of mlperf_unet3d

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-17 20:09:33 -04:00
qazal ba8602612b
Fuzz all permutations of schedule (#4136)
* simple toposort

* fuzzer

* init in_degree

* move to tests

* same seed

* configure paths

* internal graph

* compare LazyBuffers

* simpler

* simple graph

* assign works

* simpler

* fix JIT

* upstream ci

* move ci

* fix the path

* DEBUG=1

* limit max paths

* launch a cmp kernel

* Revert "launch a cmp kernel"

This reverts commit 791c6089922fa7d800456f28fc167842f188ac7e.

* exec ground truth

* better perf

* copy ground truth once

* gpu allclose ast try1

* Revert "gpu allclose ast try1"

This reverts commit 1f82103af3a7bfedb9f858b6c58b0b94f1c7e6b0.

* prerealized bufs freezing

* teeny cleanups

* reuse Buffers

* Revert "reuse Buffers"

This reverts commit a71de94b035bd5ceb1ec257f6b2529b166bcd30b.

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-17 05:03:21 +04:00
George Hotz 8f749ae0eb
New docs are in mkdocs (#4178)
* start mkdocs

* simple docs for tensor

* more docs

* move those back

* more docs

* copy markdown extensions

* docs legacy

* docs building workflow

* fix showcase links

* only that?

* install tinygrad

* add docs to setup.py

* Delete examples/llm.c/data
2024-04-16 10:59:51 +04:00
George Hotz e14a9bca0c hotfix: bump line count to 7500 for NV backend 2024-04-15 23:18:46 +04:00
chenyu a7c6864260
remove CAST_BEFORE_VIEW (#4152)
* remove CAST_BEFORE_VIEW

testing perf, also this might have issue with assign?

* remove all
2024-04-13 01:05:08 -04:00
George Hotz 0f16709c00 hotfix: remove test speed vs torch 2024-04-11 08:37:57 -07:00
George Hotz 10dbf90b2c hotfix: test speed 2024-04-09 13:20:39 -07:00
geohotstan 15f2f39658
conceptually simpler fancy index (#3335)
* init

* add failed case

* fix: temp comment out MULACC cast

* is this right?

* add test case

* oops, forgot to get rid of temp test

* WOOOOOO TOOK OUT 2 TRANSPOSES IN GATHER YAY

* cleaner

* comment cleanup

* update docs

* resolve conflict

* oops

* SUPA FAST

* comment out a test

* del some print statements

* use new broadcast stuff

* more clean up

* move try except

* skip fancy indexing for python backend test_ops
2024-04-09 11:18:04 -04:00
chenyu 9a95d87366
metal CI run llama with 4 shards (#4103)
this can catch multi tensor issue on mac.
2024-04-07 11:04:08 -04:00
chenyu a023a1ed87
update github action to actions/cache@v4 (#4077)
get rid of warning `Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/cache@v3.`
2024-04-04 22:24:26 -04:00
chenyu 9e0ebf8979
remove dtype from FlopCounter (#4075)
the annoying thing to remove all FlopCounter is that for device that does not support local, matmul index alu is huge.
we can remove the dtype first.

sneak in updating `ruff` command to `ruff check`
2024-04-04 21:23:28 -04:00
Szymon Ożóg 68fe3527f1
Tensor core ptx (#3894)
* tensor cores

* Merge from master

* faster program start in llvm (#3897)

* Fix the result permutation in einsum (#3895)

* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>

* touchup einsum (#3900)

don't need rhs_letters

* hotfix check ckpts before writing achieved model (#3901)

this killed tinybox green run

* replace dtype.name str with render_dtype (#3903)

fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override

* add --minimal flag to nvrtc (#3899)

* wmma: fix the AMD TC threads to split the first 16 threads (#3904)

previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride

* training cifar with BF16 on CUDA (#3905)

* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda

* include negative float in test_dtype (#3884)

* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow

* add to benchmark

* change var name to satisfy mypy

* spacing

* Update to new TensorCore format

* Spacing

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-04 07:32:31 -07:00
George Hotz 2abb474d43
kfd driver wip (#3912)
* kfd driver wip

* cleanups

* kfd almost ready to ring doorbell

* ding dong?

* issues with signals

* something

* works

* ops kfd

* add amd_signal_t

* works...sometimes

* program runs

* _gpu_alloc cleanup

* cleanups

* work

* header + enable profiling (#3959)

* header + enable profiling

* just cleaner

* measure

* only local time domain

* remove old comments

* fix with master

* elf parsing (#3965)

* elf parsing

* fix kernels with private

* not used

* clean up

* clean up 2

* add flags

* kfd sdma (#3970)

* working sdma

* remove driver, shorter

* all commands we might need

* svm

* kfd remove hardcoded values (#4007)

* remove hardcoded values

* match above line

* 7k lines + revert hsa

* update that from origin

* fix sdma reg gen

* not the updated SDMA

* compiler_opts

* don't require kfd_ioctl

* get ioctls from python

* get ioctls from python

* remove build_sdma_command

* merge into 64-bit fields

* shorter

* fix property spelling and off by one

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-30 15:08:12 -07:00
chenyu 3fee689ded
fix ops_python for test_uops (#3982) 2024-03-28 22:48:55 -04:00
Francis Lam 7c5729a3bd
wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
George Hotz da07f31fd4 hotfix: remove bf16 test entirely 2024-03-26 20:50:27 -07:00
George Hotz 0d5845fb5b hotfix: jit is flaky on mac 2024-03-26 20:44:05 -07:00
George Hotz 629cbc5587
only abstractions 2 (#3947) 2024-03-26 20:02:18 -07:00
Francis Lam 5530b0cbed
fuzz_linearizer: reduce debug verbosity and make easier for CI usage (#3942)
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage

* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures

* clean up naming and use set
2024-03-26 16:25:24 -04:00
chenyu 8df6587c41
hotfix 97.3 for beautiful_mnist (#3941) 2024-03-26 15:02:53 -04:00
chenyu ef537672bf
bf16 support in metal (#3929)
it runs if device gpu supports bfloat. updated ci benchmark too
2024-03-25 23:17:36 -04:00
chenyu d651835ef5
verify beautiful_mnist.py eval acc and put into benchmark ci (#3926)
* verify beautiful_mnist and put in ci

* 97.5 for eval verification
2024-03-25 16:47:49 -04:00
chenyu 83f39a8ceb
env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
chenyu 2c69888654
include negative float in test_dtype (#3884)
* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow
2024-03-24 02:39:15 -04:00
chenyu e22d78b3d2
training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
chenyu ee502c8055
fixup to_movement_ops and add back to CI (#3881) 2024-03-22 18:14:49 -04:00
George Hotz f4055439dc
don't include hip common (#3851)
* don't install hip common

* only that

* Revert "only that"

This reverts commit 85f22015d98d2775641cb9c7851fe595bdc97d29.

* less

* needed

* sep comgr

* header file

* 6.0.2

* update hsa

* hsakmt

* Revert "hsakmt"

This reverts commit d3a118078ed1c032f31abddb9d30cf6c13fc4f5e.
2024-03-22 08:50:50 -07:00
chenyu 82ce60e172
use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark (#3870)
smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090
2024-03-22 00:40:06 -04:00
chenyu bc482729d0
lower hlb_cifar acc to 93.3 (#3865)
ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now.

maybe reenable ema later if it reduces variance
2024-03-21 17:58:53 -04:00
chenyu 7ff47e45a1
cifar TARGET_EVAL_ACC_PCT=93.5 (#3843) 2024-03-20 16:56:51 -04:00
chenyu 727de5ba1e
llama 7B on 3090 benchmark (#3837)
* llama 7B on 3090 benchmark

* symlink llama
2024-03-20 12:48:22 -04:00
chenyu 47b9cc2dfe
use float32 for rand buffer in test_beam_search and test in metal (#3831) 2024-03-19 23:22:58 -04:00
chenyu e12bc85014
use BS=128 and BS=768 for resent benchmark (#3815)
50% more hcopt perf with this one weird trick
2024-03-18 23:49:55 -04:00
chenyu 20681d5c4a
remove old dist multigpu (#3811) 2024-03-18 18:31:05 -04:00
chenyu 5dd048a378
remove HIP in core tinygrad (#3810)
* remove HIP in core tinygrad

ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag

* EMULATE_CUDA
2024-03-18 18:19:27 -04:00
wozeparrot a0ab755317
threefry again (#3785)
* feat: initial xor

* feat: initial threefly

* feat: remove custom random

* fix: really need to install precommit

* feat: lmao forgot that this is rotate not a shift

* clean: put that there

* feat: numpy xor

* feat: quick test for xor

* feat: llvm xor

* feat: slightly working xor in torch

* feat: rand works in jit

* clean: save a line

* feat: match jax

* feat: maybe test against jax

* feat: requires_grad

* fix: fix test_symbolic_ops

* feat: lower alpha

* feat: just pad

* fix: maybe fix training tests?

* fix: fix some llvm stuff

* feat: cursed realize on the way out

* feat: testing jax

* fix: why is the jax install process not simple

* fix: maybe passing test

* fix: symbolic workarounds

* clean: still need that precommit

* fix: aaaa

* fix: more test fixes

* fix: quick fix for wgsl

* feat: need to set requires_grad on the final tensor

* feat: one more tensor

* feat: don't take forever

* feat: seeing y ci is brok

* feat: can't allocate 64GiB lmao

* fix: fix this

* feat: hope this doesn't break smth before i go to bed

* feat: don't destroy ram

* feat: int

* feat: remove jax

* feat: properish workaround?

* feat: skip slow webgpu tests

* feat: no longer fails

* feat: use dtypes

* feat: real number

* fix: torch

* fix: don't test against reference for torch

* feat: to device

* feat: fix advanced indexing

* feat: correct casting

* feat: even rng_counter

* feat: match master

* feat: this was actually bad

* fix: maybe?

* feat: store

* feat: remove realizes

* feat: somehow this is important

* feat: somehow this is also important

* feat: save a line

* fix: don't need that anymore

* feat: restore this

* fix: linter

* feat: remove realizes

* fix: realized is in base now

* fix: add back cast

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: :(

* fix: :(

* fix: not being dumb

* feat: try changing less tests

* feat: shouldn't have to change that

* feat: contiguous bumps it by one

* fix: hmm

* fix: numpy memory moment

* fix: cl_khr_fp16

* fix: torch has different tensor count

* fix: missing contiguous

* hmm: hmm

* fix: some fixes

* fix: typing

* feat: dont do that

* feat: typing fixes

* feat: why is this realize required?

* feat: ngl kinda odd typing

* feat: oh

* feat: remove realizes

* feat: why is this realize required?

* fix: hacky patch for cudacpu

* fix: without this realize pytest crashes?????

* fix: shorter line

* fix: cudacpu fixes

* fix: cudacpu fixes

* feat: real buffer

* feat: don't search when searching lmao

* fix: can't use contiguous things

* fix: no more 100GB arrays

* fix: revert

* fix: skip 7 and 10

* feat: working ish beam

* feat: minimize changes

* feat: seed 0 stable diffusion example changed

* fix: different on ci

* fix: no beam

* feat: make threefry optional

* fix: check value

* fix: unused import

* feat: threefry default

* fix: 5d

* feat: allow non upcast div

* fix: 5d better

* fix: 5d better

* fix: save all dtype

* feat: proper error

* feat: lazyop key

* fix: check float

* feat: try removing this realize now

* feat: disable threefry for uops hip tensor cores

* feat: don't need that

* feat: only check upcast

* fix: disable threefry for some metal tests

* feat: disable for metal tensor uops as well

* feat: disable for most uops

* fix: disable threefry for new uops tests

* feat: multitensor

* fix: typing

* feat: threefry default off

* feat: skip threefry half rand

* feat: restore old

* fix: bad git

* clean: ruff

* feat: bfloat16 fix

* fix: :|

* feat: restore old

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-18 16:47:07 -04:00
chenyu 1711274654
7B llama on 4 gpus on benchmark (#3804) 2024-03-18 14:32:37 -04:00
George Hotz 311cf2b7d3
Revert "threefry_2x32 (#2601)" (#3784)
This reverts commit db3de54bc4.
2024-03-17 10:27:20 -07:00
wozeparrot db3de54bc4
threefry_2x32 (#2601)
* feat: initial xor

* feat: initial threefly

* feat: remove custom random

* fix: really need to install precommit

* feat: lmao forgot that this is rotate not a shift

* clean: put that there

* feat: numpy xor

* feat: quick test for xor

* feat: llvm xor

* feat: slightly working xor in torch

* feat: rand works in jit

* clean: save a line

* feat: match jax

* feat: maybe test against jax

* feat: requires_grad

* fix: fix test_symbolic_ops

* feat: lower alpha

* feat: just pad

* fix: maybe fix training tests?

* fix: fix some llvm stuff

* feat: cursed realize on the way out

* feat: testing jax

* fix: why is the jax install process not simple

* fix: maybe passing test

* fix: symbolic workarounds

* clean: still need that precommit

* fix: aaaa

* fix: more test fixes

* fix: quick fix for wgsl

* feat: need to set requires_grad on the final tensor

* feat: one more tensor

* feat: don't take forever

* feat: seeing y ci is brok

* feat: can't allocate 64GiB lmao

* fix: fix this

* feat: hope this doesn't break smth before i go to bed

* feat: don't destroy ram

* feat: int

* feat: remove jax

* feat: properish workaround?

* feat: skip slow webgpu tests

* feat: no longer fails

* feat: use dtypes

* feat: real number

* fix: torch

* fix: don't test against reference for torch

* feat: to device

* feat: fix advanced indexing

* feat: correct casting

* feat: even rng_counter

* feat: match master

* feat: this was actually bad

* fix: maybe?

* feat: store

* feat: remove realizes

* feat: somehow this is important

* feat: somehow this is also important

* feat: save a line

* fix: don't need that anymore

* feat: restore this

* fix: linter

* feat: remove realizes

* fix: realized is in base now

* fix: add back cast

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: :(

* fix: :(

* fix: not being dumb

* feat: try changing less tests

* feat: shouldn't have to change that

* feat: contiguous bumps it by one

* fix: hmm

* fix: numpy memory moment

* fix: cl_khr_fp16

* fix: torch has different tensor count

* fix: missing contiguous

* hmm: hmm

* fix: some fixes

* fix: typing

* feat: dont do that

* feat: typing fixes

* feat: why is this realize required?

* feat: ngl kinda odd typing

* feat: oh

* feat: remove realizes

* feat: why is this realize required?

* fix: hacky patch for cudacpu

* fix: without this realize pytest crashes?????

* fix: shorter line

* fix: cudacpu fixes

* fix: cudacpu fixes

* feat: real buffer

* feat: don't search when searching lmao

* fix: can't use contiguous things

* fix: no more 100GB arrays

* fix: revert

* fix: skip 7 and 10

* feat: working ish beam

* feat: minimize changes

* feat: seed 0 stable diffusion example changed

* fix: different on ci

* fix: no beam

* feat: make threefry optional

* fix: check value

* fix: unused import

* feat: threefry default

* fix: 5d

* feat: allow non upcast div

* fix: 5d better

* fix: 5d better

* fix: save all dtype

* feat: proper error

* feat: lazyop key

* fix: check float

* feat: try removing this realize now

* feat: disable threefry for uops hip tensor cores

* feat: don't need that

* feat: only check upcast

* fix: disable threefry for some metal tests

* feat: disable for metal tensor uops as well

* feat: disable for most uops

* fix: disable threefry for new uops tests

* feat: multitensor

* fix: typing

* feat: threefry default off

* feat: skip threefry half rand

* feat: restore old

* fix: bad git

* clean: ruff

* feat: bfloat16 fix

* fix: :|

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-17 10:19:33 -07:00
George Hotz 53adcb34f5
remove hip backend (#3783)
* remove hip backend

* remove unused

* rhip

* more RHIP
2024-03-17 10:12:16 -07:00
chenyu 77febb44e6
llama 7B on 6 gpus benchmark (#3773) 2024-03-16 11:38:52 -04:00
George Hotz 0870dd5b3b hotfix: switch resnet training from HIP -> HSA in CI 2024-03-15 13:35:52 -07:00
chenyu 8ea53951c1
bfloat16 Tensor.rand (#3764)
* Tensor.rand for bfloat16

for numpy based random, generate one for float then cast for bfloat16.

close #3653

* remove realize
2024-03-15 15:05:13 -04:00
chenyu a2d3cf64a5
move is_dtype_supported to test.helpers (#3762)
* move is_dtype_supported to test.helpers

updated all places that check if float16 is supports

* fix tests
2024-03-15 14:33:26 -04:00
chenyu 922f8319cb
Run test_real_world in METAL test (#3760)
* clean up test_real_world

* skip that

* JIT=2 for metal

* all device
2024-03-15 13:56:52 -04:00
George Hotz 5b3d8a886e
split tinybox benchmark into two (#3741)
* split tinybox benchmark into two

* symlinks
2024-03-14 14:12:32 -07:00
David Hou 199f7c4342
MLPerf Resnet (cleaned up) (#3573)
* this is a lot of stuff

TEST_TRAIN env for less data

don't diskcache get_train_files

debug message

no lr_scaler for fp32

comment, typo

type stuff

don't destructure proc

make batchnorm parameters float

make batchnorm parameters float

resnet18, checkpointing

hack up checkpointing to keep the names in there

oops

wandb_resume

lower lr

eval/ckpt use e+1

lars

report top_1_acc

some wandb stuff

split fw and bw steps to save memory

oops

save model when reach target

formatting

make sgd hparams consistent

just always write the cats tag...

pass X and Y into backward_step to trigger input replace

shuffle eval set to fix batchnorm eval

dataset is sorted by class, so the means and variances are all wrong

small cleanup

hack restore only one copy of each tensor

do bufs from lin after cache check (lru should handle it fine)

record epoch in wandb

more digits for topk in eval

more env vars

small cleanup

cleanup hack tricks

cleanup hack tricks

don't save ckpt for testeval

cleanup

diskcache train file glob

clean up a little

device_str

SCE into tensor

small

small

log_softmax out of resnet.py

oops

hack :(

comments

HeNormal, track gradient norm

oops

log SYNCBN to wandb

real truncnorm

less samples for truncated normal

custom init for Linear

log layer stats

small

Revert "small"

This reverts commit 988f4c1cf35ca4be6c31facafccdd1e177469f2f.

Revert "log layer stats"

This reverts commit 9d9822458524c514939adeee34b88356cd191cb0.

rename BNSYNC to SYNCBN to be consistent with cifar

optional TRACK_NORMS

fix label smoothing :/

lars skip list

only weight decay if not in skip list

comment

default 0 TRACK_NORMS

don't allocate beam scratch buffers if in cache

clean up data pipeline, unsplit train/test, put back a hack

remove print

run test_indexing on remu (#3404)

* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57dc5fa70169189fd33b24dbbee99c40, reversing
changes made to 3c4c8c9e16.

fix bad seeding

UnsyncBatchNorm2d but with synced trainable weights

label downsample batchnorm in Bottleneck

:/

:/

i mean... it runs... its hits the acc... its fast...

new unsyncbatchnorm for resnet

small fix

don't do assign buffer reuse for axis change

* remove changes

* remove changes

* move LARS out of tinygrad/

* rand_truncn rename

* whitespace

* stray whitespace

* no more gnorms

* delete some dataloading stuff

* remove comment

* clean up train script

* small comments

* move checkpointing stuff to mlperf helpers

* if WANDB

* small comments

* remove whitespace change

* new unsynced bn

* clean up prints / loop vars

* whitespace

* undo nn changes

* clean up loops

* rearrange getenvs

* cpu_count()

* PolynomialLR whitespace

* move he_normal out

* cap warmup in polylr

* rearrange wandb log

* realize both x and y in data_get

* use double quotes

* combine prints in ckpts resume

* take UBN from cifar

* running_var

* whitespace

* whitespace

* typo

* if instead of ternary for resnet downsample

* clean up dataloader cleanup a little?

* separate rng for shuffle

* clean up imports in model_train

* clean up imports

* don't realize copyin in data_get

* remove TESTEVAL (train dataloader didn't get freed every loop)

* adjust wandb_config entries a little

* clean up wandb config dict

* reduce lines

* whitespace

* shorter lines

* put shm unlink back, but it doesn't seem to do anything

* don't pass seed per task

* monkeypatch batchnorm

* the reseed was wrong

* add epoch number to desc

* don't unsyncedbatchnorm is syncbn=1

* put back downsample name

* eval every epoch

* Revert "the reseed was wrong"

This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.

* cast lr in onecycle

* support fp16

* cut off kernel if expand after reduce

* test polynomial lr

* move polynomiallr to examples/mlperf

* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* no more half

* polylr and lars were merged

* undo search change

* override Linear init

* remove half stuff from model_train

* update scheduler init with new args

* don't divide by input mean

* mistake in resnet.py

* restore whitespace in resnet.py

* add test_data_parallel_resnet_train_step

* move initializers out of resnet.py

* unused imports

* log_softmax to model output in test to fix precision flakiness

* log_softmax to model output in test to fix precision flakiness

* oops, don't realize here

* is None

* realize initializations in order for determinism

* BENCHMARK flag for number of steps

* add resnet to bechmark.yml

* return instead of break

* missing return

* cpu_count, rearrange benchmark.yml

* unused variable

* disable tqdm if BENCHMARK

* getenv WARMUP_EPOCHS

* unlink disktensor shm file if exists

* terminate instead of join

* properly shut down queues

* use hip in benchmark for now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-14 00:53:41 -04:00
chenyu f30fb192b7
resnet eval on tinybox ci (#3714) 2024-03-13 13:26:30 -04:00
chenyu d69170e27e
add llama 2 70B in ci and verify output (#3682)
* add llama 2 70B in ci and verify output

* ln -s llama2 dir
2024-03-11 12:48:22 -04:00
chenyu e10ee2ed3f
llama beam tinybox ci (#3680) 2024-03-11 01:35:39 -04:00
chenyu bad6adaf8c
add mixtral and 6 gpus cifar to tinybox ci (#3676)
* add mixtral and 6 gpus cifar to tinybox ci

* print total ram used at the end of loading
2024-03-10 18:25:31 -04:00
qazal bdd62c7fd8
make the bf16 include dynamic (#3642)
* dynamic prefix

* add common ones above

these are common dtypes

aesthetics

* regression test

fuzz it

test

* run in CI

* use .append

* faster
2024-03-07 10:31:35 -05:00
David Hou 0afaf70d57
lars optimizer + tests (#3631)
* lars optimizer + tests

* fix skip list!

* use id to compare in skip list

* go back to using set

* Tensor(bool) * Tensor(bool) is and

* don't lint external/mlperf_resnet

* whitespace

* add external_test_optim to opencl tests

* give mlperf task a name

* mlperf under onnx

* remove track_gnorm

* contiguous instead of realize

* assert momentum and weight decay positive

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-06 18:11:01 -05:00
George Hotz 81baf3eed3
bring ptx back (#3623)
* bring ptx back

* ptx back

* fix define var

* fix a few bugs

* bugfixes

* fixes

* fix llvm bug

* fix test bug
2024-03-06 13:34:21 -08:00
George Hotz 568353fa84 hotfix: bump line count to 6500 2024-03-06 07:52:18 -08:00
chenyu 3c3f846c45
tinybox benchmark with HSA (#3603)
* tinybox benchmark with HSA

* torch cuda init can fail

* no TORCHCUDA

* print torch version

* LD_PRELOAD="/opt/rocm/lib/libhsa-runtime64.so"
2024-03-05 11:03:52 -05:00
chenyu 957e9800f1
llama + beam to mac benchmark, full cifar to nvidia benchmark (#3612)
would merge if it's also ~1 minute. btw why is gpt2 beam not slower in the first beam run?
2024-03-04 21:35:57 -05:00
chenyu c3b8d285aa
cleanup uops (#3605)
using `is` to compare with enums, remove long lines and slightly more compact
2024-03-04 11:03:14 -05:00
chenyu 8e5d60a322
add more gpt2 variant in mac/nvidia benchmark (#3599) 2024-03-03 17:55:30 -05:00
George Hotz 770707b376 hotfix: gpuocelot no rebuild 2024-03-02 15:57:38 -08:00
Francis Lam 162dfb07d9
fuzz_linearizer: fix uops and add to test.yml (#3588) 2024-03-02 15:03:42 -08:00
Francis Lam e17f1821a7
wmma: add CUDA tensor core and fix test_speed_v_torch failure (#3544) 2024-03-01 17:51:02 -08:00
chenyu b7e555f6c0
run test_linearizer_failures on PYTHON backend (#3565)
* run test_linearizer_failures on PYTHON backend

only test 1, some have hanging issues and gated store is not implemented

* --durations=20

* two less slow ones
2024-03-01 17:00:18 -05:00
George Hotz 5a6e151844
no barrier side effect (#3550)
* no barrier side effect

* finish barrier removal
2024-02-29 18:10:04 -08:00
George Hotz 2c19ab6561
define var (#3548)
* define var

* remove vars from there

* fix python symbolic ops

* fix llvm

* pypath
2024-02-29 16:43:27 -08:00
chenyu 978a997d1f
print nvidia-smi in CI benchmark (#3546) 2024-02-29 17:31:37 -05:00
George Hotz e7cda40d52 Revert "hotfix: disable metal graph"
This reverts commit 3541602877.
2024-02-28 16:25:12 -08:00
George Hotz 3541602877 hotfix: disable metal graph 2024-02-28 10:33:34 -08:00
George Hotz c34d382a1e
bump to macos-14 M1 (#3520)
* bump to macos-14 M1

* bump cache key

* no -n auto

* jit=2

* real tensor cores
2024-02-28 10:28:25 -08:00
George Hotz 7698781389
Revert "wmma: add CUDA tensor core (#3464)" (#3474)
This reverts commit e9cef13f0b.
2024-02-22 11:58:16 +01:00
Francis Lam e9cef13f0b
wmma: add CUDA tensor core (#3464) 2024-02-22 11:57:08 +01:00
wozeparrot 57678012e1
Upload correct benchmark artifact (#3471)
* fix: correct filename

* fix: why is this .py?
2024-02-22 01:14:16 -05:00
chenyu 7c0fc40123
enable test IMAGE=2 PYTHON=1 python3 test/test_ops.py TestOps.test_simple_conv2d (#3468) 2024-02-21 18:30:12 -05:00
chenyu 77d2a4c12a
regenerate kernel dataset after reduce arg to axis change (#3467)
```
./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```
2024-02-21 18:16:13 -05:00
George Hotz 871ba73e65
_reduce_op is axis based now (#3462)
* _reduce_op is axis based now

* axis_

* update lin failures

* disable that

* fix shape
2024-02-21 16:36:31 +01:00
chenyu 02683a8659
gate the cast before movements in lazy (#3452)
it made gpt2 slower (2ms -> 2.5ms on 3090, 7ms -> 8ms on M1 Max with BEAM=2).
disabled it in gpt2 benchmark before understanding the full issue
2024-02-20 09:36:22 -05:00
qazal 7864fb69d1
delete MovementOps (#3434)
* delete MovementOps

* keep extra/to_movement_ops.py
2024-02-19 23:21:44 +01:00
Patrick Tsai ac9d94a068
Cast correctly in python emulator (dtype tests pass) (#3446)
* Cast correctly in python emulator

* Update test yml and fix lint

* make ruff pass

* mypy passes

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-02-19 13:34:02 +01:00
George Hotz b1c0d8c99d
remove cpu and torch backends (#3399)
* remove cpu and torch backends

* don't copy to cpu

* use clang instead of cpu

* multitensor gathers on the first device

* clang is cpu + use default

* fixup

* bugfix
2024-02-15 16:55:39 +01:00
Obada Khalili 75f7e21a80
Make tests in `test/test_ops.py` pass for Python emulator (#3384)
* fix OverflowError in UnaryOps.EXP2

* avoid accessing outputs for void uops

* skip execution for UOps.IF and UOps.ENDIF

* initialize bytearray to the correct size in UOps.DEFINE_LOCAL

* validate len of input that has .sz > 1

* remove comment in code

* reinitialize loop of already iterated

* validate first value in input to be a list for inputs with .sz > 1

* add python ops tests to CI

* skip long runtime tests for PYTHON backend

* respect dtype.sz arg in UOps.CONST, and remove incorrect validation in UOps.STORE

* use math.inf instead of float('int')

* handle 0 args to UnaryOPs.LOG2

* handle load op with default of .sz > 1

* initialize the loop correctly using UOps.LOOP arg

* remove unnecessary TODO comment

* remove newline

* select a subset of 22 ops tests to skip in CI when PYTHON=1

* handle gated UOps.LOAD referencing values that have .sz > 1

* Revert "select a subset of 22 ops tests to skip in CI when PYTHON=1"

This reverts commit 7674fee81d37f8865cdcc72cc0f06f67cdf59783.

* skip tests in python backend CI command

* push fix lost in conflict resolve

* Revert "skip long runtime tests for PYTHON backend"

This reverts commit 5dd2a0376e653319551c7056742d61a5fd98f60a.

* clear loop state after last iteration
2024-02-15 16:40:25 +01:00
qazal 49cb1fee54
run test_indexing on remu (#3404)
* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57dc5fa70169189fd33b24dbbee99c40, reversing
changes made to 3c4c8c9e16.
2024-02-15 11:52:40 +01:00
qazal 27f4de2ce4
delete half_prekernel (#3388)
* generic rendering of half and bf16

hotfix

* fix uops + regression test

* fix the test for metal's half4

* uop.uop fixup

* mypy with --strict-equality, fix ops_gpu
2024-02-14 15:40:48 +01:00
qazal c8fd66a131
Run RDNA3 tensor core tests in CI (#3367)
* add test_linearizer

* skip test_padto_matmul
2024-02-11 19:54:06 -05:00
Francis Lam ce21fdfb67
ops_python: add HIP tensor core mock and refactor METAL (#3354)
* ops_python: add HIP tensor core mock and refactor METAL

* Add tests to CI

* add DEBUG=2 to full tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-02-09 12:46:06 +01:00
George Hotz b385234961
oops, change to 3.12 (#3357) 2024-02-09 12:21:06 +01:00
George Hotz 7726eef464
ops_python: add image support (#3356)
* ops_python: add image support

* uops tests in their own CI

* fix ci
2024-02-09 12:02:06 +01:00
George Hotz c32ea95d7d
Python uop emulator (#3327)
* start uop emu

* tiny_add passes

* more ops

* emulate the whole warp

* test_gemm passes

* metal gemm test pass

* works on big gemm

* works on big gemm

* more tests pass

* touch ups

* fix mypy

* cleanups

* exp2 mypy

* arch is where it belongs

* actually emulate tensor cores

* fix test

* new style
2024-02-08 19:24:55 +01:00
chenyu d8ad9e5660
verify eval acc for hlb_cifar training (#3344)
set to 93% to reduce flakiness for now
2024-02-07 19:19:59 -05:00
chenyu 0d2dacb549
test intermediate tensors created by function have same device as input (#3338)
run on TORCH since it's the fastest one on CI.
caught a bug in multinomial, and update the behavior of fancy index and gather to move the indices Tensor to same device as self.
2024-02-07 09:24:36 -05:00
chenyu 3a7c1eb383
add winograd hlb_cifar10 back to tinybox benchmark (#3300)
* add winograd hlb_cifar10 back to tinybox benchmark

* LATEWINO

* use wino for the full run to save benchmark time
2024-02-02 04:29:56 -05:00
chenyu 18e854cdbf
shrink MLB on sharded axis (#3255)
* shrink MLB on sharded axis

use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.

draft version in https://github.com/chenyuxyz/tinygrad/pull/109

* SYNCBN flag

* test unclean shrinks

* UnsyncedBatchNorm reuses BatchNorm

* more robust pad arg check

* better types

* more tests!

* 6 gpus in benchmark

* disable slow GPUS=6 benchmark
2024-01-31 21:48:25 -05:00
qazal 5b46b0ff3d
Simple RDNA3 emulator (#2974)
* mockhip->hipcpu

* allocate buffers

* launch a kernel

read_asm api

* run remu in CI

* remu 0.0.2, real test ops

* simple driver

* 0.0.3, all test_ops

* run the latest emulator

* 9 minutes is way too long, drop backprop in CI

* bring back the backward pass

* Revert "bring back the backward pass"

This reverts commit 3781e1bc56fc06b424e7c7bed1224f819247fb8f.

* Print slowest tests

* emulated device directly in ops_hip

* fix ruff, override mypy for specific rules

* test in the same code path

- hip backend env variables

- install packages and verify autogen

- run certain tests

- remove the other hip tests path

- verify Device.DEFAULT

* remove the emulated hip in extra

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-01-30 10:39:28 -08:00
chenyu 34c7621556
HIP=1 NOCLANG=1 for tinybox external_model_benchmark (#3270)
used HIP instead of GPU and disabled slow CLANG
2024-01-28 22:05:26 -05:00
George Hotz 0aad8d238b
rebuild ocelot (#3259)
* rebuild

* strip trailing whitespace
2024-01-26 18:46:36 -08:00
George Hotz 03a6bc59c1
move autogen to runtime/autogen (#3254) 2024-01-26 12:44:19 -08:00
George Hotz a3869ffd46
move gpuctypes in tree (#3253)
* move gpuctypes in tree

* fix mypy

* regex exclude

* autogen sh

* mypy exclude

* does that fix it

* fix mypy

* add hip confirm

* verify all autogens

* build clang2py

* opencl headers

* gpu on 22.04
2024-01-26 12:25:03 -08:00
chenyu bc92c4cc32
onnx Einsum, CumSum, DepthToSpace, SpaceToDepth (#3252)
* onnx Einsum, CumSum, DepthToSpace, SpaceToDepth

Einsum inner product and `...` are not supported

* --durations=20
2024-01-26 10:47:53 -05:00
George Hotz aa0d1b6330 hotfix: don't use noqa: E702 that's just dumb 2024-01-24 20:01:00 -08:00
chenyu 2088937206
run full hlb_cifar training in tinybox ci (#3145)
* run full hlb_cifar training in tinybox ci

single gpu ~89 seconds

* time that
2024-01-15 23:59:20 -05:00