Commit Graph

488 Commits

Author SHA1 Message Date
George Hotz aea55eb196 found failing upcast 2023-01-30 16:12:56 -08:00
George Hotz b67f997864 tests pass w/o float4 2023-01-30 15:40:49 -08:00
George Hotz c6f570a2e6 improve progress bar 2023-01-30 14:50:28 -08:00
George Hotz 7118602c97 goat progress bar 2023-01-30 14:37:26 -08:00
George Hotz cccfea4b25 factor out KOPT code 2023-01-30 13:13:55 -08:00
George Hotz de2c419fd4 make_pair and first attempt at hlb_cifar10 2023-01-30 11:07:23 -08:00
AllentDan 7b6b1f32b1
[Fix] fix typo: test_mnist -> datasets (#492)
* test_mnist -> datasets

* fix mnist_gan
2023-01-29 21:30:47 -08:00
George Hotz 2db272c7f7
Kernel Optimizer (#489)
* kernel optimizer

* 10x faster, but wrong. not good deal

* move test -> extra

* print x speedup

* clcache

* fix clcache + DEBUG

* GFLOPS estimate

* i==3
2023-01-29 17:15:00 -08:00
George Hotz bb0cdc2442 111.51x speedup for reduce 2023-01-29 03:06:00 -08:00
George Hotz 45c0aa6e2d search with SHIFT, REDUCE 2023-01-29 02:42:20 -08:00
George Hotz 87879cf4b6 improve search more 2023-01-29 02:08:57 -08:00
George Hotz f6bbd43cb8 improve search 2023-01-29 01:33:47 -08:00
George Hotz ebdec2b72f fix optimizer 2023-01-29 00:23:06 -08:00
George Hotz a9cabce791 oops, broke mem estimates 2023-01-28 20:21:31 -08:00
George Hotz a500e79bd1 don't OPTWG on OS X, it's way slower 2023-01-28 20:02:33 -08:00
George Hotz b0df4d99a0 os x profiling: this ratio is exact i believe 2023-01-28 19:02:51 -08:00
George Hotz ae810eb558 minor cleanups 2023-01-28 08:59:15 -08:00
George Hotz 6d5e1a8029 GEMM kernel search 2023-01-27 10:08:57 -08:00
Comma Device f08e740957 factor out hand coded opt 2023-01-26 14:54:06 -06:00
George Hotz 5e8a36a18b real op kernel 2023-01-26 09:51:32 -08:00
George Hotz e0600f537a op kernel in kernel search 2023-01-26 09:47:01 -08:00
George Hotz aafc29484a cleanups 2023-01-25 12:37:10 -08:00
George Hotz 919e943867 decent search 2023-01-25 12:20:53 -08:00
George Hotz 7f3da91f8b kernel_search 2023-01-25 12:05:09 -08:00
George Hotz e37424424f first little attempt at search 2023-01-25 11:49:29 -08:00
Comma Device 9e2af0a972 too far with the OPTWG 2023-01-24 13:14:59 -06:00
Comma Device 3590848b93 a little more local workgroup options 2023-01-24 12:50:27 -06:00
Comma Device 4b74752c42 fix hotspots by improving the workgroup optimizer 2023-01-24 12:46:28 -06:00
George Hotz fd760a390a fix incremental time 2023-01-24 10:19:04 -08:00
George Hotz a949de873b
reduce 2.0 (#469)
* reduce 2.0

* works

* hacks

* DEBUG=3 for shapes

* fix types

* 0s weren't being folded

* cleaner

* last_reduce is no longer needed

* comments and cleanup
2023-01-23 15:11:13 -08:00
George Hotz f1196984e6 harmless to intertwine the math and the stores 2023-01-21 09:31:56 -08:00
George Hotz 708215d06b
Typing (#468)
* we typing

* types look good in theory

* most tests pass

* gpu tests pass

* TEST_AST

* delete comments

* i must have written that bug so many times

* bugfix

* don't merge the small ones

* add f to constants

* commits from reduce

* don't GCD the mod nodes

* broken and a hack IMAGE=3

* group for reduce

* fix linter + mypy

* move out test ast

* insource TENSOR_TYPE_TO_NP_TYPE

* does this fix it?

* move imports out
2023-01-21 09:09:22 -08:00
George Hotz 0881d504c1
move shapetracker (#466)
* move shapetracker

* shapetracker test

* move ast

* move a few things

* fix print kernel

* fix test

* symbolic fixups
2023-01-19 09:56:31 -08:00
George Hotz 9245f4650a indexer changes for master 2023-01-18 18:02:02 -08:00
George Hotz 49c6e6d472
Latest attempt to add image (#462)
* add image

* load + store + boring stuff:

* image tests pass

* thneed print GFLOPS

* op conv test

* more debugging

* hack for multiview image

* shapetracker creates less views

* disable image tests

* working better

* ugh, lkey not key

* print in DEBUG, and allow views

* works

* simple padding conv2d

* use index for image

* that was bad code

* debug print

* fix types

* less lines

* save lines
2023-01-12 17:36:30 -08:00
George Hotz 281b0db773 three from image 2023-01-12 12:26:58 -08:00
George Hotz 9ff6c532eb
Prereqs for IMAGE=1 (#461)
* contig

* move ast, debug prog

* add Token

* cleanup reduce

* exec_ast
2023-01-11 20:18:42 -08:00
George Hotz fff1f046b0
Simple version of the new GPU backend (#458)
* newgpu

* more to delete

* hmm, tests pass with constant folding

* fix lint/type

* fix constant folding

* comment and rerun tests

* lazy touchups

* fix graph_batchnorm test

* smaller transformer to fix OOM

* Revert "smaller transformer to fix OOM"

This reverts commit a44ef8edc275a4b3c78ee711ba188e220b7a879f.

* no func cache

* introspect

* touchups

* CLASTKernel

* ugh, it was lru_cache

* codegen

* spacing

* old gpu still in opencl

* typing fix
2023-01-10 19:16:02 -08:00
George Hotz fad7cba590 move batchnorm to Tensor 2023-01-09 18:00:16 -08:00
George Hotz 4885fce56e
shapetracker from newgpu (#456)
* shapetracker from newgpu

* touchup ops

* test

* testst

* thneed deletes unused inputs

* test

* bugfix
2023-01-09 12:40:01 -08:00
George Hotz b8c94a67c9
Simple chonker (#431)
* chonker will make llvm fast

* work

* better speed tests, we will make them fast

* with the cache add is the same speed

* relu and neg are fast

* fix sum speed

* maximum maxnum?

* hack for gemm opt

* gemm very slow

* zeros like

* test_permute

* shapetracker returns self

* fix shapetracker factorization

* err, int strides

* permutes are faster now in tinygrad than pytorch

* support -1 in expand

* gemm unrolled

* improve final test case

* WIP GEMM

* why isn't GEMM fast?

* revert cache dim

* ffp contract works on clang, not llvm?

* ignore llvm ir

* this makes fma work at least, but no faster

* USE_4x4

* 63 GFLOPS

* 87 GFLOPS

* that wasn't matmul, 44 GFLOPS now

* 82 GFLOPS permuted

* this permute too

* a little speed for the convs

* 45 GFLOPS

* speed tests pass again

* clean up prints

* fix FMA WHAT A WASTE OF TIME

* colors

* moar fair

* GPU

* useless on chonker

* cleanups

* improve factorized shapetracker

* better threshold

* label conv

* work

* ops test pass again

* hot load the index

* run the last view, no need to create

* ZeroView needs a repr for the key to work

* fix segfault on out of bounds

* one more test

* start amx, and llvm.initialize_native_asmparser

* amx works

* nice AMX class

* nicer AMX class

* refactor get_idxs

* amx working

* is slower...

* useless flip

* cache

* SZ_X

* AMX_SZ_X/Y work alone

* Contiguous mlop

* test gemm packed

* PREPARE in packed

* use_amx factor

* prefetch isn't faster

* loop

* same 3ms

* 2.24 ms

* allow double on store in TG

* amx reduce is the same speed as non amx reduce

* include memory bandwidth

* clean up shapetracker

* flip returns stride

* prepare for upstream

* Update ops_llvm.py (#426)

* permutes are yellow and green now

* faster conv

* llvm cleanups

* Show optimised IR under debug 4 (#428)

* ASTKernel class

* Make tinygrad work with older python version (#427)

* Make tinygrad work with older python version

* Use partialmethod instead of partial

* smiple chonker is chonking

* remove junk from test speed vs torch

* fix linker and types

* AMX is only here now

* add LLVM tests, it's a valid backend now

* oops, run llvm test

* contiguous_op

* fix loadops compare

* dedup reduceops

Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
2022-11-10 23:17:09 -08:00
George Hotz 2cc1d970c6 updates from the chonker branch 2022-11-07 21:12:08 -08:00
George Hotz d878065ece
Gemm (#416)
* gemm

* off by factor of 5

* 50 GFLOPS

* works

* 91 gflops

* working at 50G

* works

* iy

* 150 GFLOPS

* 150 GFLOPS

* N=2048 is still fast

* threading soon

* multithread

* pinning

* throttling is sad

* Align matrices to cacheline width (#361)

Co-authored-by: cloud <Cloud11665@gmail.com>
2022-11-06 10:07:28 -08:00
George Hotz 6a8fb53304
move ops.py into lazy.py (#402)
* move ops.py into lazy.py

* fix graph and linter

* ugh, didn't add
2022-10-25 13:58:03 -07:00
George Hotz 8e22d5ee67 replace networkx with defaultdict 2022-10-20 19:36:43 -07:00
George Hotz 63f9c55156 really dumb bug 2022-10-20 17:07:47 -07:00
George Hotz 1bec4651b3 fix nonstatic weights 2022-10-20 17:04:14 -07:00
George Hotz bb288e6938 safe_numpy and warning for broken matmul 2022-10-20 15:40:22 -07:00
George Hotz 50c95c7d9a add assert to catch issue in attention 2022-10-20 15:13:00 -07:00
George Hotz 26c78ccf7d remove useless buffer 2022-10-20 14:07:28 -07:00
George Hotz a18c1f3178 zero out the inputs 2022-10-20 13:46:52 -07:00
George Hotz ace8db29f8 ReduceSum 2022-10-20 12:48:14 -07:00
George Hotz c400ee0beb
refactoring thneed (#400)
* refactoring thneed

* continue

* minor update

* looks like it's working

* big refactor

* confirm thneed got the right output

* code is there but it's broken

* works now

* always OPTWG, input -> dat

* fix type issue
2022-10-20 12:35:59 -07:00
YassineYousfi ae0f9b17df
openpilot: new models and onnx ops (#401)
* ngrl stuff

* fngrl

* fix typo in compile script

* workflow dispatch

* new models in tests

* dont need to up this threshold

Co-authored-by: HaraldSchafer <harald.the.engineer@gmail.com>
2022-10-20 11:49:19 -07:00
George Hotz ff11c4316b move get_parameters to optim.py 2022-09-25 13:16:58 -04:00
Jacky Lee 2c01a66265
Reshape dataset from fetch_mnist (#390) 2022-09-24 21:16:29 -04:00
George Hotz 271446e3eb
set requires_grad to None (#387)
* set requires_grad to None

* some things need gradients

* hmm, why was get_parameters filtering
2022-09-21 11:16:02 -04:00
YassineYousfi 2f0f91ba3d
support float16 onnx weights (#384) 2022-09-15 09:12:18 -04:00
YassineYousfi 1a7bdc51f8
support more onnx ops (#376)
* broadcast from right to left

* add another broadcasted add test

* more onnx ops

* use float32 range in clip
2022-09-07 15:15:24 -07:00
George Hotz 0516359af8 fix stupid OPENCL=1 OOM 2022-09-06 14:29:23 -07:00
George Hotz 4dadd95e3c fix tests hopefully, more stable diffusion 2022-09-03 10:38:31 -07:00
George Hotz c01a8c5c2d stable diffusion start 2022-09-03 10:08:42 -07:00
George Hotz a3fc64a585 fix batchnorm folding in openpilot compile 2022-08-31 13:04:49 -07:00
George Hotz dc7af8c3ac thneed run float32 2022-08-28 11:03:35 -07:00
George Hotz b132de677d
tinygrad.nn (#367)
* tinygrad.nn

* flake8

* working on pylint

* more pylint

* more pylint

* pylint passes

* networkx

* mypy can't infer that type

* junk
2022-08-18 07:41:00 -07:00
George Hotz f76d41812b prune graph 2022-07-17 15:38:43 -07:00
George Hotz eda6f071b2 default opt level 2 2022-07-17 14:54:40 -07:00
George Hotz 73b0471b25 join expands 2022-07-17 13:42:05 -07:00
George Hotz d04b274cd2 noop removal can replace with reshape 2022-07-16 08:32:42 -07:00
George Hotz 2720ef49ca extra and test and tuple 2022-07-07 10:01:33 -07:00
George Hotz 81b73f97a3
Optiimzation (#355)
* constant folding into kernels

* that opt worth it?

* fix mypy

* ast one kernel

* save 2 lines in conv kernel

* debug print kernel count

* cl debugging

* early realize inputs

* refactor Device
2022-07-04 08:58:57 -07:00
George Hotz 7276f8d6bf improve constant folding, detach before moving tensor 2022-07-02 15:29:40 -07:00
George Hotz 8cf1aed0f4 don't track_running_stats, parameters must require_grad 2022-07-02 14:38:45 -07:00
George Hotz 49c954b389 comments 2022-06-26 17:20:25 -07:00
George Hotz 83d50e2687 move to extra.onnx 2022-06-21 19:43:44 -07:00
George Hotz 9b27ba650b load new torch files 2022-06-07 10:06:48 -07:00
George Hotz 233c71a7ba support requires_grad 2022-06-06 07:47:31 -07:00
George Hotz d8d19ed468 wikimedia wasn't returning 200 2022-01-15 19:09:29 -08:00
George Hotz e28cdfb0cf clean up resnet 2021-11-30 16:14:54 -05:00
George Hotz 58ed46963e fix broadcastdot 2021-11-29 18:54:57 -05:00
George Hotz dca076dbf1 remove dumb nn ops 2021-11-29 18:05:31 -05:00
George Hotz 30eb3afbe1 add bias term to transformer 2021-11-29 12:45:27 -05:00
George Hotz e2a8961a18 less lines, fix bug 2021-11-17 12:52:17 -08:00
George Hotz ba28761894 move yolo into examples/yolo 2021-10-30 19:46:00 -07:00
George Hotz 63f50cff45 move back again 2021-10-30 16:13:29 -07:00
Evan Mays 285621aeda
Cherry backprop for conv2d (#281)
* quick math: 0 + x = x.

* gradient w.r.t. x using cherry for conv

* gradient w.r.t. w for conv on cherry but doing vector dot products

* small optimization

* [cherry] optimize conv backpass for large channel count

* get rid of numpy einsum
2021-10-30 16:12:19 -07:00
George Hotz 3d646272d6 move back 2021-10-30 16:12:12 -07:00
George Hotz ac8afd24fa refactor accel 2021-10-30 16:10:59 -07:00
Guglielmo Camporese 2b7589db64
Added ResNet-{18, 34, 50, 101, 152} (#271)
* added resnets

* fix minor

* fix minor

* resnet in models

* added resnet test

* added resnet train test

* added linear, conv2d nn tests

* fix minor in extra/training

* resnet in models

* fix minor

* fix tolerance for linear in nn test

* fix eval, this causes cpu and gpu UT failing

* revert transformer test

* fix minor for CPU test

* improved model get_params for sequential layer

* fix minor for params counting

* commented broken ops tests

* improved train for resnet
2021-06-21 09:37:24 -07:00
George Hotz 89798d2f43 some flags 2021-06-19 11:46:31 -07:00
George Hotz d81eae8288 debug cherry crash 2021-06-19 11:41:20 -07:00
George Hotz d3f169b267 move good models to models, add a training step test 2021-06-19 11:24:15 -07:00
George Hotz b48d4bad2e clean up print spam 2021-06-19 10:31:04 -07:00
George Hotz 027535d0b5 microcoded matmul 2021-06-17 21:03:08 -07:00
George Hotz 026e2ae6a7 three registers and a zero command 2021-06-17 17:09:18 -07:00
George Hotz 2e71ae33f6 max op works 2021-06-17 17:01:21 -07:00
George Hotz 9e12c1bbba cherry binop 2021-06-17 16:50:40 -07:00
George Hotz fcdabea880 training mnist with cherry ops 2021-06-17 16:45:35 -07:00
George Hotz 2affd226b3 speed up sum 2021-06-17 16:38:34 -07:00
George Hotz e8eb7d1b7e max op 2021-06-17 16:20:56 -07:00
George Hotz c1d469d440 sum op 2021-06-17 16:19:35 -07:00
George Hotz b1000d866e readme, plus reduce ops 2021-06-16 11:21:06 -07:00
George Hotz ff3fdc58e5 risk -> cherry 2021-06-16 09:59:48 -07:00
George Hotz 2f91c012eb build note 2021-06-15 22:41:41 -07:00
George Hotz 4850d6eb43 update todo 2021-06-15 10:22:39 -07:00
George Hotz 4e1edb3692 have tinygrad log the loads 2021-06-14 18:35:14 -07:00
George Hotz 93f2e9769d little note 2021-06-14 15:49:41 -07:00
George Hotz a89d12d735 wow, way faster 2021-06-10 17:11:39 -07:00
George Hotz 10b1306525 binops 2021-06-10 16:52:37 -07:00
George Hotz 4535d39baa comments and pow 2021-06-10 09:03:40 -07:00
George Hotz 2075fdeb4f
FPGA Based Accelerator for Tinygrad (#258)
* ops_risk

* risk sim

* guessing is for winners

* minor

* better

* matmal with risk

* conv doesn't work

* closer

* conv2d works

* ops_risk

* opt2 works

* opt1 may not be possible

* opt1 is a mulacc

* arty

* attosoc example building on mac

* minor

* riscv assembler

* gucci gang

* we got C code

* not a scam

* hello

* make risk mergeable into master

* unop support
2021-06-07 17:45:09 -07:00
Josh Smith ad756f6112
minor optimizations & cleaning (#257)
* use isinstance, some optimizations & whitespace removal

* revert whitespace changes

* revert more whitespace

* some more cleanup

* revert fstring (not a fan of the {{}})

* fix typo

* fix typo
2021-06-02 09:57:15 -07:00
George Hotz b80cacb416 fix GPU efficientnet example 2021-05-26 17:29:35 -07:00
20kdc 2653d33292
vgg7 (image upscaling) implementation - not the best, but it works (#255)
* vgg7 implementation - not the best, but it works

* VGG7 implementation: Spread nansbane to deter NaNs, maybe improved training experience

* VGG7 implementation: Fix training, for real this time

Results actually attempt to approximate the input

* VGG7 implementation: Sample probability management
2021-05-12 23:48:51 -07:00
George Hotz ac229ea750 remove print 2021-01-02 12:53:30 -08:00
George Hotz 895d142503 start trying to load yolo v5 2021-01-02 12:51:55 -08:00
Marcel Bischoff 42b4761025
transformer >99.98% test accuracy in ~30s (#230)
* transformer

* BS might divide len(Y_test)

* outoput when accuracy is high

* more readeable

* fixed loss in serious_mnist for new API
2021-01-02 07:45:09 -08:00
Liam ebd72ff437
Test split (#231)
* Split tests

Split tests into "Test CPU" and "Test GPU".

Add test flag "TEST_DEVICES" which is a comma separated list of devices:
CPU,GPU,ANE

* Run tests based on provided TEST_DEVICES flag

By default will run all "CPU,GPU,ANE"

* fix bad quote

* Revert changes and use GPU=1

This is done through setting the default Tensor Device to Device.CPU of
GPU=1 is set.

Run GPU tests: GPU=1 pytest -s -v
2021-01-01 09:19:03 -05:00
George Hotz f9170505b3 if you like your transformers twice as slow, use the GPU 2020-12-29 17:14:23 -05:00
George Hotz 3f8e137b6f extra/transformer 2020-12-29 14:14:00 -05:00
Marcel Bischoff dc8fa7999c
Transpose on GPU (#221)
* 2serious

* load/save

* fixing GPU

* added DEBUG

* needs BatchNorm or doesn't learn anything

* old file not needed

* added conv biases

* added extra/training.py and checkpoint

* assert in test only

* save

* padding

* num_classes

* checkpoint

* checkpoints for padding

* training was broken

* merge

* rotation augmentation

* more aug

* needs testing

* streamline augment, augment is fast thus bicubic

* tidying up

* transformer eval

* axis=-1

* transpose

* test for permutation using torch.movedims

* another test

* line
2020-12-29 10:40:11 -05:00
George Hotz bcb3ceeca3 set training in functions 2020-12-28 22:45:46 -05:00
Marcel Bischoff ffff98db78
Evaluation in Transformers (#218)
* 2serious

* load/save

* fixing GPU

* added DEBUG

* needs BatchNorm or doesn't learn anything

* old file not needed

* added conv biases

* added extra/training.py and checkpoint

* assert in test only

* save

* padding

* num_classes

* checkpoint

* checkpoints for padding

* training was broken

* merge

* rotation augmentation

* more aug

* needs testing

* streamline augment, augment is fast thus bicubic

* tidying up

* transformer eval
2020-12-28 09:24:51 -05:00
George Hotz d864e1c71a transformer is training 2020-12-27 18:46:32 -05:00
George Hotz a361ef6861 fixup training loop 2020-12-27 18:35:56 -05:00
Nicklas Boman 06f359baa3
issue-193 - Move torch loader out of efficientnet code (#213) 2020-12-22 00:19:16 -05:00
iainwo 56d44637f3
fixed pylint, formatted python files iwth cblack on localhost (#204)
* fixed pylint, formatted python files iwth cblack on localhost

* Revert "fixed pylint, formatted python files iwth cblack on localhost"

This reverts commit 07e2b88466fa53399ad78d962ffb2ad55bc45344.

* dedented 4-spaces added linter

Co-authored-by: Iain Wong <iainwong@outlook.com>
2020-12-17 14:37:31 -08:00
Liam bcf1518309
All devices are equal! (#196)
* Update all devices to be tested

ANE, CPU and OCL all now support all tests.

However tests are not currently passing on GPU and I cannot test on CPU.

Failing GPU test are not an issue caused by this update. Tests have not
been passing due to a missing "six" required installation.

OpenCL Tests have not been run since commit: 1a1c63a08b

devices have 3 types and are handle by a new DeviceTypes enum. (The goal
is to revert to Tensor.<type>, but this current setup allows for keyword
argument defaults: `device=DeviceType.CPU`)

All references to Tensor.GPU/CPU/ANE as been converted to the
corresponding `DeviceTypes` enum.

Refactor of the conversion code to allow for any device to any device
conversion.

* Add six dependency in requirements.txt

* Resolve failure to run tests

Move six into gpu required installs. Remove six from standard
installation.

* Remove repeated data conversion

* Refactor method names

Also reduce code with .to and .to_

* Dynamic device handlers

* Refactor DeviceTypes -> Device

* Add mem copy profiling back

* test_backward_pass_diamond_model passing

* Resolve Sum issue on GPU

* Revert batchnorm2d tests

* Update README with upadated API

* ANE testing with

* Last minute line gains
2020-12-15 23:44:08 -08:00
Marcel Bischoff da72a0eed4
Big MNIST model with PIL augmentation and load/save (#160)
* 2serious

* load/save

* fixing GPU

* added DEBUG

* needs BatchNorm or doesn't learn anything

* old file not needed

* added conv biases

* added extra/training.py and checkpoint

* assert in test only

* save

* padding

* num_classes

* checkpoint

* checkpoints for padding

* training was broken

* merge

* rotation augmentation

* more aug

* needs testing

* streamline augment, augment is fast thus bicubic

* tidying up
2020-12-13 20:45:55 -08:00
George Hotz 07ece2105e actually move it 2020-12-12 15:26:58 -08:00
George Hotz 1d10559d1d tinygrad.utils -> extra.utils 2020-12-12 15:26:07 -08:00
George Hotz 00312b8ad1 batchnorm work 2020-12-06 14:40:07 -08:00
George Hotz da514c2918 fix enet init 2020-12-06 13:52:07 -08:00
George Hotz 521098cc2f se optional, track time better 2020-12-06 12:29:42 -08:00
George Hotz 609d11e699 trainer works with CIFAR 2020-12-06 12:20:14 -08:00
George Hotz 03994e0011 load torch files without torch 2020-11-21 13:43:53 -08:00
George Hotz 2ffb8de1ea move efficientnet to extra 2020-11-16 08:08:07 -08:00
George Hotz 13d34373d1 move gradcheck to extra, clean up unbroadcast 2020-11-16 08:03:31 -08:00