Commit Graph

77 Commits

Author SHA1 Message Date
George Hotz 5e6265be6e metal timing, fix speed test 2023-02-17 12:31:54 -08:00
George Hotz 121bd03cbd metal globalcounters 2023-02-17 12:02:54 -08:00
George Hotz 20a03d5017 woah, don't sync torch if it's not torch 2023-02-12 07:48:56 -08:00
George Hotz de71c13934 test speed v torch uses jit 2023-02-12 07:43:17 -08:00
George Hotz b9f02671d3 oops, broke torch speed test 2023-02-10 16:13:53 -06:00
Jacky Lee 5c51ae8dbf
Show where tinygrad is faster in speed test vs torch (#549)
* show where tinygrad is faster

* don't change text color
2023-02-10 14:01:07 -06:00
Jacky Lee 799b3f185a
Refactor getenv into helpers (#508)
* Refactor getenv into helpers

* Remove unused os

* Fix default value

* Fix more defaults for CI

* Fix bracket

* Revert changes to openpilot/compile.py

* Use getenv from helpers when possible
2023-01-31 15:09:09 -08:00
George Hotz 2db272c7f7
Kernel Optimizer (#489)
* kernel optimizer

* 10x faster, but wrong. not good deal

* move test -> extra

* print x speedup

* clcache

* fix clcache + DEBUG

* GFLOPS estimate

* i==3
2023-01-29 17:15:00 -08:00
George Hotz ebdec2b72f fix optimizer 2023-01-29 00:23:06 -08:00
George Hotz 44e96c58b4 touch up pytorch speed tests 2023-01-25 18:11:26 -08:00
calledit a0af1045bf
Some new tests (#440)
* Make test run

* Added new tests: sub pow constant_sub

* Fix indentation

* Added one to many lines

* Fix indentation

* Update test_cl_tiler.py

* Delete test_cl_tiler.py
2023-01-25 15:40:19 -08:00
George Hotz 6d7658db12 delete opencl <celebration> 2023-01-24 14:18:35 -08:00
George Hotz 6fe9edf30f torch cuda is very fast 2023-01-23 16:24:46 -08:00
George Hotz a6de94b444 test partial sum 2023-01-22 21:28:40 -08:00
George Hotz 4885fce56e
shapetracker from newgpu (#456)
* shapetracker from newgpu

* touchup ops

* test

* testst

* thneed deletes unused inputs

* test

* bugfix
2023-01-09 12:40:01 -08:00
cloud11665 4fb97b8de0
don't fail when termcolor is not installed (#436) 2022-11-14 16:45:06 -08:00
George Hotz 5e07d4669d
the speedy chonker is going to replace the old chonker (#432)
* bringing back reshape and permute

* done with E701

* 4x4 works in generic way

* max and sum not vectorizing...

* special case single float

* support comparing to MPS

* improve matmul speed, consider generic principles

* GlobalCounter

* fix op tracking

* faster

* comment that out for now

* err, it needs that

* fix minor issues

* fix global_mem
2022-11-11 18:34:24 -08:00
George Hotz b8c94a67c9
Simple chonker (#431)
* chonker will make llvm fast

* work

* better speed tests, we will make them fast

* with the cache add is the same speed

* relu and neg are fast

* fix sum speed

* maximum maxnum?

* hack for gemm opt

* gemm very slow

* zeros like

* test_permute

* shapetracker returns self

* fix shapetracker factorization

* err, int strides

* permutes are faster now in tinygrad than pytorch

* support -1 in expand

* gemm unrolled

* improve final test case

* WIP GEMM

* why isn't GEMM fast?

* revert cache dim

* ffp contract works on clang, not llvm?

* ignore llvm ir

* this makes fma work at least, but no faster

* USE_4x4

* 63 GFLOPS

* 87 GFLOPS

* that wasn't matmul, 44 GFLOPS now

* 82 GFLOPS permuted

* this permute too

* a little speed for the convs

* 45 GFLOPS

* speed tests pass again

* clean up prints

* fix FMA WHAT A WASTE OF TIME

* colors

* moar fair

* GPU

* useless on chonker

* cleanups

* improve factorized shapetracker

* better threshold

* label conv

* work

* ops test pass again

* hot load the index

* run the last view, no need to create

* ZeroView needs a repr for the key to work

* fix segfault on out of bounds

* one more test

* start amx, and llvm.initialize_native_asmparser

* amx works

* nice AMX class

* nicer AMX class

* refactor get_idxs

* amx working

* is slower...

* useless flip

* cache

* SZ_X

* AMX_SZ_X/Y work alone

* Contiguous mlop

* test gemm packed

* PREPARE in packed

* use_amx factor

* prefetch isn't faster

* loop

* same 3ms

* 2.24 ms

* allow double on store in TG

* amx reduce is the same speed as non amx reduce

* include memory bandwidth

* clean up shapetracker

* flip returns stride

* prepare for upstream

* Update ops_llvm.py (#426)

* permutes are yellow and green now

* faster conv

* llvm cleanups

* Show optimised IR under debug 4 (#428)

* ASTKernel class

* Make tinygrad work with older python version (#427)

* Make tinygrad work with older python version

* Use partialmethod instead of partial

* smiple chonker is chonking

* remove junk from test speed vs torch

* fix linker and types

* AMX is only here now

* add LLVM tests, it's a valid backend now

* oops, run llvm test

* contiguous_op

* fix loadops compare

* dedup reduceops

Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
2022-11-10 23:17:09 -08:00
George Hotz 9781b4c3af rename test functions to helper_ 2022-11-07 21:27:56 -08:00
George Hotz 9884be2ad5 ugh, that too 2022-11-07 21:21:35 -08:00
George Hotz 537a9eb414 fix termcolor import 2022-11-07 21:19:08 -08:00
George Hotz 2cc1d970c6 updates from the chonker branch 2022-11-07 21:12:08 -08:00
George Hotz 544cb0a069 oops, remove while(1) 2022-10-29 14:05:13 -07:00
George Hotz fdb43fe553 gemm is 1.7 TFLOPS on a single M1 core 2022-10-29 13:42:33 -07:00
George Hotz f885ceb695 test speed w/o bias 2022-10-28 11:22:15 -07:00
George Hotz 793edf8900 touchup 2022-10-10 16:13:34 -07:00
George Hotz d54a45b50d measure speed vs torch 2022-10-10 16:06:00 -07:00