* we typing
* types look good in theory
* most tests pass
* gpu tests pass
* TEST_AST
* delete comments
* i must have written that bug so many times
* bugfix
* don't merge the small ones
* add f to constants
* commits from reduce
* don't GCD the mod nodes
* broken and a hack IMAGE=3
* group for reduce
* fix linter + mypy
* move out test ast
* insource TENSOR_TYPE_TO_NP_TYPE
* does this fix it?
* move imports out
* indexer
* works
* all use indexer
* boolean in the indexer too
* symbolic is a better name than indexer
* better symbolic API
* min and max
* symbolic tests
* work
* more tests
* fix demodder
* __str__ in the superclass
* NumNode
* awesome that works
* still works
* fix up parens
* fix zeroviews
* dead lines
* expr_node
* works
* still works
* refactor to not use __new__ methods
* ugh something went wrong a while ago
* this fixes it
* mod and div at the end
* test
* symbolic
* working
* one linter issue fixed
* other division
* more simplifys
* works
* validhacks
* VALIDHACKS passes thneed
* no str replace stuff
* inline indexes
* NATIVE_EXPLOG and factoring
* factor both ways
* cl indexing
* split on mod, not just full
* onnxlimit
* fix output shape
* op_estimate is a function of the program
* no ones in the index
* four_float4
* ALLOW_4FLOAT4
* test passes
* compute then store
* loads first
* bugfix
* better, but doesn't match
* select xb in smart way
* new test and bugfix
* no change to lazy
* Node fixes linter
* fix opencl with op_estimate
* fix mypy
* revert valid
* remove unused
* add image
* load + store + boring stuff:
* image tests pass
* thneed print GFLOPS
* op conv test
* more debugging
* hack for multiview image
* shapetracker creates less views
* disable image tests
* working better
* ugh, lkey not key
* print in DEBUG, and allow views
* works
* simple padding conv2d
* use index for image
* that was bad code
* debug print
* fix types
* less lines
* save lines
This commit resolves issue https://github.com/geohot/tinygrad/issues/453
In the example code in the README.md, when it is run, it prints for Tiny
Grad the tensors as:
<Tensor <LB (3, 3) op:MovementOps.RESHAPE> with grad None>
<Tensor <LB (1, 3) op:MovementOps.RESHAPE> with grad None>
But to be equivalent to the output of the Torch example, we need
to use numpy() to get it to show:
[[ 2. 2. 2.]
[ 0. 0. 0.]
[-2. -2. -2.]]
[[1. 1. 1.]]
* bringing back reshape and permute
* done with E701
* 4x4 works in generic way
* max and sum not vectorizing...
* special case single float
* support comparing to MPS
* improve matmul speed, consider generic principles
* GlobalCounter
* fix op tracking
* faster
* comment that out for now
* err, it needs that
* fix minor issues
* fix global_mem
* chonker will make llvm fast
* work
* better speed tests, we will make them fast
* with the cache add is the same speed
* relu and neg are fast
* fix sum speed
* maximum maxnum?
* hack for gemm opt
* gemm very slow
* zeros like
* test_permute
* shapetracker returns self
* fix shapetracker factorization
* err, int strides
* permutes are faster now in tinygrad than pytorch
* support -1 in expand
* gemm unrolled
* improve final test case
* WIP GEMM
* why isn't GEMM fast?
* revert cache dim
* ffp contract works on clang, not llvm?
* ignore llvm ir
* this makes fma work at least, but no faster
* USE_4x4
* 63 GFLOPS
* 87 GFLOPS
* that wasn't matmul, 44 GFLOPS now
* 82 GFLOPS permuted
* this permute too
* a little speed for the convs
* 45 GFLOPS
* speed tests pass again
* clean up prints
* fix FMA WHAT A WASTE OF TIME
* colors
* moar fair
* GPU
* useless on chonker
* cleanups
* improve factorized shapetracker
* better threshold
* label conv
* work
* ops test pass again
* hot load the index
* run the last view, no need to create
* ZeroView needs a repr for the key to work
* fix segfault on out of bounds
* one more test
* start amx, and llvm.initialize_native_asmparser
* amx works
* nice AMX class
* nicer AMX class
* refactor get_idxs
* amx working
* is slower...
* useless flip
* cache
* SZ_X
* AMX_SZ_X/Y work alone
* Contiguous mlop
* test gemm packed
* PREPARE in packed
* use_amx factor
* prefetch isn't faster
* loop
* same 3ms
* 2.24 ms
* allow double on store in TG
* amx reduce is the same speed as non amx reduce
* include memory bandwidth
* clean up shapetracker
* flip returns stride
* prepare for upstream
* Update ops_llvm.py (#426)
* permutes are yellow and green now
* faster conv
* llvm cleanups
* Show optimised IR under debug 4 (#428)
* ASTKernel class
* Make tinygrad work with older python version (#427)
* Make tinygrad work with older python version
* Use partialmethod instead of partial
* smiple chonker is chonking
* remove junk from test speed vs torch
* fix linker and types
* AMX is only here now
* add LLVM tests, it's a valid backend now
* oops, run llvm test
* contiguous_op
* fix loadops compare
* dedup reduceops
Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>