* rebalance
* balance
* parallel apt-get for all
* .local/lib/python3.11/site-packages
* what is user doing
* is that path right
* Update test.yml
* okay where are you
* site-packages
* image support weird loads
* umm, that was always wrong
* openpilot compile fails with a weird error
* image test passes
* we have valids now
* clean that up
* no more required opts
* add fastvits test, fix bug
* minor cleanups
* hip amd compilation
* gate the test properly
* cleanup unused import
* remove superfluous numpy conversion
* add SpeedyNet tests (f32 [passes] & f16 [fails])
* make CI verbose (error log from hip compiler)
* test the real ops_hip
* Merge branch 'tinygrad:master' into ci/hip-compilation
* fix CI
* cleanup
* really fix CI
* Fix CI Three: the refixening
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* hip amd compilation
* gate the test properly
* cleanup unused import
* remove superfluous numpy conversion
* add SpeedyNet tests (f32 [passes] & f16 [fails])
* make CI verbose (error log from hip compiler)
* test the real ops_hip
* Merge branch 'tinygrad:master' into ci/hip-compilation
* fix CI
* cleanup
* really fix CI
* add name support
* use fetch in gpt2
* remove requests from main lib, networkx also optional
* umm, keep that assert
* updates to fetch
* i love the walrus so much
* stop bundling mnist with tinygrad
* err, https
* download cache names
* add DOWNLOAD_CACHE_VERSION
* need env.
* ugh, wrong path
* replace get_child
* force rebuild of ocelot
* SzymonOzog gpuocelot
* delete that
* downgrade that
* non parallel
* force rebuild
* use llvm
* nauto
* less mem maybe
* print test
* helper_test_exception skip CUDACPU
* helper_test_exception
* shippable
* very close
* remove comment
* negative strides working
* almost everything passes
* calculate offset with list comprehension
* some cleanup
* got disk load working
* review suggestions
* fix after merge
* overlap working
* did it
* clean
* fixed disk load
* lint
* mypy
* removed as_strided
* trying without simplify
* added back simplify
* make sure expanding to smaller shape
* cleanup
* removed comment
* removed env file
* trying whisper test again
* onnx test sqlite issue
* working on test
* finished test
* eliminate unnecessary shrink-then-pad
* don't shrink buffer
* added strides check
* added to ci under linters
* switch issue
* allow symbolic stride
* removed .env
* isinstance
* adjust strides for double expand
* cleanup
* needed to add type hint for mypy
* set pythonpath
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test
* wmma: refactor tensor cores using existing local dims
* optimizer: fix bad rebase and break after one late local
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* add mops to graph, refactor IMAGE
* no reshape pushing
* add todo
* fix openpilot model alt
* push reshapes reduces kernels in new op
* IMAGE=2 is a first class citizen now
* Revert "disable flaky triton test"
This reverts commit 1e15fdaee7.
* Update test.yml
* check if has shared for matvec
* disable ocelot cache for triton
* disable ocelot cache
* disable ocelot cache
* pass shared to triton uops tests
* temporary debugs for CI crash
* Revert "temporary debugs for CI crash"
This reverts commit fee3ea96c818e83c19b935c2f8482e0ccc91a542.
* Revert "triton isn't tested, and allows this refactor (#2007)"
This reverts commit dea8bb0938.
* add runtime_args to every renderer, move triton local size override to runtime args
* Add binary to args, correct type returned
* update to new loops
* Update test.yml
* some cleanup
* move continue back
* more more more
* added to CI
* try
* try intentionally break some tests
* wtf
* del True for test
* yay tests broke, now pls no break
* try AGAIN
* gahy
* lol
* try
* move over constant
* moved over MORE
* move shrink over
* trailing lines
* try CUDA CI
* try again
* boom
* oops
* improved comments
* try: disable some flags and disable CUDA
* try breaking tests
* traceback has too much info so add --tb=no
* revert forced CI failure
* add comments and del unused imports
* oooooooo using regular debug try enable tb
* intentionally break tests
* added tb back. Maybe not too verbose
* strip whitespcae
* missed something
* Shape op int32 -> int64
* oops missed something
* add some types
* get rid of crazy 1 liners in pad op
* actually test Split this time LOL
* strip that whitespace
* limit metal buffers
* look at the base, not the srcs
* Revert "Revert "openpilot kernel fix from 209 to 207 (#2006)" (#2065)"
This reverts commit 924ecc4d6a.
* add a test for that
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* Fix openpilot kernel from 209 to 206
1. Use push_movement_ops conditions in _movement_op. Don't push
PAD or check if the ops are safe to be pushed with PAD
2. Don't push if all the op.buffers are realized
* change ALLOWED_KERNEL_COUNT to 206 for openpilot
* don't push through sourceless buffers
* change the tests to adjust kernel counts for new behaviour
* restore pushing of movement ops through childless buffer
* don't push EXPAND, causes OOM
* allow push of intermediate movement ops
* adding new test behaviour
* modifying external_test_opt for new behaviour
* restore old tests
* Reenable push of EXPAND and introduce new tests
I was wrong intially thinking EXPAND can cause OOM and hence I had
disabled it. Since it is 0 stride and doesn't allocate memory its cool
* Don't push EXPAND above LoadOps LB. This is causing OOM
* Push should be decided on movement root of bufs
To check if ast.op.buffers is sourceless/ realized go the the movement
root and then decide if pushing should be done or not
* refactor for readability
* use .base instead
* don't push expand, bad memory/compute consumption
* restrict push of reshape, seeing improvement
* push reshape if unary without further check
* disable PAD solves convnext kernel count increase
* reenable test_cache_binaryop_transpose
* small nit
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* Move ops_triton to runtime and remove errors from deprecated code
* Remove deprecated AST Kernel
* Remove deprecated buffer
* Add TritonProgram
* Triton Buffer
* Use RawCUDABuffer
* triton_compile
* Added new parameter
* pass _buf to program
* remove deprecated include
* Added triton tests
* Deprecated includes removed
* remove double print
* Disable float4 support
* Disable float4 support
* variable load fix
* Track local size
* Add pycuda to triton dependencies
* Merge test.yml
* install cuda packages for testing
* merge double package install
* remove emulated from triton tests
* upscale local index to power of 2 and add masking
* cuda envs
* Add TernaryOps
* ConstOp loading
* proper function name
* remove deprecated variables
* get global program from name
* const ops match local shape
* Enable test_nn
* remove deprecated import
* fix linter error
* Add wait logic
* Add local size override
* accumulate local shapes instead of using max shape
* Merge triton tests into global tests
* fix envs in testing
* Old testing routine
* split file into renderer and program
* remove print and starting whitespace
* pretty ptx print on debug 5
* linter errors
* ignore triton saturation tests
* ignore test example
* remove pytorch cpu extra index
* Add triton to existing testing routine
* use triton tests
* disable cuda backend in triton tests
* use cudacpu in tests
* print used device
* Print device default
* Remove print
* ensure we are running triton backend
* update variable signatures
* update dtypes for load
* infinity render fixed
* limit global size
* negative infinity now properly rendered
* split chain with parentheses for and node
* Add option to disable shared memory, disable for triton
* missing import
* Properly index and mask conditional load
* use mask only if not loading a block pointer
* nan support
* fix symbolic tests to include chain split
* proper masking for stores
* Implemented bool dtype
* Add mod
* fix loads for variables with valid range
* merge triton with cuda runtime
* merge from master
* run triton tests with cuda
* Correct target when running from triton
* conftest with triton compiler config
* use triton nightly
* verbose tests for triton
* capture stdout
* fix function depth when exiting multiple loops
* add render valid function for readabilty
* fix mask for local loops
* add _arg_int32 datatype
* fix dims for conditional loads
* enable non float stores
* correct variable dtypes
* fix type for arg_int32
* remove junk
* Added get max function for range based var.max
* remove deprecated code
* Fix triton ptxas path
* Fix testing for CI
* clamp local size by max local size instead of always running max
* Disable matmul test in triton cpu
* rerun tests
* Disable broken test in triton cpu
* whitespace removed
* rerun tests again
* Disable TestSymbolicOps for triton
* update to new uops
* linter fix
* ignore test/extra
* linting fix
* Update tinygrad/renderer/triton.py
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* remove deprecated line
* quotes type fix
* linter
* Remove unnecesary lines
* UnaryOps.NEG
* dont define constants
* Linting fix
* Disable tests that are broken in ocelot
* remove trailing whitespace
* reduce line count
* linting fix
* update to new uast
* New looping style
* Update to new uast
* make AST runner work with triton
* linting fix
* set renderer var for testing
* disable local for ocelot
* reenable all tests for ocelot
* Pass shared to cuda
* Don't group if the backend doesn't support shared mem
* use working gpuocelot branch
* enable all tests
* enable local for ocelot
* cleanup
* Update test.yml
* update cache key
* reenable test symbolic and extra
* Update test.yml
* Revert "Update test.yml" (rerun tests)
This reverts commit 98c0630ee5da4379e5c6b2437a5145fe87058c35.
* Revert "fix symbolic tests to include chain split"
This reverts commit 22a9a4c9cd14d23735e6540c8d90ee005ac4ea17.
* Revert "split chain with parentheses for and node"
This reverts commit 7499a7004ef4db785d0cd05cf292fdeff65ca90d.
* use global size from linearizer
* rename newvar to dtype to match other renderers
* join program start lines
* simplify code that adds axis to local dims
* assign r[u] in ssa
* We no longer need to replace target in src
* we no longer need to cast indices to int by hand
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
---------
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* valid hacks
* valid hacks
* valid hacks
* new method
* new method
* handtune
* is gate load breaking?
* lint
ruff
less junk
new approach?
maybe this?
* Make it more clear
* Make it more clear
* Will deal with the linter later
* hack for linter
* subs the idx but dont touch the valid
* Updated the mod rules
* lint hack
* I believe bug fix lets see
* Mod Node left
* revert
* Maybe this wont break?
* revert
* implemented "handtuned garbage"
* revert and use VALIDHACKS
* Lets see the CI
* still broken?
* currently its jungle
* maybe this jungle ?
* This works for everything somehow
* Added test for symbolic
* lint
* final touch
* This still works
* lint
* midway clean
* less garbage
* lint
* final form
* Slow but working way
* lint and other stuff
* lint
* mypy
* Make sure CI test Openpilot valid checks
* test if CI break
* Convert back
* refactor
* refactor
* Managed to reduce openpilot time from 30 secs to 5 secs
* Refactor
* Substitute a node with variable
* flake8
* Comment and refactor
* More comprehensive mod
* refactor
* bug fix
* More shave off
* remove not sure part
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* move assembly, assembly_ptx
* successful but broken rendering of ptx asm
* clear ins before render asm
* slightly less broken :')
* we needed thread syncs
* fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half
* Fix runtime_args for gpuocelot
* our casts were flipped on both ends
* more casting
* add ternary where op
* dealing with storing/loading bool
* add test for casting to bool from negative
* Fix args.valid on ConstOp
* add to CI, TODO: fix runtime_args for test_uops
* fix placement of runtime_args to work with lazy.Device
* undo ci changes so I can push
* fix lints
* start cleanup and fix things we broke fixing lints
* add checks for PTX specifc asm instructions
* revert added test -- doesn't pass on llvm
* skip tests for underflow,overflow
* another fix for how we're setting runtime args
* Less broken cleanup
* add to CI
* add more env variables for ci test
* fix ci to install pycuda for ptx
* ci: copy cuda test command
* cleanup
* assert to make sure we're actually running ptx in ci
* remove test assert
* move is_ptx arg
* move assembly, assembly_ptx back to extras
* fix imports
* initial merge fixes
* clear registers, fix UOps.LOAD with invalid value
* draft merge fixes
* remove prints
* quick lint and merge fixes
* cleanup
* remove PTXProgram wrapper
* final cleanup
* temp change for ci rerun
* ci rerun
* rollback ISA version
* testing new memops
* better debugging
* testing padded conv
* branching with load
* refactoring a bit
* first try
* fixing bugs
* fixing some
* eq
* eq2
* do not use x's
* working
* fixing imm
* getting things working
* refactor
* pow not working
* working except one
* refactor: one store mem
* refactor: global load
* refactor: imm
* refactor: cleaning
* fixing big offsets
* refactor with ci
* try ci
* typo
* another typo
* ubuntu default
* forgot git
* do i need git?
* missing packages
* adding python-dev
* with cache?
* buildx action
* buildx name issue?
* maybe now?
* python3
* newline warning
* maybe now
* i actually need this
* ci should work now
* improved caching
* fixing cache
* maybe now it will cache
* this
* testing cache
* trying again
* load
* missing platform
* caching gha
* testing cache
* full testing
* typo
* now?
* why
* adding checkout back
* bad formatting
* fixing convention issues
* supporting python
* adding CI flag
* testing all
* better comments
* adding debugging
* takes 12x longer
* does it output progress now?
* ignore models for speed
* fixing merge
* excluding conv_transpose2d
* only 2 test cuz is to slow
* another approach
* let's see
* faster duh
* my bad
* T_T
* typo
* sup
* with output?
* comment test
* comment test
* comment test
* :?
* no comment
* with cache
* back to normal
* testing that ci works
* back to passing
* trying again
* does it create another entry
* does it create another entry?
* build local
* hey
* Revert "excluding conv_transpose2d"
This reverts commit cc7348de03033e032f47d69caff174e2f1a7bfea.
* does it cache if done before?
* does it cache?
* done
* adding test ops
* bad formatting
* no need for this
* working static mem
* sum 1d
* add ndim
* better reg import
* fix stack
* back to np
* working except for softmax
* 5 failing
* no pogress
* remove keystone
* remove keystone
* testops passing
* cleanups
* more cleanup
* typo
* ci
* ci2
* cond import
* ci3
* ci4
* ci4
* ci5
* ci5
* ci6
* aligment
* test all
* correct test
* err read_unmapped
* passing test
* ignore for speed
* ignore for speed
* ci7
* cleanup
* remove docker
* fixing merge
* fixing bugs
* add skipload for const ops
* comments
* First merge to master: Renderer
* fix emulation
* passing all tests arm64
* cleaning
* fix handcoded binary
* cleaning
* fix errs
* fix runtime arg binary
* clean git diff
* fix and clean
* fixing metal test
* cleaning
* fix metal test
* ci ~8 min
* fix pylint and clang
* cache the files in ops_clang
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* feat: allreduce
* feat: test
* feat: need contiguous
* feat: test in ci
* feat: exit with correct code
* feat: don't need that
* feat: opencl wait_for just doesn't work
* feat: synchronize on out
* feat: try?
* feat: try again?
* feat: add extra realizes
* feat: print
* feat: seed
* feat: tol
* feat: test ones and zeros
* feat: remove print
* feat: are you just flaky
* feat: seperate scatter and gather?
* feat: just try synchronizing
* feat: remove print again
* feat: bring back difference
* feat: no sync
* feat: revert that
* feat: back to wait_for
* fix: typo
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* add stable diffusion and llama
* pretty in CI
* was CI not true
* that
* CI=true, wtf
* pythonpath
* debug=1
* oops, wrong place
* uops test broken for wgpu
* wgpu tests flaky
* flake8: Ignore frequent violations, correct infrequent ones
* Ignore some rules in test
* Reorder test ignores
* Lint test + main
* EOF indent
* Include all E71,E72 errors
* Test the failing case in CI
* Revert "Test the failing case in CI"
This reverts commit 110add0a70f5a619d07631269104e84f908af6b9.
* Push to test!
This reverts commit f317532779a0e1ac8401e2474fd5c6c8695c08e9.
* ok back to passing
This reverts commit ba5052685f93f83e06152cdc696b9e26131d8ab7.
* Prove that CI fails when formatting is incorrect.
* Fix formatting
* Remove duplicitous E117 rule
* Use flake8 config for precommit
---------
Co-authored-by: waifairer <waifairer@gmail.com>
* models matrix
* fix typo and install gpu deps
* install llvm deps if needed
* fix
* testops with cuda
* remove pip cache since not work
* cuda env
* install cuda deps
* maybe it will work now
* i can't read
* all tests in matrix
* trim down more
* opencl stuff in matrix
* opencl pip cache
* test split
* change cuda test exclusion
* test
* fix cuda maybe
* add models
* add more n=auto
* third thing
* fix bug
* cache pip more
* change name
* update tests
* try again cause why not
* balance
* try again...
* try apt cache for cuda
* try on gpu:
* try cuda again
* update packages step
* replace libz-dev with zlib1g-dev
* only cache cuda
* why error
* fix gpuocelot bug
* apt cache err
* apt cache to slow?
* opt and image in single runner
* add a couple n=autos
* remove test matrix
* try cuda apt cache again
* libz-dev -> zlib1g-dev
* remove -s since not supported by xdist
* the cache takes too long and doesn't work
* combine webgpu and metal tests
* combine imagenet to c and cpu tests
* torch tests with linters
* torch back by itself
* small windows clang test with torch tests
* fix a goofy windows bug
* im dumb
* bro
* clang with linters
* fix pylint error
* linter not work on windows
* try with clang again
* clang and imagenet?
* install deps
* fix
* fix quote
* clang by itself (windows too slow)
* env vars for imagenet
* cache pip for metal and webgpu tests
* try torch with metal and webgpu
* doesn't work, too long
* remove -v
* try -n=logical
* don't use logical
* revert accidental thing
* remove some prints unless CI
* fix print unless CI
* ignore speed tests for slow tests
* clang windows in matrix (ubuntu being tested in imagenet->c test)
* try manual pip cache
* fix windows pip cache path
* all manual pip cache
* fix pip cache dir for macos
* print_ci function in helpers
* CI as variable, no print_ci
* missed one
* cuda tests with docker image
* remove setup-python action for cuda
* python->python3?
* remove -s -v
* try fix pip cache
* maybe fix
* try to fix pip cache
* is this the path?
* maybe cache pip
* try again
* create wheels dir
* ?
* cuda pip deps in dockerfile
* disable pip cache for clang
* image from ghcr instead of docker hub
* why is clang like this
* fast deps
* try use different caches
* remove the fast thing
* try with lighter image
* remove setup python for cuda
* small docker and cuda fast deps
* ignore a few more tests
* cool docker thing (maybe)
* oops
* quotes
* fix docker command
* fix bug
* ignore train efficientnet test
* remove dockerfile (docker stuff takes too long)
* remove docker stuff and normal cuda
* oops
* ignore the tests for cuda
* does this work
* ignore test_train on slow backends
* add space
* llvm ignore same tests as cuda
* nvm
* ignore lr scheduler tests
* get some stats
* fix ignore bug
* remove extra '
* remove and
* ignore test for llvm
* change ignored tests and durationon all backends
* fix
* and -> or
* ignore some more cuda tests
* finally?
* does this fix it
* remove durations=0
* add some more tests to llvm
* make last pytest more readable
* fix
* don't train efficientnet on cpu
* try w/out pip cache
* pip cache seems to be generally better
* pytest file markers
* try apt fast for cuda
* use quick install for apt-fast
* apt-fast not worth
* apt-get to apt
* fix typo
* suppress warnings
* register markers
* disable debug on fuzz tests
* change marker names
* apt update and apt install in one command
* update marker names in test.yml
* webgpu pytest marker
* Fuzz test symbolic and shapetracker
This reverts commit d5773ddebff54c1ff608838076f0b4ff126b8aa8.
* mess again
* no tail
* test shapetracker too
* Revert mess and enable all tests
* removed leftover
* initial commit
* 81 passing
* 105 passing tests
* 148 passing
* CI tests
* install dep on ci
* try opencl pkgs
* try using vulkan
* down to only 6 failing
* refactor
* cleaning up
* another test skipped due to buffer limit
* linter
* segfault
* indent fix
* another segfault found
* small touchups
* Fix max and maxpool tests
* Add constant folding
* Add javascript export script
* better asserts in codegen
* manual upcasting
* reverted token type change
* skip safetensor test due to unsupported type
* FIx efficientnet and all other model tests
* Remove np copy
* fixed indent and missing import
* manually destroy the buffer
* revert back to length
* linter errors
* removed extra val
* skip broken tests
* skipping more tests
* Make the page pretty
* Save model weights as safetensor
* Fix imagenet to c test
* Fix second imagenet to c bug
* Async and paralel kernel compilation
* workgroup support
* reversed local size
* fixed non local bug
* correct local groups
* ci experiment
* removed typo
* Fix define local by using shared memory
* Refactor
* try running on mac
* match metal tests
* add more workers
* scope down tests
* trying windows runner
* fixed windows env
* see how many it can do
* merged master
* refactor
* missed refactor
* increase test suite coverage
* missing import
* whitespace in test_efficientnet.py
* getting there
* fixed reset
* fixed bufs
* switched to cstyle
* cleanup
* min/max rename
* one more linter issue
* fixed demo
* linter
* testing ci chrome
* add unsafe webgpu arg
* add build step
* remove WEBGPU from cmd line
* use module
* try forcing directx
* trying forced metal backend
* temp disable conv2d for CI
* disable conv_trasnpose2d
---------
Co-authored-by: 0x4d - Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* new upcast works
* float4 try
* fix unaligned float4
* disallow unaligned access
* upcast dim
* maybe good now
* fix gpu half
* vstore_half4
* fix deep image bugs
* improve symbolic to fix issues
* fix symbolic
* cl test
* this maybe
* gcd of 1 is 1
* real fix for old python
* improve fuzzer
* test speed llama
* oops, put it back
* uses the real device codegen
* just do it on the mac
* pp
* is faster?
* Revert "is faster?"
This reverts commit 42db542010906dd62376c0e419416978d03d3d62.
* disable docker again for less load on CI
* global -> group
* allow None for local_size in custom function
* lil local
* comment on shape
* fix cuda
* smart local cast
* better local heuristic
* fix ptx, and work_dim cleanup
* fix metal
* fix ops test
* fix openpilot jit
* no more optlocal
* might fix metal tests
* try metal now
* see generated metal code
* test free removal. REVERT THIS
* mergable
* matrix strategy
* push env to GITHUB_ENV
* use printf instead of echo
* use temp helper function for cross os paths
* use path join
* switched to using temp helper function
* skip test on windows due to memory limit
* small fix
* removed semi
* touchups
* clean up
* seperate tests
* test changes to test_utils on windows
* small refactor
* more cleanups
* undo helpers change
* only skip if in CI and WINDOWS
* e2e testing
* min failure
* no affine on bn, still fails
* why did i think i could detach that?
* allow more kernels for bn
* some test issue i don't understand
* fix binop, other tests failure
* that was a bad idea
* better layernorm
* inference kernel count tests
* new style reshape pushing
* fixup replacement
* 199 kernels is okay. fix flops
* push reshape through unaryops only
* GRAPH=2 draws the phantom ops
* found resnet issue
* non working test
* mul is cheaper than div
* OPT inflation
* SHUFFLE_PAD_OPS in OPT=2
* runs one metal kernel
* conv2d works
* ops tests are passing
* const folding
* all ops work
* pre commit always passes
* torch works
* working still
* fix graph test
* tests passing
* image almost works
* image conv works
* most images
* fix custom
* fix assignment
* fix compile enet
* clean up comments
* fix realize return value
* include shapetracker in LB repr
* copy should make a copy
* reenable method cache
* fix lna
* dtypes in graph
* forward only for IMAGE=2
* simple realize
* getting close
* fixup new api, it's good except the kernel count
* back to 197 kernels
* tests should pass
* go to a real float
* no type_on_cpu
* fix the docs
* put shapetracker back in it's proper place
* add dtype class
* dtypes
* buffers are lazy
* dtype is tracked by lazybuffer and GenericShape
* fix types in llvm
* llvm store
* dtype tests
* fix tests maybe
* fix flop counter
* fix CI
* CI fix and check format
* fix dtype and dtype check
* fix custom test
* fix test graph
* behavior is correct without VALIDHACKS
* simple div and mod
* fix tests
* no negative variables
* alt form is correct
* still correct
* bug in mulnode
* at least validhacks works now
* cleanups
* test validhacks, and to_image_idx
* cache compare key
* tests and __neg__
* conv2d is an hlop
* shorter conv
* KOPT=-1
* alt imp
* MULACC
* smarter mulacc
* pop conv
* 7x7 -> 5x5
* didn't fix, that's not going to work
* this is faster and matches old behavior
* oh, non lazy just won't work with mulacc
* mulacc in torch
* bool types were creeping in
* optimizer is actually better with hlop conv
* fix pushing permutes issue
* refactor einsum_mulacc
* fix up readme
* update readme
* _image_conv2d
* fix bias addition location
* pushing permutes gets back to 200 kernels
* conv cleanup
* disable hlop conv
* don't hide that in helpers
* start clang backend
* mostly working
* no group for reduce w clang
* it compiles
* compiles
* a11y
* minor fixups
* formatting
* add a test
* rename test
* add image
* load + store + boring stuff:
* image tests pass
* thneed print GFLOPS
* op conv test
* more debugging
* hack for multiview image
* shapetracker creates less views
* disable image tests
* working better
* ugh, lkey not key
* print in DEBUG, and allow views
* works
* simple padding conv2d
* use index for image
* that was bad code
* debug print
* fix types
* less lines
* save lines
* bringing back reshape and permute
* done with E701
* 4x4 works in generic way
* max and sum not vectorizing...
* special case single float
* support comparing to MPS
* improve matmul speed, consider generic principles
* GlobalCounter
* fix op tracking
* faster
* comment that out for now
* err, it needs that
* fix minor issues
* fix global_mem
* chonker will make llvm fast
* work
* better speed tests, we will make them fast
* with the cache add is the same speed
* relu and neg are fast
* fix sum speed
* maximum maxnum?
* hack for gemm opt
* gemm very slow
* zeros like
* test_permute
* shapetracker returns self
* fix shapetracker factorization
* err, int strides
* permutes are faster now in tinygrad than pytorch
* support -1 in expand
* gemm unrolled
* improve final test case
* WIP GEMM
* why isn't GEMM fast?
* revert cache dim
* ffp contract works on clang, not llvm?
* ignore llvm ir
* this makes fma work at least, but no faster
* USE_4x4
* 63 GFLOPS
* 87 GFLOPS
* that wasn't matmul, 44 GFLOPS now
* 82 GFLOPS permuted
* this permute too
* a little speed for the convs
* 45 GFLOPS
* speed tests pass again
* clean up prints
* fix FMA WHAT A WASTE OF TIME
* colors
* moar fair
* GPU
* useless on chonker
* cleanups
* improve factorized shapetracker
* better threshold
* label conv
* work
* ops test pass again
* hot load the index
* run the last view, no need to create
* ZeroView needs a repr for the key to work
* fix segfault on out of bounds
* one more test
* start amx, and llvm.initialize_native_asmparser
* amx works
* nice AMX class
* nicer AMX class
* refactor get_idxs
* amx working
* is slower...
* useless flip
* cache
* SZ_X
* AMX_SZ_X/Y work alone
* Contiguous mlop
* test gemm packed
* PREPARE in packed
* use_amx factor
* prefetch isn't faster
* loop
* same 3ms
* 2.24 ms
* allow double on store in TG
* amx reduce is the same speed as non amx reduce
* include memory bandwidth
* clean up shapetracker
* flip returns stride
* prepare for upstream
* Update ops_llvm.py (#426)
* permutes are yellow and green now
* faster conv
* llvm cleanups
* Show optimised IR under debug 4 (#428)
* ASTKernel class
* Make tinygrad work with older python version (#427)
* Make tinygrad work with older python version
* Use partialmethod instead of partial
* smiple chonker is chonking
* remove junk from test speed vs torch
* fix linker and types
* AMX is only here now
* add LLVM tests, it's a valid backend now
* oops, run llvm test
* contiguous_op
* fix loadops compare
* dedup reduceops
Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
* working exec ast
* exec_ast is staticmethod
* GenericExecAST
* fold that sometimes
* ExplicitExecAST
* exec_ast for GPU
* gpu working
* get_lazyop_shape
* now gpubuffer is ExplicitExecAST
* dedup
* add a type
* RESHAPE in opencl code
* fix linter
* that too for linter
* cleanups
* remove dead code
* GenericShape is less lines
* add ALLOWED_KERNEL_COUNT to tests
* fix mypy
* that's gotta be recursive
* fix opencl shape processing
* remove unneeded lambda
* ngrl stuff
* fngrl
* fix typo in compile script
* workflow dispatch
* new models in tests
* dont need to up this threshold
Co-authored-by: HaraldSchafer <harald.the.engineer@gmail.com>
* Split tests
Split tests into "Test CPU" and "Test GPU".
Add test flag "TEST_DEVICES" which is a comma separated list of devices:
CPU,GPU,ANE
* Run tests based on provided TEST_DEVICES flag
By default will run all "CPU,GPU,ANE"
* fix bad quote
* Revert changes and use GPU=1
This is done through setting the default Tensor Device to Device.CPU of
GPU=1 is set.
Run GPU tests: GPU=1 pytest -s -v
* Actions sloccount check for < 1000 lines (tests and source)
* Sudo lol
* Test sloccount in actions ci
* More tests for the expression
* apt install test to do validation in cli
* fuck test
* try something that probably won't work
* ofcourse not
* test that test fails if lines greater than 1000
* more than 1000 results in 1,000 not 1000. Delete
* remove test lines. Should pass now
* Untouch file
Untouched a file that was changed in testing