* Allow multi-input model export
* Add model export unit test
* Fix efficientnet compilation
* Only run model export test on JIT supported devices
* Skip export model test if not EXPORT_SUPPORTED_DEVICE
* simplify gpt2 example
* kernel_jitted_count and jit tests
* Revert "kernel_jitted_count and jit tests"
This reverts commit 31a3c26dd061dbcf6c43c295a265813ccb35b9e9.
* all_jitted test in test_real_world
* small changes
* expand in terms of substitute, directly expand g_idxs g_valid
* delete expand_ops
* don't compare using hash
* any instead of in
thanks gijskoning
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* support tc
* testing code
* no more create_rednode
* maxsize none in view/node
* oops
* undo
* typing
* oops
* oops
* lmao
* lmao
* add expand multi test
* Node.iter_idxs
* type
* type
* delete checks!
* clean up a little?
* expand_idx in symbolic
* un-golf
* play around with types >.>
* test_substitute and also remove an incorrect test?
* get rid of range
* Update symbolic.py
* split out view cache change
* split out flat components change
* reduce diff
* reduce diff
* add some float4 tests
* fix
---------
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* small lazy cleanups
* a few more
* cleanups
* no more realizing in the scheduler test
* a few more minor things
* that was just wrong
* fix graph. the graph test was completely useless
* make graph usable
* fix op graph
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* dedup buffers
* hmm, bugfix
* fix tensor cores
* opts device
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* Move ops_triton to runtime and remove errors from deprecated code
* Remove deprecated AST Kernel
* Remove deprecated buffer
* Add TritonProgram
* Triton Buffer
* Use RawCUDABuffer
* triton_compile
* Added new parameter
* pass _buf to program
* remove deprecated include
* Added triton tests
* Deprecated includes removed
* remove double print
* Disable float4 support
* Disable float4 support
* variable load fix
* Track local size
* Add pycuda to triton dependencies
* Merge test.yml
* install cuda packages for testing
* merge double package install
* remove emulated from triton tests
* upscale local index to power of 2 and add masking
* cuda envs
* Add TernaryOps
* ConstOp loading
* proper function name
* remove deprecated variables
* get global program from name
* const ops match local shape
* Enable test_nn
* remove deprecated import
* fix linter error
* Add wait logic
* Add local size override
* accumulate local shapes instead of using max shape
* Merge triton tests into global tests
* fix envs in testing
* Old testing routine
* split file into renderer and program
* remove print and starting whitespace
* pretty ptx print on debug 5
* linter errors
* ignore triton saturation tests
* ignore test example
* remove pytorch cpu extra index
* Add triton to existing testing routine
* use triton tests
* disable cuda backend in triton tests
* use cudacpu in tests
* print used device
* Print device default
* Remove print
* ensure we are running triton backend
* update variable signatures
* update dtypes for load
* infinity render fixed
* limit global size
* negative infinity now properly rendered
* split chain with parentheses for and node
* Add option to disable shared memory, disable for triton
* missing import
* Properly index and mask conditional load
* use mask only if not loading a block pointer
* nan support
* fix symbolic tests to include chain split
* proper masking for stores
* Implemented bool dtype
* Add mod
* fix loads for variables with valid range
* merge triton with cuda runtime
* merge from master
* run triton tests with cuda
* Correct target when running from triton
* conftest with triton compiler config
* use triton nightly
* verbose tests for triton
* capture stdout
* fix function depth when exiting multiple loops
* add render valid function for readabilty
* fix mask for local loops
* add _arg_int32 datatype
* fix dims for conditional loads
* enable non float stores
* correct variable dtypes
* fix type for arg_int32
* remove junk
* Added get max function for range based var.max
* remove deprecated code
* Fix triton ptxas path
* Fix testing for CI
* clamp local size by max local size instead of always running max
* Disable matmul test in triton cpu
* rerun tests
* Disable broken test in triton cpu
* whitespace removed
* rerun tests again
* Disable TestSymbolicOps for triton
* update to new uops
* linter fix
* ignore test/extra
* linting fix
* Update tinygrad/renderer/triton.py
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* remove deprecated line
* quotes type fix
* linter
* Remove unnecesary lines
* UnaryOps.NEG
* dont define constants
* Linting fix
* Disable tests that are broken in ocelot
* remove trailing whitespace
* reduce line count
* linting fix
* update to new uast
* New looping style
* Update to new uast
* make AST runner work with triton
* linting fix
* set renderer var for testing
* disable local for ocelot
* reenable all tests for ocelot
* Pass shared to cuda
* Don't group if the backend doesn't support shared mem
* use working gpuocelot branch
* enable all tests
* enable local for ocelot
* cleanup
* Update test.yml
* update cache key
* reenable test symbolic and extra
* Update test.yml
* Revert "Update test.yml" (rerun tests)
This reverts commit 98c0630ee5da4379e5c6b2437a5145fe87058c35.
* Revert "fix symbolic tests to include chain split"
This reverts commit 22a9a4c9cd14d23735e6540c8d90ee005ac4ea17.
* Revert "split chain with parentheses for and node"
This reverts commit 7499a7004ef4db785d0cd05cf292fdeff65ca90d.
* use global size from linearizer
* rename newvar to dtype to match other renderers
* join program start lines
* simplify code that adds axis to local dims
* assign r[u] in ssa
* We no longer need to replace target in src
* we no longer need to cast indices to int by hand
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
---------
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* applying st
* tests pass
* minor cleanups
* torch too
* hack
* contiguous
* move mops
* contig in BN
* tests should pass
* make torch fast
* make zeros and ones contig by default
* no contig there
* fix padding with expanding
* might fix tests
* still doesn't fix bug, but should be there
* Revert "still doesn't fix bug, but should be there"
This reverts commit 8ea92f3e070c8936f7ec3d3f56247225fcaa6320.
* minor cleanups
* valid hacks
* valid hacks
* valid hacks
* new method
* new method
* handtune
* is gate load breaking?
* lint
ruff
less junk
new approach?
maybe this?
* Make it more clear
* Make it more clear
* Will deal with the linter later
* hack for linter
* subs the idx but dont touch the valid
* Updated the mod rules
* lint hack
* I believe bug fix lets see
* Mod Node left
* revert
* Maybe this wont break?
* revert
* implemented "handtuned garbage"
* revert and use VALIDHACKS
* Lets see the CI
* still broken?
* currently its jungle
* maybe this jungle ?
* This works for everything somehow
* Added test for symbolic
* lint
* final touch
* This still works
* lint
* midway clean
* less garbage
* lint
* final form
* Slow but working way
* lint and other stuff
* lint
* mypy
* Make sure CI test Openpilot valid checks
* test if CI break
* Convert back
* refactor
* refactor
* Managed to reduce openpilot time from 30 secs to 5 secs
* Refactor
* Substitute a node with variable
* flake8
* Comment and refactor
* More comprehensive mod
* refactor
* bug fix
* More shave off
* remove not sure part
* add some contiguous
* remove second contig
* Revert "remove second contig"
This reverts commit fc164f7dca1ad75b1e466e4e45a05eca58b7e0e0.
* shm on osx
* can repro bug
* don't contig zeros and ones