* add name support
* use fetch in gpt2
* remove requests from main lib, networkx also optional
* umm, keep that assert
* updates to fetch
* i love the walrus so much
* stop bundling mnist with tinygrad
* err, https
* download cache names
* add DOWNLOAD_CACHE_VERSION
* need env.
* ugh, wrong path
* replace get_child
* force rebuild of ocelot
* SzymonOzog gpuocelot
* delete that
* downgrade that
* non parallel
* force rebuild
* use llvm
* nauto
* less mem maybe
* print test
* helper_test_exception skip CUDACPU
* helper_test_exception
* shippable
* very close
* remove comment
* negative strides working
* almost everything passes
* calculate offset with list comprehension
* some cleanup
* got disk load working
* review suggestions
* fix after merge
* overlap working
* did it
* clean
* fixed disk load
* lint
* mypy
* removed as_strided
* trying without simplify
* added back simplify
* make sure expanding to smaller shape
* cleanup
* removed comment
* removed env file
* trying whisper test again
* onnx test sqlite issue
* working on test
* finished test
* eliminate unnecessary shrink-then-pad
* don't shrink buffer
* added strides check
* added to ci under linters
* switch issue
* allow symbolic stride
* removed .env
* isinstance
* adjust strides for double expand
* cleanup
* needed to add type hint for mypy
* set pythonpath
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test
* wmma: refactor tensor cores using existing local dims
* optimizer: fix bad rebase and break after one late local
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* add mops to graph, refactor IMAGE
* no reshape pushing
* add todo
* fix openpilot model alt
* push reshapes reduces kernels in new op
* IMAGE=2 is a first class citizen now
* Revert "disable flaky triton test"
This reverts commit 1e15fdaee7.
* Update test.yml
* check if has shared for matvec
* disable ocelot cache for triton
* disable ocelot cache
* disable ocelot cache
* pass shared to triton uops tests
* temporary debugs for CI crash
* Revert "temporary debugs for CI crash"
This reverts commit fee3ea96c818e83c19b935c2f8482e0ccc91a542.
* Revert "triton isn't tested, and allows this refactor (#2007)"
This reverts commit dea8bb0938.
* add runtime_args to every renderer, move triton local size override to runtime args
* Add binary to args, correct type returned
* update to new loops
* Update test.yml
* some cleanup
* move continue back
* more more more
* added to CI
* try
* try intentionally break some tests
* wtf
* del True for test
* yay tests broke, now pls no break
* try AGAIN
* gahy
* lol
* try
* move over constant
* moved over MORE
* move shrink over
* trailing lines
* try CUDA CI
* try again
* boom
* oops
* improved comments
* try: disable some flags and disable CUDA
* try breaking tests
* traceback has too much info so add --tb=no
* revert forced CI failure
* add comments and del unused imports
* oooooooo using regular debug try enable tb
* intentionally break tests
* added tb back. Maybe not too verbose
* strip whitespcae
* missed something
* Shape op int32 -> int64
* oops missed something
* add some types
* get rid of crazy 1 liners in pad op
* actually test Split this time LOL
* strip that whitespace
* limit metal buffers
* look at the base, not the srcs
* Revert "Revert "openpilot kernel fix from 209 to 207 (#2006)" (#2065)"
This reverts commit 924ecc4d6a.
* add a test for that
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* Fix openpilot kernel from 209 to 206
1. Use push_movement_ops conditions in _movement_op. Don't push
PAD or check if the ops are safe to be pushed with PAD
2. Don't push if all the op.buffers are realized
* change ALLOWED_KERNEL_COUNT to 206 for openpilot
* don't push through sourceless buffers
* change the tests to adjust kernel counts for new behaviour
* restore pushing of movement ops through childless buffer
* don't push EXPAND, causes OOM
* allow push of intermediate movement ops
* adding new test behaviour
* modifying external_test_opt for new behaviour
* restore old tests
* Reenable push of EXPAND and introduce new tests
I was wrong intially thinking EXPAND can cause OOM and hence I had
disabled it. Since it is 0 stride and doesn't allocate memory its cool
* Don't push EXPAND above LoadOps LB. This is causing OOM
* Push should be decided on movement root of bufs
To check if ast.op.buffers is sourceless/ realized go the the movement
root and then decide if pushing should be done or not
* refactor for readability
* use .base instead
* don't push expand, bad memory/compute consumption
* restrict push of reshape, seeing improvement
* push reshape if unary without further check
* disable PAD solves convnext kernel count increase
* reenable test_cache_binaryop_transpose
* small nit