* Revert "disable flaky triton test"
This reverts commit 1e15fdaee7.
* Update test.yml
* check if has shared for matvec
* disable ocelot cache for triton
* disable ocelot cache
* disable ocelot cache
* pass shared to triton uops tests
* temporary debugs for CI crash
* Revert "temporary debugs for CI crash"
This reverts commit fee3ea96c818e83c19b935c2f8482e0ccc91a542.
* Revert "triton isn't tested, and allows this refactor (#2007)"
This reverts commit dea8bb0938.
* add runtime_args to every renderer, move triton local size override to runtime args
* Add binary to args, correct type returned
* update to new loops
* Update test.yml
* some cleanup
* move continue back
* more more more
* added to CI
* try
* try intentionally break some tests
* wtf
* del True for test
* yay tests broke, now pls no break
* try AGAIN
* gahy
* lol
* try
* move over constant
* moved over MORE
* move shrink over
* trailing lines
* try CUDA CI
* try again
* boom
* oops
* improved comments
* try: disable some flags and disable CUDA
* try breaking tests
* traceback has too much info so add --tb=no
* revert forced CI failure
* add comments and del unused imports
* oooooooo using regular debug try enable tb
* intentionally break tests
* added tb back. Maybe not too verbose
* strip whitespcae
* missed something
* Shape op int32 -> int64
* oops missed something
* add some types
* get rid of crazy 1 liners in pad op
* actually test Split this time LOL
* strip that whitespace
* limit metal buffers
* look at the base, not the srcs
* Revert "Revert "openpilot kernel fix from 209 to 207 (#2006)" (#2065)"
This reverts commit 924ecc4d6a.
* add a test for that
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* start compile2
* tweak
* why are there two more kernels?
* minor cleanups
* don't break onnx tests
* add __metadata__ support to safetensors
* no early realize in onnx
* cleanups
* bugfix
* clean up image type, add optimize
* opt to match old
* try that
* opt work
* run compile2
* optimizer
* prt more
* prerealize
* imp
* NOLOCALS works
* no locals means no locals
* support fractional globals
* all locals welcome
* int that
* cleanups
* show gemv regression
* clean up diff
* use idx for the cond
* nolocals
---------
Co-authored-by: Comma Device <device@comma.ai>
* Fix openpilot kernel from 209 to 206
1. Use push_movement_ops conditions in _movement_op. Don't push
PAD or check if the ops are safe to be pushed with PAD
2. Don't push if all the op.buffers are realized
* change ALLOWED_KERNEL_COUNT to 206 for openpilot
* don't push through sourceless buffers
* change the tests to adjust kernel counts for new behaviour
* restore pushing of movement ops through childless buffer
* don't push EXPAND, causes OOM
* allow push of intermediate movement ops
* adding new test behaviour
* modifying external_test_opt for new behaviour
* restore old tests
* Reenable push of EXPAND and introduce new tests
I was wrong intially thinking EXPAND can cause OOM and hence I had
disabled it. Since it is 0 stride and doesn't allocate memory its cool
* Don't push EXPAND above LoadOps LB. This is causing OOM
* Push should be decided on movement root of bufs
To check if ast.op.buffers is sourceless/ realized go the the movement
root and then decide if pushing should be done or not
* refactor for readability
* use .base instead
* don't push expand, bad memory/compute consumption
* restrict push of reshape, seeing improvement
* push reshape if unary without further check
* disable PAD solves convnext kernel count increase
* reenable test_cache_binaryop_transpose
* small nit
* init compiled cache
* clang not compile to stdout
* use kwrags in compile
* remove some useless lines
* slimmer
* fix
* tabs
* retry
* remove decorator
* no race in hip
* smaller hip
* unused import
* unused pathlib
* path to str
* add test
* fix linter
* less lines?
* decorator is back
* update tests
* no hip version
* better comments
* a bit better test
* linter
* work wo decorator
* linter happy
* simpler return type
* more tests
* better comment
* readable
* readable
* readable
* compile returns bytes
* no ununsed imports
* readable
* load weights in fp16
* add dtype option in nn
* fix test
* no need for dtype in nn
* add option to load weights in FP16, but NaN
* change loss scaler
* cast to float32 for norm layer
* add a todo for the forward pass padding
* fix transform
* init
* Revert "init"
This reverts commit 682bf2073a8b4eca111596c67cf6ebd79f59e585.
* kids dont do drugs
* one way to fix
* resolve merge conflict
* no more or
* clean up
* start work on auto opt
* lin failure
* not beating hcopt
* greedy
* timing is fast
* codegen.search
* greedy search in handcode_opt
* track running gflops
* clean up those files
* no failure
* testing with the test_ops pattern
* add assign test
* flake8 complaining about single line fn
* slice 2d and minor cleanup
* make assign_slice a one-liner
* we dont need to repeat the same lambda twice, default tinygrad_fxn to be np_fxn
* back assign fn for np array
* implement __setitem__ in tensor.py
* dont re-slice the ret tesnsor
* one liner assign
* drop the permute test