* test: put conv in one reduce
* put reduce at the end
* more expand
* generic, and that expand was breaking things
* ratio
* don't undo the expand
* arg 1
* strides
* warning, for resnet
* warning removed
* disable cast
* handle cast
* op
* err, that's right
* fixup
* fix that
* a test to play with
* add double_reduces
* working up to final reshape
* fold the last reshape
* moved to schedule
* fix axis
* ci, need to bring arange back
* FUSE_CONV_BW maybe
* valid in 3.9
* test_expand_reduce_is_folded_on_different_axes
* add FUSE_CONV_BW=1
* test_fold_batchnorm_backward
* test_sgd_4convs_fuse
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* bring FUSE_AS_ONE_KERNEL back
* operands need reshape?
* fused but arange didnt fold
* something deeply wrong
* yay, fused
* derive broadcasts
* s/input/reduce_input
* _fixup_ones proved a point
* this is what it takes
* down to 3 required reshapes:
1. output_shape
2. the second reduce merge dims
3. remove dims for above reshape
* start real reshapes
* resolve shape in the edges pre lazyop
* outputs are the same shape
* rewrite1: just the reduce
* more correct
* fuse_as_one_kernel
* closer
* this passes
* dont rerun info
* dont need these
* not needed
* test: use const
* hotfix: base
* asserts
* dont push through reshape
* cleanup
* dont need the cache
* test_reduceop_reshape_dont_push and test_index_fused are next
* improve single kernel indexing
* metadata in graph (#5399)
* indexing is O(1)
* add failing test
* ugh, that all needs to be replaced with symbolic
* broken on ptx, it's fine
---------
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
* indexing getting better [run_process_replay] [no_assert]
* fix test
* test_arange_2_reduce is a simpler test
* put that print back, NOOPT
* don't merge reduces (they could be different reduces)
* FUSE_AS_ONE_KERNEL
* fix tests
* fix test_var_multireduce
* w/e put that there
* fails on others too
* fix test, revert UNMUL change
* in case order matters
* one kernel indexing works
* one kernel indexing works (test other)