* option for matmul
* fixups
* fast like a nascar
* running
* thneed runner
* no buffer id makes no backing buffer
* move constant folding to the top
* runs on mac
* folded biases
* was v slow
* maybe just that
* elu touchup
* speed and float32
Co-authored-by: Comma Device <device@comma.ai>
* simple lazy
* simple
* fix graph and make realize simpler
* SHUFFLE_MOVEMENT_OPS already works
* MERGE_MOVEMENT_OPS and REMOVE_MOVEMENT_NOPS
* it works, but it's slow
* constant inlining
* cache misses are the reason for loss
* fix non determinism
* cleanup, a few tests fail
* profile
* cache lazyop
* cleanups
* create namedtuple once
* bunch of caches
* it's not deleting
* nograd
* caching allocator
* reduce_op
* fromCPU if you want fromCPU
* complain
* nvidia fix
* realized on Tensor
* numpy is very slow
* no loads in second run
* caching in View
* 10ms speedups on batman
* remove old profiler
* bunch of refactors
* contiguous on view
* elementwise_op_compile for conv
* support ewop after processing op
* this still works
* conv folding works
* all we do is conv conv conv no matter what
* all args to the conv
* still works
* unify conv and ewop
* ops_gpu cleanup
* move around ops_gpu
* remove caching allocator
* remove unused
* find_conv shorten
* gpu refactors
* simpler gpu
* and that
* cmp is fast
* 18ms on mac
* it's a lot of lines, but it's faster
* minor
* tests pass
* LoadOps.CONTIGUOUS
* remove dups
* torch converter doesn't support slice
* move lazy out for merge
* LoadOps are only for lazy
* simple lazy
* simple
* fix graph and make realize simpler
* SHUFFLE_MOVEMENT_OPS already works
* MERGE_MOVEMENT_OPS and REMOVE_MOVEMENT_NOPS
* it works, but it's slow
* constant inlining
* cache misses are the reason for loss
* fix non determinism
* cleanup, a few tests fail
* profile
* cache lazyop
* cleanups
* create namedtuple once
* bunch of caches
* it's not deleting
* nograd
* caching allocator
* reduce_op
* fromCPU if you want fromCPU
* complain
* nvidia fix
* realized on Tensor
* numpy is very slow
* no loads in second run
* caching in View
* 10ms speedups on batman
* remove old profiler
* bunch of refactors
* contiguous on view
* elementwise_op_compile for conv
* support ewop after processing op
* this still works
* conv folding works
* all we do is conv conv conv no matter what
* all args to the conv
* still works
* unify conv and ewop
* ops_gpu cleanup
* move around ops_gpu
* remove caching allocator
* remove unused
* find_conv shorten
* gpu refactors
* simpler gpu
* mergable without this
* ops torch
* accelerated opencl
* it's running, it's just wrong
* bugfix
* model is correct in opencl
* lazy image convert
* add padding support to convolution
* that stuff was all upstreamed
* remove HEAD
* oops
* test_simple_conv2d_4 passes, add dilation support
* put logic in ops_opencl
* fix crash
* hmm, stride seems okay
* padding for batched inputs
* just an issue now with cout%4
* op model still passes
* fix startPackedInputChannel
* pre and post processing ops for graph
* don't break other llops
* shapetrackering
* reshapes are free
* lazy movement ops
* start shapetracker
* that late reshape is crushing our hopes
* simple failure
* DumbShapeTracker passes tests
* improve st tests
* stacked view tracker works
* flip works
* tests pass
* shapetracker works
* use ShapeTracker in ops_gpu
* a couple lines
* fix 0 shape
* less lines
* use shapetracker for new_shape in ops.py
* simpler still
* padding with a ZeroView
* gamed it a little
* replace broadcasting with expand
* Tensor, not self
* remove broadcasting from mlops
* delete useless A operator
* expand, not repeat
* remove A op
* expand on gpu
* binary_op doesn't broadcast anymore
* expand is still total junk, but the tests should pass
* quick math: 0 + x = x.
* gradient w.r.t. x using cherry for conv
* gradient w.r.t. w for conv on cherry but doing vector dot products
* small optimization
* [cherry] optimize conv backpass for large channel count
* get rid of numpy einsum
Simple test using the Chicken example from https://upload.wikimedia.org/wikipedia/commons/4/41/Chicken.jpg and the image preprocessing from example/efficientnet.py
Note that EfficientNet loads the weights from the internet so running the tests may be slow the first time. We could speed up the tests by caching the /tmp folder.
Fixes#234
* added resnets
* fix minor
* fix minor
* resnet in models
* added resnet test
* added resnet train test
* added linear, conv2d nn tests
* fix minor in extra/training
* resnet in models
* fix minor
* fix tolerance for linear in nn test
* fix eval, this causes cpu and gpu UT failing
* revert transformer test
* fix minor for CPU test
* improved model get_params for sequential layer
* fix minor for params counting
* commented broken ops tests
* improved train for resnet
* Add dropout test
* Remove condition where training is false
* Skip dropout test when on GPU
* Revert changes to tensor.py and fix test case
* Revert change on whitespace
* Convert Tensor to cpu for testing
* Fix whitespace in tensor.py
* ops_risk
* risk sim
* guessing is for winners
* minor
* better
* matmal with risk
* conv doesn't work
* closer
* conv2d works
* ops_risk
* opt2 works
* opt1 may not be possible
* opt1 is a mulacc
* arty
* attosoc example building on mac
* minor
* riscv assembler
* gucci gang
* we got C code
* not a scam
* hello
* make risk mergeable into master
* unop support
* Some progress on yolov3
* Removed some debugging comments… Also, the forward pass eats all RAM for some reason
* forward pass almost runs
* forward pass runs almost
* forward pass runs, now we gotta load the weights
* loading weights works
* fetches config and weights
* everything kind of works, postprocessing of output still needs to be implemented, temp_process_results kind of works, but its kind of terrible, and not how things should be done
* some changes
* fixed some bugs in the forward pass and load_weights function, now outputs more correct values, however some values are still loaded incorrectly
* Something is wrong with the forward pass, Conv2d tests added
* forward pass almost outputs correct values, gotta fix one more thign
* yolo works
* some final changes
* reverting changes
* removed dataloader
* fixed some indentation
* comment out failing test, somehow it fails CI even though it passes on my computer…
* fixed wrong probabilities
* added webcam option to YOLO, now just need to add bounding boxes and speed it up
* some progress towards adding bounding boxes
* trying to speed up yolo layer on GPU, still faster on CPU but with 30GB ram usage
* Faster inference times, bounding boxes added correctly, webcam works, but is slow, and there is a memory leak when running on CPU... Also added tinygrads output on the classic dog image
* removed some debugging print statements
* updated result image
* something weird is going on, mean op on GPU tensor randomly faults, copying a tensor from GPU->CPU takes 10+ seconds…
* Improved __getitem__
* Updated
* Updated __getitem__
* Linebreaks
* Maybe this works?
* Added MNIST locally, tests run now
* Some progress on yolov3
* Removed some debugging comments… Also, the forward pass eats all RAM for some reason
* forward pass almost runs
* forward pass runs almost
* forward pass runs, now we gotta load the weights
* loading weights works
* fetches config and weights
* everything kind of works, postprocessing of output still needs to be implemented, temp_process_results kind of works, but its kind of terrible, and not how things should be done
* some changes
* fixed some bugs in the forward pass and load_weights function, now outputs more correct values, however some values are still loaded incorrectly
* Something is wrong with the forward pass, Conv2d tests added
* forward pass almost outputs correct values, gotta fix one more thign
* yolo works
* some final changes
* reverting changes
* removed dataloader
* fixed some indentation
* comment out failing test, somehow it fails CI even though it passes on my computer…
* fixed wrong probabilities
* added webcam option to YOLO, now just need to add bounding boxes and speed it up
* some progress towards adding bounding boxes
* trying to speed up yolo layer on GPU, still faster on CPU but with 30GB ram usage
* Faster inference times, bounding boxes added correctly, webcam works, but is slow, and there is a memory leak when running on CPU... Also added tinygrads output on the classic dog image
* removed some debugging print statements
* updated result image
* something weird is going on, mean op on GPU tensor randomly faults, copying a tensor from GPU->CPU takes 10+ seconds…
* Split tests
Split tests into "Test CPU" and "Test GPU".
Add test flag "TEST_DEVICES" which is a comma separated list of devices:
CPU,GPU,ANE
* Run tests based on provided TEST_DEVICES flag
By default will run all "CPU,GPU,ANE"
* fix bad quote
* Revert changes and use GPU=1
This is done through setting the default Tensor Device to Device.CPU of
GPU=1 is set.
Run GPU tests: GPU=1 pytest -s -v
* 2serious
* load/save
* fixing GPU
* added DEBUG
* needs BatchNorm or doesn't learn anything
* old file not needed
* added conv biases
* added extra/training.py and checkpoint
* assert in test only
* save
* padding
* num_classes
* checkpoint
* checkpoints for padding
* training was broken
* merge
* rotation augmentation
* more aug
* needs testing
* streamline augment, augment is fast thus bicubic
* tidying up
* transformer eval
* axis=-1
* transpose
* test for permutation using torch.movedims
* another test
* line
* Update all devices to be tested
ANE, CPU and OCL all now support all tests.
However tests are not currently passing on GPU and I cannot test on CPU.
Failing GPU test are not an issue caused by this update. Tests have not
been passing due to a missing "six" required installation.
OpenCL Tests have not been run since commit: 1a1c63a08b
devices have 3 types and are handle by a new DeviceTypes enum. (The goal
is to revert to Tensor.<type>, but this current setup allows for keyword
argument defaults: `device=DeviceType.CPU`)
All references to Tensor.GPU/CPU/ANE as been converted to the
corresponding `DeviceTypes` enum.
Refactor of the conversion code to allow for any device to any device
conversion.
* Add six dependency in requirements.txt
* Resolve failure to run tests
Move six into gpu required installs. Remove six from standard
installation.
* Remove repeated data conversion
* Refactor method names
Also reduce code with .to and .to_
* Dynamic device handlers
* Refactor DeviceTypes -> Device
* Add mem copy profiling back
* test_backward_pass_diamond_model passing
* Resolve Sum issue on GPU
* Revert batchnorm2d tests
* Update README with upadated API
* ANE testing with
* Last minute line gains
* 2serious
* load/save
* fixing GPU
* added DEBUG
* needs BatchNorm or doesn't learn anything
* old file not needed
* added conv biases
* added extra/training.py and checkpoint
* assert in test only
* save
* padding
* num_classes
* checkpoint
* checkpoints for padding
* training was broken
* merge
* rotation augmentation
* more aug
* needs testing
* streamline augment, augment is fast thus bicubic
* tidying up
* Consistent GPU classes
Convert the existing GPU classes into one standard format.
Remove duplicated functions in `test_mnist` and create a TestMNISTGPU
class. This reduces line count and ensures consistency.
Use `@unittest.skipUnless(GPU, "Requires GPU")` instead of `if GPU:` to
skip GPU testing. This will ensure that skipped tests are displayed
accordingly in the pytest output.
* Optim Testing now supports GPU
* Tensor testing now supports GPU
jacobian and gradcheck auto skipped until GPU float64 support added.
* GPU support for custom constructor methods
* Remove GPU flag from Model constructors
It was requested that the `gpu` kwarg be removed from the model
constructor. GPU conversion is now handled in the train function.
This also required the conversion of Optimizer parameters as they are
constructed prior to execution of the `train` function and are dependant
on the model GPU state.
* Fix typo: float32->float64
* Clean `get_parameters` utility
Just a quick refactor w/ the new support for optimizers.
* Remove GPU kwarg from TinyNet
Remove `gpu` kwarg from tiny net to match test_mnist `train` function.
* Detach
* Torch.detach reuses the buffer in the
* Fix test
* wakey wakey GitHub Actions
Co-authored-by: holonomicjl <58403584+holonomicjl@users.noreply.github.com>
* tensor implementation for rmsprop and adam
* test_mnist.py extended to cover sgd, rmsprop and adam on cpu and gpu
* number of steps reduced for adam from 1000 to 200
* streamlined numerical_jacobian
* Got rid of the g loop in Conv2D.forward
* ereased stupid line
* nothing
* no loops in Conv2D forward
* Conv2D backprop improved
* stupid things in examples
* alternative to einsum
* Conv2D backward einsum alternative
* tidying up
* tidied up
* no ravel
* got rid of print
* Update efficientnet.py
* Update efficientnet.py
* Update efficientnet.py
* only tensordot
* 255.0
* whitespace
* aspect ratio error in efficientnet
* noprint
* efficient net wrong strides
* broadcasting for backward ops
* Update ops.py
* Update ops.py
- was wrong
* broadcast test for backward enabled
* function adBC + not summing over already 1 axis
* spacing
Co-authored-by: Marcel Bischoff <marcel@Marcels-iMac.local>
* allow for general broadcasting of binary operations. can handle any situation where corresponding dimensions between the tensors match, or at least one of them is of size 1. if a tensor has fewer dimensions than the other, then its size is padded with 1s until they match have the same number. also refactored buffer_zeros() by creating a function buff() that makes a buffer from a numpy array
* remove extra tabs
Co-authored-by: phillip <phillip_bement@reedbement.com>
* Pad2d backward pass on GPU
* Faster Pad2D GPU backward pass (no zeroing needed)
* Fix out of bounds error
* Don't save prg
* Let compiler optimize division by 1
* More generic broadcasting (1s at the start)
* Bug fix
* Add comment
* Try to fix flaky test with other method
* Add mixed broadcast support
* 1kernel
* Separate broadcast tests
Co-authored-by: holonomicjl <58403584+holonomicjl@users.noreply.github.com>
* Somewhat more generic broadcasting
* Add TODO
* Set Torch to deterministic in test
Co-authored-by: holonomicjl <58403584+holonomicjl@users.noreply.github.com>
* no trailing whitespace
* GPU MaxPool2D.backward(); TinyConvNet train passes!
* Fix GPU avgpool.forward() init_val
Doesn’t change result but is simpler.
* Fix MaxPool GPU init_val
Tests only cover random non-negative inputs. This fixes issues if negative inputs are fed to GPU MaxPool2D. Test update to follow.
* to make it work locally
* definitely not working
* Conv2D GPU passes some of the tests
* Conv2D GPU passes more of the tests
* passes some tests and mnist
* removed unecessary code
* Conv2D Backpass works
* wrong test_ops.py
* white space + test backward
* ereased useless code
* removed default argument
* long lines
Strided CPU Pooling was introduced but assumes small kernel size
(<=(10,10)), but efficientnet.py feeds kernel_size=(112,112).
This causes a huge array buffer allocation in stack_for_pool() that
hangs inference for a long time or until system OOM.
Revert CPU Pooling for now, and re-introduce #74 later with a new
global-average-pooling op that can be used instead of avgpool2d with
large kernel size for efficientnet inference.
Co-authored-by: Ryan Neph <ryanneph@google.com>
* copy tensors to and from gpu
* add on GPU
* adding works
* we stick shapes in
* works on cpu and gpu
* test changes, not passing yet
* something else
* op tests pass
* add, mean, and sum have working forward/backward
* mul ops test
* no gpu support, no problem
* test pass, clean up later
* gpu cleanup
* cleanup test ops, don't let div fail
* revert more
* aimpler dispatcher
* clean up grad
* GPU and
* grad is a Tensor now
* gate test on GPU
* cleanups
* late loading gpu
* GPU as input option
* last cleanups