* safetensors test
* safe_save
* load back with real safetensors
* bugfix in device name. add simple torch_load
* it works for llama, but it's slower...
* mmap
* no intermediate
* load mmaped
* readinto speed
* not ready yet
* revert that
make it work out of the box for new users.
the default configuration of train_efficientnet is to use the smaller cifar
dataset. import datasets.imagenet tries to open imagenet_class_index.json
and will fail, unless user has already downloaded it.
* Add ResNet inference test and cannon
* Test with ResNet50
* test_car works with resnet fix
* Add KiTS19 dataset
* KiTS19: Implement iterate
* No batch load for this dataset
* Save results on iterate
* Implement dice score
* Add data prep and eval functions
* Resolve shape issue
* Conversion works but wrong values
* Segfaults when load_from_pretrained is called
* Fix segfault and assign properly
* Final result generated, though very slow
* Store and load final result to save time
* Fix typo in finalize
* Score computes
* More bug fixes, dice score is very low
* Working broken code
* Assign output values to result
* Getting a much higher score now
* Fix dataset preprocessing
* Mean DICE score of 88.5
* Ugh, typo
* Attempt to reimplement model
* Rename layers
* Tiny model works, kinda
* Accuracy? gone
* Implement InstanceNorm and match torch
* Test instance norm 2d and 3d
* Combined input block with downsample block
* Tiny model works, support strided convtranspose
* Commands to download dataset
* Clean up a bit
* unet3d_v2 -> unet3d
* Remove duplicated code
* Oops, put tests back
* add retinanet with resnet backbone
* adds resnext to support loading retinanet pretrained on openimages
* object detection post processing with numpy
* data is downloaded and converted to coco format with fiftyone
* data loading and mAP evaluation with pycocotools
* remove fiftyone dep
* * eval freq
* fix model timing
* del jit for last batch
* faster accumulate
* feat: add mlperf bert model
* feat: switch to nn.Embedding
* clean+fix: fix formatting
* feat: add simple downloader
* feat: metrics
* feat: don't actually need exact match
* feat: doing a run
* feat: set eps on the layernorms
* clean+fix: cleaner impl + hopefully fixed
* feat: move dataset initialization into iterate
* feat: move tokenizer out of iterate
* clean+fix: cleaner + working
* clean: cleanup
* fix: fix metrics
* feat: need to use original bert gelu + download vocab
* feat: make directory if it doesn't exist yet
* feat: jit go brrr
* feat: promote Embedding to nn
* fix: fix failing test
* feat: add test with jit
* feat: rewrite embedding to no longer need stacked for loops
* clean+fix: don't know how that happened
* feat: initial rnn-t
* feat: working with BS>1
* feat: add lstm test
* feat: test passing hidden
* clean: cleanup
* feat: specify start
* feat: way faster lstm & model
* fix: default batch size
* feat: optimization
* fix: fix metrics
* fix: fix feature splicing
* feat: cleaner stacktime
* clean: remove unused import
* clean: remove extra prints
* fix: fix tests and happy llvm
* feat: have the librispeech dataset in its own dir
* clean: unused variable
* feat: no longer need numpy for the embedding + slightly more memory efficient lstm
* fix: forgot to remove something that broke tests
* feat: use relative paths
* feat: even faster
* feat: remove pointless transposes in StackTime
* fix: correct forward
* feat: switch to soundfile for loading and fix some leaks
* feat: add comment about initial dataset setup
* feat: jit more things
* feat: default batch size back to 1
larger than 1 is broken again :(
and even in the reference implementation it gives worse results
* Make GPU the default device
* Compile EfficientNet with CPU
* don't print device
* use METAL and CUDA if possible
* Revert some changes to workflow
* Fix import error when checking device availability
* device lookup is now optional
* hopefully fix linter and tests
* fix workflow
* Skip device if not available
* don't change default if CPU=1
* simplify device selection
* Default to CPU if no GPU
* don't print device name...
* No need to change default in llama
* Make GPU the default device
* Compile EfficientNet with CPU
* don't print device
* use METAL and CUDA if possible
* Revert some changes to workflow
* Fix import error when checking device availability
* device lookup is now optional
* hopefully fix linter and tests
* fix workflow
* Skip device if not available
* don't change default if CPU=1
* simplify device selection
* Default to CPU if no GPU
* don't print device name...
* No need to change default in llama
* run github workflow
* Fix logic to select default
* pass if an error occurs
* use separate function for try except
* fix binop, other tests failure
* that was a bad idea
* better layernorm
* inference kernel count tests
* new style reshape pushing
* fixup replacement
* 199 kernels is okay. fix flops
* push reshape through unaryops only
* GRAPH=2 draws the phantom ops
* found resnet issue
* non working test
* mul is cheaper than div
* OPT inflation
* SHUFFLE_PAD_OPS in OPT=2
* add int64 as supported dtype from numpy
Without this, examples/transformer.py didn't run. With this change it runs successfully.
* Update helpers.py
* Update transformer.py
* Update training.py
* runs one metal kernel
* conv2d works
* ops tests are passing
* const folding
* all ops work
* pre commit always passes
* torch works
* working still
* fix graph test
* tests passing
* image almost works
* image conv works
* most images
* fix custom
* fix assignment
* fix compile enet
* clean up comments
* fix realize return value
* include shapetracker in LB repr
* copy should make a copy
* reenable method cache
* fix lna
* dtypes in graph
* forward only for IMAGE=2
* simple realize
* getting close
* fixup new api, it's good except the kernel count
* back to 197 kernels
* tests should pass
* go to a real float
* no type_on_cpu
* fix the docs
* put shapetracker back in it's proper place
* building shapetracker
* default ENABLE_METHOD_CACHE
* symbolic compiles
* improve types
* tensor compiles
* oops, that's a bug
* best of both worlds
* find legit typing bugs
* pad2d can take list or tuple
* sub 200ms when compiled
* third try at torch loading
* numpy fixed
* fix enet compile
* load_single_weight supports empty weights
* oops, CPU wasn't the default
* so many bugs