* WIP: Stable diffusion WebGPU port
* Load whole model: split safetensor to avoid Chrome allocation limit
* Gitignore .DS_Store, remove debug print
* Clip tokenizer in JS
* WIP: Compile model in parts (text model, diffusor, get_x_prev_and_pred_x0, decoder), and recreate forward logic in JS
* e2e stable diffusion flow
* Create initial random latent tensor in JS
* SD working e2e
* Log if some weights were not loaded properly
* Remove latent_tensor.npy used for debugging
* Cleanup, remove useless logs
* Improve UI
* Add progress bar
* Remove .npy files used for debugging
* Add clip tokenizer as external dependency
* Remove alphas_cumprod.js and load it from safetensors
* Refactor
* Simplify a lot
* Dedup base when limiting elementwise merge (webgpu)
* Add return type to safe_load_metadata
* Do not allow run when webgpu is not supported
* Add progress bar, refactor, fix special names
* Add option to chose from local vs huggingface weights
* lowercase tinygrad :)
* fp16 model dl, decompression client side
* Cache f16 model in browser, better progress
* Cache miss recovery
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* sort of works
* interpreted
* fix flopcounter
* interpreted
* simpler
* type
* functools compile ast
* lose a line
* delete extra file
* no self.method_cache
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* remove arm64, caching for cuda
* caching in llvm
* switch cache_compiled to new cache
* fix clang
* caching for metal
* fix pylint
* cleanups
* perf_counter and binary
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None
* refactor unit tests for dtypes
* add missing dtypes in llvmir.py and lib.py
* skip torch tests
* webgpu
* cleaner skips
* fix llvm bool casting issue using compare
* llvm 100% passing
* llvm segfault
* TEMP decrease timeout mins to 11
debug
* add bf16 to setup
* skip half tests in cuda cpu
* check for CUDACPU insetad
* add int16 to triton dtypes
* u16 for triton
* remove debug - diff is still hard to read
* derive from base class TestDType
* enhance test_upcast and downcast by running on every possible version
* dummy commit to rerun the flakey test
* skip the correct tests for CUDA
* bf16 should be skipped in the common TestDType cases
* re-enable bf16
* more consistent structure
* tiny changes to is_dtype_supported 1
* tiny changes 2
add reason
* fuzz
* fuzzer p2
* run fp32 twice
* remove duplicate fp32 run
* clang: use stdbool
* skip triton on bool casts
* merge and resolve conflicts
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test