* small helps
* got something working
* faster?
* faster yes
* cleanup
* cleanup
* cleanup
* Fix non jit
* Fix fp16 and some cleanup
* Fix fp16 and some cleanup
* cleanup
* similar to master
* cleanup
* change reduceop heruistics
* add model ema and jit hack
* add ema eval
* have to create a duplicate eval function for jit
* remove manual seed
* 94% achieveable with normal eval
* ema is outputting the same results as normal
* fix ema bug
* ema achieves 94% with fix seed
* multigpu tested
* constant fold decay, fix jit, adjust message for multigpu
* pull SpeedyResNet out of train_cifar()
* patch to remove hack from stable_diffusion.py
* sorry linter
* realize after assign?
* float16 broken in llvmlite use float64 for now
* int32
* idiot forgot to change test array dtype
* no JIT call in TransformerBlock
* idea
* move 2 reshapes to jitted function
shrink inside jitted too, 6.3ms
remove back reshapes, 5.5ms
isinstance -> __class__ 4.99ms
* think
revert ops_gpu.py
revert symbolic.py too
PYOPENCL_COMPILER_OUTPUT=1
* cleanup
* fix cache shape for conversational model
only reshape if start_pos > 0
* small cleanup
* include var_vals.keys() to st.key
* add comments
* llama small update
* everything jitted again, similar structure to gpt2
* fix typing
* add TODO for in place update cache
* update mask function
* kept 94 with the new fetcher
clean up batch fetcher
* 94.04% without cutmix
* 94.04% with cutmix
* move batch fetcher to avoid fetching additional batch last STEP
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* feat: train cifar using multigpu
* feat: split eval batch across 5
* feat: cleaner allreduce
* feat: 93.88%
* feat: cleaner batch chunking from bert
* feat: cleaner grad sync
* feat: tinygrad argmax
* feat: make it work with different gpu counts
* feat: move some stuff into the normal __init__
* feat: autodetect gpu count
* feat: move import inside
* [WIP]: implementation of SoftVC VITS SVC model
* fix typo
* fix whitespace
* Fully implement Generator & Synthesizer
- implement SineGen & SourceHnNSF to reconstruct source signal from F0
- source signal is added during Generator
- fix various typos
- start loading state dict for synthesizer
* Load Synthesizer weights
- Fix typos in Synthesizer
- Slightly modify vits::load_checkpoint to skip a specified layer
- Test with Saul Goodman model because Drake weights are on mega
* start work on ContentVec
- implement ConvFeatureExtractionModel for ContentVec
- start work on TransformerEncoder for ContentVec:
- this transformer probably needs its own MultiheadAttention implementation
- fix various typos in synthesizer
- add helpers to mask behavior of ~ and % operator of torch
* use normal and kaiming_normal
* Implement ContentVec
- load ContentVec weights and config from fairseq hyperparams
- use MultiHeadAttention from whisper.py
- TransformerSentenceEncoderLayer might still need some tweaking, will see during inference testing
- redid tilde()
- some cleanup
* rename the file so it can be imported
* forgot to lint
* use float() instead of cast()
* add contentvec256l9 and cleanup
* Implement SoVITS fully and run it
- Fully run sovits with .wav file
- Drake weights need to be manually downloaded for now
- Fix bugs
- Add examples/sovits_helpers
- Big TODO: INVALID Kernel for recordings > 4.5 secs
* temp fix for longer audio recordings
* Upsample no more torch
* cleanup & detailed inference time measuring
* Completely remove torch(audio)
- Implement sinc resample in tinygrad
- Load audio via Soundfile
- Some cleanups
* move stuff to helper files
* Cleanup
* fix invalid kernel
* Cleanup & add more models
* Metal sounds good after master merge
- But Synthesizer pass became much slower
* drake weights now marked save
* do load/store in numpy
* no commas needed here
* remove extra newline
* call Tensor::where on object
* use Tensor::cat instead of numpy
* pull out first iteration
* remove Sequential, Dropout, GELU, TransposeLast
* cast during loading
* clean up attention
* remove SamePad
* Major cleanup / line reduction
- Finish implementation of GroupNormMasked
- Simplify parts of TransformerEncoder
- Simplify parts of Generator
- Move all helpers to common section
- Only use repeat_expand_left for interp after SpeechEncoder
- Moved SVC-specfic ContentVec impls up (canonically)
- Proper annotations for get_encoder
- Finished all TODOs
- Squashed some whitespaces
* clean up preprocess as well
* more straightforward bool expr
* add demo mode
Fixes issue:
```
loss_cpu = loss.detach().numpy()[0]
~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed
```
Signed-off-by: David Heidelberg <david@ixit.cz>
* Refactor AttnBlock, CrossAttention, CLIPAttention to share code
* Reshape and transpose in loop
* Bugfix on attention mask
Co-authored-by: Jacky Lee <39754370+jla524@users.noreply.github.com>
* Implement scaled_dot_product_attention and test
* Support attn_mask
* Support is_causal too
* Use in llama
* Don't forget to reshape
* Set requires_grad=False for causal
* Remove staticmethod
* Remove extra spaces