* [WIP]: implementation of SoftVC VITS SVC model
* fix typo
* fix whitespace
* Fully implement Generator & Synthesizer
- implement SineGen & SourceHnNSF to reconstruct source signal from F0
- source signal is added during Generator
- fix various typos
- start loading state dict for synthesizer
* Load Synthesizer weights
- Fix typos in Synthesizer
- Slightly modify vits::load_checkpoint to skip a specified layer
- Test with Saul Goodman model because Drake weights are on mega
* start work on ContentVec
- implement ConvFeatureExtractionModel for ContentVec
- start work on TransformerEncoder for ContentVec:
- this transformer probably needs its own MultiheadAttention implementation
- fix various typos in synthesizer
- add helpers to mask behavior of ~ and % operator of torch
* use normal and kaiming_normal
* Implement ContentVec
- load ContentVec weights and config from fairseq hyperparams
- use MultiHeadAttention from whisper.py
- TransformerSentenceEncoderLayer might still need some tweaking, will see during inference testing
- redid tilde()
- some cleanup
* rename the file so it can be imported
* forgot to lint
* use float() instead of cast()
* add contentvec256l9 and cleanup
* Implement SoVITS fully and run it
- Fully run sovits with .wav file
- Drake weights need to be manually downloaded for now
- Fix bugs
- Add examples/sovits_helpers
- Big TODO: INVALID Kernel for recordings > 4.5 secs
* temp fix for longer audio recordings
* Upsample no more torch
* cleanup & detailed inference time measuring
* Completely remove torch(audio)
- Implement sinc resample in tinygrad
- Load audio via Soundfile
- Some cleanups
* move stuff to helper files
* Cleanup
* fix invalid kernel
* Cleanup & add more models
* Metal sounds good after master merge
- But Synthesizer pass became much slower
* drake weights now marked save
* do load/store in numpy
* no commas needed here
* remove extra newline
* call Tensor::where on object
* use Tensor::cat instead of numpy
* pull out first iteration
* remove Sequential, Dropout, GELU, TransposeLast
* cast during loading
* clean up attention
* remove SamePad
* Major cleanup / line reduction
- Finish implementation of GroupNormMasked
- Simplify parts of TransformerEncoder
- Simplify parts of Generator
- Move all helpers to common section
- Only use repeat_expand_left for interp after SpeechEncoder
- Moved SVC-specfic ContentVec impls up (canonically)
- Proper annotations for get_encoder
- Finished all TODOs
- Squashed some whitespaces
* clean up preprocess as well
* more straightforward bool expr
* add demo mode