Commit Graph

396 Commits

Author SHA1 Message Date
Dat D. Nguyen ae9529e678
chore: remove redundant noise in stable diffusion example (#1910) 2023-09-24 21:33:45 +08:00
Gijs Koning b8ff20ffe4
Gpt2 (#1896)
* small helps

* got something working

* faster?

* faster yes

* cleanup

* cleanup

* cleanup

* Fix non jit

* Fix fp16 and some cleanup

* Fix fp16 and some cleanup

* cleanup

* similar to master

* cleanup
2023-09-22 20:14:47 +08:00
Yixiang Gao cb5d6576cb
cifar step time 65ms while stay above 94% (#1888)
* change reduceop heruistics

* add model ema and jit hack

* add ema eval

* have to create a duplicate eval function for jit

* remove manual seed

* 94% achieveable with normal eval

* ema is outputting the same results as normal

* fix ema bug

* ema achieves 94% with fix seed

* multigpu tested

* constant fold decay, fix jit, adjust message for multigpu

* pull SpeedyResNet out of train_cifar()
2023-09-21 11:19:32 +08:00
nimlgen 4c31dfafb3
add seed to gpt-2 (#1869) 2023-09-15 17:34:14 -04:00
segf00lt 9e8c1dbf34
patch to remove hack from stable_diffusion.py (#1814)
* patch to remove hack from stable_diffusion.py

* sorry linter

* realize after assign?

* float16 broken in llvmlite use float64 for now

* int32

* idiot forgot to change test array dtype
2023-09-08 09:26:50 -07:00
chenyu ebcda8a714
Move var_vals from ShapeTracker to LazyBuffer (#1819) 2023-09-08 09:25:10 -07:00
George Hotz 722823dee1 stable diffusion: force fp16 free 2023-09-06 15:11:05 -07:00
Yixiang Gao 22cf15e9d0
convert function into tinygrad (#1803) 2023-09-06 14:41:26 -07:00
Pavol Rusnak 52a92bf95d
use class Foo: instead of class Foo(): (#1797)
* use class Foo: instead of class Foo():

* add ruff linter, copy settings from .flake8 to ruff.toml
2023-09-06 12:20:25 -07:00
badcc fd25792c8b
Ensure freqs as type float32 in freqs_cis (#1798) 2023-09-06 10:24:15 -07:00
George Hotz f67638b27a delete broken DDPG example 2023-09-06 08:01:12 -07:00
Francis Lam 0379b64ac4
add seed option to stable_diffusion (#1784)
useful for testing correctness of model runs
2023-09-05 19:45:15 -07:00
George Hotz fb1cc6bf4b
llama jit is default, print tok/sec (#1774)
* llama jit is default, print tok/sec

* jit not default in CI
2023-09-05 10:12:16 -07:00
Yixiang Gao 66a6bbd029
codellama (#1702)
* add codellama with pre-downloaded weights

* add rope_theta, fix param

* fix test

* add 7B-Python

* add 7B-Instruct

* replace single quotes with doulbe

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-09-02 08:45:12 -07:00
chenyu a2745819f6
faster gpt2 jit path and gpt2 in test_real_world (#1738) 2023-09-02 08:39:12 -07:00
geohotstan 94b1257f5e
Changed DEVICE to Device.DEFAULT in deep_determinist_policy_gradient (#1715)
* added device in optim and deep

* oops forgot to del print code

* use Device.DEFAULT instead

* removed device
2023-08-31 07:08:51 -07:00
nimlgen b5cf274da3
remove memory peak for quantized llama (#1720) 2023-08-30 16:32:30 -04:00
chenyu e4eb5d55c7
critical realize for unjitted llama (#1718) 2023-08-30 14:52:32 -04:00
George Hotz cd7ceed914 gpt2: print total instead of sync time 2023-08-30 10:59:42 -07:00
Karan Handa a8aa13dc91
[ready] Replacing os with pathlib (#1708)
* replace os.path with pathlib

* safe convert dirnames to pathlib

* replace all os.path.join

* fix cuda error

* change main chunk

* Reviewer fixes

* fix vgg

* Fixed everything

* Final fixes

* ensure consistency

* Change all parent.parent... to parents
2023-08-30 10:41:08 -07:00
chenyu ac183568be
llama JIT python runtime speedup (#1633)
* no JIT call in TransformerBlock

* idea

* move 2 reshapes to jitted function

shrink inside jitted too, 6.3ms

remove back reshapes, 5.5ms

isinstance -> __class__ 4.99ms

* think

revert ops_gpu.py

revert symbolic.py too

PYOPENCL_COMPILER_OUTPUT=1

* cleanup

* fix cache shape for conversational model

only reshape if start_pos > 0

* small cleanup

* include var_vals.keys() to st.key

* add comments

* llama small update

* everything jitted again, similar structure to gpt2

* fix typing

* add TODO for in place update cache
2023-08-30 07:51:05 -07:00
Umut Zengin 1682e9a38a
Fix: Stable Diffusion index (#1713) 2023-08-30 00:21:10 -04:00
George Hotz aa7c98722b
sd timing (#1706) 2023-08-28 20:22:57 -07:00
nimlgen 1c0449e190
add cache collector (#1595)
* init cache collector

* add test_cache_collector.py

* switch GlobalCounters.cache to CacheCollector

* init jit models test

* jitted SD

* add debug msg to print loaded bufs count

* moved cache collctor to jit

* clearer SD

* no double device import
2023-08-28 19:59:55 -07:00
Olivier Chafik ee6d8de2dc
Llama: load models in HuggingFace format (incl. indexed, safetensors) (#1583) 2023-08-28 15:11:40 -04:00
Yixiang Gao 9d93a82354
remove FAKEDATA (#1685) 2023-08-26 20:15:54 -04:00
Yixiang Gao 173850f599
fix CIFAR jit (#1657)
* update mask function

* kept 94 with the new fetcher

clean up batch fetcher

* 94.04% without cutmix

* 94.04% with cutmix

* move batch fetcher to avoid fetching additional batch last STEP
2023-08-24 16:14:40 -07:00
George Hotz a6d842af7a
move device to ops (#1646)
* move device to ops

* mlops types

* 2 lines
2023-08-23 08:30:17 -07:00
George Hotz 643cbdfd50
make embedding and GPT-2 fast (#1631)
* make embedding fast

* jit more, variable shape support

* print mem bw
2023-08-22 15:14:38 -07:00
George Hotz d3c401ba3c llama quantize: scale uses mul, not div 2023-08-22 11:48:56 -07:00
chenyu 89e13f2f04
support symbols in shrink (#1611) 2023-08-22 09:08:21 -07:00
George Hotz 718ced296c
move state to nn/state (#1619) 2023-08-22 07:36:24 -07:00
George Hotz 4f459841bc
Symbolic JIT for GPT2 (#1613)
* not fast yet

* simpler

* symbolic jit

* fp16 GOPS and GB
2023-08-21 19:44:57 -07:00
Umut Zengin f720682beb
np.argmax to Tensor.argmax (#1608)
* to tensor argmax

* removed keepdim

* training update
2023-08-21 15:22:29 -07:00
George Hotz 4ea00bad38 track down llama bug 2023-08-21 15:14:21 -07:00
Yixiang Gao 4d54afb6df
sparse cat cross entropy (#1597)
* add sparse cat cross entropy

* minor fix

* add log_softmax into loss function

* add test

* update docs

* fix training loss

* add device
2023-08-21 14:14:54 -07:00
George Hotz 2e60920317
Revert "sparse cat cross entropy (#1591)" (#1596)
This reverts commit f0ee850e98.
2023-08-21 10:04:26 -07:00
Yixiang Gao f0ee850e98
sparse cat cross entropy (#1591)
* add sparse cat cross entropy

* minor fix

* add log_softmax into loss function

* add test

* update docs
2023-08-21 09:56:41 -07:00
Yixiang Gao 8d6662a741
.cpu().numpy() -> .numpy() (#1594)
* .cpu().numpy() -> .numpy()

* restore ops_torch

* restore test_speed_v_torch
2023-08-21 09:53:29 -07:00
George Hotz b9feb1b743 fp16 support in stable diffusion 2023-08-20 05:37:21 +00:00
chenyu ae39cf84ab
Symbolic Shape JIT main PR (#1353)
* Symbolic Shape JIT

update tests

2 variables symbolic ops, adding more tests

test passing

cleanup

* more test cases

* single flag

* review update

* jit attention one piece

* realize

* symbolic_jit test for cuda

* old artifact

* works with cuda gpu but failed ci

* CUDACPU
2023-08-18 14:39:55 -07:00
wozeparrot 50decf0d45
train cifar using multigpu (#1529)
* feat: train cifar using multigpu

* feat: split eval batch across 5

* feat: cleaner allreduce

* feat: 93.88%

* feat: cleaner batch chunking from bert

* feat: cleaner grad sync

* feat: tinygrad argmax

* feat: make it work with different gpu counts

* feat: move some stuff into the normal __init__

* feat: autodetect gpu count

* feat: move import inside
2023-08-18 09:35:44 -07:00
wozeparrot 55d95d1658
llama 70b (#1558)
* feat: llama 70b

* feat: llama 70b but simpler
2023-08-16 11:36:12 -07:00
JaSpa99 2fd7004980
Implementation of SoftVC VITS SVC model (#1371)
* [WIP]: implementation of SoftVC VITS SVC model

* fix typo

* fix whitespace

* Fully implement Generator & Synthesizer

- implement SineGen & SourceHnNSF to reconstruct source signal from F0
- source signal is added during Generator
- fix various typos
- start loading state dict for synthesizer

* Load Synthesizer weights

- Fix typos in Synthesizer
- Slightly modify vits::load_checkpoint to skip a specified layer
- Test with Saul Goodman model because Drake weights are on mega

* start work on ContentVec

- implement ConvFeatureExtractionModel for ContentVec
- start work on TransformerEncoder for ContentVec:
- this transformer probably needs its own MultiheadAttention implementation
- fix various typos in synthesizer
- add helpers to mask behavior of ~ and % operator of torch

* use normal and kaiming_normal

* Implement ContentVec

- load ContentVec weights and config from fairseq hyperparams
- use MultiHeadAttention from whisper.py
- TransformerSentenceEncoderLayer might still need some tweaking, will see during inference testing
- redid tilde()
- some cleanup

* rename the file so it can be imported

* forgot to lint

* use float() instead of cast()

* add contentvec256l9 and cleanup

* Implement SoVITS fully and run it

- Fully run sovits with .wav file
- Drake weights need to be manually downloaded for now
- Fix bugs
- Add examples/sovits_helpers
- Big TODO: INVALID Kernel for recordings > 4.5 secs

* temp fix for longer audio recordings

* Upsample no more torch

* cleanup & detailed inference time measuring

* Completely remove torch(audio)

- Implement sinc resample in tinygrad
- Load audio via Soundfile
- Some cleanups

* move stuff to helper files

* Cleanup

* fix invalid kernel

* Cleanup & add more models

* Metal sounds good after master merge

- But Synthesizer pass became much slower

* drake weights now marked save

* do load/store in numpy

* no commas needed here

* remove extra newline

* call Tensor::where on object

* use Tensor::cat instead of numpy

* pull out first iteration

* remove Sequential, Dropout, GELU, TransposeLast

* cast during loading

* clean up attention

* remove SamePad

* Major cleanup / line reduction

- Finish implementation of GroupNormMasked
- Simplify parts of TransformerEncoder
- Simplify parts of Generator
- Move all helpers to common section
- Only use repeat_expand_left for interp after SpeechEncoder
- Moved SVC-specfic ContentVec impls up (canonically)
- Proper annotations for get_encoder
- Finished all TODOs
- Squashed some whitespaces

* clean up preprocess as well

* more straightforward bool expr

* add demo mode
2023-08-13 19:43:23 -07:00
David Heidelberg 13659ac6fa
examples: numpy() array returns only one value, not an array (#1534)
Fixes issue:
```
    loss_cpu = loss.detach().numpy()[0]
               ~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed
```

Signed-off-by: David Heidelberg <david@ixit.cz>
2023-08-13 14:33:05 -07:00
George Hotz 47f18f4d60
[New] SD: Refactor AttnBlock, CrossAttention, CLIPAttention to share code (#1516) (#1518)
* Refactor AttnBlock, CrossAttention, CLIPAttention to share code

* Reshape and transpose in loop

* Bugfix on attention mask

Co-authored-by: Jacky Lee <39754370+jla524@users.noreply.github.com>
2023-08-10 15:04:18 -07:00
George Hotz e3c6c0c6db
add GPT2 example (#1511) (#1514)
* add gpt2 to examples

* some cleanup

* fixes

* argparse + scaled_dot_product_attention

* add timing

* add to benchmark

Co-authored-by: YassineYousfi <yassine.y10@gmail.com>
2023-08-10 09:09:47 -07:00
George Hotz c82bd59b85
Revert "SD: Refactor AttnBlock, CrossAttention, CLIPAttention to share code (#1513)" (#1515)
This reverts commit 85e02311a2.
2023-08-10 09:08:51 -07:00
Jacky Lee 85e02311a2
SD: Refactor AttnBlock, CrossAttention, CLIPAttention to share code (#1513)
* Refactor AttnBlock, CrossAttention, CLIPAttention to share code

* Reshape and transpose in loop
2023-08-10 08:52:33 -07:00
Jacky Lee ef5f648e2f
Tensor.scaled_dot_product_attention to match torch, used in LLaMA, and tested (#1502)
* Implement scaled_dot_product_attention and test

* Support attn_mask

* Support is_causal too

* Use in llama

* Don't forget to reshape

* Set requires_grad=False for causal

* Remove staticmethod

* Remove extra spaces
2023-08-08 23:27:13 -07:00