Commit Graph

81 Commits

Author SHA1 Message Date
Tobias Fischer f9e32f2bb2
clip device fix (#6924) 2024-10-07 00:47:32 +08:00
chenyu 01a2d7316d
dtype=float in bert log_softmax for loss and accuracy (#6916) 2024-10-06 11:15:56 -04:00
George Hotz 8ca506ee37
remove the magic methods for moving between devices [pr] (#6881)
* remove the magic methods for moving between devices [pr]

* remove unneeded clang
2024-10-04 20:27:52 +08:00
George Hotz f4ec39fe58
switch symbolic from old to uops, final PR (#6872)
* switch symbolic from old to uops, final PR

* two wrong answers

* not needed resolves

* symbolic ops passes

* symbolic ops passes

* progress

* tests pass (almost)

* fix last test

* fix some tests

* global binding and unbinding

* Revert "global binding and unbinding"

This reverts commit 9456725630316487509980af20c6d2981de00bec.

* that test works now

* vars on uop doesn't recurse

* fix fuzzer

* update

* fix type

* fix gpt, it's UOp now

* ssimplify symbolics
2024-10-04 16:42:27 +08:00
chenyu c3c93f332a
symbolic bool raise ValueError when not sure [pr] (#6853) 2024-10-02 09:10:58 -04:00
Tobias Fischer 33f7599158
Compute FID Score (#6802)
* compute fid score code

* cleaner s1 and m1 loading
2024-10-01 19:47:58 -04:00
chenyu 396c96357b
update mlperf bert scripts (#6755)
removed DISABLE_DROPOUT=1.
updated BS to 54 that works on tinyboxes with dropouts.
used bert's sparse_categorical_crossentropy that takes Tensor ignore_index in accuracy method
2024-09-25 23:55:05 -04:00
samm393 19c11792fd
Flux.1 (#6334)
* initial commit

* whitespace

* get rid of torch import

* indentation

* less hardcoding

* add flux.1-dev

* jit

* no double

* t5 tidy up

* validation image

* reuse sdxl autoencoder

* typing changes

* empty lines

* remove unneeded comments

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-09-24 10:08:04 +08:00
Tobias Fischer c1bbd15bd9
Sharded SDXL Inference (#6328)
* initial sharding fixes

* sigma device fix

* emptyline space fix

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-21 01:26:43 -04:00
George Hotz 8f6d0485e7 hotfix: resnet to obj.device 2024-09-06 13:06:02 +08:00
George Hotz 9d72119a0c
minor resnet cleanups (#6382)
* minor resnet cleanups

* that should have been long

* jit

* meh
2024-09-06 12:50:21 +08:00
Tobias Fischer 3517aa89d9
sdxl batched inference fixes (#6293) 2024-08-28 07:44:58 -04:00
Tobias Fischer 211bfb6d8a
fixed batched clip computation (#6292) 2024-08-26 20:48:15 -04:00
Tobias Fischer 331b0f5477
new clip gather (#6277) 2024-08-25 19:27:24 -04:00
chenyu e6c7c3e499
update pylint path to check indent/space for all (#6022)
also fixed many errors. it was not checking nested dirs. exclude autogen for now.

can we use ruff for this?
2024-08-10 14:41:09 -04:00
wozeparrot d269bc95fa
faster tinychat (#5993) 2024-08-08 19:16:26 -07:00
George Hotz bf8ec23b00 hotfix: contiguous on precompute_freqs_cis 2024-08-07 14:40:56 -07:00
David Hou 9a485f36e4
shard kvcache (#5830) 2024-07-30 20:29:54 -07:00
George Hotz 4e89d45513 hotfix: put contiguous back in llama 2024-07-30 18:43:48 -07:00
George Hotz 21c5e8e1b7
extreme llama speed, 57.34 tok/s (#5827)
* extreme llama speed

* mergable
2024-07-30 18:32:09 -07:00
Tobias Fischer 72da3fe7e6
added clip vision model (#5595)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-19 18:35:51 -04:00
Tobias Fischer 85d4ca7caa
FID Inception Model (#5516)
* added model impl

* minor cleanups

* extracted weights loading into from_pretrained

* reorganized model for better weight loading

* removed lru cache for state dict loading
2024-07-16 23:12:03 -04:00
wozeparrot fa873df9c1
bring tinychat more inline with tinyos' version (#5358) 2024-07-10 13:13:52 -07:00
Tobias Fischer 0c3a35e5c2
Stable Diffusion v2 Inference (#5283)
* model implementation

* clip fix, more qol options
2024-07-03 22:47:10 -04:00
chenyu b2c3a28a5e
nn.RMSNorm (#5272)
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
Tobias Fischer 8c9c1cf62f
Pulled CLIP and UNet into Seperate Files (#5253)
* pulled clip and unet into seperate files

* reference cleanup, lru cache fix

* better pool indexing
2024-07-01 22:33:01 -04:00
George Hotz 14980f79dd hotfix: unbreak llama 2024-06-30 15:27:54 -07:00
George Hotz 3df47bc21e
OpenELM + repeat_interleave (#5234)
* start writing openelm

* progress...hit bug

* repeat_interleave support

* gqa

* add rotary embedding

* spp

* i think it runs correctly

* broken

* output is good now

* cleanups

* no io_uring on android
2024-06-30 15:18:39 -07:00
reddyn12 f1c7944c44
Fix batchnorm shapes for resnet.load_pretrained (#5167)
* Fix batchnorm shapes

* make it general reshape
2024-06-26 18:44:10 -04:00
chenyu e468601226
update llama attention casting (#5096)
* update llama attention casting

updated scaled_dot_product_attention middle cast and removed hard-coded half in llama attention.

* fix that
2024-06-22 10:57:17 -04:00
chenyu 8bd6cb9511
update llama model RMSNorm casting (#5095)
following the original implementation, cast back to input dtype before multiplying weight. slightly faster
https://github.com/meta-llama/llama/blob/main/llama/model.py
2024-06-21 23:02:04 -04:00
chenyu e2c5054bdd
update resnet.load_from_pretrained (#5040) 2024-06-18 16:29:22 -04:00
chenyu 67e8df4969
remove numpy from dtype (#4969)
replaced all dtype.np with _to_np_dtype defined in tensor.py.

after this, the only numpy usages are (1) Tensor(np.ndarray), (2) construct .numpy() output, (3) numpy random buffer
2024-06-14 15:38:45 -04:00
Elias Wahl d2e3c391e8
Residual in MLM loss + Change default steps (#4935)
* Residual in mlm loss

* Reduce default steps to 160K * 24

* oops

* comment
2024-06-12 16:09:18 -04:00
Elias Wahl 04e237328b
Refactor to class style (#4804) 2024-06-04 14:08:31 -07:00
chenyu 31358cbea5
change Tensor.stack to method (#4719) 2024-05-24 17:04:19 -04:00
chenyu ae861325ce
update llama sample for mac 32 input buffer limit (#4662)
set default sampling params to function call to 0, and top k in llama3 to 25.
2024-05-20 17:23:39 -04:00
wozeparrot b144d4b460
new llama3 example (#4576) 2024-05-19 22:42:23 -07:00
chenyu a65c8de735
move .half() llama freq_cis to the end of sin and cos (#4587)
otherwise arange has inf if either dim or context length exceeds half.max
2024-05-14 15:00:18 -04:00
wozeparrot d2c347fc74
faster gather for bert (#4526) 2024-05-10 22:28:48 -07:00
Elias Wahl 27613dd881
MLPerf BERT: Main training loop (#4288)
* BERT language modeling head + trunc normal initializers

* add train loop + helpers

* shuffle in dataloaders + slight changes in main loop

* beam change

* Minor changes

* random.shuffle

* HParam update

* Use deque for dataloader

* wandb bert project name

* half fixes

* BENCHMARK + remove epoch

* cast + print()

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-29 14:35:27 -04:00
Elias Wahl 2ecd61e3e2
monkey patching (#4214) 2024-04-18 19:20:52 -04:00
George Hotz e79a11b99c hotfix: revert llama change 2024-04-10 20:13:15 -07:00
George Hotz 2e6c39b0b2
Do less realizes (#4141)
* less realize

* corealize jit inputs

* prints

* print before we run
2024-04-10 19:50:50 -07:00
chenyu f8dc82a8a7
use single tensor for llama kv chache (#4108)
similar to optimization in gpt2
2024-04-08 00:38:32 -04:00
chenyu 92c0675ccf
setitem initial support (#4093)
* wip setitem

it's an eager assign to output shapetracker view

* cleanups and tests

* more cleanups
2024-04-07 20:35:22 -04:00
David Hou 4b95350c41
fp16 resnet (without expand backwards sum in float, doesn't work) (#3816)
* fp16 resnet

* cast running mean and var back to default float

* extra cast

* check symbolic no overflow

* add linearizer failure

* loss scaler after grad contig

* oops

* i think this works

* don't loss scale fp32

* remove overflow test case

* remove symbolic bounds check

* loss scaler should be float

* temporarily disable padto cuz bug

shruggie

* make running stats in batchnorm float32?

* calculate lars stuff in fp32?

* oops

* remove most changes

* move loss scaler out of optimizer

* no more FP16 var

* oops

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-28 01:25:37 -04:00
George Hotz 150ea2eb76
create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
George Hotz 4c4d3cb3e3
restrict assignment to base (#3809)
* restrict assignment to base

* add some restrictions there

* more restrictions
2024-03-18 15:33:06 -07:00
chenyu 5ac1fa933f
apply the same fix_bf16 in llama and coder (#3789)
* apply the same fix_bf16 in llama and coder

did not realize the same logic was in llama too.
really fix #2775

* flag for native SUPPORT_BF16 cast
2024-03-17 21:25:24 -04:00