tinygrad

Commit Graph

Author	SHA1	Message	Date
qazal	2094b3b327	graph ScheduleItems (#4224 ) * graph schedules * add logging * inplace --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-04-19 16:17:11 +04:00
Elias Wahl	6eef8ee22a	Wikipedia download script for MLPerf BERT training (#4202 ) * wikipedia download script * add link * checksum valueError * ops	2024-04-17 16:34:57 -04:00
George Hotz	8f749ae0eb	New docs are in mkdocs (#4178 ) * start mkdocs * simple docs for tensor * more docs * move those back * more docs * copy markdown extensions * docs legacy * docs building workflow * fix showcase links * only that? * install tinygrad * add docs to setup.py * Delete examples/llm.c/data	2024-04-16 10:59:51 +04:00
Francis Lam	9d2273235c	search: BEAM_UOPS_MAX to prune candidates with too many uops (#4088 ) * search: add better default settings for fast search not the highest possible performance, but adequate for most usage * search: revert BEAM_MIN_PROGRESS and BEAM_UPCAST_MAX default changes also sneak in a link to .gitignore for the unet3d dataset * revert BEAM_MAX_TASKS_PER_CHILD change and fix uops max condition	2024-04-15 18:56:22 -04:00
George Hotz	f916aadaea	external that test	2024-03-29 19:35:50 -07:00
reddyn12	9b5e15db6e	Mamba Implementation (#3456 ) * first commit * state back to orig * mamba comparisions * rm file * rename file * use Tensor.einsum and mke default model 370M * Cleaned code and made a comparision test * Simplyfy pull request. Only has 1 mamba implementation now. * Update prompt * rm whitespaces * last space * remove Einops dependency * rm unused code * add tests * rm print statement * rm imports * skip CLANG * Update skipIf description * skip model test in CI and add CLANG fix * rm Device import * don't be stupid * Fix conv assign When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token * fix p1 * temp * fix jit import --------- Co-authored-by: schlimeszn <schlimeszn@gmail.com> Co-authored-by: reddyn <nikidsniper@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-28 17:49:12 -07:00
George Hotz	2c6f2e899d	No extra vars call (#3054 ) * remove unused reciprocal * comment * remove unneeded call to vars * free speedup	2024-01-09 09:52:58 -08:00
George Hotz	77c98a1543	hotfix: remove weights directory	2024-01-03 13:40:39 -08:00
qazal	c704a77ca0	green dtypes ALU tests (#2617 ) * dtypes alu test * those types don't exist in torch * floats * more tests * disable those * a couple unary tests * skip float16 tests in CI for GPU * fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM * remove hardcoded float for LLVM ALU fns * less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32. * return on overflows * fix CUDA exp2 * compute results of op regardless of bounds in a python backend * skip fp16 in GPU and CUDACPU * fuzz a smaller range in the float_midcast_int32 test I sampled this and we overflow ~70% of the time. because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here * remove CUDA exp2 overload it's already there now --------- Co-authored-by: George Hotz <geohot@gmail.com>	2023-12-06 08:15:46 -08:00
George Hotz	2c363b5f0b	new style device (#2530 ) * cpu tests pass * torch works * works * metal works * fix ops_disk * metal jit works * fix openpilot * llvm and clang work * fix webgpu * docs are rly broken * LRU works on metal * delete comment * revert name to ._buf. LRU only on Compiled * changes * allocator * allocator, getting closer * lru alloc * LRUAllocator * all pass * metal * cuda * test examples * linearizer * test fixes * fix custom + clean realize * fix hip * skip tests * fix tests * fix size=0 * fix MOCKHIP * fix thneed * copy better * simple * old style metal copy * fix thneed * np reshape * give cuda a device	2023-11-30 17:07:16 -08:00
George Hotz	d87a246439	move to new cached fetch (#2493 ) * move to new cached fetch * extra.utils is over * loads * bump download cache * bump timeout	2023-11-28 17:36:55 -08:00
George Hotz	cbb8486779	ResNet training changes (update benchmark) (#2390 ) * default arg for chunk * bring back to_ * good changes * new set * unused hash * fix optim * new torch loader * fix test lr scheduler	2023-11-22 17:41:12 -08:00
Ahmed Harmouche	265304e7fd	Stable diffusion WebGPU port (#1370 ) * WIP: Stable diffusion WebGPU port * Load whole model: split safetensor to avoid Chrome allocation limit * Gitignore .DS_Store, remove debug print * Clip tokenizer in JS * WIP: Compile model in parts (text model, diffusor, get_x_prev_and_pred_x0, decoder), and recreate forward logic in JS * e2e stable diffusion flow * Create initial random latent tensor in JS * SD working e2e * Log if some weights were not loaded properly * Remove latent_tensor.npy used for debugging * Cleanup, remove useless logs * Improve UI * Add progress bar * Remove .npy files used for debugging * Add clip tokenizer as external dependency * Remove alphas_cumprod.js and load it from safetensors * Refactor * Simplify a lot * Dedup base when limiting elementwise merge (webgpu) * Add return type to safe_load_metadata * Do not allow run when webgpu is not supported * Add progress bar, refactor, fix special names * Add option to chose from local vs huggingface weights * lowercase tinygrad :) * fp16 model dl, decompression client side * Cache f16 model in browser, better progress * Cache miss recovery --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-11-03 18:29:16 -07:00
George Hotz	16ca8410f8	op logger + replay (#2021 ) * logops * fix dtype printing * needs inf * ops dataset * minor improvements * 12k kernels * opt can compile * graph flops	2023-10-08 15:10:18 -07:00
George Hotz	fdd7f282cb	Reenable tensor cores for self-hosted Mac CI (#1717 ) * debug 5 matmul * allow tensor cores in CI * tensor cores on arm64 * put debug back	2023-08-30 07:53:04 -07:00
George Hotz	739f327d2d	Shorter (#1582 ) * deleting lines * remove insert dims * if statement is never hit * bug fixes	2023-08-20 08:12:16 -07:00
wozeparrot	7e7c9001e9	distributed world (#1481 ) * feat: world * feat: tests * feat: no more backwards * feat: recv into * feat: whoops * feat: test in ci * feat: some debug logging * feat: workflow naming * feat: need to set pythonpath * feat: just send to same device	2023-08-10 10:00:51 -07:00
George Hotz	bf21aec81f	do benchmarking (#1451 ) * do benchmarking * system * artifact * go * name artifact	2023-08-05 23:35:01 -07:00
George Hotz	fc2303e520	gitignore in weights	2023-08-02 16:26:41 +00:00
Diogo	4dc8595069	simple exporting models (#1344 ) * unified exporting * json exporting * ignore more * simplified buffer export * added dtypes * added assert * swift example * fix tests * linter * remove whitespace * fixed tests * remove swift example * remove unintended changes * allow callable models to be used * whitespace * more readable json export * name change * whitespace * whitespace	2023-08-01 09:35:48 -07:00
Yixiang Gao	6e62dcfbf3	add check global dim limit in linearizer (#1299 ) * need a better place for reshape and permute * add permutation * cuda fixed * clean up * enable nvidia GPU with global max * fix order * fix CI * add check for global dim limit but need refactor * refactor * fix ignore	2023-07-31 11:14:54 -07:00
George Hotz	d6637623e3	torch test touchup	2023-07-19 09:37:23 -07:00
Diogo	a9a1df785f	Webgpu support (#1077 ) * initial commit * 81 passing * 105 passing tests * 148 passing * CI tests * install dep on ci * try opencl pkgs * try using vulkan * down to only 6 failing * refactor * cleaning up * another test skipped due to buffer limit * linter * segfault * indent fix * another segfault found * small touchups * Fix max and maxpool tests * Add constant folding * Add javascript export script * better asserts in codegen * manual upcasting * reverted token type change * skip safetensor test due to unsupported type * FIx efficientnet and all other model tests * Remove np copy * fixed indent and missing import * manually destroy the buffer * revert back to length * linter errors * removed extra val * skip broken tests * skipping more tests * Make the page pretty * Save model weights as safetensor * Fix imagenet to c test * Fix second imagenet to c bug * Async and paralel kernel compilation * workgroup support * reversed local size * fixed non local bug * correct local groups * ci experiment * removed typo * Fix define local by using shared memory * Refactor * try running on mac * match metal tests * add more workers * scope down tests * trying windows runner * fixed windows env * see how many it can do * merged master * refactor * missed refactor * increase test suite coverage * missing import * whitespace in test_efficientnet.py * getting there * fixed reset * fixed bufs * switched to cstyle * cleanup * min/max rename * one more linter issue * fixed demo * linter * testing ci chrome * add unsafe webgpu arg * add build step * remove WEBGPU from cmd line * use module * try forcing directx * trying forced metal backend * temp disable conv2d for CI * disable conv_trasnpose2d --------- Co-authored-by: 0x4d - Martin Loretz <20306567+martinloretzzz@users.noreply.github.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-07-12 12:52:06 -07:00
terafo	aa60feda48	Fix naming conflict with huggingface datasets (#1161 ) * Rename in files * Move files * Moved to extra/datasets as suggested * Changes to files * Fixed stupid mistake --------- Co-authored-by: terafo <terafo@protonmail.com>	2023-07-07 10:43:44 -07:00
George Hotz	1f5d45ca8c	imagenet loader minor cleanups	2023-06-28 05:08:09 +00:00
George Hotz	b78addf2f8	Whisper (#919 ) * no whispering yet * whispering * live whisper * small support	2023-06-03 18:55:14 -07:00
Diogo	1a5d72f812	Onnx ops And, Or, Xor, Not (#847 ) * onnx and, or, xor, not * added bool type to llvm and clang * removed float conversion * switched where op to use tensor func	2023-05-29 11:09:20 -07:00
Jacky Lee	5d212864b5	Add MLPerf UNet3D model (#775 ) * Add ResNet inference test and cannon * Test with ResNet50 * test_car works with resnet fix * Add KiTS19 dataset * KiTS19: Implement iterate * No batch load for this dataset * Save results on iterate * Implement dice score * Add data prep and eval functions * Resolve shape issue * Conversion works but wrong values * Segfaults when load_from_pretrained is called * Fix segfault and assign properly * Final result generated, though very slow * Store and load final result to save time * Fix typo in finalize * Score computes * More bug fixes, dice score is very low * Working broken code * Assign output values to result * Getting a much higher score now * Fix dataset preprocessing * Mean DICE score of 88.5 * Ugh, typo * Attempt to reimplement model * Rename layers * Tiny model works, kinda * Accuracy? gone * Implement InstanceNorm and match torch * Test instance norm 2d and 3d * Combined input block with downsample block * Tiny model works, support strided convtranspose * Commands to download dataset * Clean up a bit * unet3d_v2 -> unet3d * Remove duplicated code * Oops, put tests back	2023-05-28 20:38:19 -07:00
George Hotz	46327f7420	bugfix for stable diffusion	2023-05-29 00:03:09 +00:00
George Hotz	59f9bcd4a4	Disktensors! (#819 ) * make empty a real thing * start ops_disk * disk tensor works * interpreted cleanup * slice write to disk * preprocess imagenet * fix custom function	2023-05-28 15:40:37 -07:00
wozeparrot	67de3aa1de	Add mlperf bert model (#803 ) * feat: add mlperf bert model * feat: switch to nn.Embedding * clean+fix: fix formatting * feat: add simple downloader * feat: metrics * feat: don't actually need exact match * feat: doing a run * feat: set eps on the layernorms * clean+fix: cleaner impl + hopefully fixed * feat: move dataset initialization into iterate * feat: move tokenizer out of iterate * clean+fix: cleaner + working * clean: cleanup * fix: fix metrics * feat: need to use original bert gelu + download vocab * feat: make directory if it doesn't exist yet * feat: jit go brrr	2023-05-27 14:53:32 -07:00
George Hotz	a968c4c3a4	Cleanup mlperf (#797 ) * improve factorization * cleanups	2023-05-25 11:36:43 -07:00
wozeparrot	01ae45a43c	Add mlperf RNN-T model (#782 ) * feat: initial rnn-t * feat: working with BS>1 * feat: add lstm test * feat: test passing hidden * clean: cleanup * feat: specify start * feat: way faster lstm & model * fix: default batch size * feat: optimization * fix: fix metrics * fix: fix feature splicing * feat: cleaner stacktime * clean: remove unused import * clean: remove extra prints * fix: fix tests and happy llvm * feat: have the librispeech dataset in its own dir * clean: unused variable * feat: no longer need numpy for the embedding + slightly more memory efficient lstm * fix: forgot to remove something that broke tests * feat: use relative paths * feat: even faster * feat: remove pointless transposes in StackTime * fix: correct forward * feat: switch to soundfile for loading and fix some leaks * feat: add comment about initial dataset setup * feat: jit more things * feat: default batch size back to 1 larger than 1 is broken again :( and even in the reference implementation it gives worse results	2023-05-25 00:41:21 -07:00
George Hotz	f28df9900f	multidevice works (#763 ) * basic multigpu working * better multigpu test * upper * touchups * cl sync	2023-05-04 01:04:58 -07:00
Joqsan	0b9d4126d0	Add Tensor.stack() and Tensor.repeat() (...trying to make einops work with tinygrad) (#758 ) * add stack() and repeat() methods * make stack a static method	2023-05-01 09:37:46 -07:00
George Hotz	1240c12ac5	download cifar to datasets dir	2023-03-29 12:25:42 +04:00
George Hotz	1a039306d2	good changes from llama branch (#671 ) * good changes from llama * transpose behavior changed	2023-03-09 20:51:22 -08:00
George Hotz	b1ba78ac38	move applegpu disassembler	2023-03-05 11:21:12 -08:00
George Hotz	262f81d795	applegpu everywhere	2023-02-27 22:54:59 -08:00
Marcello Fuschi	6d97d62ab3	Add PyCharm's .idea to .gitignore (#597 )	2023-02-24 20:14:38 -08:00
George Hotz	714bf4b108	clang backend (#572 ) * start clang backend * mostly working * no group for reduce w clang * it compiles * compiles * a11y * minor fixups * formatting * add a test * rename test	2023-02-20 18:18:18 -08:00
George Hotz	5e6265be6e	metal timing, fix speed test	2023-02-17 12:31:54 -08:00
George Hotz	2844482a60	Mypy fun (#541 ) * mypy fun * things are just faster * running fast * mypy is fast * compile.sh * no gpu hack * refactor ops_cpu and ops_torch to not subclass * make weak buffer work * tensor works * fix test failing * cpu/torch cleanups * no or operator on dict in python 3.8 * that was junk * fix warnings * comment and touchup	2023-02-08 09:56:51 -06:00
George Hotz	682dc64430	works at work	2022-09-06 08:06:11 -07:00
George Hotz	121d5a17ee	use tinynn for Conv2d	2021-10-30 19:40:44 -07:00
Skosh	78aa147b39	[WIP] YOLO working on tinygrad! (#245 ) * Some progress on yolov3 * Removed some debugging comments… Also, the forward pass eats all RAM for some reason * forward pass almost runs * forward pass runs almost * forward pass runs, now we gotta load the weights * loading weights works * fetches config and weights * everything kind of works, postprocessing of output still needs to be implemented, temp_process_results kind of works, but its kind of terrible, and not how things should be done * some changes * fixed some bugs in the forward pass and load_weights function, now outputs more correct values, however some values are still loaded incorrectly * Something is wrong with the forward pass, Conv2d tests added * forward pass almost outputs correct values, gotta fix one more thign * yolo works * some final changes * reverting changes * removed dataloader * fixed some indentation * comment out failing test, somehow it fails CI even though it passes on my computer… * fixed wrong probabilities * added webcam option to YOLO, now just need to add bounding boxes and speed it up * some progress towards adding bounding boxes * trying to speed up yolo layer on GPU, still faster on CPU but with 30GB ram usage * Faster inference times, bounding boxes added correctly, webcam works, but is slow, and there is a memory leak when running on CPU... Also added tinygrads output on the classic dog image * removed some debugging print statements * updated result image * something weird is going on, mean op on GPU tensor randomly faults, copying a tensor from GPU->CPU takes 10+ seconds…	2021-04-25 18:06:52 -07:00
George Hotz	1dcaecacc4	Support for Apple Neural Engine (#130 ) * ane query is success * cite and build instructions * low level access, need to disable AMFI * coreml_ane works * coreml fun * more work * compiled example * progress * compiler works * model flow * TODOs in the readme * put some real weights in * we are learning objc * much progress i think * signed model still doesn't work * working example * there are float16 * clean up: part 1 * h11ane header, more cleanup * cleanup DeviceController creation * remove the stupid sleep * notes * start a hwx parser * no tabs * compare stuff * hmm, why don't inputs work * cache doesn't seem to fix it * hmm, the issue was the compiler * fix the compiler, guess i didn't put in weights * logging for compiler * uselessness in plist * remove hwx before compile, weights are converted to float16 * better compare * better compare * last line in comparE * opcodes from compiler * notes	2020-12-03 10:32:26 -08:00
George Hotz	94d44c97bf	add pad2d on GPU	2020-11-07 10:46:36 -08:00
Rene Delgado	cd54697fd8	fix gpu sum forward (#61 ) * ignore venv * add sum test * fix sum forward	2020-11-05 21:59:16 -08:00
Göktuğ Karakaşlı	cc9bd45b44	add setup.py and change imports to relative	2020-10-26 18:19:50 +03:00

1 2

51 Commits