tinygrad

Commit Graph

Author	SHA1	Message	Date
qazal	99ee2ec37a	Refactor code_for_op to accept a dtype (#2555 ) * update cstyle renderers to take a dtype in code_for_op * implement NEG for bools in LLVM * update triton --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-12-01 22:05:28 -08:00
George Hotz	4c984bba7e	bump version to 0.8.0, clean CI, remove requests (#2545 ) * bump version to 0.8.0, clean CI, remove requests * why was that even there	2023-12-01 10:42:50 -08:00
nimlgen	badc97f824	hip & cuda to gpuctypes (#2539 ) * cuda with gpuctypes * hip gpuctypes * graphs * rename + linter happy * use cpu_time_execution * no ji in build_kernel_node_params * remove hip_wrapper * hip fix * no arc * smalle changes * no clean moduke in cudacpu	2023-12-01 09:25:27 -08:00
chenyu	7fec966b5e	bye bye NOOP (#2534 ) * bye bye NOOP * SIN * NEG	2023-11-30 23:10:35 -08:00
Matthias Kronberg	5394a05b9d	Fix: Get item from ndarray before casting to int (#2525 ) Directly casting is deprecated and will error in the future.	2023-11-30 18:34:31 -08:00
George Hotz	2c363b5f0b	new style device (#2530 ) * cpu tests pass * torch works * works * metal works * fix ops_disk * metal jit works * fix openpilot * llvm and clang work * fix webgpu * docs are rly broken * LRU works on metal * delete comment * revert name to ._buf. LRU only on Compiled * changes * allocator * allocator, getting closer * lru alloc * LRUAllocator * all pass * metal * cuda * test examples * linearizer * test fixes * fix custom + clean realize * fix hip * skip tests * fix tests * fix size=0 * fix MOCKHIP * fix thneed * copy better * simple * old style metal copy * fix thneed * np reshape * give cuda a device	2023-11-30 17:07:16 -08:00
Davi Silva	ddeec24fa8	Cleanup & fix llama.py (#2524 ) * docs, cleanup crap * comma AI * fix 70B * this is why lexical scope exists	2023-11-30 16:00:17 -05:00
George Hotz	6707f2588e	use copyin (#2500 ) * it's always copyin * all RawBuffer are RawBufferCopyIn * cleanups * this fixes it * requirements='C' * more correct	2023-11-29 09:34:00 -08:00
George Hotz	5629fc368c	Use Buffer.STORE at the end of ASTs (#2494 ) * work * store broken * interpreteds work * this passes * symbolic cpu * fix tests * fix opt tests * images fail * fix InterpretedFlopCounter * stupid hack for images	2023-11-28 20:11:37 -08:00
Jake	5588922884	Update cuda_matmul.py (#2495 )	2023-11-28 19:46:01 -08:00
George Hotz	d87a246439	move to new cached fetch (#2493 ) * move to new cached fetch * extra.utils is over * loads * bump download cache * bump timeout	2023-11-28 17:36:55 -08:00
George Hotz	ab5d14d4ba	MEM -> LOAD (#2492 ) * MEM -> LOAD * keep legacy working	2023-11-28 16:46:37 -08:00
George Hotz	3f137b134a	jax parallel matmul example	2023-11-28 13:48:11 -08:00
Davi Silva	186ac77ec3	Update hip_matmul.py (#2480 )	2023-11-27 18:36:19 -08:00
George Hotz	9e07824542	move device to device.py (#2466 ) * move device to device.py * pylint test --disable R,C,W,E --enable E0611 * fix tests	2023-11-27 11:34:37 -08:00
George Hotz	7170a9a057	coder.py can write and run code (#2439 ) * wip mistral * coder * touchups * cleanups * mistral cleanups * clean up cache create * download the weights, fix tests * fix llama loading * global fixup * clean up all * move llama model * cleanups * Revert "cleanups" This reverts commit a71c5d59eb86290634a258704d8bab2378b8d63d. * fine, leave it	2023-11-25 12:27:54 -08:00
George Hotz	8ff2e13550	From teeny (#2426 ) * changes from teenygrad work * support not supporting ImageDType/PtrDType * fixups from teeny	2023-11-24 12:50:56 -08:00
nimlgen	e68aebfff9	bring hip graph back (#2385 ) * bring hip graph back * share with metal * fix linter * remove hasattrs * Update ops_hip.py * hip wrapper does not use _buf --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-11-24 07:53:44 -08:00
George Hotz	12023b6824	onnx ops cleanup (#2413 ) * onnx ops cleanup * revert those	2023-11-23 18:39:49 -08:00
George Hotz	095e2ced61	add name support to fetch (#2407 ) * add name support * use fetch in gpt2 * remove requests from main lib, networkx also optional * umm, keep that assert * updates to fetch * i love the walrus so much * stop bundling mnist with tinygrad * err, https * download cache names * add DOWNLOAD_CACHE_VERSION * need env. * ugh, wrong path * replace get_child	2023-11-23 14:16:17 -08:00
George Hotz	0505c5ea50	remove force_wait, refactor to graph (#2405 ) * remove force_wait * refactor * get rid of stupid ASTRunner * fix del in diskbuffer * BufferOps.FROM_UNDERLYING * put offset in the rawbuffer * fix bugs * use exec	2023-11-23 12:46:07 -08:00
George Hotz	4f8f0ac139	minor cleanups, remove dead files (#2398 ) * minor cleanups, remove dead files * s.name * use disk * pytest passes on mac	2023-11-23 09:01:50 -08:00
George Hotz	66c75f30c6	remove triton (#2396 )	2023-11-23 07:40:59 -08:00
chenyu	8798d120bb	autopad shapetracker for BEAM (#2375 ) * autopad shapetracker for BEAM * OptOps.PADTO * skip that test for now * correct padding reduce axis * just 32 * avoid more than double the FLOPs * cleanups * test case * no support for triton and llvm yet * typos * symbolic shape would not work * cannot PADTO with MAX kernel * advance db version * no breaking change - don't advance db version * is triton just python? * Revert "is triton just python?" This reverts commit 17e776c25587615e33a3634c2fb0bb8591ce65d4. * Revert "Revert "is triton just python?"" This reverts commit 6c434c01e1c4b0ea0431ec18632cd859fb3cf260. * support llvm * is it really passing in CI only? * update tests * oh triton test passed * simpler * revert that, with a test * check if st are the same * Revert "check if st are the same" This reverts commit d2a5eac110a5da1af82a2728c883779ef69c3cad. * update the db version * rebase artifact	2023-11-22 21:05:25 -05:00
qazal	0eda545946	dtypes.float.vec(sz) (#2386 ) * replace all _dtypen with dtype.vec(n) fix: print works * conceptul refactor of cstyle render_load logic * linearizer GEP is explicit that its dtype is the scalar version of localtype * vectorized global_store and load don't need a conditional	2023-11-22 17:43:14 -08:00
George Hotz	cbb8486779	ResNet training changes (update benchmark) (#2390 ) * default arg for chunk * bring back to_ * good changes * new set * unused hash * fix optim * new torch loader * fix test lr scheduler	2023-11-22 17:41:12 -08:00
wozeparrot	abbcc7aefa	missed cleanup from cache_id removal (#2376 )	2023-11-21 01:03:43 -05:00
George Hotz	a0890f4e6c	move fetch to helpers (#2363 ) * switch datasets to new fetch * add test_helpers * fix convnext and delete old torch load	2023-11-19 12:29:51 -08:00
chenyu	d7d078c7f9	Node.vars() returns a set and properly dedup (#2356 ) * dedup RedNode.vars() * vars returns a set * fix more vars * unused import * update to_movement_ops * comment	2023-11-18 17:44:52 -05:00
George Hotz	40246d35bc	ops_shm removed (#2351 ) * ops_shm removed * buf.cast * err, forgot those	2023-11-18 11:41:58 -08:00
George Hotz	c7b38b324b	A beautiful MNIST training example (#2272 ) * beautiful mnist * beautiful mnist example * from tinygrad import Tensor * more beautiful * the jit is super core tinygrad * globalcounters reset on jit run * symlinks and exclude * beautiful_cartpole * evaluate is it's own function * no symlinks * more beautiful * jit reset for double speed * type hinting for JIT * beautiful_mnist gets 98% * beautiful_mnist < 4s with BEAM=2 * better cartpole * use actor critic * zero_grad got lost * delete double relu * stable cartpole with PPO * beautiful_cartpole is more beautiful * REPLAY_BUFFER * beautiful stuff typechecks * None support in shape * hp tuning	2023-11-17 19:42:43 -08:00
chenyu	d2c0035c73	add back as_strided, move rebuilt mops to extra (#2344 ) * add back as_strided, move rebuilt mops to extra * negative stride for ops_cpu * Revert "negative stride for ops_cpu" This reverts commit a13b6815ac31478d31ae71c26f4d4e4d274bf155. * skip that * style	2023-11-17 14:34:30 -05:00
George Hotz	652d2de256	wow how did i think that was okay (#2339 )	2023-11-16 21:21:11 -08:00
chenyu	822d6e6f18	Simpler mops verify (#2325 ) * rewrite the to_movement_ops check using symbolic * tweak	2023-11-15 21:47:18 -05:00
forcefieldsovereign	b64738e1d6	Remove AS_STRIDED from shapetracker (#2216 ) * very close * remove comment * negative strides working * almost everything passes * calculate offset with list comprehension * some cleanup * got disk load working * review suggestions * fix after merge * overlap working * did it * clean * fixed disk load * lint * mypy * removed as_strided * trying without simplify * added back simplify * make sure expanding to smaller shape * cleanup * removed comment * removed env file * trying whisper test again * onnx test sqlite issue * working on test * finished test * eliminate unnecessary shrink-then-pad * don't shrink buffer * added strides check * added to ci under linters * switch issue * allow symbolic stride * removed .env * isinstance * adjust strides for double expand * cleanup * needed to add type hint for mypy * set pythonpath	2023-11-15 15:50:17 -05:00
geohotstan	3c5a51fb3a	aaaaaaa finally (#2310 )	2023-11-15 07:12:38 -08:00
George Hotz	4f7b1ac0d2	cleanups before interpreted jit (#2306 ) * jit mnist * InterpretedFlopCounter doesn't rely on Interpreted * allocator for cpu and torch * types for exec_ast * fix type issues * fix onnx, remove print * always self.from_underlying	2023-11-14 21:44:25 -08:00
nimlgen	4e0d47533e	beam works with var vals (#2296 ) * beam works with var vals * test passes now * better comment * linter happy	2023-11-14 13:03:19 -05:00
George Hotz	0cbf6c1811	move things, clean up extra (#2292 ) * move things * idk why pylint needs that now * delete unused	2023-11-13 20:18:40 -08:00
George Hotz	b1f7f29525	metal indirect command buffers (#2285 ) * metal indirect command buffers * sub 1ms gpt * metal batch exec is good * remove whitespace * input_replace * fix ci * useResources * very simple cacheallocator * update_stats * fix CI * minor * remove that from jit	2023-11-13 17:58:26 -08:00
rodfer	53c5baa8b6	add dilation to avg_pool2d (#2270 ) * add dilation to avg_pool2d * avg_pool_fix * avg_pool_fix * woo * oops * force it correct --------- Co-authored-by: rodfer0x80 <rodfer0x80@proton.me> Co-authored-by: zibokapi <zibokapi@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-11-13 08:47:56 -08:00
valar	123ea051e6	refactor/ci: delete many `# type: ignore` (#2281 ) * refactor/ci: delete many `# type: ignore` * replace `axis.__class__ is int` with `isinstance(axis, int)` to make mypy happy * add `--warn-unused-ignores` to mypy flag refs #2240 * ci: move `--warn-unused-ignores` flag to mypy config refs #2240	2023-11-12 11:04:20 -08:00
geohotstan	b853e9bb8c	Onnx 1.15.0 gogogo (#2217 ) * lol * lol * add GELULULULUL * onnx 1.50 * fuk torch bool neg * exclude regex tests * exclude dequantizelinear for now * is sunny in philly * damn it affinegrid * fixed auto_pad VALID * skip 0 shape tests * add temporary cast in Reduces * tests should pass now * added comments and cleanup * try moving dequantizelinear to onnx.py * fixed dequantizedlinear? * cleanup * try? * float16 segfaults LLVM CI..??? * cleanup comments * pin to 1.50.0 * remove use of -np.inf cuz numpy is kill * 1.50? lol I'm actually retarded * thx for review, muhbad * moved Gelu higher up	2023-11-10 15:36:48 -08:00
chenyu	a753c8e071	examples of new GPT2 and JIT change (#2261 ) * var_vals are global * working with global ish * better * fix export model * fix tests * better kv cache * does it run? * use where for kvmask * fix excessive var_vals * fix import * how does multigpu use this? * llama kinda work * faster and simpler * cleanup * fix conversation mode * test cleanups * fix one more test * test cleanup --------- Co-authored-by: George Hotz <geohot@gmail.com>	2023-11-10 15:07:02 -05:00
George Hotz	80bf0b8586	proper wmma (#2245 ) * proper wmma * hip cast * bugfixes * bugfix * that bug is fixed --------- Co-authored-by: George Hotz <george@tinygrad.org>	2023-11-09 15:15:18 -08:00
wozeparrot	4c44d1344b	feat: remove cache_id (#2236 )	2023-11-08 08:09:21 -08:00
Rory Clear	553688f12a	update metal matmul and matvec for compile api (#2238 )	2023-11-08 08:08:35 -08:00
George Hotz	2f7aab3d13	move optimize_local_size (#2221 ) * move optimize_local_size * interpret_ast	2023-11-05 21:00:52 -08:00
chenyu	f582ec56d5	Replace (getenv("CI", "") != "") with helpers.CI (#2213 )	2023-11-03 15:20:44 -07:00
George Hotz	f17bc16f46	simple runtime args (#2211 ) * simple runtime args * fix some tests * fix abstractions and triton * fix search	2023-11-03 12:31:29 -07:00
George Hotz	ddbc6eecaf	some refactors in the realization (#2206 ) * some refactors * delete old kernel search	2023-11-02 19:51:28 -07:00
George Hotz	03cf0afa4f	move all to compile api (#2203 ) * move metal+clang to compile api * all to the new style * remove binary arg * fix triton * fixup tests * fix clang * diskcache is generic * __wrapped__ * compile_gpu * fix thneed * keep the src in the ASTRunner * lib * move compile_gpu * compile_gpu in device * put compiler in astrunner * test reverts * triton compiler * ugh, that too	2023-11-01 23:01:32 -07:00
George Hotz	8932816816	remove arm64, caching for cuda (#2201 ) * remove arm64, caching for cuda * caching in llvm * switch cache_compiled to new cache * fix clang * caching for metal * fix pylint * cleanups * perf_counter and binary	2023-11-01 18:44:00 -07:00
George Hotz	7103b716c4	merge kernel and optimizer (#2200 ) * merge kernel and optimizer * linearize is reentrant * move global/local size * clean up linearizer copy * remove unneeded lin copies * stop linearizing twice * oops, that should be None	2023-11-01 15:20:01 -07:00
George Hotz	33bb650e94	use mad in opencl (#2198 ) Co-authored-by: Comma Device <device@comma.ai>	2023-11-01 10:40:08 -07:00
Comma Device	2e9982fe2d	fastvits example that's 10% faster	2023-10-31 21:48:23 -07:00
George Hotz	8ba7ced7f9	extract const if it's const (#2193 ) * extract const if it's const * fix if statement * fast math issue * fix graphing and casting * disable flaky copyout test	2023-10-31 18:52:35 -07:00
George Hotz	5aaa8a0cc1	fix shape	2023-10-31 11:36:19 -07:00
George Hotz	a27c9f9de5	openpilot compile2 (#2189 ) * try compile2 * pass to thneed * fix tanh onnx	2023-10-31 11:08:58 -07:00
forcefieldsovereign	f294bdd681	fixed imports (#2185 )	2023-10-30 22:07:17 -07:00
Akshay Kashyap	018bd29e37	Enable Multi-Output Export (#2179 ) * Enable Multi-Output Export * Add test * Update examples and lint * fix padding * test ops * dummy commit to rerun test * revert cuda lint * Enforce tuple/list of tensors * subscripted generics * put back webgpu test * Re-enable WebGPU Efficientnet test	2023-10-30 18:42:26 -07:00
chenyu	6c58bf3e9c	in time_linearizer, allocate a scratch buffer if output buffer is also input (#2152 ) * in time_linearizer, allocate a scratch buffer if output buffer is also input * move scratch buffer creation outside search	2023-10-28 07:17:41 -10:00
George Hotz	e0201922e3	Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142 ) * stable diffusion < 324ms * revert swap action * fix tests due to more sum splitting * REDUCEOP_SPLIT_THRESHOLD env var * added from unaligned np test (#2134) * align cpu buffer before copy into cl buffer (#2135) * remove shelve from handcode_resnet50_opt.py (#2139) * Add dictionary keys to reduce db size (#2131) * work * ignore beam cache * dictionary keys are generic * minor db cleanups * fix baseline and extract dataset * fix training * log likelihood * more lin to feats * sts * training policynet * net sort of works * dedup * refactor, stupid new actions * fix uops deduping * BEAM_ESTIMATE --------- Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>	2023-10-27 10:53:06 -10:00
chenyu	0ca0e9ee5e	exclude ast with variables from beam search (#2140 ) * exclude ast with variables from beam search * test that * add to CI	2023-10-25 16:35:29 -04:00
wozeparrot	c29653605e	hip multigpu training (#1878 ) * feat: move to hip * feat: special path for RawBufferTransfer * feat: initial rawbuffertransfer * feat: hip ipc * feat: working hip ipc * feat: need to base device without args * feat: close mem handle * feat: modified test * feat: more multihip stuff * clean: cleanup * feat: cleaner * feat: don't crash * feat: test more * clean: way cleaner hip wrapper * feat: barrier * feat: barrier * feat: this breaks stuff * feat: we can use empty here * feat: maybe fix tests * feat: maybe fix tests again? * fix: probably fix tests * feat: no waiting here * feat: wait here * feat: much larger test * feat: need to sync here * feat: make this async * feat: no waiting! * feat: cut here * feat: sync copy * feat: random imports * feat: much cleaner world * feat: restore this * feat: restore this * clean: cleanup * feat: set this	2023-10-24 17:35:53 -04:00
nimlgen	2e89fd264f	Refactor hipgraph (#2141 ) * refactor hip graph * linter happy * happy liner	2023-10-24 15:45:56 -04:00
George Hotz	cea2bc7964	Add dictionary keys to reduce db size (#2131 ) * work * ignore beam cache * dictionary keys are generic * minor db cleanups * fix baseline and extract dataset * fix training * log likelihood	2023-10-24 10:49:22 -04:00
George Hotz	6dc8eb5bfd	universal disk cache (#2130 ) * caching infra for tinygrad * nons tr key * fix linter * no shelve in beam search * beam search caching * check tensor cores with beam too * pretty print * LATEBEAM in stable diffusion	2023-10-22 10:56:57 -07:00
George Hotz	abeba8f1fc	optimization: get actions in CI (#2125 ) * get actions in CI * actually run the test * pythonpath	2023-10-20 12:22:01 -07:00
Sean D'Souza	999c95ea29	fix: hlb cifar types (#2099 )	2023-10-17 19:23:50 -07:00
Ahmed Harmouche	2b5ea7d9cb	Fix output Float32Array size in webgpu export (#2096 )	2023-10-17 15:28:19 -07:00
Szymon Ożóg	4bef1591f0	Disable ocelot cache + fix matvec in triton (#2010 ) * Revert "disable flaky triton test" This reverts commit `1e15fdaee7`. * Update test.yml * check if has shared for matvec * disable ocelot cache for triton * disable ocelot cache * disable ocelot cache * pass shared to triton uops tests * temporary debugs for CI crash * Revert "temporary debugs for CI crash" This reverts commit fee3ea96c818e83c19b935c2f8482e0ccc91a542. * Revert "triton isn't tested, and allows this refactor (#2007)" This reverts commit `dea8bb0938`. * add runtime_args to every renderer, move triton local size override to runtime args * Add binary to args, correct type returned * update to new loops * Update test.yml	2023-10-17 10:33:32 -07:00
geohotstan	5ed630204b	Add ONNX to CI for other backends (#2069 ) * some cleanup * move continue back * more more more * added to CI * try * try intentionally break some tests * wtf * del True for test * yay tests broke, now pls no break * try AGAIN * gahy * lol * try * move over constant * moved over MORE * move shrink over * trailing lines * try CUDA CI * try again * boom * oops * improved comments * try: disable some flags and disable CUDA * try breaking tests * traceback has too much info so add --tb=no * revert forced CI failure * add comments and del unused imports * oooooooo using regular debug try enable tb * intentionally break tests * added tb back. Maybe not too verbose * strip whitespcae * missed something * Shape op int32 -> int64 * oops missed something * add some types * get rid of crazy 1 liners in pad op * actually test Split this time LOL * strip that whitespace	2023-10-17 09:33:54 -07:00
George Hotz	1bf4aef0f5	fix image dtype cmp (#2089 ) * fix image dtype cmp * print that with debug 3	2023-10-16 17:52:38 -07:00
George Hotz	a7b18ac325	try beam search on device (#2085 ) * try beam search on device * fix beam with nolocals * ops too --------- Co-authored-by: Comma Device <device@comma.ai>	2023-10-16 12:52:42 -07:00
George Hotz	c36d306606	KOPT is over, BEAM is upstream (#2071 ) * create cache for q learning * make linter happy * global beam * where it belongs * bugfix * ditch the kopt, use the beam * faster lin and DEBUG=2 okay * remove kopt, move search to features	2023-10-16 09:46:03 -07:00
George Hotz	5472a14544	openpilot compile2 (#1977 ) * start compile2 * tweak * why are there two more kernels? * minor cleanups * don't break onnx tests * add __metadata__ support to safetensors * no early realize in onnx * cleanups * bugfix * clean up image type, add optimize * opt to match old * try that * opt work * run compile2 * optimizer * prt more * prerealize * imp * NOLOCALS works * no locals means no locals * support fractional globals * all locals welcome * int that * cleanups * show gemv regression * clean up diff * use idx for the cond * nolocals --------- Co-authored-by: Comma Device <device@comma.ai>	2023-10-15 20:39:46 -07:00
George Hotz	49bcfec383	0s in the action space (#2070 ) * 0s in the action space * simpler * skip duplicate actions	2023-10-14 11:22:48 -07:00
George Hotz	4124cf1df5	cleanup tensor cores, expose exclude local upcast (#2064 ) * expose exclude_local_upcast * convert apply tensor cores to ops * update comment * put LOCAL back to what it was, BEAM is better than way	2023-10-14 09:21:03 -07:00
George Hotz	90c777d815	remove apply_auto_opt (#2063 )	2023-10-13 07:44:14 -07:00
George Hotz	6f1810af2d	with unroll, the action space goes from 161 -> 127 (#2060 ) * with unroll, the action space goes from 161 -> 127 * more reliable instrumentation * beam search is so op * beam bugfix	2023-10-12 20:52:23 -07:00
George Hotz	c5edb3c374	train value net, improve API, add BCE (#2047 ) * api cleanups, BCE losses * valuenet * fixup examples * learning okay * add valuenet runner * net improvements * net improvements * 40% win rate	2023-10-12 07:56:38 -07:00
George Hotz	0ba629c7b9	add world dataset (#2045 )	2023-10-11 15:54:30 -07:00
George Hotz	0c3b6f13a8	Latest opt (#2044 ) * split out actions * rl algorithm	2023-10-11 15:46:14 -07:00
George Hotz	41bfeb2c1e	start work on auto opt (#2034 ) * start work on auto opt * lin failure * not beating hcopt * greedy * timing is fast * codegen.search * greedy search in handcode_opt * track running gflops * clean up those files * no failure	2023-10-11 12:54:53 -07:00
chenyu	1c980517c5	s/var_vals_from_ast/vars_from_ast (#2038 )	2023-10-10 20:21:55 -07:00
George Hotz	f139060103	Rewrite hand coded opt with action space (#2030 ) * tests passing * hand coded opt with new abstractions * simpler opts * split out tensor cores	2023-10-10 07:38:38 -07:00
George Hotz	16ca8410f8	op logger + replay (#2021 ) * logops * fix dtype printing * needs inf * ops dataset * minor improvements * 12k kernels * opt can compile * graph flops	2023-10-08 15:10:18 -07:00
George Hotz	8db92bd060	fix tvm gemm example	2023-10-08 05:57:41 -07:00
Francis Lam	dece9958f8	wmma: clean up to make WMMA arg order consistent (#2014 ) also add cache defeat to extra/gemm/simple_matmul.py	2023-10-07 17:45:40 -07:00
George Hotz	6ee9cae44f	don't extract CIFAR every time / use the cache	2023-10-07 12:33:50 -07:00
George Hotz	dea8bb0938	triton isn't tested, and allows this refactor (#2007 ) * triton isn't tested * cuda buffer	2023-10-07 07:29:59 -07:00
Roelof van Dijk	26fcc8dff6	fix: remove runtime imports (#1982 ) fix: import what is used probably monkeypatched fix: import revert selective import	2023-10-07 05:23:08 -07:00
George Hotz	f54959e5cd	move print tree into graph (#2003 ) * move print tree into graph * add winograd profiling test * change pre-commit to run ruff first	2023-10-07 04:39:21 -07:00
Ahmed Harmouche	2114dc13d1	Allow multi-input model export (#1995 ) * Allow multi-input model export * Add model export unit test * Fix efficientnet compilation * Only run model export test on JIT supported devices * Skip export model test if not EXPORT_SUPPORTED_DEVICE	2023-10-07 04:13:34 -07:00
George Hotz	ffa33d743a	good changes from openpilot_compile2 (#2000 ) * good changed from openpilot_compile2 * float32 image type was wrong * cleaner way to write that + a test	2023-10-06 13:33:24 -07:00
Francis Lam	0ba75c4370	optimizer: add matvec optimizations (#1972 ) * optimizer: add matvec optimizations * renderer: fix alignment of shared memory in opencl	2023-10-04 14:16:27 -07:00
George Hotz	de5d603ec1	corealize + remove realize from lazybuffer (#1968 ) * corealize + remove realize from lazybuffer * fix multigpu * fix graph	2023-10-04 10:59:31 -07:00
nimlgen	2ea1dd3e87	no process() in Linearizer (#1966 ) * no process() in Linearizer * more process() clean up	2023-10-04 07:18:42 -07:00
George Hotz	717451a244	Revert "optimizer: add matvec optimizations (#1753 )" (#1959 ) This reverts commit `f520323054`.	2023-10-03 00:28:42 -07:00
Francis Lam	f520323054	optimizer: add matvec optimizations (#1753 ) * optimizer: add matvec optimizations * Update optimizer.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-10-03 00:01:59 -07:00
David Hou	8e9db88474	expand after expr_idxs in Linearizer.global_load (#1818 ) * small changes * expand in terms of substitute, directly expand g_idxs g_valid * delete expand_ops * don't compare using hash * any instead of in thanks gijskoning Co-authored-by: Gijs Koning <gijs-koning@live.nl> * support tc * testing code * no more create_rednode * maxsize none in view/node * oops * undo * typing * oops * oops * lmao * lmao * add expand multi test * Node.iter_idxs * type * type * delete checks! * clean up a little? * expand_idx in symbolic * un-golf * play around with types >.> * test_substitute and also remove an incorrect test? * get rid of range * Update symbolic.py * split out view cache change * split out flat components change * reduce diff * reduce diff * add some float4 tests * fix --------- Co-authored-by: Gijs Koning <gijs-koning@live.nl>	2023-09-29 10:33:34 -07:00
Francis Lam	f445e056ed	wmma: add test and tensor core shape (#1925 )	2023-09-28 18:04:28 -07:00
Yixiang Gao	094d3d71be	with Tensor.train() (#1935 ) * add with.train * remove the rest TODOs * fix pyflake * fix pyflake error * fix mypy	2023-09-28 18:02:31 -07:00
George Hotz	c36d0e3bd8	tvm import hook	2023-09-28 09:24:32 -07:00
George Hotz	adab724caa	schedule2, keep the tests working with small changes (#1932 ) * lazy cleanups * ast functions take in LazyOps * op instead of self.op * _base for mops * fix contiguous * start schedule * test_schedule * fix openpilot * more tests * bugfix and test skip * work * make sure things get freed * fix zerosized tensors * fix failing test * fix ceil and friends * fix openpilot * disable training * disable test collectives	2023-09-28 09:14:43 -07:00
nimlgen	45f02393f0	HipGraph support (#1880 ) * init hip graph * optimize args update * cache symbolic in jit * remove NOSTAT * init BasicBatchExecutor * symbolic infer cache per jit instance * basicbatchexec is defualt for compiled * batch_exec is taken from ASTRunner * no infer cache * batched execution of hip graph * add comment about hip graph batches * readable hip graph	2023-09-24 20:14:36 +08:00
Szymon Ożóg	58296c079d	Make Triton work again (#1547 ) * Move ops_triton to runtime and remove errors from deprecated code * Remove deprecated AST Kernel * Remove deprecated buffer * Add TritonProgram * Triton Buffer * Use RawCUDABuffer * triton_compile * Added new parameter * pass _buf to program * remove deprecated include * Added triton tests * Deprecated includes removed * remove double print * Disable float4 support * Disable float4 support * variable load fix * Track local size * Add pycuda to triton dependencies * Merge test.yml * install cuda packages for testing * merge double package install * remove emulated from triton tests * upscale local index to power of 2 and add masking * cuda envs * Add TernaryOps * ConstOp loading * proper function name * remove deprecated variables * get global program from name * const ops match local shape * Enable test_nn * remove deprecated import * fix linter error * Add wait logic * Add local size override * accumulate local shapes instead of using max shape * Merge triton tests into global tests * fix envs in testing * Old testing routine * split file into renderer and program * remove print and starting whitespace * pretty ptx print on debug 5 * linter errors * ignore triton saturation tests * ignore test example * remove pytorch cpu extra index * Add triton to existing testing routine * use triton tests * disable cuda backend in triton tests * use cudacpu in tests * print used device * Print device default * Remove print * ensure we are running triton backend * update variable signatures * update dtypes for load * infinity render fixed * limit global size * negative infinity now properly rendered * split chain with parentheses for and node * Add option to disable shared memory, disable for triton * missing import * Properly index and mask conditional load * use mask only if not loading a block pointer * nan support * fix symbolic tests to include chain split * proper masking for stores * Implemented bool dtype * Add mod * fix loads for variables with valid range * merge triton with cuda runtime * merge from master * run triton tests with cuda * Correct target when running from triton * conftest with triton compiler config * use triton nightly * verbose tests for triton * capture stdout * fix function depth when exiting multiple loops * add render valid function for readabilty * fix mask for local loops * add _arg_int32 datatype * fix dims for conditional loads * enable non float stores * correct variable dtypes * fix type for arg_int32 * remove junk * Added get max function for range based var.max * remove deprecated code * Fix triton ptxas path * Fix testing for CI * clamp local size by max local size instead of always running max * Disable matmul test in triton cpu * rerun tests * Disable broken test in triton cpu * whitespace removed * rerun tests again * Disable TestSymbolicOps for triton * update to new uops * linter fix * ignore test/extra * linting fix * Update tinygrad/renderer/triton.py Co-authored-by: Gijs Koning <gijs-koning@live.nl> * remove deprecated line * quotes type fix * linter * Remove unnecesary lines * UnaryOps.NEG * dont define constants * Linting fix * Disable tests that are broken in ocelot * remove trailing whitespace * reduce line count * linting fix * update to new uast * New looping style * Update to new uast * make AST runner work with triton * linting fix * set renderer var for testing * disable local for ocelot * reenable all tests for ocelot * Pass shared to cuda * Don't group if the backend doesn't support shared mem * use working gpuocelot branch * enable all tests * enable local for ocelot * cleanup * Update test.yml * update cache key * reenable test symbolic and extra * Update test.yml * Revert "Update test.yml" (rerun tests) This reverts commit 98c0630ee5da4379e5c6b2437a5145fe87058c35. * Revert "fix symbolic tests to include chain split" This reverts commit 22a9a4c9cd14d23735e6540c8d90ee005ac4ea17. * Revert "split chain with parentheses for and node" This reverts commit 7499a7004ef4db785d0cd05cf292fdeff65ca90d. * use global size from linearizer * rename newvar to dtype to match other renderers * join program start lines * simplify code that adds axis to local dims * assign r[u] in ssa * We no longer need to replace target in src * we no longer need to cast indices to int by hand * Update triton.py(rerun tests) * Update triton.py(rerun tests) * Update triton.py(rerun tests) --------- Co-authored-by: Gijs Koning <gijs-koning@live.nl> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-09-23 14:17:12 +08:00
qazal	d0e752003d	fixes (#1893 )	2023-09-22 07:20:27 +08:00
wozeparrot	009a99a0b1	feat: way cleaner hip wrapper (#1895 )	2023-09-22 07:20:03 +08:00
kormann	864746d6aa	polish print_tree (#1868 ) * fix * isinstance	2023-09-21 11:13:10 +08:00
chenyu	3ec301c2d7	apply view.py patch (#1844 )	2023-09-10 17:32:15 -07:00
kormann	7ac65a93b4	utils.printtree (#1816 ) * utils.printtree * linter compliance * rename to print_tree	2023-09-07 23:08:57 -07:00
George Hotz	4613c9e77c	add tvm example, formatting (#1813 ) * add tvm example * no realize	2023-09-07 11:50:41 -07:00
Pavol Rusnak	52a92bf95d	use class Foo: instead of class Foo(): (#1797 ) * use class Foo: instead of class Foo(): * add ruff linter, copy settings from .flake8 to ruff.toml	2023-09-06 12:20:25 -07:00
geohotstan	9af5645ba3	onnx full passing (#1076 ) * 1 * 83 failed * learning how git works * lol idk * zero shape aaaa * space lol * aaa * test check * haha * fixed gather * 73 failing * 71 failing * 68 failing * added some debug * fking resize * lol * 62 failing * 58 failling fucking did nearest resize hell yeah * clean up * 56 failing * janitor duty * lol * 53 failing * hi mom * 50 failing * added linear interp, but coord_trans is wrong * did lin interpolation woohoo * 43 failing * 40 failing * temporary Gather fix * 39 failing * fixed slice onnxver<10 * 37 failing * 35 failing * excluded tests that use float64 * 32 failing with hacks * added _batchnorm() for 3D 5D batchnorm, 29 failing * changed ALLOWED_KERNEL_COUNT from 199 to 207 * added improved Gather op, reverted ALLOWED_KERNEL_COUNT commit * support Round op * added storage_order/indices maxpool, 27 failing * support maxunpool, 25 failures * support Gradient, 23 failures * merged new where * added Adam * cleanups * added Momentum and Nesterov Momentum * added Adagrad * support sequence_type, 20 failing * ugh git * I give up on cubic interp :D, 9 failing * sexy 1 liner gather, much improved, wow * polished gather to make it shine bright like a diamond * clean 1 liner for gather * improved readability of gather * uhh * clean up * more clean up * WHITEspace * implemented SoftmaxCrossEntropyLoss op * added comments and cleaned up if statements * update * thank based wozeparrot for pow and new GatherElements * CPU and TORCH all pass \| cast float64 -> float32 for all fromCPU() * _nearest_gather() failing on yolo * reverted ops_cpu change and added assert in Resize * added comments for resize for multiple channels * oops * merge * test * switched np.pad to Tensor.pad for constant padding * gah * gah2 * sexy reflect pad with movementops -> add * delete commented out lines * edge mode pad sexy as well * trying out model_benchmark * revert gitignore change lol * init * Revert "init" This reverts commit 682bf2073a8b4eca111596c67cf6ebd79f59e585. * wrote cast workaround for CPU, CPU and TORCH all pass * wrote cast workaround for CPU, CPU and TORCH all pass * skipped tests w/ 0 shape for METAL and GPU * excluded tests for CLANG, CPU, TORCH, CLANG pass * fixed hacky ConvTranspose * gotta figure out autopad * UOps.STORE support cast bool -> float * small fix for fast gather * reverted 0 shape skipped tests * oops missed a file * added comment * fixed slice op hack * First commit to pr * More trig ops * More trig ops * format * isinf support * More ops * changed onnx_ops to use our new gather :D * Det op bug fix * rebase * fixed some tests * det broken and slow * fixed compress to use new gather * implemented argmax argmin * support variable types in type_proto * support Upsample and Identity sequence * we support float64 now and tinygrad support automatic broadcasting * added EyeLike op * resize does support multiple channels now actually * yolov8 onnx runs successfully * added batch size 1 * oops * finally fixed type_proto I think * fixed some llvm bugs * del whitespaces * added ZenginU Format PR * test * oops * added float64 exclude tests back * more skipped tests * try * ok openpilot pass * flake8 pass * woooooohooo * revert external_model_benchmark changes * perf tested gather * removed promote types from ops_cpu * numerical errors from 1681 is fixed --------- Co-authored-by: ZenginU <umutzengin00@gmail.com>	2023-09-05 13:23:32 -07:00
George Hotz	56abe04e4b	disable assembly (#1755 )	2023-09-04 09:41:20 -07:00
wozeparrot	bf05534c6e	hip multidevice (#1728 ) * feat: hip multidevice support + p2p * feat: default device	2023-09-01 06:46:13 -07:00
Karan Handa	a8aa13dc91	[ready] Replacing os with pathlib (#1708 ) * replace os.path with pathlib * safe convert dirnames to pathlib * replace all os.path.join * fix cuda error * change main chunk * Reviewer fixes * fix vgg * Fixed everything * Final fixes * ensure consistency * Change all parent.parent... to parents	2023-08-30 10:41:08 -07:00
nimlgen	1c0449e190	add cache collector (#1595 ) * init cache collector * add test_cache_collector.py * switch GlobalCounters.cache to CacheCollector * init jit models test * jitted SD * add debug msg to print loaded bufs count * moved cache collctor to jit * clearer SD * no double device import	2023-08-28 19:59:55 -07:00
George Hotz	a6d842af7a	move device to ops (#1646 ) * move device to ops * mlops types * 2 lines	2023-08-23 08:30:17 -07:00
George Hotz	718ced296c	move state to nn/state (#1619 )	2023-08-22 07:36:24 -07:00
Umut Zengin	f720682beb	np.argmax to Tensor.argmax (#1608 ) * to tensor argmax * removed keepdim * training update	2023-08-21 15:22:29 -07:00
Yixiang Gao	4d54afb6df	sparse cat cross entropy (#1597 ) * add sparse cat cross entropy * minor fix * add log_softmax into loss function * add test * update docs * fix training loss * add device	2023-08-21 14:14:54 -07:00
George Hotz	2e60920317	Revert "sparse cat cross entropy (#1591 )" (#1596 ) This reverts commit `f0ee850e98`.	2023-08-21 10:04:26 -07:00
Yixiang Gao	f0ee850e98	sparse cat cross entropy (#1591 ) * add sparse cat cross entropy * minor fix * add log_softmax into loss function * add test * update docs	2023-08-21 09:56:41 -07:00
Yixiang Gao	8d6662a741	.cpu().numpy() -> .numpy() (#1594 ) * .cpu().numpy() -> .numpy() * restore ops_torch * restore test_speed_v_torch	2023-08-21 09:53:29 -07:00
George Hotz	e464442adf	WMMA for 7900XTX (#1563 ) * go * hip no LRU * work * works * 16 TFLOPS * 29 TFLOPS * 30 TFLOPS * never mind, it's 60 TFLOPS * fix metal WMMA * put hip alloc back	2023-08-19 09:07:23 -07:00
chenyu	ae39cf84ab	Symbolic Shape JIT main PR (#1353 ) * Symbolic Shape JIT update tests 2 variables symbolic ops, adding more tests test passing cleanup * more test cases * single flag * review update * jit attention one piece * realize * symbolic_jit test for cuda * old artifact * works with cuda gpu but failed ci * CUDACPU	2023-08-18 14:39:55 -07:00
wozeparrot	50decf0d45	train cifar using multigpu (#1529 ) * feat: train cifar using multigpu * feat: split eval batch across 5 * feat: cleaner allreduce * feat: 93.88% * feat: cleaner batch chunking from bert * feat: cleaner grad sync * feat: tinygrad argmax * feat: make it work with different gpu counts * feat: move some stuff into the normal __init__ * feat: autodetect gpu count * feat: move import inside	2023-08-18 09:35:44 -07:00
wozeparrot	15150d60c4	fix: small fix for lru on hip (#1567 )	2023-08-18 09:18:38 -07:00
Ethan Sorrell	cb62911f6b	PTX Reintegration and Passing Tests (#1512 ) * move assembly, assembly_ptx * successful but broken rendering of ptx asm * clear ins before render asm * slightly less broken :') * we needed thread syncs * fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half * Fix runtime_args for gpuocelot * our casts were flipped on both ends * more casting * add ternary where op * dealing with storing/loading bool * add test for casting to bool from negative * Fix args.valid on ConstOp * add to CI, TODO: fix runtime_args for test_uops * fix placement of runtime_args to work with lazy.Device * undo ci changes so I can push * fix lints * start cleanup and fix things we broke fixing lints * add checks for PTX specifc asm instructions * revert added test -- doesn't pass on llvm * skip tests for underflow,overflow * another fix for how we're setting runtime args * Less broken cleanup * add to CI * add more env variables for ci test * fix ci to install pycuda for ptx * ci: copy cuda test command * cleanup * assert to make sure we're actually running ptx in ci * remove test assert * move is_ptx arg * move assembly, assembly_ptx back to extras * fix imports * initial merge fixes * clear registers, fix UOps.LOAD with invalid value * draft merge fixes * remove prints * quick lint and merge fixes * cleanup * remove PTXProgram wrapper * final cleanup * temp change for ci rerun * ci rerun * rollback ISA version	2023-08-16 16:20:20 -07:00
JaSpa99	491e85597a	Run onnx commavq model (#1537 ) * try to run commavq * fix 0 dim, start implementing new ops - Implement EmbedLayerNormalization - Implement Attention * SkipLayerNormalization and FastGelu * use original torch model, cast inputs * fix some ops: - properly do Cast - Attention: bi- and unidirectional - FastGelu: add bias before gelu * cleanup onnx_ops.py * add validation option to benchmark * cleanup imports * add checks incase onnx2torch implements ops in future * run onnx instead of original torch * just skip gpu on m1 * reactivate the other models * check for strange params & squash whitespace * cleanup * fix causal mask Attention * Range doesn't need int cast * embedding vocab_counter same dtype as input * no need to cast * always validate, fix PosixPath ort --------- Co-authored-by: George Hotz <george@comma.ai>	2023-08-16 12:24:40 -07:00
George Hotz	f8109b830c	promote assembly to the main codebase (#1544 ) * promote assembly to the main codebase * not namedtuple	2023-08-14 22:47:45 -07:00
Steven Anderson	93a36c3659	Arm (#1421 ) * testing new memops * better debugging * testing padded conv * branching with load * refactoring a bit * first try * fixing bugs * fixing some * eq * eq2 * do not use x's * working * fixing imm * getting things working * refactor * pow not working * working except one * refactor: one store mem * refactor: global load * refactor: imm * refactor: cleaning * fixing big offsets * refactor with ci * try ci * typo * another typo * ubuntu default * forgot git * do i need git? * missing packages * adding python-dev * with cache? * buildx action * buildx name issue? * maybe now? * python3 * newline warning * maybe now * i actually need this * ci should work now * improved caching * fixing cache * maybe now it will cache * this * testing cache * trying again * load * missing platform * caching gha * testing cache * full testing * typo * now? * why * adding checkout back * bad formatting * fixing convention issues * supporting python * adding CI flag * testing all * better comments * adding debugging * takes 12x longer * does it output progress now? * ignore models for speed * fixing merge * excluding conv_transpose2d * only 2 test cuz is to slow * another approach * let's see * faster duh * my bad * T_T * typo * sup * with output? * comment test * comment test * comment test * :? * no comment * with cache * back to normal * testing that ci works * back to passing * trying again * does it create another entry * does it create another entry? * build local * hey * Revert "excluding conv_transpose2d" This reverts commit cc7348de03033e032f47d69caff174e2f1a7bfea. * does it cache if done before? * does it cache? * done * adding test ops * bad formatting * no need for this * working static mem * sum 1d * add ndim * better reg import * fix stack * back to np * working except for softmax * 5 failing * no pogress * remove keystone * remove keystone * testops passing * cleanups * more cleanup * typo * ci * ci2 * cond import * ci3 * ci4 * ci4 * ci5 * ci5 * ci6 * aligment * test all * correct test * err read_unmapped * passing test * ignore for speed * ignore for speed * ci7 * cleanup * remove docker * fixing merge * fixing bugs * add skipload for const ops * comments * First merge to master: Renderer * fix emulation * passing all tests arm64 * cleaning * fix handcoded binary * cleaning * fix errs * fix runtime arg binary * clean git diff * fix and clean * fixing metal test * cleaning * fix metal test * ci ~8 min * fix pylint and clang * cache the files in ops_clang --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-08-14 19:29:30 -07:00
Szymon Ożóg	330fb7b1a3	Print more meaningfull hip error messages (#1530 )	2023-08-12 07:16:20 -07:00
wozeparrot	29d5801387	distributed collectives (#1519 ) * feat: world * feat: tests * feat: no more backwards * feat: recv into * feat: whoops * feat: test in ci * feat: some debug logging * feat: workflow naming * feat: need to set pythonpath * feat: just send to same device * feat: allreduce * feat: test * feat: need contiguous * feat: test in ci * feat: exit with correct code * feat: don't need that * feat: opencl wait_for just doesn't work * feat: synchronize on out * feat: try? * feat: try again? * feat: add extra realizes * feat: print * feat: seed * feat: tol * feat: test ones and zeros * feat: remove print * feat: are you just flaky * feat: seperate scatter and gather? * feat: just try synchronizing * feat: remove print again * feat: bring back difference * feat: no sync * feat: revert that * feat: back to wait_for * fix: typo	2023-08-11 10:22:07 -07:00
wozeparrot	7e7c9001e9	distributed world (#1481 ) * feat: world * feat: tests * feat: no more backwards * feat: recv into * feat: whoops * feat: test in ci * feat: some debug logging * feat: workflow naming * feat: need to set pythonpath * feat: just send to same device	2023-08-10 10:00:51 -07:00
George Hotz	c417cd3c97	fast HIP gemm -> 100 TFLOPS (#1476 ) * fast HIP gemm * wmma * correct b * fix spilling * 60 TFLOPS * 64 TFLOPS * 65 TFLOPS	2023-08-09 06:54:15 -07:00
Yixiang Gao	6480a1a180	CIFAR 94.03% (#1340 ) * add disk_tensor * fix jit * new baseline before whitening * whitening through torch * whiting done currently at 91.65% * 91.99% * clean up mixup and 92.3% * clean up 92.30% * 92.49% before searching for new hyper-parameters * fix CI * fix white space * add whitening init in test * refactor, update hyperpara, 92.72% * converting whiting to tinygrad operation * update CI kernels count for CIFAR * add pad reflect * add random crop 92.53% * update hyperpara 93% * 93.15% on docker container, need to refactor the assignment for hyper param * print out weights and bias to be separated * bias/non-bias params separated * fix whitespace * clean up * refactor hyper-param with dict * refactor lr schedular params * fix whitespace * fix cross entropy loss * fix whitespace * move opt hyp to hyp dict * minor fixup * adjust model, loss scaling * 92.74% while using half of compute as before * update hyp for cutmix * random shuffle during batches * clean up * updating the model * update ConvGroup * disable gradients for batchnorm layer weights * whitespace * 93.92% * clean up * finally 94%git add .! * rewrite whitening to remove dependency on torch * whitespace * remove dependency on torch, 93.91% * back to 94.03% * clean up * update test_real_world	2023-08-08 15:13:24 -07:00
George Hotz	d24f936501	just cmplt (#1493 ) * just cmplt * fix maximum * don't save, there's no backward * ugh, no slot either * eq is a scam	2023-08-08 13:58:10 -07:00
Roelof van Dijk	0ce7511110	fix: is not use with a literal (#1487 ) Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>	2023-08-08 07:35:30 -07:00
Diogo	4dc8595069	simple exporting models (#1344 ) * unified exporting * json exporting * ignore more * simplified buffer export * added dtypes * added assert * swift example * fix tests * linter * remove whitespace * fixed tests * remove swift example * remove unintended changes * allow callable models to be used * whitespace * more readable json export * name change * whitespace * whitespace	2023-08-01 09:35:48 -07:00
David Hou	3300d0aeaf	syncthreads before wmma (#1389 ) (venv) chaos@tiny3:~/tinygrad$ KX=2 KY=2 N=2048 python extra/gemm/hip_matmul.py 4194304 289.60 us, would be 59322.55 GFLOPS matmul, 173.80 GB/s	2023-07-31 17:05:49 -07:00
George Hotz	37fa7e96fb	Revert "update editorconfig, enforce via CI (#1343 )" (#1380 ) This reverts commit `da2efecbe2`.	2023-07-31 10:35:50 -07:00
Pavol Rusnak	da2efecbe2	update editorconfig, enforce via CI (#1343 ) * update editorconfig to set unix-style newlines and trim whitespace * add editorconfig github action to the CI * fix whitespace	2023-07-30 18:44:30 -07:00
Cole Sutyak	2d4e182294	change fetch to allow for local file selection (#1309 )	2023-07-23 15:00:16 -04:00
Jacob Pradels	b112edd2c3	Add pylint trailing whitespace rule (#1314 )	2023-07-21 13:37:55 -04:00
madt2709	d2c1e8409a	Update arange to be (start, stop, step) (#1308 )	2023-07-21 00:27:23 -04:00
wozeparrot	37cc33269a	cl fixes for multigpu (#1276 ) * feat: opencl fixes for multigpu usage * clean: who needs this import anyways	2023-07-18 19:59:30 -07:00
George Hotz	ab3d281a6e	Refactor MemOps (#1256 ) * metal tests pass locally * define global * refactor DEFINE_GLOBAL * move assembly out. it isn't tested * fix llvm	2023-07-17 16:36:33 -07:00
Stan	91f797cd52	Moved mkdir in `utils.download_file` to diff line (#1249 ) * Moved mkdir to diff line .mkdir does not return the actual directory being created. * use walrus operator to simplify	2023-07-16 00:30:46 -07:00
Yixiang Gao	a8f2c16f8e	add contiguous (#1246 )	2023-07-15 08:36:34 -07:00
George Hotz	67e34b356a	good stuff from tensor cores branch (#1199 )	2023-07-08 16:58:26 -07:00
Jacky Lee	e0c2ae8984	Update file paths (#1179 )	2023-07-07 18:41:58 -07:00
George Hotz	b8dfbba703	hip_matmul: f16 gemm 2048x2048 gets 36 TFLOPS	2023-07-08 00:35:45 +00:00
Stan	69d33cab0d	Fix: auto create parent dir when downloading file (#1173 ) * Fix: auto create parent dir when downloading file also removed duplicate import `os` * Added test for auto parent dir creation when downloading file	2023-07-07 13:40:29 -07:00
terafo	aa60feda48	Fix naming conflict with huggingface datasets (#1161 ) * Rename in files * Move files * Moved to extra/datasets as suggested * Changes to files * Fixed stupid mistake --------- Co-authored-by: terafo <terafo@protonmail.com>	2023-07-07 10:43:44 -07:00
Stan	9b6e57eccd	helpers.py: improved test coverage + exception handling (#1165 ) * Fixes + improved test coverage for helpers.py - added exception handling in `proc`, if an exception was thrown, the thread would hang - made `_early_exec_process` catch any Exception, before if an exception was thrown before the process was started, it would hand the thread * Made `_early_exec_process` catch any Exception Otherwise, if an exception was thrown before the process was started, it would hang the thread. For example a type error for an argument passed to `subprocess.check_output` * Fixed `from tinygrad.helpers import Timing` import oops, for some reason my IDE cleaned that import from extra/helpers. * Fixed import in llama.py Another one that I skipped by accident, mybad * Extracted a class for tests of early exec * Normalize line endings, windows uses /r/n * Made `cross_process` not a daemon	2023-07-07 10:26:05 -07:00
Kunwar Raj Singh	8391648822	Over 90% on CIFAR with examples/hlb_cifar10.py (#1073 ) * fix eval, lr decay, best eval * 82.27 * 82.64 * 82.79, reproducable * add lr sched, 85.26 * 87.42 * 87.94 * 87.42 * tta with flip * training flip aug * refactor * using Tensor for LR is faster * 89.5 * refactor, flip only train set * 90.01 * 90.64 * eval jit * refactor * only JIT model * fix eval JIT * fix eval JIT * 90.82 * STEPS=900 reaches 90.22 * TTA envvar * TTA default 0 * fully jit training * refactor optim * fix sched * add label smoothing * param changes * patial gelu * OneCycle with pause * gelu maybe works * 90.12 * remove pause lr * maybe fix lr schedulers * scheduler test passing * comments * try mixup * shuffle! * add back the missing last eval * fix shuffle bugs * add mixup prob * fix mixup prob * 90.19 * correct mixup * correct mixup * correct mixup * 90.24 * 90.33 * refactor, add type hints * add gradient clipping * maybe fix test * full JIT * back to relu for now * pass mixup prob as param * add typehints * maybe CI works * try erf gelu * CI, types * remove useless import/ * refactor optim * refactor optim * try leakyrelu * try celu * gelu * 90.67 * remove grad clip * remove grad clip tests * revert params * add test for OneCycleLR * 90.62 * fix eval timing * fix eval timing again * so where i calculate mixup_prob matters --------- Co-authored-by: Kunwar Raj Singh <kunwar31@pop-os.localdomain>	2023-07-06 20:46:22 -07:00
Eli Frigo	801564f31b	Remove POW llop and add SQRT llop (#1104 ) * fixed division by zero for fast operations * made et closer to 0 * replace POW llop with SQRT * updated mlops to swap SQRT and POW llops * updated hlops to swap POW and SQRT * added sqrt llop to cpu runtime * added sqrt llop to cstyle codegen * added POW llop to llvm ir codegen * added SQRT llop to torch runtime * moved pow from mlops to hlops * found a better way to do reverse pow * fixed indentation * added SQRT llop to triton * update docs to match new llops * removed POW operator from assembly codegen * added sqrt and rsqrt to pow hlop * rewrote pow function in tensor.py * Adjust tolerance * Adjust for adamw * Reduce for Adam too * removed accidental leftover code * removed all of accidental code * added rsqrt test * removed pow from mlops again it was added back when resolving merge conflicts --------- Co-authored-by: Jacky Lee <jla524@sfu.ca>	2023-07-05 18:07:58 -07:00
Reza Rezvan	d1356cac27	Fix: Jacobian tests [WIP] (#1126 ) * Fix: Jacobian tests; num_jacobian either bugged or not accurate enough; * Fix: Jacobian tests; * Fix: Gradcheck;	2023-07-05 15:36:22 -07:00
Mehmet Kuzucu	c3173ff281	Add return statement to the train function (#1135 ) add a return statement to the train function in order to provide access to the losses and accuracies lists	2023-07-05 08:13:38 -07:00
George Hotz	2f968f8547	ignore cloudpickle type for local mypy	2023-07-04 13:51:20 -07:00
Daniel Hipke	b4ce23e4b8	Make cross_process use cloudpickle (#1118 ) * fix syntax issues in imagenet_download.py * use cloudpickle in cross_process to make it work in Python 3.9+ * add cross_process test * prevent unpickling on every function call * add cloudpickle to setup.py * add support for args/kwargs	2023-07-04 00:47:34 -07:00
Anselm Coogan	a22aad7d32	Use generators instead of lists in `any`s and `all`s (#1111 ) * Use generators in any(..) instead of lists for better best-case * Use generators in all(...) instead of lists * enable R1729 in .pylintrc * revert import sorting --------- Co-authored-by: Anselm Coogan <anselm@scandit.com>	2023-07-03 16:06:06 -07:00
Frank Pinnola	2071e53da8	Handle broadcast flag on gemm (#1103 )	2023-07-02 22:15:07 -07:00
Rob Grossman	c8ddc34368	include missing queue in thneed load (#1095 )	2023-07-02 12:33:59 -07:00
George Hotz	e234bf2298	hip matmul : add K support	2023-06-28 19:54:33 +00:00
George Hotz	0e93b9642a	hip matmul	2023-06-28 19:21:01 +00:00
George Hotz	6ec0a24706	imagenet eval in 1 min 28 sec	2023-06-28 04:23:26 +00:00
George Hotz	9c6e507518	move accel into extra	2023-06-23 16:38:15 -07:00
Diogo	57d3aa76a5	Windows & Ubuntu CLANG CI support (#1011 ) * matrix strategy * push env to GITHUB_ENV * use printf instead of echo * use temp helper function for cross os paths * use path join * switched to using temp helper function * skip test on windows due to memory limit * small fix * removed semi * touchups * clean up * seperate tests * test changes to test_utils on windows * small refactor * more cleanups * undo helpers change * only skip if in CI and WINDOWS	2023-06-19 09:33:24 -07:00
Alex Wang	3d63c71e27	HIP backend (#750 ) * llama works for HIP backend * Use hipMemcpyAsync; Less lines of code * Remove unused code * Refactor * Add comments; hipDeviceSynchronize * HIP over GPU; Remove PyHIP dependency * Cleanups * Fix mypy check * Merge master; Dump assembly code	2023-06-18 11:35:57 -07:00
Casey Primozic	805eef10dd	Add tensorflow GEMM benchmark script (#1000 ) * Modelled closely after the existing torch benchmark script but just adapted slightly for tensorflow	2023-06-18 10:57:45 -07:00
Diogo	d2b837c1d9	Adds floor/ceil (#989 ) * floor ceil impl * control casting in numpy	2023-06-17 10:56:21 -07:00
George Hotz	fe71282ba1	faster RDNA assembly backend (#990 ) * fast asm * torch gemm	2023-06-16 12:06:38 -07:00
George Hotz	ba56ee6020	RDNA assembly backend ($1000 bounty) (#787 ) * Revert "Revert "ops rdna"" This reverts commit `0400315078`. * Revert "Revert "writing 2"" This reverts commit `325a3bf2cf`. * no dump * 2x 2 * simple asm * local size * sub * lil work * support args != 3 * assembler work * generate that * ptx assembler * begin index renderer * max * ptx loops * gemms work * valid works * asm working a bit more * close * passing all ops tests * ptx is a codegen only, not a backend * ptx * float16 support * rdna goes here * install types * make amd disassemble * ansilen for pretty print * fix ptx log2/exp2 * assemblyinstruction * new asm * working gemm * fix cmp * more passing * mod * ptx works again * rdan3 add works * log exp * sin is sin 2pi * fix types * progress * loops work * rdna xyz * better addressing * cleanups * handle exception in early process * div support * rdna float4 * locals work * fix neg index * cast * smaller diff * yaml * import only if selected * fromimport * types * this all needs rewriting * a few more	2023-06-16 09:33:18 -07:00
Yahya Lmallas	804c45b5fc	FIX: Can't pickle local object (#979 ) _early_exec_process is a local function that is defined whiting the scope of another function, should be global	2023-06-14 12:32:17 -07:00
Steven Anderson	e54b6c5e7f	One hot (#972 ) * passing with 1d indices * passing all test * cleanup * using safe_numpy for scalar	2023-06-12 10:13:29 -07:00
Diogo	2d4370b487	Adds tril & triu support (#936 ) * triu & tril support * lint and kernel count error * switched shape indicies * larger shape tests * reverted numpy removal until #942 is resolved	2023-06-09 22:13:20 -07:00
Steven Anderson	c0e558b77c	Test nllloss (#958 ) * works but slow * work with NC and NCd1 it still slow * refactor * support for k dimensions * without numpy	2023-06-09 09:00:29 -07:00
Diogo	6b1280f01c	fixes to Onnx ops LayerNormalization/Prelu and added OptionalHasElement/OptionalGetElement (#956 ) * prelu and where casting * typing for safe_numpy * optional * get rid of tracing in ci * cleanup and resolved layernorm issues * removed debug print	2023-06-08 16:09:19 -07:00
Diogo	666d151f8a	Onnx slice fixups (#952 ) * resolved some slice test errors and added some more debugging logs * use same device in cumsum * increased float priority * onnx debug ouput match input	2023-06-07 19:44:30 -07:00
M4tthewDE	664d6cc7e5	Implement onnx MeanVarianceNormalization (#943 )	2023-06-06 10:28:19 -07:00
Steven Anderson	079ea217a3	fix test_pow_type - autocasting for Pow with inputs of diff type (#937 )	2023-06-05 15:22:35 -07:00
M4tthewDE	70f12fdb57	Fix wrong op version being used if versions equal (#934 )	2023-06-05 07:45:10 -07:00
Steven Anderson	79613eb83e	Test min (#932 ) * fix __neg__ defaulting to float32 due to 0.0 * fixed __neg__ always defaulting to float32 * fixed openpilot (OpenCL) Test	2023-06-05 00:03:30 -07:00
George Hotz	fbf17f0031	intel benchmark matmul gets 60 TFLOPS?	2023-06-04 17:01:50 +00:00
Steven Anderson	657e642e3a	Fixed test suite for Clip (#912 ) * Fixed test suite for Clip * fixed issue with clip when taking large negative numbers as min * Remove typings	2023-06-04 09:01:01 -07:00
George Hotz	afd0be8a9c	intel example	2023-06-04 06:43:09 +00:00
George Hotz	ed1963b899	Fast DiskTensor to other Tensor (#916 ) * make disktensors fast * loading * loader for sd and llama	2023-06-03 12:25:41 -07:00
George Hotz	791530045d	Refactor LoadOps (#910 ) * test * work * upd test * loadops * cleanups * real ones * remove LazyNumpyArray * fix assign test * remove range * np.require * llama uses arange kernels * no caching consts * fix enet * torch load support * tests cleanup * fix shufflenet * fix image * fix torch_load test	2023-06-03 09:40:43 -07:00
Steven Anderson	513aeb2f66	Fixed all ConstantOfShape test suite (#907 )	2023-06-02 11:26:40 -07:00
Steven Anderson	301f7b54c6	ConstantOfShape ONNX test fixed. (#890 ) * ConstantOfShape ONNX test fixed. * removed redundant if statement * value is optional and should default to a float32 tensor with value of 0 * fixed: default parameters are created at function definition, bad for mutable objects.	2023-06-02 07:34:25 -07:00
kposborne2	ae83e9844c	add output_padding to transposed conv (#875 )	2023-06-01 00:03:22 -07:00
Friedrich Carl Eichenroth	740304ef9d	Small Onnx Parser Improvements (#885 ) * wip * rename onnx_version to onnx_model_versioN * add type * add types * small cleanup * revert some changes from before * add todo * dumb fix	2023-06-01 00:01:01 -07:00
Marcello Fuschi	3924aae8ed	Fix ONNX dropout and unify the implementation (#857 ) * Fix ONNX dropout and unify the implementation * Use tensor rand method for dropout * Change approach for RNG in ONNX Dropout * Fix style * Test legacy RNG seeding * Remove the necessity for legacy RNG in Tensor class	2023-05-31 07:40:47 -07:00
skobsman	2e393f7ef2	InstanceNormalization ONNX test fixed. (#870 )	2023-05-30 16:07:44 -07:00
Friedrich Carl Eichenroth	f91f28d9e2	fix a bunch of tests (#856 )	2023-05-29 17:48:26 -07:00

... 2 3 4 5 6 ...

624 Commits