tinygrad

Commit Graph

Author	SHA1	Message	Date
chenyu	fcf4a5ccf2	fix example that calls Tensor.__bool__ (#3650 ) also removed `.cpu()` calls in mask_rcnn so `python3 examples/mlperf/model_spec.py` runs	2024-03-07 16:59:26 -05:00
chenyu	8f10bfa2ff	ban __bool__ on Tensor (#3632 ) * ban __bool__ on Tensor avoid misuse * test case * fix tests * fix more tests	2024-03-06 17:12:35 -05:00
George Hotz	81baf3eed3	bring ptx back (#3623 ) * bring ptx back * ptx back * fix define var * fix a few bugs * bugfixes * fixes * fix llvm bug * fix test bug	2024-03-06 13:34:21 -08:00
Elias Wahl	7db6dd725d	multilazybuffer fix (#3609 )	2024-03-04 17:36:23 -05:00
chenyu	35d998efa8	disable flaky test_conv_beam in CI (#3553 ) might fail due to CL_OUT_OF_RESOURCES	2024-02-29 22:59:41 -05:00
Caleb Bunch	0b1fc5888a	fix 'Import Error: cannot import name compile_cuda from tinygrad.runtime.ops_cuda' error in extra/gemm/cuda_matmul.py (#3531 )	2024-02-28 17:15:32 -08:00
wozeparrot	da32c37346	use hash as key for beam (#3516 ) * feat: use hash as key for beam * feat: bump db version	2024-02-28 10:19:01 -08:00
chenyu	77d2a4c12a	regenerate kernel dataset after reduce arg to axis change (#3467 ) ``` ./extra/optimization/generate_dataset.sh gzip /tmp/sops mv /tmp/sops.gz extra/datasets/ ```	2024-02-21 18:16:13 -05:00
chenyu	30f26279c5	add back "CPU" in test_onnx_backend supports_device (#3426 ) the onnx tests were all skipped.	2024-02-16 00:49:30 -05:00
geohotstan	5eb4c902f6	correct division dtype casting (#3405 ) * 新年快乐 * fix: exclude floordiv onnx tests * fix: less weird if statements in div * 龙年大吉 * fix: tempfix onnx div * fix: use reference impl for div	2024-02-15 19:34:40 -05:00
George Hotz	b1c0d8c99d	remove cpu and torch backends (#3399 ) * remove cpu and torch backends * don't copy to cpu * use clang instead of cpu * multitensor gathers on the first device * clang is cpu + use default * fixup * bugfix	2024-02-15 16:55:39 +01:00
George Hotz	2e60012bcf	move create schedule and delete old API (#3377 ) * move create schedule and delete old API * fix test multitensor	2024-02-12 18:10:45 +01:00
George Hotz	41efaa848c	move graph.py and jit.py into features (#3376 ) * move graph.py into features * move jit into features * fix quickstart	2024-02-12 17:34:34 +01:00
Yoshinori Sano	98c732cf9d	fix metal compile error in extra/gemm (#3365 )	2024-02-10 12:54:41 +01:00
terafo	3752e97c8f	Fix: Always cast ONNX Slice op arguments into ints (#3317 ) * fix: ensure that axes and steps are always ints * Cast everything in tinygrad --------- Co-authored-by: terafo <terafo@protonmail.com>	2024-02-04 18:40:48 -05:00
chenyu	9b8c1a0408	Tensor.batchnorm works more than 2d and reuse in onnx (#3284 )	2024-01-30 19:02:45 -05:00
chenyu	7816c3b692	onnx update for trilu and argmax (#3283 ) * support 0 in shape for tril and triu * select_last_index for ArgMax and ArgMin * pass **kwargs	2024-01-30 18:39:16 -05:00
Francis Lam	4273aabe31	extra/gemm: add a simple_conv.py along with correctness check (#3236 ) * extra/gemm: add a simple_conv.py along with correctness check The goal is to easily test tensor core triggering situations * test: add tests for acc_dtype handling and fixed typing	2024-01-26 19:06:57 -08:00
George Hotz	473935125a	use comgr to compile (#3248 ) * use comgr to compile * fast * bfloat16 * move comgr to it's own file * cleaner style * comgr in new place * comgr free + dtype cleanup	2024-01-26 18:27:49 -08:00
George Hotz	03a6bc59c1	move autogen to runtime/autogen (#3254 )	2024-01-26 12:44:19 -08:00
George Hotz	a3869ffd46	move gpuctypes in tree (#3253 ) * move gpuctypes in tree * fix mypy * regex exclude * autogen sh * mypy exclude * does that fix it * fix mypy * add hip confirm * verify all autogens * build clang2py * opencl headers * gpu on 22.04	2024-01-26 12:25:03 -08:00
chenyu	bc92c4cc32	onnx Einsum, CumSum, DepthToSpace, SpaceToDepth (#3252 ) * onnx Einsum, CumSum, DepthToSpace, SpaceToDepth Einsum inner product and `...` are not supported * --durations=20	2024-01-26 10:47:53 -05:00
chenyu	e45ffdb6cf	cleanup onnx (#3249 ) * add onnx test_reduce_log_sum_exp * more reuse * more * stuff * good CenterCropPad * imports * good ArrayFeatureExtractor * pretty good Pad * stuff * stuff * onnx.py * Atan * pass int8 test * dtype related * fastmath stuff * Resize linear * fix CI * move back	2024-01-25 20:39:59 -05:00
Ahmed Harmouche	168b1f879c	Fix hip_matmul gemm in extra (#3241 )	2024-01-25 16:03:04 -08:00
geohotstan	3628bea910	fix: big round even rounder round (#3242 ) * fix: big round even rounder round * fix: variable name lol * feat: 1 less potential cast * consistant naming (im just spaming commits now) * LOL MISSED ONNX ANOTHER COMMIT * test: fix test_ops and remove _round * test: tensor methods oops	2024-01-25 12:24:15 -05:00
geohotstan	b0b5eba535	fix _round in onnx_ops to look more like new Tensor.round (#3239 ) * fix: _round in onnxops * fix: minor things * fix: no more n * fix: smol * fix: smoller	2024-01-25 01:18:58 -05:00
chenyu	afeadbedc9	touch up Tensor.round and Tensor.neg (#3228 )	2024-01-24 12:29:37 -05:00
geohotstan	842053873d	fix neg logical_not inconsistencies (#3222 ) * try * test: add logical_not tests * gah im retarded, but this doesn't match types for const() * fix: can't we jsut do this? * big change: I don't actually know what I'm doing * WOOO IM JUST CHANGING EVERYTHING WOW probably gon revert later * BYE BYE noqa: E501 * fix: less lines and add test * fix: rm 2 redundant tests * fix: eq with False so we don't unintentionally implicit upcast, but it's bool anyways so w/e	2024-01-24 11:48:40 -05:00
chenyu	485332935e	ring copy example (#3185 ) * ring copy example * use ones for init	2024-01-19 23:34:30 -05:00
George Hotz	c80884884e	event driven hip (#3160 ) * event driven hip * simpler, src makes copy * pass mypy	2024-01-18 14:35:18 -08:00
Max-We	0338903429	Update kits19.py (#3166 )	2024-01-18 08:33:50 -08:00
George Hotz	743b36f0ce	hotfix: copy size is in bytes	2024-01-17 16:44:15 +00:00
George Hotz	a72b1b6d65	sharding for llama (#3151 ) * shard llama * sharding works * simpler * simpler * consume option * disable that test * save a line --------- Co-authored-by: George Hotz <george@tinygrad.org>	2024-01-16 19:28:00 -08:00
George Hotz	ca0beeef38	Christopherm99 ptx (#3139 ) * get basic ptx impl working * test ops passing * mypy * dont hardcode target * more walrus * ptx in ci * bool cast and f16 load/store * weird numpy bug and f16 cast tolerance * cast half to bool * fix 1 byte load/store * disable half for ptx * fix args and enable xid * fix non-ptr args * allow bitcast * mypy * cleanups * midcast use allclose * add xor * Revert "disable half for ptx" This reverts commit 73391c05fde5f7811293f60d994417d97ab20613. * enable float16 * mypy * no more crashing in ci * fix ci * minor cleanups * use new fn for ptx compiler * no diskcache in ptx compile * use rn instead of rz * save some lines * new DEFINE_GLOBAL syntax * line length * new llvm * cmpeq * minor fix * cast in mulacc * update test_recursive_add to check line count * mypy * remove llvmir.py * fix bool const * wip * cleanups * working * llvm in separate pr * cleanups * more cleanups * fix ci * use in_features directly in nn.Linear.__init__ bound check (#3050) * use in_features directly in nn.Linear.__init__ bound check get rid of the unnecessary check of isinstance int * that is always int * long lines * Device._buffers -> Device._devices (#3052) backend devices used to be called buffers * make Embedding device aware for multigpu (#3051) * make Embedding device aware for multigpu * split line instead of igore because that's cheating * add test incomplete * add test complete * remove comment * fix white space * remove nn.Embedding * remove unused reciprocal (#3053) * remove unused reciprocal * comment * unit tests for Device.canonicalize (#3055) * add multigpu test for RMSNorm (#3056) * need all gather * add two multigpu test scenarios for RMSNorm * No extra vars call (#3054) * remove unused reciprocal * comment * remove unneeded call to vars * free speedup * explicit lazybuffer caching (#3058) * hotfix: remove useless slow assert from ShapeTracker * Speed tweaks (#3059) * base doesn't have to be a function * no double fetch * pop, don't check * make the gc happy * avoid hasattr * cache canonicalize * remove assert, faster base * don't redefine that every time * fix gpt2 attention with start_pos = 0 (#3061) * fix gpt2 attention with start_pos size 1 test cases taken from ll_transformer branch * fix interpreted * Tensor.cat with 0 shape tensors (#3062) * Tensor.cat with 0 shape tensors supported both 0 in cat axis (for a subset of input), or 0 in non-cat axis (all needs to be 0) * no shp * test scaled dot product attention (#3063) * add test * add initial test for scaled dot product attention * test pass for scaled dot product attention * cached size (#3060) * cached size * simplify simplify * 0 doesn't have base * fix test * cleaner cache * hmm, metal is flaky on this...might be real(ish) but useless as test * short circuit reshape/expand properly * better reshape bypass * hotfix: use is for enum compare * hotfix: use is for enum compare, a few more * speedtweaks3: apply shouldn't use the tensor constructor (#3065) * speedtweaks3: apply shouldn't use the tensor constructor * replace 0 size with CONST, not 0 in shape * update gh actions (#3033) * update checkout actions * update upload artifact * update setup python --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> * unbind view or shapetracker also returns var_val (#3067) * unbind view or shapetracker also returns var_val 4% faster for llama compile time * one line less * unbound_views * hotfix: examples/transformer.py * jit autorealizes output (#3069) * early gate the graph (#3070) * simpler idxs_to_idx (#3071) * filter_strides -> canonicalize_strides (#3072) * fix onehot and jit in examples/transformer (#3073) trained to 0.999 in < 6 seconds on M1 Max consistently * better test demonstration (#3077) * a better test demonstration * fix white space * Tensor.expand resolves the new_shape before shortcut return (#3078) similar to how reshape is done. also updated shrink shortcut criteria to read similar to pad * minor cleanups of lazy.py (#3080) * wmma: clean up device specific tensor core code (#3081) * mem_estimate is always int, not symbolic (#3083) * mem_estimate is always int, not symbolic op_estimate can be symbolic, but mem_estimate is always int, thus we don't need to sym_infer it. fixed some long lines too. update_stats is a very big function * operator does not need underscores * cat works (#3086) * hotfix disable flaky mac runner wino cifar (#3087) * remove the third merging state in view._merge_dims (#3085) no logic depends on state == 0 or state == 2 * minor cleanup of View.reshape (#3088) * minor cleanup of View.reshape removed some redundant logic * new_strides * revert that * use BEAM=2 instead of BEAM=4 in cuda ci gpt2 (#3089) BEAM=2 is faster and less search time. investigating why BEAM2+BEAM4 is slower than BEAM2 alone * use device from LinearizerOptions in kernel search (#3090) * use device from LinearizerOptions in kernel search removed all Device.DEFAULT in search.py * pass device string for parallel pickle * device for interpreted backends in LinearizerOptions * update jit type annotation post lazy rewrite (#3091) * add mutigpu support for llama attention (#3064) * add llama attention test for multigpu * test fails * kv cache trying to shrink on sharded axis * mask None works for scale dot product * kv cache seems to be working but scale dot product breaks * scaled dot product works, but the last linear layer failed * running into the reshape case where it could be wrong for multigpu * making sure it was the reshape * adding contiguous doesn't solve * need to shard more properly * remove reshape test * minor adjustment to scale dot product attention test * weights are sharded wrong * continue fix new weight sharding * clean up * fix attention when start_pos is 0 * remove print * add TODOs for the best mutigpu interface * bugfix do not reset shapetracker of 0 size lazybuffer (#3096) it might be coming from an expand, and resetting results incorrect stride. caught by interpreted backend * One hot in tensor.py (#3093) * onehot in Tensor.py * one_hot tests * works for all shapes, not just 1 * pylint * not a static method * moved around, num_classes mandatory * pylint * pylint * space & moving * formatting * moved tests * fix broadcasted logic if there's 0 in shapes (#3097) * fix broadcasted logic if there's 0 in shapes should always expand into 0, not the other way around. fixed matmul with 0 in input shapes. for forwards for now though, backward is more involved and would need to change 0 size shortcuts * fix tests * replace with tensor op (#3099) * fix gpt2 with empty prompt (#3100) logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes * Revert "fix gpt2 with empty prompt" (#3101) * fix gpt2 with empty prompt take 2 (#3102) logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes * wmma: enable METAL half tensor cores and clean up cstyle (#3095) * wmma: enable METAL half tensor cores and clean up cstyle * revert simple_matmul rand changes and break line in tensor * added metal fp16->fp32 tensor core * add half @ half to mac benchmark (#3103) * flag to profile mixtral - 1.7 tok/s now (#3104) * update NumNode.__hash__ to be hash(self.b) (#3105) with this, `a:=NumNode(x) == b` implies `hash(a) == hash(b)` * catch runtime error in search._time_program (#3106) return inf if search encountered runtime errors. * no exceptions in __del__ when module creation is failed in hip/cuda (#3107) * failed test case due to cast resets shapetracker (#3109) cast implicitly resets shapetracker and makes it contiguous (for disk tensor), which fails for Interpreted backend if inputs contain non-contiguous st. * cleanup ops_disk type annotation and redundant str cast (#3110) * minor cleanup of test_disk_tensor (#3112) * add Tensor.var (#3114) also updated MeanVarianceNormalization and made test_ops test tensors of var and std smaller * move sample inside jit for beautiful_mnist (#3115) also removed .realize() for jit functions since jit does it automatically now. a little more beautiful * minor cleanups of onnx_ops (#3116) * fix conversation: llama generates token not prob now (#3120) * add device options for tests in multigpu (#3121) * make DType a dataclass (#3111) * remove np from DType * convert to dataclass * remove dunder hash, eq, ne overrides from ImageDType * is dataclass required for PtrDType? * fix GPU tests * reduce lines * revert changes to np * minor cleanup * hotfix: ptrdtype compare was broken * move fromcpu out of lazy.py (#3122) * move fromcpu out of lazy.py * fix abstractions2 * remove numpy from device (#3123) * remove numpy from device * fix tests * np item * cleanups * simplify with as_buffer * no toCPU * tinygradic * cast to scalar * remove numpy from ops_torch (#3124) updated mnist test to cast label to int8 and avoid hacking cast issue of torch uint8 * Fix backward fn for `<` and `==` (#3037) * fix no grad fn for < and == * remove 2 line breaks * Remove deprecated autograd variable --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> * separate try except blocks in onnx2torch in model benchmark (#3126) exceptions can be raised from either model conversion or individual backend failed. openpilot on torch mps works, but does not work with torch cpu. seperate the expcetion block so that the benchmark can inlcude torch mps for openpilot. * update env_vars.md (#3127) mostly removed deprecated ones. not clear how to maintain this especially for extra/examples * update test_ptr_ne (#3130) * remove np from metal graph (#3129) * dtype fmt (#3132) * dtype fmt * three ways to access * fix off-by-one error in st_equal (#3131) * fix off by one error * whitespace * no numpy (#3134) * fast resnet eval (#3135) * fast resnet eval * fix HIP multidevice graph * neater expression for devices * lines * add decorator test * remove LLVMOPT * move ptx * Update ops_cuda.py --------- Co-authored-by: Christopher Milan <chrismilan@ucla.edu> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: Yixiang Gao <yixiangg310573@gmail.com> Co-authored-by: jxdv <virgoj@protonmail.com> Co-authored-by: Francis Lam <flam@alum.mit.edu> Co-authored-by: SnakeOnex <sheeproman@gmail.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> Co-authored-by: Jyotirmaya Mahanta <jyotirmaya.mahanta@gmail.com> Co-authored-by: Guy Leroy <g.m.leroy@outlook.com> Co-authored-by: Paul Gustafson <paul.gustafson@theambrusgroup.com>	2024-01-15 16:44:20 -08:00
George Hotz	a464909d79	fast resnet eval (#3135 ) * fast resnet eval * fix HIP multidevice graph * neater expression for devices * lines * add decorator test	2024-01-15 14:15:18 -08:00
chenyu	db965a0c74	remove numpy from ops_torch (#3124 ) updated mnist test to cast label to int8 and avoid hacking cast issue of torch uint8	2024-01-14 22:46:57 -05:00
chenyu	152ef7fc79	minor cleanups of onnx_ops (#3116 )	2024-01-14 02:15:24 -05:00
chenyu	a313e63a9b	add Tensor.var (#3114 ) also updated MeanVarianceNormalization and made test_ops test tensors of var and std smaller	2024-01-14 01:11:08 -05:00
Francis Lam	ddbdb52f77	wmma: enable METAL half tensor cores and clean up cstyle (#3095 ) * wmma: enable METAL half tensor cores and clean up cstyle * revert simple_matmul rand changes and break line in tensor * added metal fp16->fp32 tensor core	2024-01-12 16:25:28 -05:00
SnakeOnex	0c49d38ba7	replace with tensor op (#3099 )	2024-01-12 14:13:40 -05:00
Yixiang Gao	13e872b53f	add mutigpu support for llama attention (#3064 ) * add llama attention test for multigpu * test fails * kv cache trying to shrink on sharded axis * mask None works for scale dot product * kv cache seems to be working but scale dot product breaks * scaled dot product works, but the last linear layer failed * running into the reshape case where it could be wrong for multigpu * making sure it was the reshape * adding contiguous doesn't solve * need to shard more properly * remove reshape test * minor adjustment to scale dot product attention test * weights are sharded wrong * continue fix new weight sharding * clean up * fix attention when start_pos is 0 * remove print * add TODOs for the best mutigpu interface	2024-01-11 16:31:02 -08:00
chenyu	507e0afba0	fix onehot and jit in examples/transformer (#3073 ) trained to 0.999 in < 6 seconds on M1 Max consistently	2024-01-10 02:22:41 -05:00
George Hotz	ae83733431	hotfix: examples/transformer.py	2024-01-09 19:28:09 -08:00
chenyu	1d730b8853	remove ACCUM_FP32 in simple_matmul.py (#3045 ) * remove ACCUM_FP32 in simple_matmul.py accumate for half inputs is always in float * move test llama compile speed to metal	2024-01-08 17:37:57 -05:00
George Hotz	c003be7309	Revert "track size in shapetracker" (#3043 ) * Revert "track size in shapetracker (#3026)" This reverts commit `a8ba1ac08f`. * st.size	2024-01-08 13:13:39 -08:00
George Hotz	c5a941d466	webgl backend in extra (#3041 ) * WebGL WIP * 84% of ops passing test * tests passing 100% * Cleanup, refactor * Shave off some lines * Work on dtypes * TestOps at 100% again * Efficient net shaders compile in browser webgl2 * Compile all efficientnet shaders in browser * Create empty textures for tensor buffers * Run program. Up next weight loading * Exported WebGL model working * Add tests, refactor * Explicit cast alu for GLSL * Fix CI tests * WebGL efficientnet demo * Compile and run yolov8 in browser * Fix imports * Simplify yolo compile * Fix boolbool and cast cmplt to float More tests * Do std tests pass on CI? * Skip std tests on CI * Remove explicit_cast_alu hack, and solve it in code_for_op * Move to new dtype-less alloc api * Remove local size hack: optimize local_size only if device has local * Remove glsl.py, and move content to cstyle * dont_use_locals in opts * Fix dtype tests * type_map in CStyleLanguage * Make core changes smaller, cleaner, refactor export_model and demo * Skip pad_slice * Simplify: render_const, render_conditional * solve bool alu for other binops, cleaner ops_webgl * Fix noopt hack * Remove some skipIfs * WebGL image hack * type_names is a better name * global_max * Fix dtype import * Fix type_names -> type_map * Fix lint * Remove webgpu, back to 5k lines (#3040) * remove webgpu * max 5000 lines * revert those to master * retain that cstyle --------- Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>	2024-01-08 09:29:13 -08:00
George Hotz	8cbcd1b342	Remove webgpu, back to 5k lines (#3040 ) * remove webgpu * max 5000 lines	2024-01-08 09:10:07 -08:00
chenyu	c9371f0d31	hotfix llama conversation mode (#3031 ) without contiguous on keys and values, it runs but the update is incorrect	2024-01-06 16:57:07 -05:00
George Hotz	a8ba1ac08f	track size in shapetracker (#3026 ) * track size in shapetracker * shapetracker adapter * size is an int * create Buffer with st.size * only compare the views for the jit * fix webgpu	2024-01-05 20:15:53 -08:00
George Hotz	60abc62a3f	fast hip read (#3014 ) * fast hip read * hip read faster * fix tests * to_mv * simplify * bump to 6k lines	2024-01-05 10:33:13 -08:00
chenyu	f88506e630	move gpt2/llama sampling inside the model call (#3013 ) * move gpt2/llama sampling inside the model call * argmax uses one more kernel	2024-01-04 17:01:50 -05:00
George Hotz	c2a044ed83	disk_read_speed example	2024-01-04 13:59:43 -08:00
Yixiang Gao	8a63f26a0f	make LR scheduler work with multigpu (#3011 ) * add a failing test for LR scheduler when using multigpu * fix calculation order and unnecessary tensor created for float * min_lr is no longer tensor	2024-01-04 12:10:56 -08:00
chenyu	6fa285b943	touchup onnx xor and not (#3008 )	2024-01-04 02:02:42 -05:00
geohotstan	57817028bb	removed redundant dtype hacks in onnx_ops (#2939 ) * updated most dtype hacks in onnx_ops * temporarily revert dequantizelinear change * I think this is right... * MORE FIXES WOOOO NEW DTYPE IS AWESOME * ok * oops missed a print * half -> float32 for CI * is npdtype * some more * fix if ordering * more clean ups * final cleanups * casting to half not allowed * k nvm * revert ArgMax change * only GPU * llvm begone * teeny tiny change * fix: attempt to add cast tests * try this * fix dequantizelinear * revert some stuff * tests pass pls * less lines in onnx_tests * oops missed string tensor tests * clean up * try: revert default behavior changes * fix: disabled Cast and Castlike tests * docs: small changes * fix: fixed isNaN op and enabled associated tests * fix: forgot about float16 * done * update disabled test * gah missed another float16 * disable rest of failing tests * rm extra line * try... --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-01-04 01:45:24 -05:00
George Hotz	7e191fbb86	hotfix: don't jitcache with 1 kernel. improvements to hip sniffer	2024-01-03 19:17:08 -08:00
George Hotz	753a7ecc05	Hip driver (#2992 ) * start hip driver * fix hip llama * make HIP default if we can * don't change those	2024-01-03 12:53:47 -08:00
chenyu	b1d9e54ea3	regenerate kernel ast dataset (#2968 ) added back the log ast function and removed hacks that work around the old dataset	2024-01-01 20:26:17 -05:00
George Hotz	a280cfe169	move dtypes to dtype.py (#2964 ) * move dtypes to dtype.py * fix urllib	2024-01-01 14:58:48 -08:00
George Hotz	c81ce9643d	move globalcounters to ops (#2960 ) * move globalcounters to ops * missed a few * sick of that failing	2024-01-01 14:21:02 -08:00
chenyu	ad4472e6e8	cleanup llama apply_rotary_emb and other helpers (#2950 ) * cleanup llama apply_rotary_emb and other helpers used ellipsis and other higher level tensor function. disabled the half @ half -> half tensor core as it fails uop dtype checks * keep hip 8x8->8 wmma	2023-12-29 11:39:15 -05:00
chenyu	61e255d197	use max for gpt2 and llama (#2949 ) not using argmax yet because there's a multinomial outside of function.	2023-12-28 23:26:00 -05:00
chenyu	820f2e054e	fix PADTO optimization (#2935 ) the correct condition is that PADTO cannot be applied to reduce axis, not Reduce.MAX in ops. even for Reduce.SUM it's possible that the reduce axis had a div before, and the padded 0 became inf then sum over it is incorrect.	2023-12-25 22:52:49 -05:00
qazal	12996d3a7d	green linearizer asserts for ops (#2800 ) * these asserts should pass * fix that assert * ALU dtypes * acc dtype for group_for_reduce * cast image ALUs to the base dtype * remove all casts from linearizer * fix argmax * fix multinomial * fix __getitem__ * Revert "fix __getitem__" This reverts commit 62ad719bfa5a2e1fcbfa931360f54897f8977602. * fix MemBuffer outputs being wrong when there is an arange + ALU with a different dtype eg. fancy slicing (int, float), bert embeddings (int, long) this should be fixed in lazy instead of having to break the kernel * cleanup argmax fix * fix matmul in ints cast in the end * fix llama * skip wrong hardcoded asts in the worlds dataset * fix llama p2 * cleanup missing parts of the diff --------- Co-authored-by: George Hotz <geohot@gmail.com>	2023-12-25 10:41:54 -05:00
chenyu	1fb815e77e	hotfix fix coder. RMSNorm cannot have float16 input (#2932 ) * hotfix fix coder. RMSNorm cannot have float16 input * update real world test due to new kernels * more type casts	2023-12-25 02:28:11 -05:00
chenyu	b469fe3723	add CMPEQ (#2931 ) * CMPEQ * work * fix onnx * fix round * fix webgpu * prettier * no PADTO in actions	2023-12-25 00:15:55 -05:00
chenyu	b55b55d56e	use at least int32 and uint32 for sum output (#2926 ) * use at least int32 and uint32 for sum output * use the correct type for acc * fix opencl * llvm mulacc	2023-12-24 01:14:54 -05:00
chenyu	50927defad	s/lazydata.realized/lazydata.base.realized/g (#2914 ) * s/lazydata.realized/lazydata.base.realized/g * not that	2023-12-22 14:45:13 -05:00
chenyu	fd0ba33b38	onnx_ops formatting cleanup (#2904 ) also removed a case in safe_numpy that always convert 0-dim array to 1-dim	2023-12-21 20:06:06 -05:00
chenyu	8a04107d30	move the op casting logic from mlops to tensor try 2 (#2887 ) * unary works * where works * add sub mul * xor div * CMPLT * sparse_categorical_crossentropy * image const * sparse_categorical_crossentropy	2023-12-20 23:50:37 -05:00
George Hotz	7da2325dc7	get_lazyops() -> lazyops (#2884 ) * get_lazyops() -> lazyops * don't compare empty mem	2023-12-20 18:04:49 -08:00
George Hotz	64dded27f0	pad ops broke coder (#2881 ) * pad ops broke coder * that contiguous fixes it * Update lazy.py	2023-12-20 17:03:41 -08:00
George Hotz	1765849937	new lazy, benchmark (#2878 ) * lazy rewrite, try 2 * min fix tests * pass contig test * put broken pads back * move that to realize * no contig child fixes array packing * so wrong * now that's correct * base children * fix bind issues * disable to_image_idx * fix tests * that failure shouldn't break other tests * more fixes * fix torch * skip failing tests in CI * 1e-7 * half is broken * 1e-6 margin of error	2023-12-20 14:33:21 -08:00
geohotstan	fec8e9060c	Add simple fancy indexing exceptions (#2706 ) * fancy indexing raise error * updated error message * improved error check * oops * fixed onnx * oops typo * merge * add full_flatten * try * merged and updated some tests * more cleaning * done * temp fix onnx * try * add todo in onnx_test * reword * gah	2023-12-19 11:23:51 -05:00
chenyu	73cadfbb3c	Remove pytest markers (#2831 ) * remove pytest marker * fix some, skip some * tweak * fix * skip slow * skip more	2023-12-18 18:53:28 -05:00
chenyu	0723f26c80	dtypes.default_float and dtypes.default_int (#2824 )	2023-12-18 12:21:44 -05:00
Rory Clear	f409b57854	update metal matmul and matvec for new device style (#2732 ) * update for new device style * create device before compile --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2023-12-17 16:15:07 -05:00
George Hotz	bad0ff60b7	start Qualcomm GPU driver (#2804 ) * hooking works * working * qcom work * parsing command buffers * proper parse	2023-12-16 23:10:50 -08:00
chenyu	157c0be509	cleanup onnx, pass one more reshape test and remove some casts (#2806 )	2023-12-16 20:40:43 -05:00
chenyu	765f8b05e5	TernaryOps.WHERE has vin[0] as bool and BinaryOps.CMPLT always outputs bool (#2782 ) * vin[0] to where is always bool * due to better hack * update test * fix test_uops	2023-12-15 14:51:51 -05:00
chenyu	c0f76ed4ea	transformer kvcache and mask have same dtype as input (#2771 ) * transformer kvcache and mask have same dtype as input * don't use `=0` in cstyle ternary where * (bool) * where float16 test	2023-12-14 22:41:51 -05:00
chenyu	66d9eb10b6	arange default dtype to int and zeros/ones default to float (#2769 )	2023-12-14 17:53:00 -05:00
chenyu	57017c87e9	remove duplicated dtype in DEFINE_GLOBAL args (#2768 ) now DEFINE_GLOBAL uop.arg[1] is always the same as uop.dtype, we can remove the one in arg and just use uop.dtype	2023-12-14 15:42:36 -05:00
chenyu	5235cdee3d	remove _arg_int32 internal type (#2767 ) in DEFINE_GLOBAL, PtrDtype(int32) is buffer and int32 is int	2023-12-14 14:17:14 -05:00
chenyu	8a2a2257b4	minor onnx_op cleanups to prep dtype changes (#2764 ) * minor onnx_op cleanups to prep dtype changes read through it and clean some minor stuff * revert embedding - is it really being tested	2023-12-14 13:01:27 -05:00
chenyu	64fea9ff4a	Revert "minor onnx_op cleanups to prep dtype changes (#2758 )" (#2759 ) This reverts commit `38da001b64`.	2023-12-14 03:12:14 -05:00
chenyu	38da001b64	minor onnx_op cleanups to prep dtype changes (#2758 ) read through it and clean some minor stuff	2023-12-14 03:05:59 -05:00
Nguyen Nguyen Phuong	07cf45e133	fix cuda matmul (#2725 )	2023-12-12 07:59:31 -08:00
George Hotz	b5fd160b39	hotfix: increase rtol on simple_matmul	2023-12-11 10:10:29 -08:00
George Hotz	b3982187d1	Mixtral Example (#2691 ) * mixtral * simpler * global counters * simpler * weights arg	2023-12-10 17:18:31 -08:00
chenyu	181b0970b5	slightly better extra/to_movement_ops dedups (#2695 )	2023-12-10 11:05:44 -05:00
chenyu	ef18d79faa	remove noop from to_movement_ops (#2693 )	2023-12-10 00:50:24 -05:00
George Hotz	4164d0ebbd	multitensor start (#2676 ) * multitensor work * early gen fixes the tests * atol for flaky test	2023-12-07 17:07:05 -08:00
chenyu	539b00a645	move llama getenv("JIT") from models to examples (#2671 ) Transformer class has a jit param so we should use that in the caller	2023-12-07 12:43:22 -05:00
George Hotz	a73579919f	mlx benchmark, a lil slower than tg	2023-12-05 19:00:43 -08:00
qazal	be09cc87c1	Bitcast support / fast bf16 load (#2011 ) * bitcast renderers * fast llama load * make it one kernel * regression testing p1: re-enable test_dtype for all backends fix GPU * regression testing p2: fuzz all possible cases against numpy remove hancoded tests since the fuzzer covers them * define ushort * fix indent, probably need flake8 back for CI to catch --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-12-05 16:19:28 -08:00
George Hotz	232ed2af3f	more test cleanups (#2631 ) * more test cleanups * move test example back	2023-12-05 16:17:57 -08:00
George Hotz	0be5d16950	only 62 gflops (#2629 )	2023-12-05 13:28:24 -08:00
chenyu	6ba6349c97	JIT=0 llama.py should not jit (#2609 )	2023-12-04 20:21:07 -05:00
Yixiang Gao	fde44aed76	update hip_matmul with new abstraction (#2605 )	2023-12-04 13:37:10 -08:00
qazal	4380ccb169	Non fp32 math (#2264 ) * `global_load` and `global_store` using buffer dtype * `UOps.PHI` in all dtypes * `UOps.ALU` in all dtypes * `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes * -- endof implementation -- +tiny lint changes * these tests require the fp16 extention you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261) `GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul` skip the new test_linearizer_failures in CI GPU because of the fp16 extention This passes on a real GPU since the extention is available: `GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8` see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644) * these tests fail in CI due to segfaults and CPU crashes To confirm they're green locally, you can run the following commands: 1. For the tests skipped in test_ops.py (note: CLANG is very slow) `for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done` 2. For the ONNX tests skipped in CLANG: ``` CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \ test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu ``` 3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific `LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu` * Revert "these tests fail in CI due to segfaults and CPU crashes" This reverts commit 15db57014381a4449d563526ac6c870e36257658. * merge with cleanup-vectorized-hip-renders * barely working HIP P1, ALU ops need a refactor? * manage the fact that in HIP [half2 is actually an unsigned int vec](`f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)`) and half is a totally different __half that [has an unsigned int element in it](`f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)`) but can't be accessed [because it's private](`f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)`). If you just do this: ``` half2 val0 = // ... half val1 = // ... ``` then you can't do: ``` val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half')) ``` * update the sign definition to avoid division by zero in all dtypes * diff cleanup p1: why were these in the diff anyways * less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI! add ALU ops overloads for HIP this will make HIP max work handle mod Revert "handle mod" This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933. update max to use hmax add HIP GEP render logic enable CIFAR fp16 benchmark test ops for HIP back to store as float because this only works for float4 grouping right now test_ops for hip!! always sign * back to the sign we had before because we cant do a backward pass on a Less node * remove old hacks HIP compiling test_ops in CI takes ~9 mins, not doing it for now new HIP ALUs * reduce accs done right * refactor to function * no device hacks hacks p2 the other way * LLVM ALU ops half, float and double are all float update max * update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool * cleanup LLVM wrong code * dummy change for the CUDA install glitch --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-12-03 13:45:49 -08:00
qazal	99ee2ec37a	Refactor code_for_op to accept a dtype (#2555 ) * update cstyle renderers to take a dtype in code_for_op * implement NEG for bools in LLVM * update triton --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-12-01 22:05:28 -08:00
George Hotz	4c984bba7e	bump version to 0.8.0, clean CI, remove requests (#2545 ) * bump version to 0.8.0, clean CI, remove requests * why was that even there	2023-12-01 10:42:50 -08:00
nimlgen	badc97f824	hip & cuda to gpuctypes (#2539 ) * cuda with gpuctypes * hip gpuctypes * graphs * rename + linter happy * use cpu_time_execution * no ji in build_kernel_node_params * remove hip_wrapper * hip fix * no arc * smalle changes * no clean moduke in cudacpu	2023-12-01 09:25:27 -08:00
chenyu	7fec966b5e	bye bye NOOP (#2534 ) * bye bye NOOP * SIN * NEG	2023-11-30 23:10:35 -08:00
Matthias Kronberg	5394a05b9d	Fix: Get item from ndarray before casting to int (#2525 ) Directly casting is deprecated and will error in the future.	2023-11-30 18:34:31 -08:00
George Hotz	2c363b5f0b	new style device (#2530 ) * cpu tests pass * torch works * works * metal works * fix ops_disk * metal jit works * fix openpilot * llvm and clang work * fix webgpu * docs are rly broken * LRU works on metal * delete comment * revert name to ._buf. LRU only on Compiled * changes * allocator * allocator, getting closer * lru alloc * LRUAllocator * all pass * metal * cuda * test examples * linearizer * test fixes * fix custom + clean realize * fix hip * skip tests * fix tests * fix size=0 * fix MOCKHIP * fix thneed * copy better * simple * old style metal copy * fix thneed * np reshape * give cuda a device	2023-11-30 17:07:16 -08:00
Davi Silva	ddeec24fa8	Cleanup & fix llama.py (#2524 ) * docs, cleanup crap * comma AI * fix 70B * this is why lexical scope exists	2023-11-30 16:00:17 -05:00
George Hotz	6707f2588e	use copyin (#2500 ) * it's always copyin * all RawBuffer are RawBufferCopyIn * cleanups * this fixes it * requirements='C' * more correct	2023-11-29 09:34:00 -08:00
George Hotz	5629fc368c	Use Buffer.STORE at the end of ASTs (#2494 ) * work * store broken * interpreteds work * this passes * symbolic cpu * fix tests * fix opt tests * images fail * fix InterpretedFlopCounter * stupid hack for images	2023-11-28 20:11:37 -08:00
Jake	5588922884	Update cuda_matmul.py (#2495 )	2023-11-28 19:46:01 -08:00
George Hotz	d87a246439	move to new cached fetch (#2493 ) * move to new cached fetch * extra.utils is over * loads * bump download cache * bump timeout	2023-11-28 17:36:55 -08:00
George Hotz	ab5d14d4ba	MEM -> LOAD (#2492 ) * MEM -> LOAD * keep legacy working	2023-11-28 16:46:37 -08:00
George Hotz	3f137b134a	jax parallel matmul example	2023-11-28 13:48:11 -08:00
Davi Silva	186ac77ec3	Update hip_matmul.py (#2480 )	2023-11-27 18:36:19 -08:00
George Hotz	9e07824542	move device to device.py (#2466 ) * move device to device.py * pylint test --disable R,C,W,E --enable E0611 * fix tests	2023-11-27 11:34:37 -08:00
George Hotz	7170a9a057	coder.py can write and run code (#2439 ) * wip mistral * coder * touchups * cleanups * mistral cleanups * clean up cache create * download the weights, fix tests * fix llama loading * global fixup * clean up all * move llama model * cleanups * Revert "cleanups" This reverts commit a71c5d59eb86290634a258704d8bab2378b8d63d. * fine, leave it	2023-11-25 12:27:54 -08:00
George Hotz	8ff2e13550	From teeny (#2426 ) * changes from teenygrad work * support not supporting ImageDType/PtrDType * fixups from teeny	2023-11-24 12:50:56 -08:00
nimlgen	e68aebfff9	bring hip graph back (#2385 ) * bring hip graph back * share with metal * fix linter * remove hasattrs * Update ops_hip.py * hip wrapper does not use _buf --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-11-24 07:53:44 -08:00
George Hotz	12023b6824	onnx ops cleanup (#2413 ) * onnx ops cleanup * revert those	2023-11-23 18:39:49 -08:00
George Hotz	095e2ced61	add name support to fetch (#2407 ) * add name support * use fetch in gpt2 * remove requests from main lib, networkx also optional * umm, keep that assert * updates to fetch * i love the walrus so much * stop bundling mnist with tinygrad * err, https * download cache names * add DOWNLOAD_CACHE_VERSION * need env. * ugh, wrong path * replace get_child	2023-11-23 14:16:17 -08:00
George Hotz	0505c5ea50	remove force_wait, refactor to graph (#2405 ) * remove force_wait * refactor * get rid of stupid ASTRunner * fix del in diskbuffer * BufferOps.FROM_UNDERLYING * put offset in the rawbuffer * fix bugs * use exec	2023-11-23 12:46:07 -08:00
George Hotz	4f8f0ac139	minor cleanups, remove dead files (#2398 ) * minor cleanups, remove dead files * s.name * use disk * pytest passes on mac	2023-11-23 09:01:50 -08:00
George Hotz	66c75f30c6	remove triton (#2396 )	2023-11-23 07:40:59 -08:00
chenyu	8798d120bb	autopad shapetracker for BEAM (#2375 ) * autopad shapetracker for BEAM * OptOps.PADTO * skip that test for now * correct padding reduce axis * just 32 * avoid more than double the FLOPs * cleanups * test case * no support for triton and llvm yet * typos * symbolic shape would not work * cannot PADTO with MAX kernel * advance db version * no breaking change - don't advance db version * is triton just python? * Revert "is triton just python?" This reverts commit 17e776c25587615e33a3634c2fb0bb8591ce65d4. * Revert "Revert "is triton just python?"" This reverts commit 6c434c01e1c4b0ea0431ec18632cd859fb3cf260. * support llvm * is it really passing in CI only? * update tests * oh triton test passed * simpler * revert that, with a test * check if st are the same * Revert "check if st are the same" This reverts commit d2a5eac110a5da1af82a2728c883779ef69c3cad. * update the db version * rebase artifact	2023-11-22 21:05:25 -05:00
qazal	0eda545946	dtypes.float.vec(sz) (#2386 ) * replace all _dtypen with dtype.vec(n) fix: print works * conceptul refactor of cstyle render_load logic * linearizer GEP is explicit that its dtype is the scalar version of localtype * vectorized global_store and load don't need a conditional	2023-11-22 17:43:14 -08:00
George Hotz	cbb8486779	ResNet training changes (update benchmark) (#2390 ) * default arg for chunk * bring back to_ * good changes * new set * unused hash * fix optim * new torch loader * fix test lr scheduler	2023-11-22 17:41:12 -08:00
wozeparrot	abbcc7aefa	missed cleanup from cache_id removal (#2376 )	2023-11-21 01:03:43 -05:00
George Hotz	a0890f4e6c	move fetch to helpers (#2363 ) * switch datasets to new fetch * add test_helpers * fix convnext and delete old torch load	2023-11-19 12:29:51 -08:00
chenyu	d7d078c7f9	Node.vars() returns a set and properly dedup (#2356 ) * dedup RedNode.vars() * vars returns a set * fix more vars * unused import * update to_movement_ops * comment	2023-11-18 17:44:52 -05:00
George Hotz	40246d35bc	ops_shm removed (#2351 ) * ops_shm removed * buf.cast * err, forgot those	2023-11-18 11:41:58 -08:00
George Hotz	c7b38b324b	A beautiful MNIST training example (#2272 ) * beautiful mnist * beautiful mnist example * from tinygrad import Tensor * more beautiful * the jit is super core tinygrad * globalcounters reset on jit run * symlinks and exclude * beautiful_cartpole * evaluate is it's own function * no symlinks * more beautiful * jit reset for double speed * type hinting for JIT * beautiful_mnist gets 98% * beautiful_mnist < 4s with BEAM=2 * better cartpole * use actor critic * zero_grad got lost * delete double relu * stable cartpole with PPO * beautiful_cartpole is more beautiful * REPLAY_BUFFER * beautiful stuff typechecks * None support in shape * hp tuning	2023-11-17 19:42:43 -08:00
chenyu	d2c0035c73	add back as_strided, move rebuilt mops to extra (#2344 ) * add back as_strided, move rebuilt mops to extra * negative stride for ops_cpu * Revert "negative stride for ops_cpu" This reverts commit a13b6815ac31478d31ae71c26f4d4e4d274bf155. * skip that * style	2023-11-17 14:34:30 -05:00
George Hotz	652d2de256	wow how did i think that was okay (#2339 )	2023-11-16 21:21:11 -08:00
chenyu	822d6e6f18	Simpler mops verify (#2325 ) * rewrite the to_movement_ops check using symbolic * tweak	2023-11-15 21:47:18 -05:00
forcefieldsovereign	b64738e1d6	Remove AS_STRIDED from shapetracker (#2216 ) * very close * remove comment * negative strides working * almost everything passes * calculate offset with list comprehension * some cleanup * got disk load working * review suggestions * fix after merge * overlap working * did it * clean * fixed disk load * lint * mypy * removed as_strided * trying without simplify * added back simplify * make sure expanding to smaller shape * cleanup * removed comment * removed env file * trying whisper test again * onnx test sqlite issue * working on test * finished test * eliminate unnecessary shrink-then-pad * don't shrink buffer * added strides check * added to ci under linters * switch issue * allow symbolic stride * removed .env * isinstance * adjust strides for double expand * cleanup * needed to add type hint for mypy * set pythonpath	2023-11-15 15:50:17 -05:00
geohotstan	3c5a51fb3a	aaaaaaa finally (#2310 )	2023-11-15 07:12:38 -08:00
George Hotz	4f7b1ac0d2	cleanups before interpreted jit (#2306 ) * jit mnist * InterpretedFlopCounter doesn't rely on Interpreted * allocator for cpu and torch * types for exec_ast * fix type issues * fix onnx, remove print * always self.from_underlying	2023-11-14 21:44:25 -08:00
nimlgen	4e0d47533e	beam works with var vals (#2296 ) * beam works with var vals * test passes now * better comment * linter happy	2023-11-14 13:03:19 -05:00
George Hotz	0cbf6c1811	move things, clean up extra (#2292 ) * move things * idk why pylint needs that now * delete unused	2023-11-13 20:18:40 -08:00
George Hotz	b1f7f29525	metal indirect command buffers (#2285 ) * metal indirect command buffers * sub 1ms gpt * metal batch exec is good * remove whitespace * input_replace * fix ci * useResources * very simple cacheallocator * update_stats * fix CI * minor * remove that from jit	2023-11-13 17:58:26 -08:00
rodfer	53c5baa8b6	add dilation to avg_pool2d (#2270 ) * add dilation to avg_pool2d * avg_pool_fix * avg_pool_fix * woo * oops * force it correct --------- Co-authored-by: rodfer0x80 <rodfer0x80@proton.me> Co-authored-by: zibokapi <zibokapi@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-11-13 08:47:56 -08:00
valar	123ea051e6	refactor/ci: delete many `# type: ignore` (#2281 ) * refactor/ci: delete many `# type: ignore` * replace `axis.__class__ is int` with `isinstance(axis, int)` to make mypy happy * add `--warn-unused-ignores` to mypy flag refs #2240 * ci: move `--warn-unused-ignores` flag to mypy config refs #2240	2023-11-12 11:04:20 -08:00
geohotstan	b853e9bb8c	Onnx 1.15.0 gogogo (#2217 ) * lol * lol * add GELULULULUL * onnx 1.50 * fuk torch bool neg * exclude regex tests * exclude dequantizelinear for now * is sunny in philly * damn it affinegrid * fixed auto_pad VALID * skip 0 shape tests * add temporary cast in Reduces * tests should pass now * added comments and cleanup * try moving dequantizelinear to onnx.py * fixed dequantizedlinear? * cleanup * try? * float16 segfaults LLVM CI..??? * cleanup comments * pin to 1.50.0 * remove use of -np.inf cuz numpy is kill * 1.50? lol I'm actually retarded * thx for review, muhbad * moved Gelu higher up	2023-11-10 15:36:48 -08:00
chenyu	a753c8e071	examples of new GPT2 and JIT change (#2261 ) * var_vals are global * working with global ish * better * fix export model * fix tests * better kv cache * does it run? * use where for kvmask * fix excessive var_vals * fix import * how does multigpu use this? * llama kinda work * faster and simpler * cleanup * fix conversation mode * test cleanups * fix one more test * test cleanup --------- Co-authored-by: George Hotz <geohot@gmail.com>	2023-11-10 15:07:02 -05:00
George Hotz	80bf0b8586	proper wmma (#2245 ) * proper wmma * hip cast * bugfixes * bugfix * that bug is fixed --------- Co-authored-by: George Hotz <george@tinygrad.org>	2023-11-09 15:15:18 -08:00
wozeparrot	4c44d1344b	feat: remove cache_id (#2236 )	2023-11-08 08:09:21 -08:00
Rory Clear	553688f12a	update metal matmul and matvec for compile api (#2238 )	2023-11-08 08:08:35 -08:00
George Hotz	2f7aab3d13	move optimize_local_size (#2221 ) * move optimize_local_size * interpret_ast	2023-11-05 21:00:52 -08:00
chenyu	f582ec56d5	Replace (getenv("CI", "") != "") with helpers.CI (#2213 )	2023-11-03 15:20:44 -07:00
George Hotz	f17bc16f46	simple runtime args (#2211 ) * simple runtime args * fix some tests * fix abstractions and triton * fix search	2023-11-03 12:31:29 -07:00
George Hotz	ddbc6eecaf	some refactors in the realization (#2206 ) * some refactors * delete old kernel search	2023-11-02 19:51:28 -07:00
George Hotz	03cf0afa4f	move all to compile api (#2203 ) * move metal+clang to compile api * all to the new style * remove binary arg * fix triton * fixup tests * fix clang * diskcache is generic * __wrapped__ * compile_gpu * fix thneed * keep the src in the ASTRunner * lib * move compile_gpu * compile_gpu in device * put compiler in astrunner * test reverts * triton compiler * ugh, that too	2023-11-01 23:01:32 -07:00
George Hotz	8932816816	remove arm64, caching for cuda (#2201 ) * remove arm64, caching for cuda * caching in llvm * switch cache_compiled to new cache * fix clang * caching for metal * fix pylint * cleanups * perf_counter and binary	2023-11-01 18:44:00 -07:00
George Hotz	7103b716c4	merge kernel and optimizer (#2200 ) * merge kernel and optimizer * linearize is reentrant * move global/local size * clean up linearizer copy * remove unneeded lin copies * stop linearizing twice * oops, that should be None	2023-11-01 15:20:01 -07:00
George Hotz	33bb650e94	use mad in opencl (#2198 ) Co-authored-by: Comma Device <device@comma.ai>	2023-11-01 10:40:08 -07:00
Comma Device	2e9982fe2d	fastvits example that's 10% faster	2023-10-31 21:48:23 -07:00
George Hotz	8ba7ced7f9	extract const if it's const (#2193 ) * extract const if it's const * fix if statement * fast math issue * fix graphing and casting * disable flaky copyout test	2023-10-31 18:52:35 -07:00
George Hotz	5aaa8a0cc1	fix shape	2023-10-31 11:36:19 -07:00
George Hotz	a27c9f9de5	openpilot compile2 (#2189 ) * try compile2 * pass to thneed * fix tanh onnx	2023-10-31 11:08:58 -07:00
forcefieldsovereign	f294bdd681	fixed imports (#2185 )	2023-10-30 22:07:17 -07:00
Akshay Kashyap	018bd29e37	Enable Multi-Output Export (#2179 ) * Enable Multi-Output Export * Add test * Update examples and lint * fix padding * test ops * dummy commit to rerun test * revert cuda lint * Enforce tuple/list of tensors * subscripted generics * put back webgpu test * Re-enable WebGPU Efficientnet test	2023-10-30 18:42:26 -07:00
chenyu	6c58bf3e9c	in time_linearizer, allocate a scratch buffer if output buffer is also input (#2152 ) * in time_linearizer, allocate a scratch buffer if output buffer is also input * move scratch buffer creation outside search	2023-10-28 07:17:41 -10:00
George Hotz	e0201922e3	Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142 ) * stable diffusion < 324ms * revert swap action * fix tests due to more sum splitting * REDUCEOP_SPLIT_THRESHOLD env var * added from unaligned np test (#2134) * align cpu buffer before copy into cl buffer (#2135) * remove shelve from handcode_resnet50_opt.py (#2139) * Add dictionary keys to reduce db size (#2131) * work * ignore beam cache * dictionary keys are generic * minor db cleanups * fix baseline and extract dataset * fix training * log likelihood * more lin to feats * sts * training policynet * net sort of works * dedup * refactor, stupid new actions * fix uops deduping * BEAM_ESTIMATE --------- Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>	2023-10-27 10:53:06 -10:00
chenyu	0ca0e9ee5e	exclude ast with variables from beam search (#2140 ) * exclude ast with variables from beam search * test that * add to CI	2023-10-25 16:35:29 -04:00
wozeparrot	c29653605e	hip multigpu training (#1878 ) * feat: move to hip * feat: special path for RawBufferTransfer * feat: initial rawbuffertransfer * feat: hip ipc * feat: working hip ipc * feat: need to base device without args * feat: close mem handle * feat: modified test * feat: more multihip stuff * clean: cleanup * feat: cleaner * feat: don't crash * feat: test more * clean: way cleaner hip wrapper * feat: barrier * feat: barrier * feat: this breaks stuff * feat: we can use empty here * feat: maybe fix tests * feat: maybe fix tests again? * fix: probably fix tests * feat: no waiting here * feat: wait here * feat: much larger test * feat: need to sync here * feat: make this async * feat: no waiting! * feat: cut here * feat: sync copy * feat: random imports * feat: much cleaner world * feat: restore this * feat: restore this * clean: cleanup * feat: set this	2023-10-24 17:35:53 -04:00
nimlgen	2e89fd264f	Refactor hipgraph (#2141 ) * refactor hip graph * linter happy * happy liner	2023-10-24 15:45:56 -04:00
George Hotz	cea2bc7964	Add dictionary keys to reduce db size (#2131 ) * work * ignore beam cache * dictionary keys are generic * minor db cleanups * fix baseline and extract dataset * fix training * log likelihood	2023-10-24 10:49:22 -04:00
George Hotz	6dc8eb5bfd	universal disk cache (#2130 ) * caching infra for tinygrad * nons tr key * fix linter * no shelve in beam search * beam search caching * check tensor cores with beam too * pretty print * LATEBEAM in stable diffusion	2023-10-22 10:56:57 -07:00
George Hotz	abeba8f1fc	optimization: get actions in CI (#2125 ) * get actions in CI * actually run the test * pythonpath	2023-10-20 12:22:01 -07:00
Sean D'Souza	999c95ea29	fix: hlb cifar types (#2099 )	2023-10-17 19:23:50 -07:00
Ahmed Harmouche	2b5ea7d9cb	Fix output Float32Array size in webgpu export (#2096 )	2023-10-17 15:28:19 -07:00
Szymon Ożóg	4bef1591f0	Disable ocelot cache + fix matvec in triton (#2010 ) * Revert "disable flaky triton test" This reverts commit `1e15fdaee7`. * Update test.yml * check if has shared for matvec * disable ocelot cache for triton * disable ocelot cache * disable ocelot cache * pass shared to triton uops tests * temporary debugs for CI crash * Revert "temporary debugs for CI crash" This reverts commit fee3ea96c818e83c19b935c2f8482e0ccc91a542. * Revert "triton isn't tested, and allows this refactor (#2007)" This reverts commit `dea8bb0938`. * add runtime_args to every renderer, move triton local size override to runtime args * Add binary to args, correct type returned * update to new loops * Update test.yml	2023-10-17 10:33:32 -07:00
geohotstan	5ed630204b	Add ONNX to CI for other backends (#2069 ) * some cleanup * move continue back * more more more * added to CI * try * try intentionally break some tests * wtf * del True for test * yay tests broke, now pls no break * try AGAIN * gahy * lol * try * move over constant * moved over MORE * move shrink over * trailing lines * try CUDA CI * try again * boom * oops * improved comments * try: disable some flags and disable CUDA * try breaking tests * traceback has too much info so add --tb=no * revert forced CI failure * add comments and del unused imports * oooooooo using regular debug try enable tb * intentionally break tests * added tb back. Maybe not too verbose * strip whitespcae * missed something * Shape op int32 -> int64 * oops missed something * add some types * get rid of crazy 1 liners in pad op * actually test Split this time LOL * strip that whitespace	2023-10-17 09:33:54 -07:00
George Hotz	1bf4aef0f5	fix image dtype cmp (#2089 ) * fix image dtype cmp * print that with debug 3	2023-10-16 17:52:38 -07:00
George Hotz	a7b18ac325	try beam search on device (#2085 ) * try beam search on device * fix beam with nolocals * ops too --------- Co-authored-by: Comma Device <device@comma.ai>	2023-10-16 12:52:42 -07:00
George Hotz	c36d306606	KOPT is over, BEAM is upstream (#2071 ) * create cache for q learning * make linter happy * global beam * where it belongs * bugfix * ditch the kopt, use the beam * faster lin and DEBUG=2 okay * remove kopt, move search to features	2023-10-16 09:46:03 -07:00
George Hotz	5472a14544	openpilot compile2 (#1977 ) * start compile2 * tweak * why are there two more kernels? * minor cleanups * don't break onnx tests * add __metadata__ support to safetensors * no early realize in onnx * cleanups * bugfix * clean up image type, add optimize * opt to match old * try that * opt work * run compile2 * optimizer * prt more * prerealize * imp * NOLOCALS works * no locals means no locals * support fractional globals * all locals welcome * int that * cleanups * show gemv regression * clean up diff * use idx for the cond * nolocals --------- Co-authored-by: Comma Device <device@comma.ai>	2023-10-15 20:39:46 -07:00
George Hotz	49bcfec383	0s in the action space (#2070 ) * 0s in the action space * simpler * skip duplicate actions	2023-10-14 11:22:48 -07:00
George Hotz	4124cf1df5	cleanup tensor cores, expose exclude local upcast (#2064 ) * expose exclude_local_upcast * convert apply tensor cores to ops * update comment * put LOCAL back to what it was, BEAM is better than way	2023-10-14 09:21:03 -07:00
George Hotz	90c777d815	remove apply_auto_opt (#2063 )	2023-10-13 07:44:14 -07:00
George Hotz	6f1810af2d	with unroll, the action space goes from 161 -> 127 (#2060 ) * with unroll, the action space goes from 161 -> 127 * more reliable instrumentation * beam search is so op * beam bugfix	2023-10-12 20:52:23 -07:00
George Hotz	c5edb3c374	train value net, improve API, add BCE (#2047 ) * api cleanups, BCE losses * valuenet * fixup examples * learning okay * add valuenet runner * net improvements * net improvements * 40% win rate	2023-10-12 07:56:38 -07:00
George Hotz	0ba629c7b9	add world dataset (#2045 )	2023-10-11 15:54:30 -07:00
George Hotz	0c3b6f13a8	Latest opt (#2044 ) * split out actions * rl algorithm	2023-10-11 15:46:14 -07:00
George Hotz	41bfeb2c1e	start work on auto opt (#2034 ) * start work on auto opt * lin failure * not beating hcopt * greedy * timing is fast * codegen.search * greedy search in handcode_opt * track running gflops * clean up those files * no failure	2023-10-11 12:54:53 -07:00
chenyu	1c980517c5	s/var_vals_from_ast/vars_from_ast (#2038 )	2023-10-10 20:21:55 -07:00
George Hotz	f139060103	Rewrite hand coded opt with action space (#2030 ) * tests passing * hand coded opt with new abstractions * simpler opts * split out tensor cores	2023-10-10 07:38:38 -07:00
George Hotz	16ca8410f8	op logger + replay (#2021 ) * logops * fix dtype printing * needs inf * ops dataset * minor improvements * 12k kernels * opt can compile * graph flops	2023-10-08 15:10:18 -07:00
George Hotz	8db92bd060	fix tvm gemm example	2023-10-08 05:57:41 -07:00
Francis Lam	dece9958f8	wmma: clean up to make WMMA arg order consistent (#2014 ) also add cache defeat to extra/gemm/simple_matmul.py	2023-10-07 17:45:40 -07:00
George Hotz	6ee9cae44f	don't extract CIFAR every time / use the cache	2023-10-07 12:33:50 -07:00
George Hotz	dea8bb0938	triton isn't tested, and allows this refactor (#2007 ) * triton isn't tested * cuda buffer	2023-10-07 07:29:59 -07:00
Roelof van Dijk	26fcc8dff6	fix: remove runtime imports (#1982 ) fix: import what is used probably monkeypatched fix: import revert selective import	2023-10-07 05:23:08 -07:00
George Hotz	f54959e5cd	move print tree into graph (#2003 ) * move print tree into graph * add winograd profiling test * change pre-commit to run ruff first	2023-10-07 04:39:21 -07:00
Ahmed Harmouche	2114dc13d1	Allow multi-input model export (#1995 ) * Allow multi-input model export * Add model export unit test * Fix efficientnet compilation * Only run model export test on JIT supported devices * Skip export model test if not EXPORT_SUPPORTED_DEVICE	2023-10-07 04:13:34 -07:00
George Hotz	ffa33d743a	good changes from openpilot_compile2 (#2000 ) * good changed from openpilot_compile2 * float32 image type was wrong * cleaner way to write that + a test	2023-10-06 13:33:24 -07:00
Francis Lam	0ba75c4370	optimizer: add matvec optimizations (#1972 ) * optimizer: add matvec optimizations * renderer: fix alignment of shared memory in opencl	2023-10-04 14:16:27 -07:00
George Hotz	de5d603ec1	corealize + remove realize from lazybuffer (#1968 ) * corealize + remove realize from lazybuffer * fix multigpu * fix graph	2023-10-04 10:59:31 -07:00
nimlgen	2ea1dd3e87	no process() in Linearizer (#1966 ) * no process() in Linearizer * more process() clean up	2023-10-04 07:18:42 -07:00
George Hotz	717451a244	Revert "optimizer: add matvec optimizations (#1753 )" (#1959 ) This reverts commit `f520323054`.	2023-10-03 00:28:42 -07:00
Francis Lam	f520323054	optimizer: add matvec optimizations (#1753 ) * optimizer: add matvec optimizations * Update optimizer.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-10-03 00:01:59 -07:00
David Hou	8e9db88474	expand after expr_idxs in Linearizer.global_load (#1818 ) * small changes * expand in terms of substitute, directly expand g_idxs g_valid * delete expand_ops * don't compare using hash * any instead of in thanks gijskoning Co-authored-by: Gijs Koning <gijs-koning@live.nl> * support tc * testing code * no more create_rednode * maxsize none in view/node * oops * undo * typing * oops * oops * lmao * lmao * add expand multi test * Node.iter_idxs * type * type * delete checks! * clean up a little? * expand_idx in symbolic * un-golf * play around with types >.> * test_substitute and also remove an incorrect test? * get rid of range * Update symbolic.py * split out view cache change * split out flat components change * reduce diff * reduce diff * add some float4 tests * fix --------- Co-authored-by: Gijs Koning <gijs-koning@live.nl>	2023-09-29 10:33:34 -07:00
Francis Lam	f445e056ed	wmma: add test and tensor core shape (#1925 )	2023-09-28 18:04:28 -07:00
Yixiang Gao	094d3d71be	with Tensor.train() (#1935 ) * add with.train * remove the rest TODOs * fix pyflake * fix pyflake error * fix mypy	2023-09-28 18:02:31 -07:00
George Hotz	c36d0e3bd8	tvm import hook	2023-09-28 09:24:32 -07:00
George Hotz	adab724caa	schedule2, keep the tests working with small changes (#1932 ) * lazy cleanups * ast functions take in LazyOps * op instead of self.op * _base for mops * fix contiguous * start schedule * test_schedule * fix openpilot * more tests * bugfix and test skip * work * make sure things get freed * fix zerosized tensors * fix failing test * fix ceil and friends * fix openpilot * disable training * disable test collectives	2023-09-28 09:14:43 -07:00
nimlgen	45f02393f0	HipGraph support (#1880 ) * init hip graph * optimize args update * cache symbolic in jit * remove NOSTAT * init BasicBatchExecutor * symbolic infer cache per jit instance * basicbatchexec is defualt for compiled * batch_exec is taken from ASTRunner * no infer cache * batched execution of hip graph * add comment about hip graph batches * readable hip graph	2023-09-24 20:14:36 +08:00
Szymon Ożóg	58296c079d	Make Triton work again (#1547 ) * Move ops_triton to runtime and remove errors from deprecated code * Remove deprecated AST Kernel * Remove deprecated buffer * Add TritonProgram * Triton Buffer * Use RawCUDABuffer * triton_compile * Added new parameter * pass _buf to program * remove deprecated include * Added triton tests * Deprecated includes removed * remove double print * Disable float4 support * Disable float4 support * variable load fix * Track local size * Add pycuda to triton dependencies * Merge test.yml * install cuda packages for testing * merge double package install * remove emulated from triton tests * upscale local index to power of 2 and add masking * cuda envs * Add TernaryOps * ConstOp loading * proper function name * remove deprecated variables * get global program from name * const ops match local shape * Enable test_nn * remove deprecated import * fix linter error * Add wait logic * Add local size override * accumulate local shapes instead of using max shape * Merge triton tests into global tests * fix envs in testing * Old testing routine * split file into renderer and program * remove print and starting whitespace * pretty ptx print on debug 5 * linter errors * ignore triton saturation tests * ignore test example * remove pytorch cpu extra index * Add triton to existing testing routine * use triton tests * disable cuda backend in triton tests * use cudacpu in tests * print used device * Print device default * Remove print * ensure we are running triton backend * update variable signatures * update dtypes for load * infinity render fixed * limit global size * negative infinity now properly rendered * split chain with parentheses for and node * Add option to disable shared memory, disable for triton * missing import * Properly index and mask conditional load * use mask only if not loading a block pointer * nan support * fix symbolic tests to include chain split * proper masking for stores * Implemented bool dtype * Add mod * fix loads for variables with valid range * merge triton with cuda runtime * merge from master * run triton tests with cuda * Correct target when running from triton * conftest with triton compiler config * use triton nightly * verbose tests for triton * capture stdout * fix function depth when exiting multiple loops * add render valid function for readabilty * fix mask for local loops * add _arg_int32 datatype * fix dims for conditional loads * enable non float stores * correct variable dtypes * fix type for arg_int32 * remove junk * Added get max function for range based var.max * remove deprecated code * Fix triton ptxas path * Fix testing for CI * clamp local size by max local size instead of always running max * Disable matmul test in triton cpu * rerun tests * Disable broken test in triton cpu * whitespace removed * rerun tests again * Disable TestSymbolicOps for triton * update to new uops * linter fix * ignore test/extra * linting fix * Update tinygrad/renderer/triton.py Co-authored-by: Gijs Koning <gijs-koning@live.nl> * remove deprecated line * quotes type fix * linter * Remove unnecesary lines * UnaryOps.NEG * dont define constants * Linting fix * Disable tests that are broken in ocelot * remove trailing whitespace * reduce line count * linting fix * update to new uast * New looping style * Update to new uast * make AST runner work with triton * linting fix * set renderer var for testing * disable local for ocelot * reenable all tests for ocelot * Pass shared to cuda * Don't group if the backend doesn't support shared mem * use working gpuocelot branch * enable all tests * enable local for ocelot * cleanup * Update test.yml * update cache key * reenable test symbolic and extra * Update test.yml * Revert "Update test.yml" (rerun tests) This reverts commit 98c0630ee5da4379e5c6b2437a5145fe87058c35. * Revert "fix symbolic tests to include chain split" This reverts commit 22a9a4c9cd14d23735e6540c8d90ee005ac4ea17. * Revert "split chain with parentheses for and node" This reverts commit 7499a7004ef4db785d0cd05cf292fdeff65ca90d. * use global size from linearizer * rename newvar to dtype to match other renderers * join program start lines * simplify code that adds axis to local dims * assign r[u] in ssa * We no longer need to replace target in src * we no longer need to cast indices to int by hand * Update triton.py(rerun tests) * Update triton.py(rerun tests) * Update triton.py(rerun tests) --------- Co-authored-by: Gijs Koning <gijs-koning@live.nl> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-09-23 14:17:12 +08:00
qazal	d0e752003d	fixes (#1893 )	2023-09-22 07:20:27 +08:00
wozeparrot	009a99a0b1	feat: way cleaner hip wrapper (#1895 )	2023-09-22 07:20:03 +08:00
kormann	864746d6aa	polish print_tree (#1868 ) * fix * isinstance	2023-09-21 11:13:10 +08:00
chenyu	3ec301c2d7	apply view.py patch (#1844 )	2023-09-10 17:32:15 -07:00
kormann	7ac65a93b4	utils.printtree (#1816 ) * utils.printtree * linter compliance * rename to print_tree	2023-09-07 23:08:57 -07:00
George Hotz	4613c9e77c	add tvm example, formatting (#1813 ) * add tvm example * no realize	2023-09-07 11:50:41 -07:00
Pavol Rusnak	52a92bf95d	use class Foo: instead of class Foo(): (#1797 ) * use class Foo: instead of class Foo(): * add ruff linter, copy settings from .flake8 to ruff.toml	2023-09-06 12:20:25 -07:00
geohotstan	9af5645ba3	onnx full passing (#1076 ) * 1 * 83 failed * learning how git works * lol idk * zero shape aaaa * space lol * aaa * test check * haha * fixed gather * 73 failing * 71 failing * 68 failing * added some debug * fking resize * lol * 62 failing * 58 failling fucking did nearest resize hell yeah * clean up * 56 failing * janitor duty * lol * 53 failing * hi mom * 50 failing * added linear interp, but coord_trans is wrong * did lin interpolation woohoo * 43 failing * 40 failing * temporary Gather fix * 39 failing * fixed slice onnxver<10 * 37 failing * 35 failing * excluded tests that use float64 * 32 failing with hacks * added _batchnorm() for 3D 5D batchnorm, 29 failing * changed ALLOWED_KERNEL_COUNT from 199 to 207 * added improved Gather op, reverted ALLOWED_KERNEL_COUNT commit * support Round op * added storage_order/indices maxpool, 27 failing * support maxunpool, 25 failures * support Gradient, 23 failures * merged new where * added Adam * cleanups * added Momentum and Nesterov Momentum * added Adagrad * support sequence_type, 20 failing * ugh git * I give up on cubic interp :D, 9 failing * sexy 1 liner gather, much improved, wow * polished gather to make it shine bright like a diamond * clean 1 liner for gather * improved readability of gather * uhh * clean up * more clean up * WHITEspace * implemented SoftmaxCrossEntropyLoss op * added comments and cleaned up if statements * update * thank based wozeparrot for pow and new GatherElements * CPU and TORCH all pass \| cast float64 -> float32 for all fromCPU() * _nearest_gather() failing on yolo * reverted ops_cpu change and added assert in Resize * added comments for resize for multiple channels * oops * merge * test * switched np.pad to Tensor.pad for constant padding * gah * gah2 * sexy reflect pad with movementops -> add * delete commented out lines * edge mode pad sexy as well * trying out model_benchmark * revert gitignore change lol * init * Revert "init" This reverts commit 682bf2073a8b4eca111596c67cf6ebd79f59e585. * wrote cast workaround for CPU, CPU and TORCH all pass * wrote cast workaround for CPU, CPU and TORCH all pass * skipped tests w/ 0 shape for METAL and GPU * excluded tests for CLANG, CPU, TORCH, CLANG pass * fixed hacky ConvTranspose * gotta figure out autopad * UOps.STORE support cast bool -> float * small fix for fast gather * reverted 0 shape skipped tests * oops missed a file * added comment * fixed slice op hack * First commit to pr * More trig ops * More trig ops * format * isinf support * More ops * changed onnx_ops to use our new gather :D * Det op bug fix * rebase * fixed some tests * det broken and slow * fixed compress to use new gather * implemented argmax argmin * support variable types in type_proto * support Upsample and Identity sequence * we support float64 now and tinygrad support automatic broadcasting * added EyeLike op * resize does support multiple channels now actually * yolov8 onnx runs successfully * added batch size 1 * oops * finally fixed type_proto I think * fixed some llvm bugs * del whitespaces * added ZenginU Format PR * test * oops * added float64 exclude tests back * more skipped tests * try * ok openpilot pass * flake8 pass * woooooohooo * revert external_model_benchmark changes * perf tested gather * removed promote types from ops_cpu * numerical errors from 1681 is fixed --------- Co-authored-by: ZenginU <umutzengin00@gmail.com>	2023-09-05 13:23:32 -07:00
George Hotz	56abe04e4b	disable assembly (#1755 )	2023-09-04 09:41:20 -07:00
wozeparrot	bf05534c6e	hip multidevice (#1728 ) * feat: hip multidevice support + p2p * feat: default device	2023-09-01 06:46:13 -07:00
Karan Handa	a8aa13dc91	[ready] Replacing os with pathlib (#1708 ) * replace os.path with pathlib * safe convert dirnames to pathlib * replace all os.path.join * fix cuda error * change main chunk * Reviewer fixes * fix vgg * Fixed everything * Final fixes * ensure consistency * Change all parent.parent... to parents	2023-08-30 10:41:08 -07:00
nimlgen	1c0449e190	add cache collector (#1595 ) * init cache collector * add test_cache_collector.py * switch GlobalCounters.cache to CacheCollector * init jit models test * jitted SD * add debug msg to print loaded bufs count * moved cache collctor to jit * clearer SD * no double device import	2023-08-28 19:59:55 -07:00
George Hotz	a6d842af7a	move device to ops (#1646 ) * move device to ops * mlops types * 2 lines	2023-08-23 08:30:17 -07:00
George Hotz	718ced296c	move state to nn/state (#1619 )	2023-08-22 07:36:24 -07:00
Umut Zengin	f720682beb	np.argmax to Tensor.argmax (#1608 ) * to tensor argmax * removed keepdim * training update	2023-08-21 15:22:29 -07:00
Yixiang Gao	4d54afb6df	sparse cat cross entropy (#1597 ) * add sparse cat cross entropy * minor fix * add log_softmax into loss function * add test * update docs * fix training loss * add device	2023-08-21 14:14:54 -07:00
George Hotz	2e60920317	Revert "sparse cat cross entropy (#1591 )" (#1596 ) This reverts commit `f0ee850e98`.	2023-08-21 10:04:26 -07:00
Yixiang Gao	f0ee850e98	sparse cat cross entropy (#1591 ) * add sparse cat cross entropy * minor fix * add log_softmax into loss function * add test * update docs	2023-08-21 09:56:41 -07:00
Yixiang Gao	8d6662a741	.cpu().numpy() -> .numpy() (#1594 ) * .cpu().numpy() -> .numpy() * restore ops_torch * restore test_speed_v_torch	2023-08-21 09:53:29 -07:00
George Hotz	e464442adf	WMMA for 7900XTX (#1563 ) * go * hip no LRU * work * works * 16 TFLOPS * 29 TFLOPS * 30 TFLOPS * never mind, it's 60 TFLOPS * fix metal WMMA * put hip alloc back	2023-08-19 09:07:23 -07:00
chenyu	ae39cf84ab	Symbolic Shape JIT main PR (#1353 ) * Symbolic Shape JIT update tests 2 variables symbolic ops, adding more tests test passing cleanup * more test cases * single flag * review update * jit attention one piece * realize * symbolic_jit test for cuda * old artifact * works with cuda gpu but failed ci * CUDACPU	2023-08-18 14:39:55 -07:00
wozeparrot	50decf0d45	train cifar using multigpu (#1529 ) * feat: train cifar using multigpu * feat: split eval batch across 5 * feat: cleaner allreduce * feat: 93.88% * feat: cleaner batch chunking from bert * feat: cleaner grad sync * feat: tinygrad argmax * feat: make it work with different gpu counts * feat: move some stuff into the normal __init__ * feat: autodetect gpu count * feat: move import inside	2023-08-18 09:35:44 -07:00
wozeparrot	15150d60c4	fix: small fix for lru on hip (#1567 )	2023-08-18 09:18:38 -07:00
Ethan Sorrell	cb62911f6b	PTX Reintegration and Passing Tests (#1512 ) * move assembly, assembly_ptx * successful but broken rendering of ptx asm * clear ins before render asm * slightly less broken :') * we needed thread syncs * fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half * Fix runtime_args for gpuocelot * our casts were flipped on both ends * more casting * add ternary where op * dealing with storing/loading bool * add test for casting to bool from negative * Fix args.valid on ConstOp * add to CI, TODO: fix runtime_args for test_uops * fix placement of runtime_args to work with lazy.Device * undo ci changes so I can push * fix lints * start cleanup and fix things we broke fixing lints * add checks for PTX specifc asm instructions * revert added test -- doesn't pass on llvm * skip tests for underflow,overflow * another fix for how we're setting runtime args * Less broken cleanup * add to CI * add more env variables for ci test * fix ci to install pycuda for ptx * ci: copy cuda test command * cleanup * assert to make sure we're actually running ptx in ci * remove test assert * move is_ptx arg * move assembly, assembly_ptx back to extras * fix imports * initial merge fixes * clear registers, fix UOps.LOAD with invalid value * draft merge fixes * remove prints * quick lint and merge fixes * cleanup * remove PTXProgram wrapper * final cleanup * temp change for ci rerun * ci rerun * rollback ISA version	2023-08-16 16:20:20 -07:00
JaSpa99	491e85597a	Run onnx commavq model (#1537 ) * try to run commavq * fix 0 dim, start implementing new ops - Implement EmbedLayerNormalization - Implement Attention * SkipLayerNormalization and FastGelu * use original torch model, cast inputs * fix some ops: - properly do Cast - Attention: bi- and unidirectional - FastGelu: add bias before gelu * cleanup onnx_ops.py * add validation option to benchmark * cleanup imports * add checks incase onnx2torch implements ops in future * run onnx instead of original torch * just skip gpu on m1 * reactivate the other models * check for strange params & squash whitespace * cleanup * fix causal mask Attention * Range doesn't need int cast * embedding vocab_counter same dtype as input * no need to cast * always validate, fix PosixPath ort --------- Co-authored-by: George Hotz <george@comma.ai>	2023-08-16 12:24:40 -07:00
George Hotz	f8109b830c	promote assembly to the main codebase (#1544 ) * promote assembly to the main codebase * not namedtuple	2023-08-14 22:47:45 -07:00
Steven Anderson	93a36c3659	Arm (#1421 ) * testing new memops * better debugging * testing padded conv * branching with load * refactoring a bit * first try * fixing bugs * fixing some * eq * eq2 * do not use x's * working * fixing imm * getting things working * refactor * pow not working * working except one * refactor: one store mem * refactor: global load * refactor: imm * refactor: cleaning * fixing big offsets * refactor with ci * try ci * typo * another typo * ubuntu default * forgot git * do i need git? * missing packages * adding python-dev * with cache? * buildx action * buildx name issue? * maybe now? * python3 * newline warning * maybe now * i actually need this * ci should work now * improved caching * fixing cache * maybe now it will cache * this * testing cache * trying again * load * missing platform * caching gha * testing cache * full testing * typo * now? * why * adding checkout back * bad formatting * fixing convention issues * supporting python * adding CI flag * testing all * better comments * adding debugging * takes 12x longer * does it output progress now? * ignore models for speed * fixing merge * excluding conv_transpose2d * only 2 test cuz is to slow * another approach * let's see * faster duh * my bad * T_T * typo * sup * with output? * comment test * comment test * comment test * :? * no comment * with cache * back to normal * testing that ci works * back to passing * trying again * does it create another entry * does it create another entry? * build local * hey * Revert "excluding conv_transpose2d" This reverts commit cc7348de03033e032f47d69caff174e2f1a7bfea. * does it cache if done before? * does it cache? * done * adding test ops * bad formatting * no need for this * working static mem * sum 1d * add ndim * better reg import * fix stack * back to np * working except for softmax * 5 failing * no pogress * remove keystone * remove keystone * testops passing * cleanups * more cleanup * typo * ci * ci2 * cond import * ci3 * ci4 * ci4 * ci5 * ci5 * ci6 * aligment * test all * correct test * err read_unmapped * passing test * ignore for speed * ignore for speed * ci7 * cleanup * remove docker * fixing merge * fixing bugs * add skipload for const ops * comments * First merge to master: Renderer * fix emulation * passing all tests arm64 * cleaning * fix handcoded binary * cleaning * fix errs * fix runtime arg binary * clean git diff * fix and clean * fixing metal test * cleaning * fix metal test * ci ~8 min * fix pylint and clang * cache the files in ops_clang --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-08-14 19:29:30 -07:00
Szymon Ożóg	330fb7b1a3	Print more meaningfull hip error messages (#1530 )	2023-08-12 07:16:20 -07:00
wozeparrot	29d5801387	distributed collectives (#1519 ) * feat: world * feat: tests * feat: no more backwards * feat: recv into * feat: whoops * feat: test in ci * feat: some debug logging * feat: workflow naming * feat: need to set pythonpath * feat: just send to same device * feat: allreduce * feat: test * feat: need contiguous * feat: test in ci * feat: exit with correct code * feat: don't need that * feat: opencl wait_for just doesn't work * feat: synchronize on out * feat: try? * feat: try again? * feat: add extra realizes * feat: print * feat: seed * feat: tol * feat: test ones and zeros * feat: remove print * feat: are you just flaky * feat: seperate scatter and gather? * feat: just try synchronizing * feat: remove print again * feat: bring back difference * feat: no sync * feat: revert that * feat: back to wait_for * fix: typo	2023-08-11 10:22:07 -07:00
wozeparrot	7e7c9001e9	distributed world (#1481 ) * feat: world * feat: tests * feat: no more backwards * feat: recv into * feat: whoops * feat: test in ci * feat: some debug logging * feat: workflow naming * feat: need to set pythonpath * feat: just send to same device	2023-08-10 10:00:51 -07:00
George Hotz	c417cd3c97	fast HIP gemm -> 100 TFLOPS (#1476 ) * fast HIP gemm * wmma * correct b * fix spilling * 60 TFLOPS * 64 TFLOPS * 65 TFLOPS	2023-08-09 06:54:15 -07:00
Yixiang Gao	6480a1a180	CIFAR 94.03% (#1340 ) * add disk_tensor * fix jit * new baseline before whitening * whitening through torch * whiting done currently at 91.65% * 91.99% * clean up mixup and 92.3% * clean up 92.30% * 92.49% before searching for new hyper-parameters * fix CI * fix white space * add whitening init in test * refactor, update hyperpara, 92.72% * converting whiting to tinygrad operation * update CI kernels count for CIFAR * add pad reflect * add random crop 92.53% * update hyperpara 93% * 93.15% on docker container, need to refactor the assignment for hyper param * print out weights and bias to be separated * bias/non-bias params separated * fix whitespace * clean up * refactor hyper-param with dict * refactor lr schedular params * fix whitespace * fix cross entropy loss * fix whitespace * move opt hyp to hyp dict * minor fixup * adjust model, loss scaling * 92.74% while using half of compute as before * update hyp for cutmix * random shuffle during batches * clean up * updating the model * update ConvGroup * disable gradients for batchnorm layer weights * whitespace * 93.92% * clean up * finally 94%git add .! * rewrite whitening to remove dependency on torch * whitespace * remove dependency on torch, 93.91% * back to 94.03% * clean up * update test_real_world	2023-08-08 15:13:24 -07:00
George Hotz	d24f936501	just cmplt (#1493 ) * just cmplt * fix maximum * don't save, there's no backward * ugh, no slot either * eq is a scam	2023-08-08 13:58:10 -07:00
Roelof van Dijk	0ce7511110	fix: is not use with a literal (#1487 ) Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>	2023-08-08 07:35:30 -07:00
Diogo	4dc8595069	simple exporting models (#1344 ) * unified exporting * json exporting * ignore more * simplified buffer export * added dtypes * added assert * swift example * fix tests * linter * remove whitespace * fixed tests * remove swift example * remove unintended changes * allow callable models to be used * whitespace * more readable json export * name change * whitespace * whitespace	2023-08-01 09:35:48 -07:00
David Hou	3300d0aeaf	syncthreads before wmma (#1389 ) (venv) chaos@tiny3:~/tinygrad$ KX=2 KY=2 N=2048 python extra/gemm/hip_matmul.py 4194304 289.60 us, would be 59322.55 GFLOPS matmul, 173.80 GB/s	2023-07-31 17:05:49 -07:00
George Hotz	37fa7e96fb	Revert "update editorconfig, enforce via CI (#1343 )" (#1380 ) This reverts commit `da2efecbe2`.	2023-07-31 10:35:50 -07:00
Pavol Rusnak	da2efecbe2	update editorconfig, enforce via CI (#1343 ) * update editorconfig to set unix-style newlines and trim whitespace * add editorconfig github action to the CI * fix whitespace	2023-07-30 18:44:30 -07:00
Cole Sutyak	2d4e182294	change fetch to allow for local file selection (#1309 )	2023-07-23 15:00:16 -04:00
Jacob Pradels	b112edd2c3	Add pylint trailing whitespace rule (#1314 )	2023-07-21 13:37:55 -04:00
madt2709	d2c1e8409a	Update arange to be (start, stop, step) (#1308 )	2023-07-21 00:27:23 -04:00

... 3 4 5 6 7 ...

775 Commits