tinygrad

Commit Graph

Author	SHA1	Message	Date
qazal	e7f6b654ad	cleanup uop eq asserts for swizzle [run_process_replay] (#6362 ) * cleanup uop eq asserts for swizzle [run_process_replay] * more stuff	2024-09-05 13:36:36 +08:00
Oleg Rybalko	64f1384f5b	Einsum ellipsis support (#6333 ) * working ellipsis expansion * refactor * fix commas in output * add capital letters * refactor	2024-09-05 10:08:55 +08:00
nimlgen	326a77336e	qcom remove some tests skips (#6353 )	2024-09-04 15:38:18 +03:00
qazal	99018a4aa1	minor schedule differ utils [run_process_replay] (#6348 ) * minor schedule differ utils [run_process_replay] * rm	2024-09-04 03:41:38 +08:00
nimlgen	3adb76894d	validate image=2 float16=1 openpilot benchmark (#6346 ) * validate image=2 float=16 openpilot * linter * linter2	2024-09-03 20:13:40 +03:00
qazal	2f00bf0c78	conv bw in one kernel with graph_rewrite (#6330 ) * double reduce merger * add test_fold_conv_relu_backward_ast_rewrite * a correctness test to iterate on * merge axes the other way around * better	2024-09-03 03:53:53 +08:00
Vyacheslav Pachkov	4c33192a8b	add qcom runtime (#5213 ) * qcom: driver init * autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros * autogen: add adreno commands and registers * ops_qcom: QcomAllocator + signals * fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom * qcom: we do not really need all these constants input/output is enough * qcom: perfctr for CS (do not really need all the rest) * qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max * qcom: explicitly set instruction len based on the shader size * ops_qcom: Program init extracts shader from open cl binary sets input/output buffers allocates stack sets cs mode runs shader * use data64_le from helpers * ops_qcom: use fill_kernargs for filling i/o buffers * ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset * new signals & fix exec * add QCOM to the list of supported devices * correct QcomComputeQueue._wait using CP_WAIT_REG_MEM * fix exec, synchronize before copyout * correct setting num_units for ST_SHADER * fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway * extract offsets to kernel arguments from opencl binary * extract constants values and offsets from opencl binary * handle KGSL_MEMFLAGS_USE_CPU_MAP correctly * align kernel name to 4 bytes when skipping kernel opencl struct * skip to consts directly using an offset from opencl binary header * fix alloc * get halfreg and fullreg from opencl bin * set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE * parse prg offset from open cl binary * save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG * support for vals in _fill_kernargs * support 16-bit constants * use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts this helps to not fall down when executing big kernels /* Don't time out if the context has disabled it / if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE) return; minor changes of _exec * QCOMRenderer * disable HCQGraph for demo. TOOD: support HCQ update api * support HCQ - remove copy queue - add updates - add strides for buffs and vars for QCOM * bufs_stride * clean ups * linter * call super().__init__(value) in QcomSignal * disable=unused-import * mypy * type ignore when queue is on the device * fix * query gpu_id. Will be useful for selecting commands e.g. CP_EVENT_WRITE vs CP_EVENT_WRITE7 * working timestamps * free context after device is done * move gpu stack to the device * reserve some space with lib_gpu for gpu to write to this fixes test_interpolate_bilinear * exclude tests that fails with GPU=1 on qualcomm * lint * unmap mem in _gpu_free * ctxt priority and preemtion policy * remove old qcom * pass size to self.device.allocator.free * skip tests only on qcom * use kgsl and adreno defines instead of numeric vals * use allocator for allocating lib_gpu * update to QcomArgsState from master * intermediate commit while conquering images * enable image tests on qcom * fix shader disasm size, dump textures stuff * working images * allow signals to be 0 * set branchstack from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * set shared memory size from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * update images in QcomArgsState & less loc for images * set stack sizes from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * stack allocation based on OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * better autogen for kgsl and adreno. no more bitshifts Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * cleanup commit for parse cl lib Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dont forget actual generated files * refactor + less loc Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * device.py back * lint * ruff * timestamp divisor Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * fix tex fmt & round global size Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dtypes * 19.2MHz * -1 loc in _update_exec * remove noqa --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-09-02 19:35:47 +03:00
George Hotz	406ec8240e	hotfix: lin_fail_41 passes on my M3 Max	2024-08-31 11:46:46 -07:00
Roelof van Dijk	ad4b3b457f	bump limit for test_llama_embedding_opt (#6332 )	2024-08-31 10:03:43 -04:00
George Hotz	72939901fc	hotfix: ebs print kernel names	2024-08-29 21:20:36 -07:00
George Hotz	365babe391	precompute early_reject [run_process_replay] (#6327 ) * precompute early_reject [run_process_replay] * features for ebs * fix ocelot cache	2024-08-29 18:26:24 -07:00
George Hotz	385904526f	remove more rules [run_process_replay] (#6326 ) * remove more rules [run_process_replay] * disable invalid test * ptx needs that str	2024-08-29 16:27:10 -07:00
qazal	539654fbe1	graph_rewrite complexity tests [run_process_replay] (#6317 )	2024-08-29 22:39:08 +03:00
qazal	07942ef361	Proposal: Better UOps.SWIZZLE (#6309 ) * better UOps.SWIZZLE * test_swizzle_rewrite * add it to docs * show a diff * a lil more verbose * two teeny notes * hotfix: sink	2024-08-29 15:39:48 +03:00
qazal	dd4e5f1c8d	process replay rewrite (#6284 ) * process replay rewrite p2 * start some unittests + exceptions and exits * shebang * remove extra kernel init	2024-08-29 15:08:27 +03:00
pedro	7de4eac8f7	add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308 ) * add `nearest` mode to interpolate matching pytorch `nearest` which is knowingly buggy + relevant TestsOps * add `nearest-exact` mode to interpolate matching pytorch `nearest-exact` + relevant TestOps * fix uint8 bilinear interpolation by matching custom torch implementation * implement uint8 lerp with torch interpolation trick without converting it to float	2024-08-28 21:59:51 -07:00
qazal	ec34d9ee36	start benchmarking ast graph rewrite (#6297 ) * ast_rewrite to ctx var * add external_benchmark_ast * refactor to asts * track lazybuffers * more work * record checkpoint * cleanup	2024-08-27 18:18:44 +03:00
Max-We	ab2714423b	Add einsum tests (#6286 ) Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>	2024-08-26 09:09:25 -07:00
chenyu	b76f0c875e	lazy const fold idiv 1 (#6285 )	2024-08-26 10:29:59 -04:00
chenyu	af7c04ff57	Tensor.__floordiv__ (#6283 ) support Tensor.__floordiv__ and friends	2024-08-26 09:43:40 -04:00
qazal	d2f8eeed2e	make [compare_schedule] the default [run_process_replay] (#6273 ) * make [compare_schedule] the default * capture ctx * logging * set capture to false	2024-08-26 21:40:03 +08:00
CaltropHungerton	002f60b4c3	fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192 ) * fix wmma flop counting on intel, add count tests * half * add half gemm * Update test.yml * one test * Update test_uops_stats.py * Update test_uops_stats.py * Update test_uops_stats.py * smaller matrix, use unittest skipUnless decorator	2024-08-25 18:37:05 -07:00
qazal	f0cc8ca5f2	generic st_fixup in scheduler graph rewrite [compare_schedule] (#6278 )	2024-08-25 11:02:17 +03:00
gswangg	3cf507ae7f	remove extra.ops and LazyOp support from Kernel (#6267 ) * remove extra.ops and BufferOps * remove extra.ops and LazyOp support in Kernel	2024-08-24 16:44:38 +03:00
qazal	ccb05d8baa	fixup neg tests [run_process_replay] (#6268 )	2024-08-24 16:35:43 +03:00
gswangg	ea76b93814	migrate test_linearizer_dumb.py to UOp AST (#6241 ) * add imports and update test_unmerged_ifs to UOp AST * test_max_simplify_and_cancel * test_expander_new_srcs * test_llama_embedding * test_unaligns_idxs * test_unrolled_float4_align * test_upcasted_stores_out_of_order * remove LazyOp * remove extra/ops and replace ReduceOps.SUM with BinaryOps.ADD	2024-08-24 16:27:29 +03:00
gswangg	e44653e25a	migrate test_linearizer_failures.py to UOp AST (#6240 ) * add imports and update test_failure_1 to UOp AST * update test_failure_2 with UOp AST * update test_failure_3 * test_failure_5 * test_failure_6 * test_failure_7 * test_failure_8 * test_failure_9 * test_failure_10 * test_failure_11 * test_failure_12 * test_failure_12_multireduce * uncomment skip and migrate test_failure_13 * test_failure_14 * test_failure_15 * test_failure_16 * test_failure_17 * test_failure_18 * test_failure_19 * test_failure_20 * test_failure_21 * test_failure_22 * test_failure_23 * test_failure_24 * test_failure_25 * test_failure_26 * test_failure_27 * test_failure_28 * test_failure_29 * test_failure_30 * test_failure_31 * test_failure_32 * test_failure_33 * test_failure_34 * test_failure_36 * test_failure_37 * test_failure_38 * test_update_39 * test_failure_40 * test_failure_41 * test_failure_42 * test_failure_43 * test_failure_44 * test_failure_45 * test_failure_46 * test_failure_47 * test_failure_48 * test_failure_49 * test_failure_50 * remove LazyOp * reskip test_failure_22 * remove extra/ops * replace ReduceOps with BinaryOps * fixup that import --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-08-24 16:26:58 +03:00
gswangg	1dc6040877	migrate test_search.py to UOp AST (#6245 ) * add imports and update test_kernel_count with UOp AST * test_filter_global_buffer * remove LazyOp * remove extra.ops and ReduceOps --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-24 16:13:53 +03:00
qazal	ae23540d6e	refresh process replay schedule ref in reset.py (#6265 )	2024-08-24 16:12:51 +03:00
gswangg	7be5eede71	migrate test_linearizer_overflows.py to UOp AST (#6244 ) * add imports, remove ConstBuffer, and update test_overflow_1 with UOp AST * test_overflow_2 * test_overflow_3 * test_overflow_4 * test_overflow_5 * test_overflow_6 * test_overflow_7 * TestLinearizerOverflowAlt::test_overflow_1 * TestLinearizerOverflowAlt::test_overflow_2 * remove LazyOp * remove extra.ops * remove ReduceOps	2024-08-24 16:10:29 +03:00
chenyu	943ab97d24	fix Tensor.prod for multitensor (#6264 )	2024-08-24 08:52:24 -04:00
qazal	bcb2f1caa3	init REDUCE_AXIS with BinaryOps (#6256 ) * REDUCE_AXIS arg with BinaryOps * more work in kernel.py fixup sops.gz * fix TestGraphRewriteEfficiency	2024-08-24 11:28:41 +03:00
chenyu	da5cf11859	fix acc init value for MUL (#6263 )	2024-08-23 23:19:44 -04:00
George Hotz	26498b322e	add BEAM to external_benchmark_schedule.py	2024-08-23 18:10:46 -07:00
George Hotz	53a73038e3	hotfix: TestGraphRewriteEfficiency.test_create_many_uops	2024-08-23 15:51:57 -07:00
chenyu	590c0922b6	Tensor.prod (#6250 ) * Tensor.prod a new reduce op! * onnx ReduceProd	2024-08-23 10:06:32 -04:00
qazal	78d6bd8b41	start graph rewrite in the scheduler (#6248 ) * start graph rewrite in the scheduler * test: enable it * test timings * only fails in multi reduce * more isolated tests	2024-08-23 13:15:55 +03:00
George Hotz	238896ca02	loooking into graph rewrite speed (#6239 ) * loooking into graph rewrite speed * track, replace is slow * if all same, no permutations [run_process_replay] * types so compile works * no implied comprehension * TRACK_MATCH_STATS=2	2024-08-22 13:17:55 -07:00
chenyu	e745e16441	remove UnaryOps.NEG (#6238 ) * Remove UnaryOps.NEG generated new dataset with ``` time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh gzip /tmp/sops mv /tmp/sops.gz extra/datasets/ ``` * fix that	2024-08-22 14:21:39 -04:00
nimlgen	6c4ddd6260	hcq skip tests when no multidev (#6235 ) * hcq skip tests when no multidev * linter * a bit higher tinout	2024-08-22 18:27:16 +03:00
chenyu	08539f08b0	fix UOp repr with Variable in arg (#6236 )	2024-08-22 11:06:33 -04:00
chenyu	3fc8203475	remove NEG from handwritten ast in tests (#6234 ) * remove NEG from handwritten ast in tests * test_linearizer_failures	2024-08-22 09:06:59 -04:00
chenyu	1c5ef5b793	format test_linearizer_failure (#6231 ) made it easier to remove NEG	2024-08-21 21:10:56 -04:00
nimlgen	78c94abe9c	raise time limit for ci in test_profile_multidev_transfer (#6227 )	2024-08-21 22:42:03 +03:00
gswangg	c74b318458	migrate test_linearizer.py to UOp AST, pt. 2 (#6228 )	2024-08-21 22:16:11 +03:00
George Hotz	c3168952f0	wip: tracking pattern matcher [run_process_replay] (#6225 ) * wip: tracking pattern matcher * better * proper dedup * timing * early reject * mergable match stats * TrackedPattenMatcher * fix TrackedPattenMatcher * cleanups * clean that too * remove early_reject * Revert "remove early_reject" This reverts commit dc2aef14b8f5da58f5ec9566daf252513cac394c. * total * sort by time * match_stats cleanup	2024-08-21 11:57:26 -07:00
chenyu	a666450e4d	UOp pattern x + x -> x * 2 (#6224 ) * UOp pattern x + x -> x * 2 now there's no NEG, with this it covers all kinds of ax+bx * can remove x-x	2024-08-21 12:06:19 -04:00
chenyu	c9a9631818	no UnaryOps.NEG in generated UOp patterns (#6209 ) * no UnaryOps.NEG in generated UOp patterns removed pattern `x * (-1) -> -x` and `x != True` * those are fine because NEG became CMPNE and True * fix sd validation L2 norm	2024-08-21 11:08:22 -04:00
qazal	3b8cc5a3e0	more multireduce tests prep for neg removal [run_process_replay] (#6220 )	2024-08-21 12:45:24 +03:00
qazal	f03e5a4b3b	test_multireduce const has a shape (#6218 )	2024-08-21 11:02:45 +03:00
George Hotz	2c42e9c2c6	faster rewrite, no folder in expand/reduce [run_process_replay] (#6216 ) * faster rewrite, no folder in expand/reduce [run_process_replay] * is removing the expander there okay * parens * don't reconstruct exact match uop * fast do_reduce * expand pyint * most of the parents gains with less lines	2024-08-20 23:36:58 -07:00
George Hotz	16f420f7a7	split full_graph_rewrite and linearize_uop [run_process_replay] (#6215 ) * split full_graph_rewrite and linearize_uop * fix tests * graph rewrite in test uops * add types	2024-08-20 20:12:33 -07:00
George Hotz	9faf205601	CIFAR trainer + various bugfixes / improvements (#6146 ) * move cifar into datasets * support for pathlib Tensors, tar_extract, and fetch gunzip * too early for Device.DEFAULT * simpler hlb_cifar + .to(None) is default * new compiler failure, start beautiful_cifar * beautiful cifar runs but is broken * jit train step * cleaner * std_mean, not mean_std * more correct * fast indexing * don't print that * torch load broken * add eval * nicer bar * decoraters are the way to do this * bounds check the load * a few ops * batchnorm bugfix, if track_running_stats is False, use online estimate * full timing * fix fusion * unneeded realize * master tensor	2024-08-20 16:58:46 -07:00
madt2709	4bb98d8882	Fix track_running_stats in batchnorm (#6200 ) * Fix track_running_stats in batchnorm * Fix linter * Update test_fold_conv_batchnorm_notrain to keep allowed at 1 * Add test_fold_conv_batchnorm_notrain_no_running_stats * Save 1 line	2024-08-20 14:01:22 -07:00
George Hotz	a5d79688db	fix indexing out of bounds (#6208 ) * fix indeing out of bounds * 5 ops per access is fine	2024-08-20 11:34:56 -07:00
chenyu	4451bcaf95	update test_arange test_llama_embedding_opt (#6207 ) non CI uses larger embedding, still same orders of magnitude	2024-08-20 13:58:43 -04:00
qazal	074cf780dd	add option to only benchmark schedule [run_process_replay] (#6204 )	2024-08-20 16:51:27 +03:00
gswangg	0e6f057eae	migrate test_linearizer.py to UOP AST (pt. 1) (#6150 ) * migrate test_multioutput to UOP AST * inline buf declarations * migrate test_multireduce to UOp AST * update test_mid_dim_multireduce to UOp AST * update test_triple_multireduce with UOp AST * make global definitions more concise * update test_double_reduce_multireduce with UOp AST * update test_multireduce_with_parallel with UOp AST * update test_multiout_multireduce to UOp AST * make gidx style consistent across updated tests --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-20 10:02:20 +03:00
chenyu	10330a41c7	add CMPNE tests in test_uops (#6196 ) fixed the output_dtype for CMPNE and match the tests for CMPLT	2024-08-19 19:41:21 -04:00
chenyu	21d6739237	remove UnaryOps.NEG from lazy.py (#6193 ) * remove UnaryOps.NEG from lazy.py * neg is no longer unary	2024-08-19 18:41:28 -04:00
Gabe Caldwell	bdd6325f31	default num_classes value for one_hot (#6182 ) * num_classes=-1 If num_classes set to -1, the number of classes will be inferred as one greater than the largest class value in the input tensor. * num_classes desc comment to explain num_classes default and what that means. * replacing ' with `	2024-08-19 12:07:14 -07:00
Alessandro Benetti	9328248610	support for std_mean and cross_entropy (#6181 ) * support for std_mean and cross_entropy (#3) * Cross entropy and std mean support * remove extra examples	2024-08-19 12:06:44 -07:00
Max-We	53b20afa3f	Write tar_extract (#6180 ) * Add tar_extract * Add tar_extract tests * Fix dtype for initialization from path * Tests for path initialization * rm print --------- Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>	2024-08-19 12:06:17 -07:00
Eitan Turok	8556d0c642	Support `gunzip` in `fetch` (#6176 ) * init * update * clean * add type * clean * fix import order * shorten variable names	2024-08-19 12:04:40 -07:00
samm393	5d742f7fe3	Missing features from rearrange (#6184 ) * fixes and tests * typo in test	2024-08-19 11:19:07 -07:00
qazal	2242ff84be	type verify intermediate UOps [run_process_replay] (#6140 ) * type verify intermediate UOps [run_process_replay] * merge asserts * variable const	2024-08-19 20:59:01 +03:00
qazal	478145cb8e	lowering error in diff_schedule is fine [run_process_replay] (#6185 )	2024-08-19 20:51:12 +03:00
chenyu	00578a021b	re:6125 switch real_size to use uops [run_process_replay] (#6138 ) * switch real_size to use uops [run_process_replay] * enough to pass --------- Co-authored-by: George Hotz <geohot@gmail.com>	2024-08-19 13:20:24 -04:00
qazal	e28d29641f	more scheduler process replay tooling [run_process_replay] (#6178 )	2024-08-19 15:35:51 +03:00
chenyu	b36a7273c6	RUF018 assignment-in-assert [run_process_replay] (#6172 ) assertion should not have side effect or `-O` breaks. initially just wanted to fix the one in rearrange, but it also made some long lines less long	2024-08-19 00:34:52 -04:00
chenyu	9c60a27ece	lower float64 sin fuzzer threshold (#6173 ) 139216373.71875 failed https://github.com/tinygrad/tinygrad/actions/runs/10446960642/job/28925156240	2024-08-19 00:25:42 -04:00
samm393	fd7c84c1c8	Rearrange (#6106 ) * rearrange and tests * tidy * whitespace * remove line * -5 lines * test fix * static -> instance * fix () & add more tests * remove flags * -1 line * match einops * whitespace * repeated names	2024-08-18 20:22:28 -07:00
chenyu	2de174677a	threefry touchup [run_process_replay] (#6169 ) also why is test_gc testing _rng_counter is allocated??	2024-08-18 23:01:24 -04:00
David González Martínez	724e408736	add support for retain_graph in backward (#6145 ) * add support for retain_graph in backward * fix: dont accumulate grad on non-leaf tensors * fix order * fix: do not delete grad on leafs * fix linter * fix: can't exactly match torch behaviour internally * allow numerical room for test * refactor	2024-08-18 16:08:31 -07:00
wozeparrot	0c5189de25	threefry half (#6154 )	2024-08-18 15:23:12 -07:00
Timmy	e3d14d1ccc	Lowerer Multireduce Grouping (#6097 ) * grouping changes to codegen * linters + tests * fix identical store issue on PTX * comment in grouping multireduce tests * cleaning up diff * cleaning up diff * comments * linters * hotfix: dont change kernels --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-08-18 19:57:51 +03:00
qazal	1ba83cc7fa	split test_sgd_4convs_fuse [run_process_replay] (#6158 )	2024-08-18 18:35:42 +03:00
qazal	be6dda4093	hotfix: more lazyop rename to uop [run_process_replay] (#6157 )	2024-08-18 17:28:44 +03:00
George Hotz	17a043edad	tensor inference (#6156 ) * tensor inference * test is even better name	2024-08-18 00:19:28 -07:00
chenyu	f7950fc2b6	add E275 missing-whitespace-after-keyword linting rule (#6149 ) requires space after keywords like `assert`, `not`, `return`, `else`	2024-08-17 16:44:34 -04:00
George Hotz	88edc2902d	axis_is_masked with graph_rewrite [run_process_replay] (#6144 )	2024-08-17 10:28:49 -07:00
qazal	5a266d5d0c	type verify ImageDType and PtrDType [run_process_replay] (#6137 ) * type verify ImageDType and PtrDType [run_process_replay] * fix tests	2024-08-17 16:37:07 +03:00
qazal	d1d41130cd	use membufs in ImageDType checks [run_process_replay] (#6136 ) * use membufs in ImageDType checks * set by key [run_process_replay]	2024-08-17 16:17:46 +03:00
qazal	d9ce664350	add test_verify_ast [run_process_replay] (#6134 )	2024-08-17 14:14:30 +03:00
George Hotz	3a2d724cb2	extra matcher from renderer [run_process_replay] (#6130 ) * extra matcher from renderer * cache_pm [run_process_replay]	2024-08-16 23:53:11 -07:00
George Hotz	5048066e79	st_arg, never -1 [run_process_replay] (#6128 )	2024-08-16 22:46:56 -07:00
George Hotz	d9cb45af09	only axis is masked [run_process_replay] (#6123 )	2024-08-16 21:01:17 -07:00
George Hotz	94aa5f11b5	Revert "use vmax for real_size [run_process_replay] (#6120 )" (#6122 ) This reverts commit `a6e3211444`.	2024-08-16 20:33:19 -07:00
George Hotz	a6e3211444	use vmax for real_size [run_process_replay] (#6120 ) * use vmax for real_size [run_process_replay] * axis is masked	2024-08-16 20:17:23 -07:00
George Hotz	912f01ed4b	UOpGraph -> linearize_uop [run_process_replay] (#6119 )	2024-08-16 19:48:39 -07:00
George Hotz	89c7989659	no shapetracker in ops [run_process_replay] (#6117 )	2024-08-16 17:23:27 -07:00
George Hotz	74ee9febec	remove iter from uopgraph (#6110 ) * remove iter from uopgraph * linearize returns uops * fix tests * linearize in linearize * tests fix * touchup * test failures	2024-08-16 15:58:29 -07:00
qazal	28c75bf2a6	merge uops with ops (#6111 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-08-16 18:17:57 -04:00
qazal	d5e3217076	hotfix: scheduler differ (#6115 ) * hotfix: scheduler differ * add the test back * track keys	2024-08-16 23:34:49 +03:00
qazal	c23d44c779	AST is UOp (#6030 ) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc. * fix benchmark * remove extra dedup	2024-08-16 22:09:00 +03:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
George Hotz	553ae9ebc0	bilinear interp uint8 fails (#6103 ) * new test for e2e compile failures * fix bug * bilinear interp uint8 fails * better tests	2024-08-15 19:34:39 -07:00
George Hotz	c850e03758	new test for e2e compile failures (#6101 ) * new test for e2e compile failures * fix bug	2024-08-15 18:56:22 -07:00
chenyu	9ef82e1f2b	UOp pattern DEFINE_VAR with min==max is also CONST (#6095 ) * UOp pattern DEFINE_VAR with min==max is also CONST * fix tests	2024-08-15 12:09:44 -04:00
qazal	4d38fec8c1	rename lazyops to parents [run_process_replay] (#6091 )	2024-08-15 17:27:32 +03:00
chenyu	5accfe26a0	rewrite bool ADD to OR and MUL to AND (#6084 ) * rewrite bool ADD to OR and MUL to AND fixed running `tinyphysics.onnx`, which contains a getitem from a boolean tensor. only can repro through BEAM_COMPARE, which i think is a different bug in test_linearizer_failure * fold those, and fix tests * only for bool * move dtypes.bool	2024-08-15 10:11:57 -04:00
chenyu	df03dca6e3	move % inside UOp mod_folding and remove deprecated tests (#6085 ) [run_process_replay]	2024-08-14 23:25:10 -04:00
qazal	2bf7b56485	minor test fixups from the AST is UOp diff (#6081 ) * add assert_equiv_uops cache * dont expect lowering and schedule errors	2024-08-14 23:58:04 +03:00
George Hotz	64563abc90	add LSTMCell to nn (#6080 ) * add LSTMCell to nn * lstmcell works with no input on first * fix no bias 0 * simpler	2024-08-14 12:08:42 -07:00
chenyu	6b3112d525	fix qcom process_replay for kernel diff (#6079 ) * debug why qcom process_replay does not run skipping the wrong exception? * um-hum * get_step_times was parsed incorrectly * cleanup	2024-08-14 15:05:49 -04:00
chenyu	2fe9d62451	increase test_recursive_add time from 1s to 2s (#6078 ) flaky https://github.com/chenyuxyz/tinygrad/actions/runs/10392144818/job/28776666700	2024-08-14 13:52:02 -04:00
samm393	2dc586ffe5	Shape change bitcast for more dtypes (#6047 ) * bitcast & tests * use to_dtype * put disk tensor tests back * tests * bitmask * no bitmask --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-14 10:03:34 -07:00
qazal	83a2543c74	spec for in order LOAD/STORE indexing (#6073 ) * test_unaligns_idxs * spec for in order LOAD/STORE indexing * test UOps.SPECIAL * check for supports_float4	2024-08-14 19:18:00 +03:00
chenyu	5048f9a4d5	test linearizer failure 49 (#6074 ) with UOP_IS_SYMBOLIC=1, on METAL it breaks store fusion and have A+B and B+A being two different UOp	2024-08-14 11:29:10 -04:00
qazal	30035df5a4	add metal process replay back (#6068 ) test this new one	2024-08-14 12:29:56 +03:00
chenyu	1782e4f64d	use div folding to do lt folding (#6065 )	2024-08-13 16:59:05 -04:00
chenyu	e3af273fa1	touchup cl_errors (#6058 ) * touchup cl_errors * update test	2024-08-13 13:06:59 -04:00
qazal	9145ad52ff	revert UOps eq, this needs to be isolated in realize.py (#6063 ) This reverts commit `dccca7f227`.	2024-08-13 18:02:34 +03:00
Tobias Fischer	6e3eb50fd1	added fix and reg tests (#6060 )	2024-08-12 21:00:48 -04:00
qazal	dccca7f227	test: uop and lazyop have the same compare (#6053 ) * test: uop and lazyop have the same compare * typings * self.assert_equiv_uops -> assertEqual * hash dtype * test nop too * TestPatternMatcher never used this compare anyway * nop eq and ne tests	2024-08-13 00:33:19 +03:00
chenyu	3f2d24a6ec	test_failure_48 for wrong truncation in idx on NV (#6055 ) also added `RAWAST` to print pre-modified AST in DEBUG=3	2024-08-12 16:17:42 -04:00
chenyu	6ed9711898	UOps pattern (x%c)+(x//c)*c = x (#6051 ) pretty cool that this is very easy to write now	2024-08-12 14:58:48 -04:00
ignaciosica	777d6b3349	Fix compile error for max with inline const (#5840 )	2024-08-12 23:40:39 +08:00
ignaciosica	164ca5632e	split tensor core tests (#6041 )	2024-08-12 09:42:02 -04:00
chenyu	7ce716b3a0	bigint -> pyint [run_process_replay] (#6040 ) it's a python int. priority should be higher than bool, but we are not using it in type promo now.	2024-08-12 09:12:23 -04:00
Timmy	a00994b423	Lowerer Multireduce Uopgraph (#6007 ) * uopgraph changes * fixing for non-reducing ranges * multireduce tests * linters * linters * removing comments * removing arg[1] * linters * prettier * linters * more linters * use any instead of intersection	2024-08-12 15:16:07 +03:00
qazal	7d1f118731	use assertIs in test_schedule (#6035 ) * use self.assertIs in test_schedule * test_lazybuffer	2024-08-11 19:19:18 +03:00
qazal	b918e3c255	cache assert_equiv_uops (#6033 )	2024-08-11 12:17:05 +03:00
George Hotz	1b3443902c	don't use tgmath with clang (#6029 ) * don't use tgmath with clang * fix tests * nostdlib for clang * needs ffreestanding on OSX	2024-08-10 13:58:19 -07:00
chenyu	5820940d98	more relax rtol for test_arange_fuse_grouped_children (#6027 ) one more https://github.com/chenyuxyz/tinygrad/actions/runs/10334072657/job/28607120462	2024-08-10 16:10:03 -04:00
chenyu	10374a2741	relax rtol for test_arange_fuse_grouped_children (#6026 ) flaky https://github.com/tinygrad/tinygrad/actions/runs/10333939631/job/28606831006?pr=6023	2024-08-10 15:49:11 -04:00
George Hotz	cf7d3c1eb8	fix tests locally on metal (#6025 ) * remove contiguous child, it was breaking tests locally * hmm, it's still needed * include NOOPT in method cache key	2024-08-10 12:36:22 -07:00
chenyu	e6c7c3e499	update pylint path to check indent/space for all (#6022 ) also fixed many errors. it was not checking nested dirs. exclude autogen for now. can we use ruff for this?	2024-08-10 14:41:09 -04:00
George Hotz	cfb04c67d1	run unit tests separate from others (and only once) (#6020 ) * run unit tests separate from others * ignore unit tests elsewhere	2024-08-10 11:17:56 -07:00
uuuvn	ee3b015407	ELF loader strtab fix and tests (#6011 ) * ELF loader strtab fix and tests * ruff * typos * only one test	2024-08-10 10:13:16 -07:00
Jun Zhang	54e176fb4f	Ignore non-computational backends when overwriting the default (#5770 )	2024-08-10 09:23:29 -07:00
qazal	3ef2788c4f	hotfix: run the entire test_conv_bw schedule (#6014 )	2024-08-10 17:55:41 +03:00
qazal	0e62076cf5	more process replay cleanups (#6013 ) * more process replay cleanups * comma benchmark missing	2024-08-10 17:29:10 +03:00
chenyu	63a8bc29d4	addition divisor in UOp div_folding (#6002 ) in addition to try gcd of all terms, also try least common divisor of all MULs	2024-08-09 20:09:05 -04:00
chenyu	5961faa4be	minor change to UOp div_fold (#6004 ) remove an unnecessary gcd and swap the quo rem order, minimize diff for divisor pr	2024-08-09 17:09:59 -04:00
qazal	7373b05ee8	assert conv bw reduceops merge [compare_schedule] (#6001 ) * assert conv bw reduceops merge [compare_schedule] * diff with ref_commit_hash	2024-08-09 19:29:56 +03:00
qazal	b67d521a07	assert test_conv_bw correctness (#6000 ) * assert test_conv_bw correctness * reorder half * metal and clang still red	2024-08-09 18:30:36 +03:00
qazal	a833f1a735	scheduler process replay with [compare_schedule] (#5997 )	2024-08-09 16:58:22 +03:00
qazal	24c7c41ce0	diff LazyBuffer schedules in process replay (#5996 ) * start diff printing * this should be 2 * add to process_replay.py * enable schedule capture * arange diff is process replay	2024-08-09 14:16:43 +03:00
chenyu	1f1eb46af6	more failed simplified UOp div test case (#5992 ) this speculative div was handled by "divisor" in symbolic.	2024-08-08 18:39:25 -04:00
chenyu	c3e1ae2535	add failed simplified UOp div test case (#5990 ) more cases!	2024-08-08 17:37:48 -04:00
nimlgen	38d5eecc68	hcq profiler support args (#5989 ) * hcq profiler support args * bytes -> _bytes * fix * add test * mypy * not f strings * percison	2024-08-09 00:18:36 +03:00
qazal	45b1761175	smaller test_llama_embedding + assert correctness (#5986 ) * smaller test_llama_embedding in CI * test correctness	2024-08-08 22:11:29 +03:00
Timmy	8c99bdab08	More Multireduce Tests (#5968 ) * multireduce tests * linters * more linters * more linters * seeing how it works with parallel	2024-08-08 22:04:08 +03:00
gswangg	df44a4e861	Make vectorization of CONST explicit (#5322 ) * remove test_const_vectorize_fold * remove const folding UPat for VECTORIZE * refactor cstyle render_const * remove calls to dtype.scalar() in render_const * add assert * add vectorized const to UOp.const * add UPat GEP-VECTORIZE-CONST -> CONST * render_vectorize for DEFINE_ACC in cstyle * add back missing render_cast in render_const * generate vectorized consts as UOps for DEFINE_ACC * update asserts for DEFINE_ACC with VECTORIZE src * add UPats for PHI with VECTORIZE src * use prev rendered vectorize in DEFINE_ACC render * update DEFINE_ACC in python runtime * update vectorized DEFINE_ACC in PTXRenderer * rebase DEFINE_ACC changes on lowerer * verbose rewrite of bad UPats * simplify UOps.CONST implementation in ops_python * update sum_collapse UPats for DEFINE_ACC-VECTORIZE * revert linearizer to TOT * fix DEFINE_ACC implementation in ops_python * simplify DEFINE_ACC in cstyle * Fix linter error * support VECTORIZE in fold gated load/store UPat * support VECTORIZE in other fold gated load UPats * rewrite VECTORIZE in UPat for no input DEFINE_ACC * simplify DEFINE_ACC render in cstyle * make VECTORIZE rules more concise * add more vectorize fold tests * inline VECTORIZE-CONSTs in cstyle render * revert VECTORIZE/GEP rule refactor * revert cstyle render_const refactor * inline VECTORIZE-CONSTs in cstyle render * implicitly vectorized const rendering -> explicit * WMMA VECTORIZE CONST process replay hacks * VECTORIZE CONST NAN process_replay hacks * more VECTORIZE CONST NAN hacks * cleanup process_replay hacks * isnan() -> not isfinite() cstyle VECTORIZE CONST * tweak isnan and isfinite checks VECTORIZE CONST * tweak for positive vs negative infinity VECTORIZE CONST * add assert to PTX CONST render * process_replay VECTORIZE CONST render parity for PTX STORE * vmin/vmax for VECTORIZE'd CONST * update WMMA folding rules * add tests for WMMA VECTORIZE fold * hack for cstyle half4 CONST zero process_replay parity * revert PTX backend changes * add back minimal DEFINE_ACC PTX change * remove cstyle process_replay hacks * remove dead code in PTX CONST render * cleanup vmin/vmax logic for VECTORIZE'd CONSTs * update vectorize fold tests to use DEFINE_VAR * fix long line formatting in test * remove unwanted merge artifact * more vmin/vmax cleanup * remove unnecessary asserts * yet more vmin/vmax cleanup * get rid of explicit VECTORIZE CONST logic in _min_max * reuse CONST instead of creating a new one * remove unneeded cast * handle DType correctly in sconst * improve readability of tests * save a line * save another line * tuplize pats in src * remove GEP-VECTORIZE pats * add vec +0 fold * HACK: fold only vec8 +0 * remove vectorized ALU fold hack --------- Co-authored-by: qazal <qazal.software@gmail.com> Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-08 20:59:05 +03:00
chenyu	62c77a2831	trim const in UOp div_folding (#5982 ) simplify `(4x+4y+7)//16` to `(x+y+1)//4`. fixed `GPU=1 UOP_IS_SYMBOLIC=1 IMAGE=2 python -m pytest test/test_ops.py -k conv`	2024-08-08 12:49:05 -04:00
qazal	e6d41b0ce7	hotfix: adjust test_backward_pass_diamond_model thresholds (#5981 )	2024-08-09 00:20:53 +08:00
nimlgen	183c4c91a3	fix non-jitted transfers in profile (#5980 ) * fix transfers in profile * fix linter * sync to be sure everythin is recorded	2024-08-08 17:58:08 +03:00
George Hotz	c5baa3d66b	hotfix: don't run OOM test in CI	2024-08-07 22:19:29 -07:00
chenyu	859d0e4709	UOp simplify `(x+c0)c1 -> xc1+c0*c1` (#5973 )	2024-08-07 21:25:22 -04:00
wozeparrot	97d708252a	remove realize from threefry (#5969 )	2024-08-07 15:08:49 -07:00
George Hotz	bf8ec23b00	hotfix: contiguous on precompute_freqs_cis	2024-08-07 14:40:56 -07:00
nimlgen	8d8704af2d	fix amd exec_update for locals (#5966 )	2024-08-07 21:02:56 +03:00
tyoc213	0c4e9dbe71	retrieve defined opencl error codes (#5792 )	2024-08-07 10:46:24 -07:00
qazal	d6f4a61c42	graph LBScheduleItem [run_process_replay] (#5960 ) * add toposort key to LBScheduleItem * use dedup * graph LBScheduleItem * make that comment beautiful again * diff_schedule utils * update fuzz_schedule	2024-08-07 19:59:11 +03:00
qazal	7677361d90	test pushing through different expands in 1 kernel (#5963 ) * test pushing through different expands in 1 kernel * realize eye * back to test_example_matmul	2024-08-07 19:33:18 +03:00
qazal	39dda3d042	rename prescheduled items to lsi [run_process_replay] (#5959 ) * rename to lsi * fuzz_schedule more typings * rename fuzz_schedule	2024-08-07 14:31:50 +03:00
qazal	728b7e189e	diff_schedule tests [run_process_replay] (#5958 ) * diff_schedule tests [run_process_replay] * ok to run serial	2024-08-07 13:50:27 +03:00
chenyu	a7163b80d8	lower test_transcendental fuzz test threshold for sin float64 (#5956 )	2024-08-07 02:04:37 -04:00
chenyu	fa3a36e576	fancier UOp div gcd folding (#5953 ) combine and cancel the remaining const based on gcd of other terms like SumNode.	2024-08-07 02:04:25 -04:00
chenyu	aa7fd7ef74	Use `(-self).lt(-x+1)` for `UOp.ge` (#5955 ) matched symbolic and fixed UOP_IS_SYMBOLIC=1 arange folding	2024-08-07 01:31:27 -04:00
George Hotz	658d58784b	embedding doesn't cast (#5952 ) * embedding doesn't cast * test the right thing * too much annoying with that test	2024-08-06 17:49:14 -07:00
wozeparrot	30d0cb2a82	fix: fix transcendental flakyness on exp float with 9.96875 (#5951 )	2024-08-06 17:32:13 -07:00
George Hotz	3a0515ea22	hotfix: process_replay/diff_schedule.py to LBScheduleItem	2024-08-06 17:01:05 -07:00
chenyu	aee737bd9e	divide by gcd in UOp div folding (#5949 ) * divide by gcd in UOp div folding `(6x+6y)//16 -> (3x+3y)//8` etc simpler version * only factor out const * don't apply for unsigned * don't need that if * space	2024-08-06 20:00:57 -04:00
George Hotz	6d1fdcfce2	don't reduce the same thing in a vector (#5950 ) * don't reduce the same thing over and over * cleaner way to write it that doesn't loop	2024-08-06 16:59:15 -07:00
qazal	d5d7f4e7b8	more TestIndexing correctness asserts [run_process_replay] (#5948 ) * use torch in test_mnist_val * more asserts	2024-08-07 01:50:42 +03:00
chenyu	794796256c	UOp.const_factor [run_process_replay] (#5945 ) * UOp.const_factor [run_process_replay] simplify mod and div folding * test does not work now	2024-08-06 18:18:29 -04:00
George Hotz	73d4d51845	add LBScheduleItem type [run_process_replay] (#5944 ) * add LBScheduleItem type [run_process_replay] * minor cleanups * fix * fix fuzz tests * add group cache type	2024-08-06 14:49:40 -07:00
qazal	7b6496f2e6	fix the reduceops cache breaking beautiful_mnist (#5938 ) * fix the reduceops cache breaking beautiful_mnist * test_sparse_categorical_crossentropy_simple * starting tests * atol from test_nn * test_sparse_categorical_crossentropy_alt * dont use torch	2024-08-07 00:02:54 +03:00
George Hotz	1417cc8df1	can reenable that test now (#5914 )	2024-08-06 13:38:21 -07:00
chenyu	489575c3be	more UOp sum div with gcd tests (#5936 ) * more UOp sum div with gcd tests * one more	2024-08-06 12:50:10 -04:00
ignaciosica	81ae9fadc8	Float4 support for CLANG (#5915 ) * float4 support on clang * skip linearizer tests that require locals * add aligned attribute	2024-08-06 07:50:12 -07:00
qazal	a7db4c3ee9	show timings for DIFF_ARANGE=1 (#5935 ) * show timings for DIFF_ARANGE=1 * always with DEBUG=2	2024-08-06 17:20:38 +03:00
qazal	102a8c184b	diff fused arange schedules with ARANGE_DIFF=1 (#5934 ) * diff fused arange schedules with ARANGE_DIFF=1 * better llama diff	2024-08-06 16:52:26 +03:00
qazal	3d4742dd2e	override output shape in fused assign (#5930 ) * override output shape in fused assign This makes ``` FUSE_ARANGE=1 JIT=0 python3 examples/llama.py --gen 1 --prompt "Hello." --count 10 --temperature 0 --timing ``` work. In general we should assert ASSIGN doesn't change shape. * merge asserts	2024-08-06 13:28:50 +03:00
chenyu	09b7722637	UOp generic div folding (#5896 )	2024-08-05 21:38:43 -04:00
George Hotz	3e1336957d	test arange with all opts (#5923 ) * test arange with all opts * Update test_arange.py * Update test_arange.py * Update test_arange.py * Update test_arange.py * Update test_arange.py	2024-08-05 18:38:25 -07:00
George Hotz	5d17f54e3c	fast mnist indexing (#5921 ) * fast mnist indexing * more tests * remove those tests, new indexing rule	2024-08-05 13:55:15 -07:00
George Hotz	e81c18f494	make the arange test check correctness [run_process_replay] (#5920 )	2024-08-05 13:41:06 -07:00
George Hotz	8d1c884e78	capture the const pattern in both directions (#5919 ) * capture the const pattern in both directions * add regression test	2024-08-05 12:15:38 -07:00
George Hotz	42f599870c	unroll arange is broken (#5918 ) * unroll arange is broken * fix unrolled arange * one more test	2024-08-05 12:15:07 -07:00
qazal	70949ea7e6	test cstyle compile error for max with inline const (#5838 ) * test_failure_46 * GPU=1 fails too * add test_renderer * add failing platforms * nv too * assert return value	2024-08-05 19:02:16 +03:00
qazal	e0c6520138	check arange fusing with VIEW and COPY (#5912 ) * check arange fusing with VIEW and COPY * gpu and clang	2024-08-05 17:09:21 +03:00
nimlgen	590b9ebb34	hcq copy queue is optional (#5909 ) * hcq copy queue is optional * one more * this	2024-08-05 14:03:25 +03:00
George Hotz	159ac06b5b	remove unused reduce rules + improve unparented (#5908 ) * remove unused reduce rules [run_process_replay] * this work * those tests are meaningless now	2024-08-04 18:18:27 -07:00
George Hotz	d7387d31bf	remove useless reduce cases [run_process_replay] (#5907 ) * remove useless reduce cases [run_process_replay] * do_reduce cleanup * more cleanups + no longer supported tests * Revert "more cleanups + no longer supported tests" This reverts commit e9f2f6ba7061f8697a308aacdc3442fa922a77f5. * no longer supported tests * switch ReduceOps.SUM -> BinaryOps.ADD	2024-08-04 17:11:08 -07:00
George Hotz	be8958e26b	use CONTRACT before REDUCE (#5903 ) * use CONTRACT before REDUCE [run_process_replay] * support half expand * EXPAND GEP	2024-08-04 16:17:33 -07:00
chenyu	4a65010de8	remove CUDACPU flag in tests [run_process_replay] (#5902 ) no longer used	2024-08-04 16:06:38 -04:00
qazal	aad9234e52	test fused precompute_freqs_cis (#5900 ) * test_precompute_freqs_cis * tiny for ci	2024-08-04 21:01:05 +03:00
chenyu	c67e9887f7	support using str to specify dtype (#5897 ) * support using str to specify dtype in Tensor creation and args into `cast` and `bitcast`, and acc_dtype * more tests	2024-08-04 12:56:28 -04:00
qazal	4c5ef2cc4f	setitem with arange fusion 1 (#5898 )	2024-08-04 16:09:21 +03:00
chenyu	da61dea1b2	simple failed UOp sub symbolic test case (#5894 )	2024-08-03 14:27:23 -04:00
qazal	56ef9e453e	pad reduceops to the max of each dimension (#5889 ) * early verify * pad reduceops to the max of each dim * remove the function	2024-08-03 14:03:30 +03:00
qazal	65fa86901a	indexing fusion 2 (#5888 ) * arange fusion * kernels that fuse * tests	2024-08-03 13:13:39 +03:00
qazal	af59b2eea9	tests from the indexing fusion branch (#5886 )	2024-08-03 11:56:48 +03:00
chenyu	d5de44340e	UOp add mod folding (#5862 ) * UOp add mod folding * that passes now	2024-08-02 18:31:46 -04:00
chenyu	41bbd3f4c1	update UOp mod reduction patterns (#5883 ) prepare generic mod folding, also some test changes from mod folding pr	2024-08-02 17:43:40 -04:00
wozeparrot	acadccf344	comma benchmark (#5518 )	2024-08-02 14:36:54 -07:00
Elias Wahl	4a114756f6	New BERT dataloader (#5881 ) * One file == One topic * update test * new dataloader * update train script * get index is faster	2024-08-02 15:12:23 -04:00
nimlgen	2777784b91	add dependency viewer to hcq profiler (#5874 ) * hcq profiler support deps * clean up * cleaner * cleanup * revert this * linter * mypy * add test * sync is strange, need to take the end * linter + test	2024-08-02 22:07:01 +03:00
George Hotz	23e8c39288	get program fields in __post_init__ [run_process_replay] (#5878 ) * get program fields in __post_init__ [run_process_replay] * remove print	2024-08-02 09:57:12 -07:00
qazal	8611fa6c99	apply opts.extra_matcher in process replay [run_process_replay] (#5877 )	2024-08-02 18:07:58 +03:00
qazal	2a791f7924	fuzz uops is simpler with List[UOp] [run_process_replay] (#5875 ) * remove from fuzz_uops * update fuzz_uops.py * add to realize.py	2024-08-02 17:28:15 +03:00
George Hotz	877e0b4ba0	define global only has the index [run_process_replay] (#5869 ) * define global only has the index [run_process_replay] * fix that linearizer test * fix ptx * stupid ptx fix	2024-08-01 19:01:15 -07:00
chenyu	f27f949a5d	Revert "revert some UOp IDIV bound (#5863 )" (#5871 ) This reverts commit `0c8d202348`.	2024-08-01 21:38:31 -04:00
chenyu	df138bc558	Revert "revert a mod pattern (#5864 )" (#5870 ) This reverts commit `5c8de2d044`.	2024-08-01 20:44:26 -04:00
chenyu	1b0314d9ef	Revert "remove one more UOp mod pattern (#5865 )" (#5868 ) This reverts commit `b03b8e18c2`.	2024-08-01 20:28:35 -04:00
George Hotz	d73bc85ba9	UOpGraph not in renderer or Program [run_process_replay] (#5867 ) * UOpGraph not in renderer or Program [run_process_replay] * fix some tests * fix ptx	2024-08-01 16:20:30 -07:00
chenyu	b392b8edc3	increase atol and rtol test_gemm_fp16 (#5866 ) * increase atol and rtol test_gemm_fp16 made it pass with NOOPT which has larger accumulated error * revert that	2024-08-01 19:09:58 -04:00
chenyu	b03b8e18c2	remove one more UOp mod pattern (#5865 ) fixed UOP_IS_SYMBOLIC=1 test_failure_40	2024-08-01 18:29:04 -04:00
chenyu	5c8de2d044	revert a mod pattern (#5864 ) fixed UOP_IS_SYMBOLIC=1 linearizer failure 47	2024-08-01 17:24:26 -04:00
George Hotz	2d3c7e4d4e	some TestPickleJIT tests (#5860 ) * some TestPickleJIT tests * hotfix: print which opencl device we are using	2024-08-01 12:39:59 -07:00
chenyu	0c8d202348	revert some UOp IDIV bound (#5863 ) * revert some UOp IDIV bound breaks conv with UOP_IS_SYMBOLIC, added some conv tests in CI * those are correct * skip slow ones	2024-08-01 15:09:06 -04:00
George Hotz	53fcac9e80	hotfix: increase time on flaky NV test	2024-08-01 10:20:07 -07:00
qazal	26d0265d66	test schedule of LazyBuffers [run_process_replay] (#5859 )	2024-08-01 19:06:29 +03:00
David Hou	eb91423cb4	MLB support reshape for uneven shards (#5804 ) * cleaner uneven reshape * update test	2024-08-01 02:36:03 -07:00
David González Martínez	0f09b94c43	add failing test for second order derivatives (#5772 ) * add failing test * fix lint * fix bad merge * fix again * fix test * more minimal	2024-08-01 02:34:47 -07:00
George Hotz	9d05dfb6f4	move JIT graphing into CapturedJit (#5852 ) * move JIT graphing into CapturedJit * better * _jit_cache * clear inputs cleanup * test_pickle_jit with graph + cleanup * 0 is fine to start * support None in bufs * alloc real buffers * cleaner	2024-07-31 20:48:17 -07:00
chenyu	0ec732b494	test lin fail 47 for UOP_IS_SYMBOLIC (#5853 ) failed arange example with UOP_IS_SYMBOLIC	2024-07-31 23:09:22 -04:00
George Hotz	c6a8395f1b	CapturedJit is fun to pickle [run_process_replay] (#5851 ) * CapturedJit is fun to pickle * export input replace	2024-07-31 17:23:01 -07:00
George Hotz	72621d9e7c	count the specials in uops [run_process_replay] (#5848 ) * count the specials in uops [run_process_replay] * cleanups	2024-07-31 14:53:18 -07:00
chenyu	c2ffcf6887	remove the wrong mod UOp pattern (#5847 ) don't think we are hitting it because the stride construction, and it's wrong and not needed	2024-07-31 16:24:25 -04:00
qazal	8174c438a3	pad test_failure_45 (#5846 )	2024-07-31 23:08:48 +03:00
George Hotz	8672a9db3f	add test to validate lazyops dims (#5845 )	2024-07-31 12:59:38 -07:00
chenyu	4fe5b95568	fix UOp ALU bound (#5844 ) * fix UOp ALU bound root cause of resnet bug, the ALU bound is only correct for scalar, not vectorized * it can be nan...	2024-07-31 15:19:31 -04:00
nimlgen	f768935be8	add RING_ALLREDUCE_THRESHOLD (#5835 ) * add RING_ALLREDUCE_THRESHOLD * becnhmark * fixes * fix n_gpus * unused import * remove debug=2	2024-07-31 16:13:09 +03:00
chenyu	2e087ca8e4	UOp bound for div negative number (#5808 )	2024-07-31 02:10:23 -04:00
qazal	bcbd925001	hcopts failing test for fused arange kernel (#5815 ) * add failure_43 * n 45	2024-07-31 09:02:44 +03:00
qazal	ed556c260e	UOps.IF rules more tests (#5831 ) * init tests * split tests * assert multiple gates simplicity	2024-07-31 00:11:02 -04:00
David Hou	492a696d14	allow specify splits in shard, handle multiple different splits in MLB.e (#5599 ) * allow specify splits in shard, handle multiple different splits in MLB.e * line width * linter * don't use Device in docstring * specify size of shards instead of boundaries * adjust docstring for specify size of shards instead of boundaries * don't allow splits on symbolic axis? * just allow sint in splits_to_bounds * add message for assert * bounds instead of splits to save lines * fix types * reduce diff * fix * tuple * golf :( --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-07-30 19:33:04 -07:00
chenyu	c3da458bc3	UOp if min==max folds to CONST (#5828 ) * UOp if min==max folds to CONST * fix test	2024-07-30 22:14:22 -04:00
George Hotz	e6879035a0	work to make GEMV fast (#5824 ) * work to make GEMV fast * half8 cast * align struct * fix amd * float8 is a later problem	2024-07-30 17:41:40 -07:00
chenyu	02f0be03f2	tests on UOp div negative number and arange opts (#5825 )	2024-07-30 20:06:57 -04:00
George Hotz	693990a346	swap src[2] and src[3] in load [run_process_replay] (#5821 ) * swap src[2] and src[3] in load [run_process_replay] * cleanups + bugfix * fix ptx	2024-07-30 14:04:13 -07:00
George Hotz	17a2f74412	new style load/store folder (#5784 ) * remove old index reorder * new style folder * works better * dedup * one failure * this is fine now... * expander_rewrite * images broken, but all else should work * cleanups * make tests work with old * fix images * cleanups + bugfix * minor fixes * fix gated store folding * flip gate_creator and expander * fix gated store * remove unneeded rules * lines getting close * line count good	2024-07-30 13:17:20 -07:00
qazal	03d866b84f	UOps.IF with rewrite rules (#5812 ) * expand merge * merge barriers * gate_folder * test_linearizer_failures * this can be here * bring the new repr back * gate_folder2 * gate_creator is better * gate_folder * dedup conditions * early gate folding * dedup barrier * fold noop conditions * all consts can go away * free lines	2024-07-30 20:50:56 +03:00
chenyu	defd89e8e0	unify negative shape creation to raise ValueError (#5817 ) [run_process_replay]	2024-07-30 13:42:59 -04:00
P4ssenger	6742a4789a	Add check for negative dimension in view (#5790 ) * add check for negative dimension in view * add negative dim tests * move check to tensor level * fix error message * move check to view create --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-30 13:26:27 -04:00
Francis Lata	ce61be16f1	clean up how preprocessed folder is defined (#5813 )	2024-07-30 12:35:26 -04:00
qazal	5e827e51d2	add llama3 BEAM=2 failures to test_linearizer_failures (#5553 ) * skips * opts.device * benchmarks * add to test_linearizer_failures * remove hardcoded ones * linter * skip cpu	2024-07-30 00:37:32 +03:00
samm393	573e0f9a48	remove float division from idiv in python_alu (#5777 ) * removes float division from idiv in python_alu * add test * cleaner logic * pass clang unsigned literals correctly * suffix ULL instead of U --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-29 12:14:12 -04:00
samm393	2c94316bd2	ull literal support and test (#5789 ) * ull literal support and test * missing .numpy()	2024-07-29 11:50:49 -04:00
nimlgen	ab3839a80a	cleanup nv/cuda compilers (#5767 ) * cleanup nv/cuda compilers * destroy prog * small test * fix test * nv ptx rewrite key * jitlink free * ptx is part of cuda	2024-07-29 13:50:03 +03:00
chenyu	e7a14f398e	more uop_symbolic tests for divmod pairs (#5785 )	2024-07-28 21:27:06 -04:00
George Hotz	76d191ab94	move consts to end of add (#5783 ) * move consts to end of add * better * fix infinite loop	2024-07-28 17:38:57 -07:00
chenyu	71a64d8252	UOps.MUL bound when one is negative (#5781 ) * UOps.MUL bound when one is negative also one more distribute_mul rule * don't always expand	2024-07-28 19:02:47 -04:00
qazal	b775db6b60	high-level benchmark timing diff (#5776 ) * high level timings benchmark times fix defs * use the name map * skip last task	2024-07-28 23:42:57 +03:00
chenyu	600a39771d	fix Tensor.arange if (stop-start) and step have different signs (#5775 )	2024-07-28 14:34:10 -04:00
David González Martínez	d0fd84e617	feat: allow passing gradient to .backward() to compute vjp (#5771 ) * feat: allow passing gradient to .backward() to compute vjp * fix * refactor * fix trailing whitespace	2024-07-28 11:13:18 -07:00
qazal	e0e7293b0a	make process replay unique in retries [run_process_replay] (#5773 )	2024-07-28 20:44:15 +03:00
qazal	95dda8dadf	more unmatching vectorize/gep asserts [run_process_replay] (#5760 ) * merge vectorize/gep rules [run_process_replay] * assert dtypes * src= * float2=(float4.x,float4.y)	2024-07-28 15:08:54 +08:00
chenyu	bfbd7c5461	more generic UOp mul mod folding (#5765 )	2024-07-27 20:20:35 -04:00
chenyu	80c6475757	update test_uop_symbolic to test UOp min and max (#5764 ) covers #5750, #5748, #5741	2024-07-27 19:53:21 -04:00
nimlgen	ed1d784077	test profiler timer sync across devs (#5751 ) * test profiler timer sync across devs * more correct * typo	2024-07-27 16:47:37 +03:00
qazal	3e49d86c01	process replay diffs 3 things now (#5731 ) * github api infra * process replay is 3 parts now * parse benchmarks * add gh_token * complete diff * move process replay tests * last successful run * add tempdir * skip master	2024-07-27 12:52:20 +03:00
qazal	57b4a8e98d	assert process replay asserts (#5737 ) * assert process replay asserts * one ci job is fine * test: Revert "separate process replay main loop (#5734)" This reverts commit `94d578396f`. * mac sed needs that * Revert "test: Revert "separate process replay main loop (#5734)"" This reverts commit e4ad7684d5472a64841a66b43bc1db7c9bbbf9e8. * disable process replay capture * save time * amd is tiny * send to /dev/null	2024-07-27 12:07:50 +03:00
George Hotz	f8972ace38	test flops (and allow wide ALU in UOps) [run_process_replay] (#5749 ) * flops test in external_test_speed_theoretical.py * test speed theo * min SZMAX * allow wide ALU for things that support it * needed for mypy	2024-07-26 21:07:28 -07:00
George Hotz	2fde2d2914	hotfix: external_test_speed_theoretical works on 24GB	2024-07-26 18:41:52 -07:00
George Hotz	829262a5ee	add external_test_speed_theoretical	2024-07-26 17:45:22 -07:00
kormann	a5ede535ef	NOp field name [run_process_replay] (#5742 ) * rm def name * add field name	2024-07-26 18:45:59 -04:00
George Hotz	c50e374bb6	multiple locals + get_kernel_modifier + fix valid (#5739 ) * multiple locals + get_kernel_modifier + fix valid * fix test pattern matcher	2024-07-26 15:10:10 -07:00
chenyu	dc7483ee6f	UOp simple div folding (#5740 ) made UOp.divides return the Optional[quotient] and used it for simple div folding	2024-07-26 17:14:32 -04:00
chenyu	671259417f	reuse UOp `__repr__` for NOp (#5738 )	2024-07-26 16:59:55 -04:00
kormann	b0c1dba299	named UOp class "NOP" [run_process_replay] (#5728 ) * NOP * fix const + simplify compile * rm VAR for NOOP --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-07-26 13:25:53 -07:00
George Hotz	4df46eac67	clean up tensor cores [run_process_replay] (#5736 ) * clean up tensor cores [run_process_replay] * remove tuple(wmma_sz), self.opts.device * remove tls, leave DEVICE	2024-07-26 13:21:23 -07:00
qazal	94d578396f	separate process replay main loop (#5734 ) * separate process replay main loop * [run_process_replay] * add kernel_changed * test with [run_process_replay] * revert temp [run_process_replay]	2024-07-26 21:43:08 +03:00
chenyu	a4e9ebc68a	update test_uop_symbolic (#5733 ) enabled more passed tests	2024-07-26 13:46:09 -04:00
chenyu	2cc55a3095	UOp simple mul add div fold (#5726 )	2024-07-25 22:00:30 -04:00
chenyu	5521b6d437	UOp simple mul-add-lt fold (#5721 )	2024-07-25 20:49:38 -04:00
qazal	1b53207b4f	revert isolated dags scheduling (#5724 )	2024-07-25 19:45:12 -04:00
chenyu	845b0d1c9d	UOp more generic div folding (#5722 ) old: `x // c` can fold if `0 <= x.vmin <= x.vmax < c` new: `x // c` can fold if `0 < c and x.vmin // c == x.vmax // c`	2024-07-25 17:49:14 -04:00
chenyu	a82815262c	more test_pattern_matcher fixups (#5714 )	2024-07-25 14:12:21 -04:00
chenyu	05e02ddfb3	fixup test_pattern_matcher (#5712 )	2024-07-25 13:48:52 -04:00
qazal	9ceb3a3d1f	beautiful_mnist -4.3% kernels (#5709 ) * add is_complete * partially delete forced_realized * p2 * start * refactor to can_group * remove steps * _get_inputs is nicer * fix the cache * cache is dict now * rename to group	2024-07-25 20:30:49 +03:00
kormann	1e2eac755d	Fix repr upat (#5705 ) * test * fix * x fix * simpler * rm extra space	2024-07-25 12:05:48 -04:00
qazal	1c992de257	hotfix: compare_schedule defaults to false (#5707 )	2024-07-25 17:08:28 +03:00
qazal	489cda827a	more scheduler process replay tooling (#5706 ) * more scheduler process replay tooling * refactor to compare_schedule	2024-07-25 15:47:18 +03:00
qazal	4e070a2c89	start work on indexing fusion (#5590 ) * start base * the views add up base reduceop st: ShapeTracker(views=(View(shape=(60000, 1), strides=(1, 0), offset=0, mask=None, contiguous=True),)) top st: ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False))) merged buf.st+st: ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False))) * p1 * some cleanups * more cleanups * one kernel * more * late fuse arange * less lines * more work * fix st strides 1 * update test_schedule, start argmax * test_tiny_argmax * add FUSE_ARANGE * more cleanup * add utils * reduce merging * fix axis and fold if needed * more fusion * need to figure this out * now fixing all of these * todos+save a line * ready for p1	2024-07-25 13:23:38 +03:00
nimlgen	08f47d7dc3	more info on failure 41 (#5704 )	2024-07-25 12:14:28 +03:00
nimlgen	69d4f474d8	amd resnet pf (#5703 )	2024-07-25 11:21:22 +03:00
chenyu	46e1151c02	UOp more generic mul -> mod folding (#5698 )	2024-07-24 21:41:25 -04:00
chenyu	66a9c372af	UOp mod reduction (#5697 )	2024-07-24 20:36:00 -04:00
chenyu	8648fb2636	UOp vmin/vmax on ADD (#5689 )	2024-07-24 19:09:42 -04:00
chenyu	85710e86cb	UOps div folding (#5690 ) #5689, with just div folding and new test cases	2024-07-24 14:21:44 -04:00
chenyu	a7a77dfd83	UOp mul lt fold (#5677 )	2024-07-24 02:49:25 -04:00
chenyu	4e85761d40	UOp mod folding (#5668 )	2024-07-24 00:10:47 -04:00
George Hotz	053550c3f3	remove MERGE opt, cleanup wmma upcast (#5669 ) * remove MERGE opt, cleanup wmma upcast * upcast first * fix broken vectorize folding rule	2024-07-23 20:43:42 -07:00
chenyu	3060e0be4f	add vmin vmax of SPECIAL (#5670 ) * add vmin vmax of SPECIAL folded stuff like (-1 < gidx0) * flaky	2024-07-23 22:55:54 -04:00
George Hotz	fa14f7b4fd	switch contract arg to match expand arg [run_process_replay] (#5667 ) * switch contract arg to match expand arg [run_process_replay] * support multiaxis contract too, it's easy * cancel contract/expand	2024-07-23 18:08:33 -07:00
George Hotz	a85493bdbe	multiaxis contract test	2024-07-23 15:09:15 -07:00
George Hotz	e3f00ac77d	Fix cuda tc emu test (#5663 ) * fix acc folding for NV tensor cores * fix correctness of reduce_before_expand * fix test emulated CUDA tensor cores * test_gemm_fp16 on some devices	2024-07-23 15:04:25 -07:00
chenyu	16c27ae400	update UOp.SPECIAL arg spec [run_process_replay] (#5661 ) * update UOp.SPECIAL arg spec [run_process_replay] from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable * fix ptx	2024-07-23 16:58:12 -04:00
chenyu	01fe00e055	skip test_failure_39 in CI (#5660 ) took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger	2024-07-23 14:47:05 -04:00
chenyu	199b3bf02b	simple UOp lt/ge folding (#5657 ) works if lhs is a DEFINE_VAR. folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more	2024-07-23 14:11:05 -04:00
qazal	b0fc5a4c6f	start scheduler process replay (#5656 )	2024-07-23 20:02:51 +03:00
chenyu	e210c87b4a	uop mod-mod simplification (#5650 )	2024-07-23 12:33:55 -04:00
nimlgen	1384f08cd4	hcq profile tests (#5654 ) * profile tests * fixes * remove linter	2024-07-23 18:40:33 +03:00
qazal	5f394fc9c6	more work toward non-blocking process replay (#5653 ) * non-blocking process replay * more actionable * test it * revert the test * %s/logging.warn/logging.warning	2024-07-23 14:26:31 +03:00
qazal	7cb67e6fb2	merge gated stores spec (#5652 ) * test_unmerged_ifs should merge ifs * test_tiny_gate_store * test_merge_ifs_alt * assert assert asserts	2024-07-23 18:53:27 +08:00

... 4 5 6 7 8 ...

2686 Commits