tinygrad

Commit Graph

Author	SHA1	Message	Date
qazal	4259311006	merge views in conv swizzle (#6464 )	2024-09-11 10:11:01 +08:00
qazal	803b8b9313	conv bw schedule and correctness tests to iterate on (#6461 ) first to fix AST_REWRITE=1, then to implement the same fusion for dtypes.half.	2024-09-11 08:47:07 +08:00
chenyu	b574caadc9	fix UOp const_factor for ADD [run_process_replay] (#6459 ) currently not used, fixed for completeness	2024-09-10 20:04:26 -04:00
chenyu	2105832b87	_min_max of MUL of 2 non-positive inputs (#6454 )	2024-09-10 07:13:01 -04:00
qazal	f4f705a07c	can push SWIZZLE through reduce both ways (#6453 )	2024-09-10 16:00:50 +08:00
qazal	1347e49e82	second iteration on UOps.SWIZZLE (#6451 ) * new swizzle * fix the failing tests * test a double swizzle * ci	2024-09-10 14:43:21 +08:00
chenyu	e0d35e3657	update test_padto_sum_not_ok (#6450 ) updated the setup as `exp() < -1` could be folded to False	2024-09-09 22:46:42 -04:00
qazal	95c9fe841e	UOp.st infra for the new SWIZZLE (#6449 )	2024-09-10 09:39:45 +08:00
qazal	abfbd9fd2f	fix Variable init from the DEFINE_VAR refactor (#6448 ) prereq for UOps.VALID.	2024-09-10 09:14:29 +08:00
chenyu	fcc69adfc5	simplify c0x<c1 for negative int c0,c1 (#6431 ) simplify c0x<c1 for negative int c0,c1 fine if rhs is zero	2024-09-09 21:05:53 -04:00
qazal	29e63097a0	st is a cached_property on UOp [run_process_replay] (#6433 )	2024-09-10 08:30:35 +08:00
George Hotz	904f6a63fa	Revert "Revert "cleanup process_replay/* namings [run_process_replay] (#6429 )…" (#6442 ) This reverts commit `eda177da84`.	2024-09-10 07:00:16 +08:00
George Hotz	dbd4536167	Revert "add UOps.VALID (#6387 )" (#6441 ) This reverts commit `8186e4e7d6`.	2024-09-09 21:33:00 +08:00
George Hotz	eda177da84	Revert "cleanup process_replay/* namings [run_process_replay] (#6429 )" (#6437 ) This reverts commit `f4e83b30b4`.	2024-09-09 18:52:36 +08:00
George Hotz	42e5c8335e	remove args from min/max [run_process_replay] (#6430 ) * remove args from min/max [run_process_replay] * it's a ConstType * sconst_like unused * any const is fine	2024-09-09 18:18:20 +08:00
qazal	f4e83b30b4	cleanup process_replay/* namings [run_process_replay] (#6429 )	2024-09-09 16:59:04 +08:00
George Hotz	8186e4e7d6	add UOps.VALID (#6387 ) * uops valid * broke full_shape * fixup that st (hardcoded asts still red) * fixup DEFINE_VAR debug more debug * start moving stuff to ast_const * move test_linearizer * move test_linearizer_failures to ast_const * fixup test_schedule * small diff change * regenerate dataset * fixup test_multitensor * regen dataset try 2 --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-09-09 16:58:43 +08:00
qazal	935b6b658f	delete seen from the scheduler api [run_process_replay] (#6427 ) docs	2024-09-09 16:26:34 +08:00
chenyu	1941e66cc9	real strides with uops (#6365 ) * real strides with uops [run_process_replay] * compare with old * Revert "compare with old" This reverts commit f53a8d42768e0b95d37b1bae8e80e288a69c6e3f. * make those @unittest.expectedFailure	2024-09-09 03:06:27 -04:00
chenyu	ac98f5056e	move lt-folding to a function [run_process_replay] (#6422 ) and added more tests (some failed to match symbolic)	2024-09-09 02:04:52 -04:00
qazal	ff8a9ac3c1	test new style gated store rendering (#6413 ) * test new style gated store rendering * switch to lidx * make lidx optional * fixup [run_process_replay]	2024-09-09 13:59:22 +08:00
George Hotz	90fb17304f	put rewrite back in ops [run_process_replay] (#6421 )	2024-09-09 13:53:51 +08:00
qazal	442150a8df	more ast_const for hardcoding consts [run_process_replay] (#6418 )	2024-09-09 11:35:08 +08:00
chenyu	25af78c593	failed uop_symbolic divmod test by variable (#6414 )	2024-09-08 23:08:58 -04:00
chenyu	ad05302232	tests of real_stride of symbolic shape (#6409 ) these would have failed in #6365	2024-09-08 21:37:19 -04:00
qazal	935b4ddff6	use ast_const in test_linearizer asts [run_process_replay] (#6407 )	2024-09-09 08:46:58 +08:00
qazal	9a67ec6174	refactor to list of kernels [run_process_replay] (#6403 )	2024-09-08 17:19:45 +08:00
chenyu	7df4373fd9	tensor reduction touchup (#6402 ) - fixing spacing - use get_args to get valid Literal values and raise ValueError to match, and a test for that - use `Y` to be consistent	2024-09-08 03:55:51 -04:00
Irakli Salia	2e01efc35f	tensor roll (#6375 ) * tensor roll function and tests * fix type annotations * reduce line count * more readable	2024-09-07 05:14:28 +08:00
Tim Becker	dfb818788e	Support `reduction` parameter in more loss functions (#6302 )	2024-09-07 05:11:20 +08:00
chenyu	26c5d8346a	remove Variable from UOp.DEFINE_VAR (#6393 ) now it's just arg = (expr as str, min as UOp.const, max as UOp.const)	2024-09-06 05:55:19 -04:00
chenyu	9ed2b8b818	fix DEFINE_VAR setup in test_uop_graph [run_process_replay] (#6392 ) making sure arg always have 3 items	2024-09-06 05:32:12 -04:00
George Hotz	282af21b95	hotfix: DEBUG_EXPAND -1 and NOOPT in benchmark schedule	2024-09-06 17:22:30 +08:00
chenyu	9a9fea7b8c	move DEFINE_VAR min/max from src to arg (#6388 ) new arg is (Variable, min as CONST, max as CONST)	2024-09-06 05:01:02 -04:00
qazal	f1bd2a5519	fix BUFFER_UOPS sts in verify_ast [run_process_replay] (#6389 )	2024-09-06 16:59:22 +08:00
chenyu	cc05016fa8	move test_pattern_matcher to test/unit (#6386 )	2024-09-06 03:22:43 -04:00
George Hotz	86d34daac9	UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383 )	2024-09-06 12:38:35 +08:00
chenyu	002303c145	fix output of truncate_fp16 (#6381 ) make sure the non-inf path returns the truncated value	2024-09-05 22:55:43 -04:00
George Hotz	c88329244b	create rewrite.py [run_process_replay] (#6379 ) * create rewrite.py [run_process_replay] * fix tests * not in rewrite or ops * skip flaky test	2024-09-06 10:51:01 +08:00
George Hotz	66e7e51c79	Revert beam failure (#6376 ) * Revert "late gate creation for STORE [run_process_replay] (#6373)" This reverts commit `c26744de9f`. * Revert "gated store rewrite to UOps.IF (#5976)" This reverts commit `48061e8400`.	2024-09-06 09:36:44 +08:00
ignaciosica	c15506fc35	[WIP] amx support as TC (#5693 ) * almost working with relu, even hackable... but acc size is wrong, fix needed * upcast based on threads, change thread size to 4x4 * revert wrongfully commented assert * fix tc load indexing * modify for size 8 * fix bug for size 8 * Revert "fix bug for size 8" This reverts commit cdb3f5df85b6116e8bef10214647a9201c400655. * Revert "modify for size 8" This reverts commit 3ef0904bd96291c7a3a351c702fba2905c196bcc. * good kernel with changes in lowerer * revert "good kernel with changes in lowerer" This reverts commit 975e2b5a4ecfe475370e88ce9db78b2d42e4c4d4. * good kernel for relu! * refactor lowerer changes * add amx context var to helper * clean up amx flag * improve lowerer changes readability * improve check for amx * revert lowerer if * add float4 type rendering for clang * add amx definitions * enable indexing for clang if amx * working amx example, wrong because of dims * almost works for float 16, need to spot using double load in amx * cleaner render_kernel * revert chages in simple_matmul and delete env * add new var upcast_offset to get_optimized_ast * change axis for axes * invert if in rendering phi * fix some bugs * fix linearizer tests * fix vec/get pat for amx * remove clang tc if amx is disabled * add ops_python support * refactor into one complementary function in ops_python * add job for EMUALTE_AMX * improve checking for AMX in UPCAST and TC extra ops * fix lint issue * commit before refactor into autocontained AMX * start refactor by removing special rendering for AMX * all ready for amx handcoded kernel * working poc, most straightforward amx support * avoid local opts for tc if amx * fix merge bugs * skip test for clang * skip tc hand-coded opts if amx * remove hardcoded ops_python values * remove hardcoded sizes for amx kernel * fix ops_python bug where dim was hard-coded * change contract for vectorize * working without changes in lowerer * revert changes in gep rendering * fix ops_python * modify comment * skip test if clang for different type accumulation * move rename and bug for seperate pr * fix wrong path for test * addmm not implemented in torch for cpu * change struct for vector; equally slow but cleaner * revert modified test * simply wmma rendering * minor change * noqa:501 * add length 16 for AMX * fix vectorized half issue * fix error * remove comment * change set for dedup * split test of tensor_core_extra_ops so that cases that dont require locals run for AMX * add amx reference * load acc into amx registers * fix dtype rendering and remove noqa * moved tests change into another pr * add real AMX job for CI and fix bug * fix ops_python bug * fix test class * remove real AMX tests and fix uops_stats test * remove wrong test * acc folding * hotfix: bug * fix float4 tests for amx * hack for fixing flops counting * hotfix: mypy * add flop counts test for amx * improve test_float4_multidim_amx * improve test_float4_multidim_amx * improve test_float4_multidim_unaligned_load_amx * nits tests --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-09-06 09:01:10 +08:00
qazal	c26744de9f	late gate creation for STORE [run_process_replay] (#6373 )	2024-09-06 03:32:19 +08:00
Ian Paul	48061e8400	gated store rewrite to UOps.IF (#5976 ) * Core change to gate stores in IFs * Updates to cstyle renderer to handle IFs around STOREs * Make uops asserts happy * Add tests and fix newly broken tests * make ruff happy * make mypy happy * Simplify renderer to have all gated stores use IF * Revert some changes * Make test_where_fold happy * Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out * Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE * Re-change broken test * Make ifs be grouped together * get non-merged IFs working. ALl tests pass except grouping related ifs together * Fix tests by making the IF UOp dependent on the correct node of the STORE UOp * Changes to uopgraph * Simplify graph rewrite logic * Changes to get test_padto_where_multireduce working * Simplify uops.store renderer * Make test_padto_where_multireduce pass but now other tests fail * Clean up uopgraph from scrach work * Ignore sudo IF srcs when rendering * Attempt to fix llvm tests * rm comment * reduce lines * Add line to make mypy happy :( * llvmir fix pt 1 * Mods after rebasing to master * Fix llvmir * Fix ptx tests * Fix other ptx tests * Move changes from uops.py to ops.py * rm uops.py * Fix TestGateStoreRewrite tests * Get multireduce tests working * reset to remote branch * Fix linearizer tests * uop_graph test patch * Add comment to create_gate * hotfix: uncomment those tests * Attempt to fix ptx tests by including whitespace inside if block * Patch from remote tinybox. Tests passing here * Min changes to get some ptx tests passsing * Changes after rebase * Exclude ifs and endifs from ptx * IF conditional branching within ptx * Save lines on delete_redundant_gates * Simplify merge_gates * rm noqa * Remove unnecessary checks when merging gates * Fix ops error msg * Smarter check for if/endif in llvmir * simplify delete redundant gates to only have 2 returns * spacing * Smarter check at beginning of merge_gates * patches from comments * Remove need for merge_gates * include proper srcs in IF from the get-go * test expand ifs dumb will result in 4 ifs, not 1 now * Make tests happy * Fix uops stats * rm merge_gates method. Will add back in separate PR * Spacing * cleaner error msg * Fix uops rendering when expanding. test_failure_43 * patch tests * undo changes in delete_redundant_gates * process replay attempt * re-intro deletion of redundant gates * fix addition of gates when they get nested in stores and loads * patch tests * smarter init of IF srcs when adding gate to STORE * make ruff happy * Resp to comment * include all src[2]'s srcs in IF for gated store * add reference of the storing value to the gate's src * minor patch after rebasing * change ptx renderer --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-09-06 01:05:30 +08:00
nimlgen	a1a15b54c9	qcom cache flush (#6367 ) * qcom cache flush * bench * linter * move	2024-09-05 13:23:39 +03:00
chenyu	62f9f273f7	increase test_profile_multidev_transfer threshold (#6370 ) flaky, bumpped to 16000 for CI	2024-09-05 05:49:32 -04:00
George Hotz	e882294c02	uops touchups [run_process_replay] (#6368 ) * uops touchups [run_process_replay] * those are classmethods * oops, kwargs * no kwargs there	2024-09-05 17:22:32 +08:00
George Hotz	a28ed7ba4d	math trait [run_process_replay] (#6364 ) * math trait [run_process_replay] * const -> const_like * Revert "const -> const_like" This reverts commit 85727c83d38f59e153333a3dbfa68f87b3a5a6ce. * add MathTrait to LazyBuffer * clean up function * fixup the rest of function * fix custom function * mlb math trait * fix that test	2024-09-05 16:19:17 +08:00
George Hotz	4a51c28ee7	switch const to const_like [run_process_replay] (#6356 ) * const like * no more _const * missed one * mypy ops.py * file missing * const_like * fix image and test uop graph [run_process_replay] * fix ptx	2024-09-05 13:57:54 +08:00
George Hotz	0d6922edb4	faster local tests. copy torch permuted to defautl device [run_process_replay] (#6363 )	2024-09-05 13:57:20 +08:00
chenyu	6fd24561d1	distribute MUL const into ADD for int (#6361 ) pre-req for real_stride	2024-09-05 01:36:57 -04:00
qazal	e7f6b654ad	cleanup uop eq asserts for swizzle [run_process_replay] (#6362 ) * cleanup uop eq asserts for swizzle [run_process_replay] * more stuff	2024-09-05 13:36:36 +08:00
Oleg Rybalko	64f1384f5b	Einsum ellipsis support (#6333 ) * working ellipsis expansion * refactor * fix commas in output * add capital letters * refactor	2024-09-05 10:08:55 +08:00
nimlgen	326a77336e	qcom remove some tests skips (#6353 )	2024-09-04 15:38:18 +03:00
qazal	99018a4aa1	minor schedule differ utils [run_process_replay] (#6348 ) * minor schedule differ utils [run_process_replay] * rm	2024-09-04 03:41:38 +08:00
nimlgen	3adb76894d	validate image=2 float16=1 openpilot benchmark (#6346 ) * validate image=2 float=16 openpilot * linter * linter2	2024-09-03 20:13:40 +03:00
qazal	2f00bf0c78	conv bw in one kernel with graph_rewrite (#6330 ) * double reduce merger * add test_fold_conv_relu_backward_ast_rewrite * a correctness test to iterate on * merge axes the other way around * better	2024-09-03 03:53:53 +08:00
Vyacheslav Pachkov	4c33192a8b	add qcom runtime (#5213 ) * qcom: driver init * autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros * autogen: add adreno commands and registers * ops_qcom: QcomAllocator + signals * fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom * qcom: we do not really need all these constants input/output is enough * qcom: perfctr for CS (do not really need all the rest) * qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max * qcom: explicitly set instruction len based on the shader size * ops_qcom: Program init extracts shader from open cl binary sets input/output buffers allocates stack sets cs mode runs shader * use data64_le from helpers * ops_qcom: use fill_kernargs for filling i/o buffers * ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset * new signals & fix exec * add QCOM to the list of supported devices * correct QcomComputeQueue._wait using CP_WAIT_REG_MEM * fix exec, synchronize before copyout * correct setting num_units for ST_SHADER * fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway * extract offsets to kernel arguments from opencl binary * extract constants values and offsets from opencl binary * handle KGSL_MEMFLAGS_USE_CPU_MAP correctly * align kernel name to 4 bytes when skipping kernel opencl struct * skip to consts directly using an offset from opencl binary header * fix alloc * get halfreg and fullreg from opencl bin * set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE * parse prg offset from open cl binary * save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG * support for vals in _fill_kernargs * support 16-bit constants * use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts this helps to not fall down when executing big kernels /* Don't time out if the context has disabled it / if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE) return; minor changes of _exec * QCOMRenderer * disable HCQGraph for demo. TOOD: support HCQ update api * support HCQ - remove copy queue - add updates - add strides for buffs and vars for QCOM * bufs_stride * clean ups * linter * call super().__init__(value) in QcomSignal * disable=unused-import * mypy * type ignore when queue is on the device * fix * query gpu_id. Will be useful for selecting commands e.g. CP_EVENT_WRITE vs CP_EVENT_WRITE7 * working timestamps * free context after device is done * move gpu stack to the device * reserve some space with lib_gpu for gpu to write to this fixes test_interpolate_bilinear * exclude tests that fails with GPU=1 on qualcomm * lint * unmap mem in _gpu_free * ctxt priority and preemtion policy * remove old qcom * pass size to self.device.allocator.free * skip tests only on qcom * use kgsl and adreno defines instead of numeric vals * use allocator for allocating lib_gpu * update to QcomArgsState from master * intermediate commit while conquering images * enable image tests on qcom * fix shader disasm size, dump textures stuff * working images * allow signals to be 0 * set branchstack from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * set shared memory size from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * update images in QcomArgsState & less loc for images * set stack sizes from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * stack allocation based on OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * better autogen for kgsl and adreno. no more bitshifts Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * cleanup commit for parse cl lib Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dont forget actual generated files * refactor + less loc Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * device.py back * lint * ruff * timestamp divisor Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * fix tex fmt & round global size Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dtypes * 19.2MHz * -1 loc in _update_exec * remove noqa --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-09-02 19:35:47 +03:00
George Hotz	406ec8240e	hotfix: lin_fail_41 passes on my M3 Max	2024-08-31 11:46:46 -07:00
Roelof van Dijk	ad4b3b457f	bump limit for test_llama_embedding_opt (#6332 )	2024-08-31 10:03:43 -04:00
George Hotz	72939901fc	hotfix: ebs print kernel names	2024-08-29 21:20:36 -07:00
George Hotz	365babe391	precompute early_reject [run_process_replay] (#6327 ) * precompute early_reject [run_process_replay] * features for ebs * fix ocelot cache	2024-08-29 18:26:24 -07:00
George Hotz	385904526f	remove more rules [run_process_replay] (#6326 ) * remove more rules [run_process_replay] * disable invalid test * ptx needs that str	2024-08-29 16:27:10 -07:00
qazal	539654fbe1	graph_rewrite complexity tests [run_process_replay] (#6317 )	2024-08-29 22:39:08 +03:00
qazal	07942ef361	Proposal: Better UOps.SWIZZLE (#6309 ) * better UOps.SWIZZLE * test_swizzle_rewrite * add it to docs * show a diff * a lil more verbose * two teeny notes * hotfix: sink	2024-08-29 15:39:48 +03:00
qazal	dd4e5f1c8d	process replay rewrite (#6284 ) * process replay rewrite p2 * start some unittests + exceptions and exits * shebang * remove extra kernel init	2024-08-29 15:08:27 +03:00
pedro	7de4eac8f7	add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308 ) * add `nearest` mode to interpolate matching pytorch `nearest` which is knowingly buggy + relevant TestsOps * add `nearest-exact` mode to interpolate matching pytorch `nearest-exact` + relevant TestOps * fix uint8 bilinear interpolation by matching custom torch implementation * implement uint8 lerp with torch interpolation trick without converting it to float	2024-08-28 21:59:51 -07:00
qazal	ec34d9ee36	start benchmarking ast graph rewrite (#6297 ) * ast_rewrite to ctx var * add external_benchmark_ast * refactor to asts * track lazybuffers * more work * record checkpoint * cleanup	2024-08-27 18:18:44 +03:00
Max-We	ab2714423b	Add einsum tests (#6286 ) Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>	2024-08-26 09:09:25 -07:00
chenyu	b76f0c875e	lazy const fold idiv 1 (#6285 )	2024-08-26 10:29:59 -04:00
chenyu	af7c04ff57	Tensor.__floordiv__ (#6283 ) support Tensor.__floordiv__ and friends	2024-08-26 09:43:40 -04:00
qazal	d2f8eeed2e	make [compare_schedule] the default [run_process_replay] (#6273 ) * make [compare_schedule] the default * capture ctx * logging * set capture to false	2024-08-26 21:40:03 +08:00
CaltropHungerton	002f60b4c3	fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192 ) * fix wmma flop counting on intel, add count tests * half * add half gemm * Update test.yml * one test * Update test_uops_stats.py * Update test_uops_stats.py * Update test_uops_stats.py * smaller matrix, use unittest skipUnless decorator	2024-08-25 18:37:05 -07:00
qazal	f0cc8ca5f2	generic st_fixup in scheduler graph rewrite [compare_schedule] (#6278 )	2024-08-25 11:02:17 +03:00
gswangg	3cf507ae7f	remove extra.ops and LazyOp support from Kernel (#6267 ) * remove extra.ops and BufferOps * remove extra.ops and LazyOp support in Kernel	2024-08-24 16:44:38 +03:00
qazal	ccb05d8baa	fixup neg tests [run_process_replay] (#6268 )	2024-08-24 16:35:43 +03:00
gswangg	ea76b93814	migrate test_linearizer_dumb.py to UOp AST (#6241 ) * add imports and update test_unmerged_ifs to UOp AST * test_max_simplify_and_cancel * test_expander_new_srcs * test_llama_embedding * test_unaligns_idxs * test_unrolled_float4_align * test_upcasted_stores_out_of_order * remove LazyOp * remove extra/ops and replace ReduceOps.SUM with BinaryOps.ADD	2024-08-24 16:27:29 +03:00
gswangg	e44653e25a	migrate test_linearizer_failures.py to UOp AST (#6240 ) * add imports and update test_failure_1 to UOp AST * update test_failure_2 with UOp AST * update test_failure_3 * test_failure_5 * test_failure_6 * test_failure_7 * test_failure_8 * test_failure_9 * test_failure_10 * test_failure_11 * test_failure_12 * test_failure_12_multireduce * uncomment skip and migrate test_failure_13 * test_failure_14 * test_failure_15 * test_failure_16 * test_failure_17 * test_failure_18 * test_failure_19 * test_failure_20 * test_failure_21 * test_failure_22 * test_failure_23 * test_failure_24 * test_failure_25 * test_failure_26 * test_failure_27 * test_failure_28 * test_failure_29 * test_failure_30 * test_failure_31 * test_failure_32 * test_failure_33 * test_failure_34 * test_failure_36 * test_failure_37 * test_failure_38 * test_update_39 * test_failure_40 * test_failure_41 * test_failure_42 * test_failure_43 * test_failure_44 * test_failure_45 * test_failure_46 * test_failure_47 * test_failure_48 * test_failure_49 * test_failure_50 * remove LazyOp * reskip test_failure_22 * remove extra/ops * replace ReduceOps with BinaryOps * fixup that import --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-08-24 16:26:58 +03:00
gswangg	1dc6040877	migrate test_search.py to UOp AST (#6245 ) * add imports and update test_kernel_count with UOp AST * test_filter_global_buffer * remove LazyOp * remove extra.ops and ReduceOps --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-24 16:13:53 +03:00
qazal	ae23540d6e	refresh process replay schedule ref in reset.py (#6265 )	2024-08-24 16:12:51 +03:00
gswangg	7be5eede71	migrate test_linearizer_overflows.py to UOp AST (#6244 ) * add imports, remove ConstBuffer, and update test_overflow_1 with UOp AST * test_overflow_2 * test_overflow_3 * test_overflow_4 * test_overflow_5 * test_overflow_6 * test_overflow_7 * TestLinearizerOverflowAlt::test_overflow_1 * TestLinearizerOverflowAlt::test_overflow_2 * remove LazyOp * remove extra.ops * remove ReduceOps	2024-08-24 16:10:29 +03:00
chenyu	943ab97d24	fix Tensor.prod for multitensor (#6264 )	2024-08-24 08:52:24 -04:00
qazal	bcb2f1caa3	init REDUCE_AXIS with BinaryOps (#6256 ) * REDUCE_AXIS arg with BinaryOps * more work in kernel.py fixup sops.gz * fix TestGraphRewriteEfficiency	2024-08-24 11:28:41 +03:00
chenyu	da5cf11859	fix acc init value for MUL (#6263 )	2024-08-23 23:19:44 -04:00
George Hotz	26498b322e	add BEAM to external_benchmark_schedule.py	2024-08-23 18:10:46 -07:00
George Hotz	53a73038e3	hotfix: TestGraphRewriteEfficiency.test_create_many_uops	2024-08-23 15:51:57 -07:00
chenyu	590c0922b6	Tensor.prod (#6250 ) * Tensor.prod a new reduce op! * onnx ReduceProd	2024-08-23 10:06:32 -04:00
qazal	78d6bd8b41	start graph rewrite in the scheduler (#6248 ) * start graph rewrite in the scheduler * test: enable it * test timings * only fails in multi reduce * more isolated tests	2024-08-23 13:15:55 +03:00
George Hotz	238896ca02	loooking into graph rewrite speed (#6239 ) * loooking into graph rewrite speed * track, replace is slow * if all same, no permutations [run_process_replay] * types so compile works * no implied comprehension * TRACK_MATCH_STATS=2	2024-08-22 13:17:55 -07:00
chenyu	e745e16441	remove UnaryOps.NEG (#6238 ) * Remove UnaryOps.NEG generated new dataset with ``` time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh gzip /tmp/sops mv /tmp/sops.gz extra/datasets/ ``` * fix that	2024-08-22 14:21:39 -04:00
nimlgen	6c4ddd6260	hcq skip tests when no multidev (#6235 ) * hcq skip tests when no multidev * linter * a bit higher tinout	2024-08-22 18:27:16 +03:00
chenyu	08539f08b0	fix UOp repr with Variable in arg (#6236 )	2024-08-22 11:06:33 -04:00
chenyu	3fc8203475	remove NEG from handwritten ast in tests (#6234 ) * remove NEG from handwritten ast in tests * test_linearizer_failures	2024-08-22 09:06:59 -04:00
chenyu	1c5ef5b793	format test_linearizer_failure (#6231 ) made it easier to remove NEG	2024-08-21 21:10:56 -04:00
nimlgen	78c94abe9c	raise time limit for ci in test_profile_multidev_transfer (#6227 )	2024-08-21 22:42:03 +03:00
gswangg	c74b318458	migrate test_linearizer.py to UOp AST, pt. 2 (#6228 )	2024-08-21 22:16:11 +03:00
George Hotz	c3168952f0	wip: tracking pattern matcher [run_process_replay] (#6225 ) * wip: tracking pattern matcher * better * proper dedup * timing * early reject * mergable match stats * TrackedPattenMatcher * fix TrackedPattenMatcher * cleanups * clean that too * remove early_reject * Revert "remove early_reject" This reverts commit dc2aef14b8f5da58f5ec9566daf252513cac394c. * total * sort by time * match_stats cleanup	2024-08-21 11:57:26 -07:00
chenyu	a666450e4d	UOp pattern x + x -> x * 2 (#6224 ) * UOp pattern x + x -> x * 2 now there's no NEG, with this it covers all kinds of ax+bx * can remove x-x	2024-08-21 12:06:19 -04:00
chenyu	c9a9631818	no UnaryOps.NEG in generated UOp patterns (#6209 ) * no UnaryOps.NEG in generated UOp patterns removed pattern `x * (-1) -> -x` and `x != True` * those are fine because NEG became CMPNE and True * fix sd validation L2 norm	2024-08-21 11:08:22 -04:00
qazal	3b8cc5a3e0	more multireduce tests prep for neg removal [run_process_replay] (#6220 )	2024-08-21 12:45:24 +03:00
qazal	f03e5a4b3b	test_multireduce const has a shape (#6218 )	2024-08-21 11:02:45 +03:00
George Hotz	2c42e9c2c6	faster rewrite, no folder in expand/reduce [run_process_replay] (#6216 ) * faster rewrite, no folder in expand/reduce [run_process_replay] * is removing the expander there okay * parens * don't reconstruct exact match uop * fast do_reduce * expand pyint * most of the parents gains with less lines	2024-08-20 23:36:58 -07:00
George Hotz	16f420f7a7	split full_graph_rewrite and linearize_uop [run_process_replay] (#6215 ) * split full_graph_rewrite and linearize_uop * fix tests * graph rewrite in test uops * add types	2024-08-20 20:12:33 -07:00
George Hotz	9faf205601	CIFAR trainer + various bugfixes / improvements (#6146 ) * move cifar into datasets * support for pathlib Tensors, tar_extract, and fetch gunzip * too early for Device.DEFAULT * simpler hlb_cifar + .to(None) is default * new compiler failure, start beautiful_cifar * beautiful cifar runs but is broken * jit train step * cleaner * std_mean, not mean_std * more correct * fast indexing * don't print that * torch load broken * add eval * nicer bar * decoraters are the way to do this * bounds check the load * a few ops * batchnorm bugfix, if track_running_stats is False, use online estimate * full timing * fix fusion * unneeded realize * master tensor	2024-08-20 16:58:46 -07:00
madt2709	4bb98d8882	Fix track_running_stats in batchnorm (#6200 ) * Fix track_running_stats in batchnorm * Fix linter * Update test_fold_conv_batchnorm_notrain to keep allowed at 1 * Add test_fold_conv_batchnorm_notrain_no_running_stats * Save 1 line	2024-08-20 14:01:22 -07:00
George Hotz	a5d79688db	fix indexing out of bounds (#6208 ) * fix indeing out of bounds * 5 ops per access is fine	2024-08-20 11:34:56 -07:00
chenyu	4451bcaf95	update test_arange test_llama_embedding_opt (#6207 ) non CI uses larger embedding, still same orders of magnitude	2024-08-20 13:58:43 -04:00
qazal	074cf780dd	add option to only benchmark schedule [run_process_replay] (#6204 )	2024-08-20 16:51:27 +03:00
gswangg	0e6f057eae	migrate test_linearizer.py to UOP AST (pt. 1) (#6150 ) * migrate test_multioutput to UOP AST * inline buf declarations * migrate test_multireduce to UOp AST * update test_mid_dim_multireduce to UOp AST * update test_triple_multireduce with UOp AST * make global definitions more concise * update test_double_reduce_multireduce with UOp AST * update test_multireduce_with_parallel with UOp AST * update test_multiout_multireduce to UOp AST * make gidx style consistent across updated tests --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-20 10:02:20 +03:00
chenyu	10330a41c7	add CMPNE tests in test_uops (#6196 ) fixed the output_dtype for CMPNE and match the tests for CMPLT	2024-08-19 19:41:21 -04:00
chenyu	21d6739237	remove UnaryOps.NEG from lazy.py (#6193 ) * remove UnaryOps.NEG from lazy.py * neg is no longer unary	2024-08-19 18:41:28 -04:00
Gabe Caldwell	bdd6325f31	default num_classes value for one_hot (#6182 ) * num_classes=-1 If num_classes set to -1, the number of classes will be inferred as one greater than the largest class value in the input tensor. * num_classes desc comment to explain num_classes default and what that means. * replacing ' with `	2024-08-19 12:07:14 -07:00
Alessandro Benetti	9328248610	support for std_mean and cross_entropy (#6181 ) * support for std_mean and cross_entropy (#3) * Cross entropy and std mean support * remove extra examples	2024-08-19 12:06:44 -07:00
Max-We	53b20afa3f	Write tar_extract (#6180 ) * Add tar_extract * Add tar_extract tests * Fix dtype for initialization from path * Tests for path initialization * rm print --------- Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>	2024-08-19 12:06:17 -07:00
Eitan Turok	8556d0c642	Support `gunzip` in `fetch` (#6176 ) * init * update * clean * add type * clean * fix import order * shorten variable names	2024-08-19 12:04:40 -07:00
samm393	5d742f7fe3	Missing features from rearrange (#6184 ) * fixes and tests * typo in test	2024-08-19 11:19:07 -07:00
qazal	2242ff84be	type verify intermediate UOps [run_process_replay] (#6140 ) * type verify intermediate UOps [run_process_replay] * merge asserts * variable const	2024-08-19 20:59:01 +03:00
qazal	478145cb8e	lowering error in diff_schedule is fine [run_process_replay] (#6185 )	2024-08-19 20:51:12 +03:00
chenyu	00578a021b	re:6125 switch real_size to use uops [run_process_replay] (#6138 ) * switch real_size to use uops [run_process_replay] * enough to pass --------- Co-authored-by: George Hotz <geohot@gmail.com>	2024-08-19 13:20:24 -04:00
qazal	e28d29641f	more scheduler process replay tooling [run_process_replay] (#6178 )	2024-08-19 15:35:51 +03:00
chenyu	b36a7273c6	RUF018 assignment-in-assert [run_process_replay] (#6172 ) assertion should not have side effect or `-O` breaks. initially just wanted to fix the one in rearrange, but it also made some long lines less long	2024-08-19 00:34:52 -04:00
chenyu	9c60a27ece	lower float64 sin fuzzer threshold (#6173 ) 139216373.71875 failed https://github.com/tinygrad/tinygrad/actions/runs/10446960642/job/28925156240	2024-08-19 00:25:42 -04:00
samm393	fd7c84c1c8	Rearrange (#6106 ) * rearrange and tests * tidy * whitespace * remove line * -5 lines * test fix * static -> instance * fix () & add more tests * remove flags * -1 line * match einops * whitespace * repeated names	2024-08-18 20:22:28 -07:00
chenyu	2de174677a	threefry touchup [run_process_replay] (#6169 ) also why is test_gc testing _rng_counter is allocated??	2024-08-18 23:01:24 -04:00
David González Martínez	724e408736	add support for retain_graph in backward (#6145 ) * add support for retain_graph in backward * fix: dont accumulate grad on non-leaf tensors * fix order * fix: do not delete grad on leafs * fix linter * fix: can't exactly match torch behaviour internally * allow numerical room for test * refactor	2024-08-18 16:08:31 -07:00
wozeparrot	0c5189de25	threefry half (#6154 )	2024-08-18 15:23:12 -07:00
Timmy	e3d14d1ccc	Lowerer Multireduce Grouping (#6097 ) * grouping changes to codegen * linters + tests * fix identical store issue on PTX * comment in grouping multireduce tests * cleaning up diff * cleaning up diff * comments * linters * hotfix: dont change kernels --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-08-18 19:57:51 +03:00
qazal	1ba83cc7fa	split test_sgd_4convs_fuse [run_process_replay] (#6158 )	2024-08-18 18:35:42 +03:00
qazal	be6dda4093	hotfix: more lazyop rename to uop [run_process_replay] (#6157 )	2024-08-18 17:28:44 +03:00
George Hotz	17a043edad	tensor inference (#6156 ) * tensor inference * test is even better name	2024-08-18 00:19:28 -07:00
chenyu	f7950fc2b6	add E275 missing-whitespace-after-keyword linting rule (#6149 ) requires space after keywords like `assert`, `not`, `return`, `else`	2024-08-17 16:44:34 -04:00
George Hotz	88edc2902d	axis_is_masked with graph_rewrite [run_process_replay] (#6144 )	2024-08-17 10:28:49 -07:00
qazal	5a266d5d0c	type verify ImageDType and PtrDType [run_process_replay] (#6137 ) * type verify ImageDType and PtrDType [run_process_replay] * fix tests	2024-08-17 16:37:07 +03:00
qazal	d1d41130cd	use membufs in ImageDType checks [run_process_replay] (#6136 ) * use membufs in ImageDType checks * set by key [run_process_replay]	2024-08-17 16:17:46 +03:00
qazal	d9ce664350	add test_verify_ast [run_process_replay] (#6134 )	2024-08-17 14:14:30 +03:00
George Hotz	3a2d724cb2	extra matcher from renderer [run_process_replay] (#6130 ) * extra matcher from renderer * cache_pm [run_process_replay]	2024-08-16 23:53:11 -07:00
George Hotz	5048066e79	st_arg, never -1 [run_process_replay] (#6128 )	2024-08-16 22:46:56 -07:00
George Hotz	d9cb45af09	only axis is masked [run_process_replay] (#6123 )	2024-08-16 21:01:17 -07:00
George Hotz	94aa5f11b5	Revert "use vmax for real_size [run_process_replay] (#6120 )" (#6122 ) This reverts commit `a6e3211444`.	2024-08-16 20:33:19 -07:00
George Hotz	a6e3211444	use vmax for real_size [run_process_replay] (#6120 ) * use vmax for real_size [run_process_replay] * axis is masked	2024-08-16 20:17:23 -07:00
George Hotz	912f01ed4b	UOpGraph -> linearize_uop [run_process_replay] (#6119 )	2024-08-16 19:48:39 -07:00
George Hotz	89c7989659	no shapetracker in ops [run_process_replay] (#6117 )	2024-08-16 17:23:27 -07:00
George Hotz	74ee9febec	remove iter from uopgraph (#6110 ) * remove iter from uopgraph * linearize returns uops * fix tests * linearize in linearize * tests fix * touchup * test failures	2024-08-16 15:58:29 -07:00
qazal	28c75bf2a6	merge uops with ops (#6111 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-08-16 18:17:57 -04:00
qazal	d5e3217076	hotfix: scheduler differ (#6115 ) * hotfix: scheduler differ * add the test back * track keys	2024-08-16 23:34:49 +03:00
qazal	c23d44c779	AST is UOp (#6030 ) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc. * fix benchmark * remove extra dedup	2024-08-16 22:09:00 +03:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
George Hotz	553ae9ebc0	bilinear interp uint8 fails (#6103 ) * new test for e2e compile failures * fix bug * bilinear interp uint8 fails * better tests	2024-08-15 19:34:39 -07:00
George Hotz	c850e03758	new test for e2e compile failures (#6101 ) * new test for e2e compile failures * fix bug	2024-08-15 18:56:22 -07:00
chenyu	9ef82e1f2b	UOp pattern DEFINE_VAR with min==max is also CONST (#6095 ) * UOp pattern DEFINE_VAR with min==max is also CONST * fix tests	2024-08-15 12:09:44 -04:00
qazal	4d38fec8c1	rename lazyops to parents [run_process_replay] (#6091 )	2024-08-15 17:27:32 +03:00
chenyu	5accfe26a0	rewrite bool ADD to OR and MUL to AND (#6084 ) * rewrite bool ADD to OR and MUL to AND fixed running `tinyphysics.onnx`, which contains a getitem from a boolean tensor. only can repro through BEAM_COMPARE, which i think is a different bug in test_linearizer_failure * fold those, and fix tests * only for bool * move dtypes.bool	2024-08-15 10:11:57 -04:00
chenyu	df03dca6e3	move % inside UOp mod_folding and remove deprecated tests (#6085 ) [run_process_replay]	2024-08-14 23:25:10 -04:00
qazal	2bf7b56485	minor test fixups from the AST is UOp diff (#6081 ) * add assert_equiv_uops cache * dont expect lowering and schedule errors	2024-08-14 23:58:04 +03:00
George Hotz	64563abc90	add LSTMCell to nn (#6080 ) * add LSTMCell to nn * lstmcell works with no input on first * fix no bias 0 * simpler	2024-08-14 12:08:42 -07:00
chenyu	6b3112d525	fix qcom process_replay for kernel diff (#6079 ) * debug why qcom process_replay does not run skipping the wrong exception? * um-hum * get_step_times was parsed incorrectly * cleanup	2024-08-14 15:05:49 -04:00
chenyu	2fe9d62451	increase test_recursive_add time from 1s to 2s (#6078 ) flaky https://github.com/chenyuxyz/tinygrad/actions/runs/10392144818/job/28776666700	2024-08-14 13:52:02 -04:00
samm393	2dc586ffe5	Shape change bitcast for more dtypes (#6047 ) * bitcast & tests * use to_dtype * put disk tensor tests back * tests * bitmask * no bitmask --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-14 10:03:34 -07:00
qazal	83a2543c74	spec for in order LOAD/STORE indexing (#6073 ) * test_unaligns_idxs * spec for in order LOAD/STORE indexing * test UOps.SPECIAL * check for supports_float4	2024-08-14 19:18:00 +03:00
chenyu	5048f9a4d5	test linearizer failure 49 (#6074 ) with UOP_IS_SYMBOLIC=1, on METAL it breaks store fusion and have A+B and B+A being two different UOp	2024-08-14 11:29:10 -04:00
qazal	30035df5a4	add metal process replay back (#6068 ) test this new one	2024-08-14 12:29:56 +03:00
chenyu	1782e4f64d	use div folding to do lt folding (#6065 )	2024-08-13 16:59:05 -04:00
chenyu	e3af273fa1	touchup cl_errors (#6058 ) * touchup cl_errors * update test	2024-08-13 13:06:59 -04:00
qazal	9145ad52ff	revert UOps eq, this needs to be isolated in realize.py (#6063 ) This reverts commit `dccca7f227`.	2024-08-13 18:02:34 +03:00
Tobias Fischer	6e3eb50fd1	added fix and reg tests (#6060 )	2024-08-12 21:00:48 -04:00
qazal	dccca7f227	test: uop and lazyop have the same compare (#6053 ) * test: uop and lazyop have the same compare * typings * self.assert_equiv_uops -> assertEqual * hash dtype * test nop too * TestPatternMatcher never used this compare anyway * nop eq and ne tests	2024-08-13 00:33:19 +03:00
chenyu	3f2d24a6ec	test_failure_48 for wrong truncation in idx on NV (#6055 ) also added `RAWAST` to print pre-modified AST in DEBUG=3	2024-08-12 16:17:42 -04:00
chenyu	6ed9711898	UOps pattern (x%c)+(x//c)*c = x (#6051 ) pretty cool that this is very easy to write now	2024-08-12 14:58:48 -04:00
ignaciosica	777d6b3349	Fix compile error for max with inline const (#5840 )	2024-08-12 23:40:39 +08:00
ignaciosica	164ca5632e	split tensor core tests (#6041 )	2024-08-12 09:42:02 -04:00
chenyu	7ce716b3a0	bigint -> pyint [run_process_replay] (#6040 ) it's a python int. priority should be higher than bool, but we are not using it in type promo now.	2024-08-12 09:12:23 -04:00
Timmy	a00994b423	Lowerer Multireduce Uopgraph (#6007 ) * uopgraph changes * fixing for non-reducing ranges * multireduce tests * linters * linters * removing comments * removing arg[1] * linters * prettier * linters * more linters * use any instead of intersection	2024-08-12 15:16:07 +03:00
qazal	7d1f118731	use assertIs in test_schedule (#6035 ) * use self.assertIs in test_schedule * test_lazybuffer	2024-08-11 19:19:18 +03:00
qazal	b918e3c255	cache assert_equiv_uops (#6033 )	2024-08-11 12:17:05 +03:00
George Hotz	1b3443902c	don't use tgmath with clang (#6029 ) * don't use tgmath with clang * fix tests * nostdlib for clang * needs ffreestanding on OSX	2024-08-10 13:58:19 -07:00
chenyu	5820940d98	more relax rtol for test_arange_fuse_grouped_children (#6027 ) one more https://github.com/chenyuxyz/tinygrad/actions/runs/10334072657/job/28607120462	2024-08-10 16:10:03 -04:00
chenyu	10374a2741	relax rtol for test_arange_fuse_grouped_children (#6026 ) flaky https://github.com/tinygrad/tinygrad/actions/runs/10333939631/job/28606831006?pr=6023	2024-08-10 15:49:11 -04:00
George Hotz	cf7d3c1eb8	fix tests locally on metal (#6025 ) * remove contiguous child, it was breaking tests locally * hmm, it's still needed * include NOOPT in method cache key	2024-08-10 12:36:22 -07:00
chenyu	e6c7c3e499	update pylint path to check indent/space for all (#6022 ) also fixed many errors. it was not checking nested dirs. exclude autogen for now. can we use ruff for this?	2024-08-10 14:41:09 -04:00
George Hotz	cfb04c67d1	run unit tests separate from others (and only once) (#6020 ) * run unit tests separate from others * ignore unit tests elsewhere	2024-08-10 11:17:56 -07:00
uuuvn	ee3b015407	ELF loader strtab fix and tests (#6011 ) * ELF loader strtab fix and tests * ruff * typos * only one test	2024-08-10 10:13:16 -07:00
Jun Zhang	54e176fb4f	Ignore non-computational backends when overwriting the default (#5770 )	2024-08-10 09:23:29 -07:00
qazal	3ef2788c4f	hotfix: run the entire test_conv_bw schedule (#6014 )	2024-08-10 17:55:41 +03:00
qazal	0e62076cf5	more process replay cleanups (#6013 ) * more process replay cleanups * comma benchmark missing	2024-08-10 17:29:10 +03:00
chenyu	63a8bc29d4	addition divisor in UOp div_folding (#6002 ) in addition to try gcd of all terms, also try least common divisor of all MULs	2024-08-09 20:09:05 -04:00
chenyu	5961faa4be	minor change to UOp div_fold (#6004 ) remove an unnecessary gcd and swap the quo rem order, minimize diff for divisor pr	2024-08-09 17:09:59 -04:00
qazal	7373b05ee8	assert conv bw reduceops merge [compare_schedule] (#6001 ) * assert conv bw reduceops merge [compare_schedule] * diff with ref_commit_hash	2024-08-09 19:29:56 +03:00
qazal	b67d521a07	assert test_conv_bw correctness (#6000 ) * assert test_conv_bw correctness * reorder half * metal and clang still red	2024-08-09 18:30:36 +03:00
qazal	a833f1a735	scheduler process replay with [compare_schedule] (#5997 )	2024-08-09 16:58:22 +03:00
qazal	24c7c41ce0	diff LazyBuffer schedules in process replay (#5996 ) * start diff printing * this should be 2 * add to process_replay.py * enable schedule capture * arange diff is process replay	2024-08-09 14:16:43 +03:00
chenyu	1f1eb46af6	more failed simplified UOp div test case (#5992 ) this speculative div was handled by "divisor" in symbolic.	2024-08-08 18:39:25 -04:00
chenyu	c3e1ae2535	add failed simplified UOp div test case (#5990 ) more cases!	2024-08-08 17:37:48 -04:00
nimlgen	38d5eecc68	hcq profiler support args (#5989 ) * hcq profiler support args * bytes -> _bytes * fix * add test * mypy * not f strings * percison	2024-08-09 00:18:36 +03:00
qazal	45b1761175	smaller test_llama_embedding + assert correctness (#5986 ) * smaller test_llama_embedding in CI * test correctness	2024-08-08 22:11:29 +03:00
Timmy	8c99bdab08	More Multireduce Tests (#5968 ) * multireduce tests * linters * more linters * more linters * seeing how it works with parallel	2024-08-08 22:04:08 +03:00
gswangg	df44a4e861	Make vectorization of CONST explicit (#5322 ) * remove test_const_vectorize_fold * remove const folding UPat for VECTORIZE * refactor cstyle render_const * remove calls to dtype.scalar() in render_const * add assert * add vectorized const to UOp.const * add UPat GEP-VECTORIZE-CONST -> CONST * render_vectorize for DEFINE_ACC in cstyle * add back missing render_cast in render_const * generate vectorized consts as UOps for DEFINE_ACC * update asserts for DEFINE_ACC with VECTORIZE src * add UPats for PHI with VECTORIZE src * use prev rendered vectorize in DEFINE_ACC render * update DEFINE_ACC in python runtime * update vectorized DEFINE_ACC in PTXRenderer * rebase DEFINE_ACC changes on lowerer * verbose rewrite of bad UPats * simplify UOps.CONST implementation in ops_python * update sum_collapse UPats for DEFINE_ACC-VECTORIZE * revert linearizer to TOT * fix DEFINE_ACC implementation in ops_python * simplify DEFINE_ACC in cstyle * Fix linter error * support VECTORIZE in fold gated load/store UPat * support VECTORIZE in other fold gated load UPats * rewrite VECTORIZE in UPat for no input DEFINE_ACC * simplify DEFINE_ACC render in cstyle * make VECTORIZE rules more concise * add more vectorize fold tests * inline VECTORIZE-CONSTs in cstyle render * revert VECTORIZE/GEP rule refactor * revert cstyle render_const refactor * inline VECTORIZE-CONSTs in cstyle render * implicitly vectorized const rendering -> explicit * WMMA VECTORIZE CONST process replay hacks * VECTORIZE CONST NAN process_replay hacks * more VECTORIZE CONST NAN hacks * cleanup process_replay hacks * isnan() -> not isfinite() cstyle VECTORIZE CONST * tweak isnan and isfinite checks VECTORIZE CONST * tweak for positive vs negative infinity VECTORIZE CONST * add assert to PTX CONST render * process_replay VECTORIZE CONST render parity for PTX STORE * vmin/vmax for VECTORIZE'd CONST * update WMMA folding rules * add tests for WMMA VECTORIZE fold * hack for cstyle half4 CONST zero process_replay parity * revert PTX backend changes * add back minimal DEFINE_ACC PTX change * remove cstyle process_replay hacks * remove dead code in PTX CONST render * cleanup vmin/vmax logic for VECTORIZE'd CONSTs * update vectorize fold tests to use DEFINE_VAR * fix long line formatting in test * remove unwanted merge artifact * more vmin/vmax cleanup * remove unnecessary asserts * yet more vmin/vmax cleanup * get rid of explicit VECTORIZE CONST logic in _min_max * reuse CONST instead of creating a new one * remove unneeded cast * handle DType correctly in sconst * improve readability of tests * save a line * save another line * tuplize pats in src * remove GEP-VECTORIZE pats * add vec +0 fold * HACK: fold only vec8 +0 * remove vectorized ALU fold hack --------- Co-authored-by: qazal <qazal.software@gmail.com> Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-08 20:59:05 +03:00
chenyu	62c77a2831	trim const in UOp div_folding (#5982 ) simplify `(4x+4y+7)//16` to `(x+y+1)//4`. fixed `GPU=1 UOP_IS_SYMBOLIC=1 IMAGE=2 python -m pytest test/test_ops.py -k conv`	2024-08-08 12:49:05 -04:00
qazal	e6d41b0ce7	hotfix: adjust test_backward_pass_diamond_model thresholds (#5981 )	2024-08-09 00:20:53 +08:00
nimlgen	183c4c91a3	fix non-jitted transfers in profile (#5980 ) * fix transfers in profile * fix linter * sync to be sure everythin is recorded	2024-08-08 17:58:08 +03:00
George Hotz	c5baa3d66b	hotfix: don't run OOM test in CI	2024-08-07 22:19:29 -07:00
chenyu	859d0e4709	UOp simplify `(x+c0)c1 -> xc1+c0*c1` (#5973 )	2024-08-07 21:25:22 -04:00
wozeparrot	97d708252a	remove realize from threefry (#5969 )	2024-08-07 15:08:49 -07:00
George Hotz	bf8ec23b00	hotfix: contiguous on precompute_freqs_cis	2024-08-07 14:40:56 -07:00
nimlgen	8d8704af2d	fix amd exec_update for locals (#5966 )	2024-08-07 21:02:56 +03:00
tyoc213	0c4e9dbe71	retrieve defined opencl error codes (#5792 )	2024-08-07 10:46:24 -07:00
qazal	d6f4a61c42	graph LBScheduleItem [run_process_replay] (#5960 ) * add toposort key to LBScheduleItem * use dedup * graph LBScheduleItem * make that comment beautiful again * diff_schedule utils * update fuzz_schedule	2024-08-07 19:59:11 +03:00
qazal	7677361d90	test pushing through different expands in 1 kernel (#5963 ) * test pushing through different expands in 1 kernel * realize eye * back to test_example_matmul	2024-08-07 19:33:18 +03:00
qazal	39dda3d042	rename prescheduled items to lsi [run_process_replay] (#5959 ) * rename to lsi * fuzz_schedule more typings * rename fuzz_schedule	2024-08-07 14:31:50 +03:00
qazal	728b7e189e	diff_schedule tests [run_process_replay] (#5958 ) * diff_schedule tests [run_process_replay] * ok to run serial	2024-08-07 13:50:27 +03:00
chenyu	a7163b80d8	lower test_transcendental fuzz test threshold for sin float64 (#5956 )	2024-08-07 02:04:37 -04:00
chenyu	fa3a36e576	fancier UOp div gcd folding (#5953 ) combine and cancel the remaining const based on gcd of other terms like SumNode.	2024-08-07 02:04:25 -04:00
chenyu	aa7fd7ef74	Use `(-self).lt(-x+1)` for `UOp.ge` (#5955 ) matched symbolic and fixed UOP_IS_SYMBOLIC=1 arange folding	2024-08-07 01:31:27 -04:00
George Hotz	658d58784b	embedding doesn't cast (#5952 ) * embedding doesn't cast * test the right thing * too much annoying with that test	2024-08-06 17:49:14 -07:00
wozeparrot	30d0cb2a82	fix: fix transcendental flakyness on exp float with 9.96875 (#5951 )	2024-08-06 17:32:13 -07:00
George Hotz	3a0515ea22	hotfix: process_replay/diff_schedule.py to LBScheduleItem	2024-08-06 17:01:05 -07:00
chenyu	aee737bd9e	divide by gcd in UOp div folding (#5949 ) * divide by gcd in UOp div folding `(6x+6y)//16 -> (3x+3y)//8` etc simpler version * only factor out const * don't apply for unsigned * don't need that if * space	2024-08-06 20:00:57 -04:00
George Hotz	6d1fdcfce2	don't reduce the same thing in a vector (#5950 ) * don't reduce the same thing over and over * cleaner way to write it that doesn't loop	2024-08-06 16:59:15 -07:00
qazal	d5d7f4e7b8	more TestIndexing correctness asserts [run_process_replay] (#5948 ) * use torch in test_mnist_val * more asserts	2024-08-07 01:50:42 +03:00
chenyu	794796256c	UOp.const_factor [run_process_replay] (#5945 ) * UOp.const_factor [run_process_replay] simplify mod and div folding * test does not work now	2024-08-06 18:18:29 -04:00
George Hotz	73d4d51845	add LBScheduleItem type [run_process_replay] (#5944 ) * add LBScheduleItem type [run_process_replay] * minor cleanups * fix * fix fuzz tests * add group cache type	2024-08-06 14:49:40 -07:00
qazal	7b6496f2e6	fix the reduceops cache breaking beautiful_mnist (#5938 ) * fix the reduceops cache breaking beautiful_mnist * test_sparse_categorical_crossentropy_simple * starting tests * atol from test_nn * test_sparse_categorical_crossentropy_alt * dont use torch	2024-08-07 00:02:54 +03:00
George Hotz	1417cc8df1	can reenable that test now (#5914 )	2024-08-06 13:38:21 -07:00
chenyu	489575c3be	more UOp sum div with gcd tests (#5936 ) * more UOp sum div with gcd tests * one more	2024-08-06 12:50:10 -04:00
ignaciosica	81ae9fadc8	Float4 support for CLANG (#5915 ) * float4 support on clang * skip linearizer tests that require locals * add aligned attribute	2024-08-06 07:50:12 -07:00
qazal	a7db4c3ee9	show timings for DIFF_ARANGE=1 (#5935 ) * show timings for DIFF_ARANGE=1 * always with DEBUG=2	2024-08-06 17:20:38 +03:00
qazal	102a8c184b	diff fused arange schedules with ARANGE_DIFF=1 (#5934 ) * diff fused arange schedules with ARANGE_DIFF=1 * better llama diff	2024-08-06 16:52:26 +03:00
qazal	3d4742dd2e	override output shape in fused assign (#5930 ) * override output shape in fused assign This makes ``` FUSE_ARANGE=1 JIT=0 python3 examples/llama.py --gen 1 --prompt "Hello." --count 10 --temperature 0 --timing ``` work. In general we should assert ASSIGN doesn't change shape. * merge asserts	2024-08-06 13:28:50 +03:00
chenyu	09b7722637	UOp generic div folding (#5896 )	2024-08-05 21:38:43 -04:00
George Hotz	3e1336957d	test arange with all opts (#5923 ) * test arange with all opts * Update test_arange.py * Update test_arange.py * Update test_arange.py * Update test_arange.py * Update test_arange.py	2024-08-05 18:38:25 -07:00
George Hotz	5d17f54e3c	fast mnist indexing (#5921 ) * fast mnist indexing * more tests * remove those tests, new indexing rule	2024-08-05 13:55:15 -07:00
George Hotz	e81c18f494	make the arange test check correctness [run_process_replay] (#5920 )	2024-08-05 13:41:06 -07:00
George Hotz	8d1c884e78	capture the const pattern in both directions (#5919 ) * capture the const pattern in both directions * add regression test	2024-08-05 12:15:38 -07:00
George Hotz	42f599870c	unroll arange is broken (#5918 ) * unroll arange is broken * fix unrolled arange * one more test	2024-08-05 12:15:07 -07:00
qazal	70949ea7e6	test cstyle compile error for max with inline const (#5838 ) * test_failure_46 * GPU=1 fails too * add test_renderer * add failing platforms * nv too * assert return value	2024-08-05 19:02:16 +03:00
qazal	e0c6520138	check arange fusing with VIEW and COPY (#5912 ) * check arange fusing with VIEW and COPY * gpu and clang	2024-08-05 17:09:21 +03:00
nimlgen	590b9ebb34	hcq copy queue is optional (#5909 ) * hcq copy queue is optional * one more * this	2024-08-05 14:03:25 +03:00
George Hotz	159ac06b5b	remove unused reduce rules + improve unparented (#5908 ) * remove unused reduce rules [run_process_replay] * this work * those tests are meaningless now	2024-08-04 18:18:27 -07:00
George Hotz	d7387d31bf	remove useless reduce cases [run_process_replay] (#5907 ) * remove useless reduce cases [run_process_replay] * do_reduce cleanup * more cleanups + no longer supported tests * Revert "more cleanups + no longer supported tests" This reverts commit e9f2f6ba7061f8697a308aacdc3442fa922a77f5. * no longer supported tests * switch ReduceOps.SUM -> BinaryOps.ADD	2024-08-04 17:11:08 -07:00
George Hotz	be8958e26b	use CONTRACT before REDUCE (#5903 ) * use CONTRACT before REDUCE [run_process_replay] * support half expand * EXPAND GEP	2024-08-04 16:17:33 -07:00
chenyu	4a65010de8	remove CUDACPU flag in tests [run_process_replay] (#5902 ) no longer used	2024-08-04 16:06:38 -04:00
qazal	aad9234e52	test fused precompute_freqs_cis (#5900 ) * test_precompute_freqs_cis * tiny for ci	2024-08-04 21:01:05 +03:00
chenyu	c67e9887f7	support using str to specify dtype (#5897 ) * support using str to specify dtype in Tensor creation and args into `cast` and `bitcast`, and acc_dtype * more tests	2024-08-04 12:56:28 -04:00
qazal	4c5ef2cc4f	setitem with arange fusion 1 (#5898 )	2024-08-04 16:09:21 +03:00
chenyu	da61dea1b2	simple failed UOp sub symbolic test case (#5894 )	2024-08-03 14:27:23 -04:00
qazal	56ef9e453e	pad reduceops to the max of each dimension (#5889 ) * early verify * pad reduceops to the max of each dim * remove the function	2024-08-03 14:03:30 +03:00
qazal	65fa86901a	indexing fusion 2 (#5888 ) * arange fusion * kernels that fuse * tests	2024-08-03 13:13:39 +03:00
qazal	af59b2eea9	tests from the indexing fusion branch (#5886 )	2024-08-03 11:56:48 +03:00
chenyu	d5de44340e	UOp add mod folding (#5862 ) * UOp add mod folding * that passes now	2024-08-02 18:31:46 -04:00
chenyu	41bbd3f4c1	update UOp mod reduction patterns (#5883 ) prepare generic mod folding, also some test changes from mod folding pr	2024-08-02 17:43:40 -04:00
wozeparrot	acadccf344	comma benchmark (#5518 )	2024-08-02 14:36:54 -07:00
Elias Wahl	4a114756f6	New BERT dataloader (#5881 ) * One file == One topic * update test * new dataloader * update train script * get index is faster	2024-08-02 15:12:23 -04:00

... 3 4 5 6 7 ...

2686 Commits