tinygrad

Commit Graph

Author	SHA1	Message	Date
qazal	28c75bf2a6	merge uops with ops (#6111 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-08-16 18:17:57 -04:00
qazal	c23d44c779	AST is UOp (#6030 ) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc. * fix benchmark * remove extra dedup	2024-08-16 22:09:00 +03:00
George Hotz	fa7e734b49	MetaOps.KERNEL (#5543 )	2024-07-17 19:41:23 -07:00
George Hotz	6707c778d0	scheduleitem is not Tuple [run_process_replay] (#5425 ) * scheduleitem is not Tuple [run_process_replay] * fix tests * fix op + fuzzers * fix mop test	2024-07-12 15:13:19 -07:00
chenyu	2396ab9b33	more transcend cleanup [run_process_replay] (#5369 ) fix test name, less # noqa: E501 and removed the cast	2024-07-10 23:05:03 -04:00
George Hotz	0215c952c5	Move transcendental to UOp level (#5367 ) * move uopgraph to file [run_process_replay] * transcendental uops * tests pass * no skip --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-10 19:06:25 -07:00
hikettei	320e7ed935	Approximations for SIN/LOG2/EXP2 passing all tests. (#5187 ) * [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime * Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf) * [WIP] Added a support for LLVM IR * cleaned up the code for the mypy and linter * [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue. * [Add] added fast=true mode which disables the payne-hanek reduction which is slow * [Fix] fails to compute elements when shape includes zero * [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly * [wip] update the assembly for ptx * Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required). * [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64) * [Fix] Cyclic dependencies existing in xlog2 * [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py) * [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed... * [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode. * [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored) * [WIP] Added fp16 exp2 implementation * [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation. * stashed the changes for FP16 sin * [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower) * [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al. * [Refactor] Added the function polyN to clean-up N-terms polynomial approximation. * [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2 * [Patch] added bitcast_forward option * [Patch] resolved cycle graph * patch fix cycle graph * set bitcast_forward=True in ilogb2k * bitcast_forward for multi.py * E501 * Break into multiple small PRs * [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN * [Patch] NV still required FP64 for xlog2 * updated schedule test * updated the count of kernels * [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD. * Bitcast: make them api-compatible * [update] force to use bitcast * updated the count of constant folding * [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value * [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash. * xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time * some minor simplification to payne hanek reduction * [refactor] refactored some rebundant parts existing in payne hanek * [refactor] more readable payne hanek impl * [refactor] improved the code consistency of payne hanek * [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.) * Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)" This reverts commit 0eee08b87c9e46da8aec0a8edec5316634031a49. * use allow_buffer_view * lets support multilazytensor * updated the count of kernels * [test] added the jit tests for approx ops * keep failed constant folding tests tested, added expectedFailure * explict the timeout deadline when testing approx jit timeout * [WIP] Simplified the implementation of xsin, never timeouts * [Refactor] Improved the consistency of approx sin implementation, passing time out tests * integrated xexp2_base into xexp2 * Set switch_over=39800.0 * delete: is_buffer_fastmath_supported * sin: compute against abs(x) * some cleanups * fix typo * removed the space between param and dtype * allow 514 kernels on CI for sd * [refactor] no need to upcast ad ldexp3k * [refactor] added some comments, references to help understanding the code. * [Fix] 1.0 ULP Sine Approximation for FP16 * [update] assume e != 0 * use pow2if instead of ldexp3k to fuse payne_hanek reduction into one * check if approximated sin/log2/exp are fused into one * clean up changes * test amd exp * some code cleanup and test sigmoid * fix: enabled payne_hanek for fp16 to achieve higher acc * fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel * [Refactor] Rename: fastmath -> transcendental * [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py * updated const folding tests * TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al. * Add: unittest.main() * Import TRANSCENDENTAL instead of getenv * Refactor: Added dtype check when TRANSCENDENTAL=2, more context var * Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-10 16:44:58 -07:00
qazal	981afb114f	safely fold NEG in lazy.py (#5135 ) * safe * add test	2024-06-24 19:40:37 -04:00
chenyu	36a1f38049	lazy folding: mul -1 is neg, and neg neg is noop (#4472 )	2024-05-08 01:52:22 -04:00
chenyu	c508eb7425	revert the removal of CAST_BEFORE_VIEW (#4471 ) this brings most of the memory gain for resnet back.	2024-05-08 00:14:29 -04:00
chenyu	f363f39e83	fix dtype of const folded sum (#4349 ) const folding sum should return in the same dtype the same as regular sum, which can be different from input dtype	2024-04-29 11:40:45 -04:00
George Hotz	ba7314c26b	cleanup lbs (#4163 )	2024-04-12 22:32:16 -07:00
chenyu	a7c6864260	remove CAST_BEFORE_VIEW (#4152 ) * remove CAST_BEFORE_VIEW testing perf, also this might have issue with assign? * remove all	2024-04-13 01:05:08 -04:00
geohotstan	1a1dd1c1a7	add and enable tests for indexing const folding (#4068 ) * enable test in test_indexing * added tests * rename stuff * del a test case cuz it's loadops.copy	2024-04-04 10:46:28 -04:00
chenyu	406cb5fd90	const fold ReduceOps (#4059 )	2024-04-03 14:39:28 -04:00
chenyu	fe03725b21	const fold cast unrealized_unpadded_const (#4047 ) * const fold unrealized_unpadded_const changed the underlying arg directly * CAST_BEFORE_VIEW folds some * fix const index in getitem	2024-04-03 12:31:24 -04:00
chenyu	f61ed869f5	Use exec_alu for lazy const folding (#4039 )	2024-04-02 20:52:05 -04:00
chenyu	85edc493b0	uops const fold rules to prevent tautological compare warnings (#4041 ) * uops const fold rules to prevent tautological compare warnings `bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false * not true for nan * and nan does not work with llvm * full truth table test * revert a==a * comments and indents	2024-04-02 16:45:58 -04:00
chenyu	82440d3416	don't call contiguous for unpadded const into multi tensor (#4032 ) * don't call contiguous for unpadded const into multi tensor fixed multi const folding for sharded const. still wip, need to be careful that this does not break multi device cache somewhere * ehh need a memory test for that * simple sharded memory test	2024-04-01 19:22:14 -04:00
chenyu	77a68fc52f	test examples for multi tensor const folding (#4031 ) works with literal const operand now because it's copied to each shard and handled by lazy. does not work for sharded const	2024-04-01 16:53:43 -04:00
chenyu	379d52548d	const fold left const operand for ADD and MUL (#4029 ) * const fold left const operand for ADD and MUL * neg have dtype issue	2024-04-01 15:09:04 -04:00
chenyu	0e02d074bd	fix Tensor.pow folding for exponent 0 and 1 (#4025 )	2024-03-31 19:57:23 -04:00
chenyu	d3f27761b0	move const folding of ADD/SUB/MUL from tensor to lazy (#4020 ) * move const folding of ADD/SUB/MUL from tensor to lazy will do div and pow separately. * fix onnx adding with None	2024-03-31 16:35:36 -04:00
chenyu	7f859593b8	fix _to_const_val and const folding around it (#4017 ) * fix _to_const_val and const folding around it is_unrealized_contiguous_const is too strict and almost never hit if const is expanded. suffice to check if there's no pad * that test is folded * test_const_folding	2024-03-31 13:09:23 -04:00

24 Commits