tinygrad

Commit Graph

Author	SHA1	Message	Date
chenyu	75d4344cda	UOps.BITCAST (#3747 ) * UOps.BITCAST implicitly fixed no const folding for bitcast * python backend * ptx * consistent llvm	2024-03-14 21:00:35 -04:00
chenyu	9a00a453c7	add test case for uop cast constant fold (#3746 ) and a expected failed bitcast fold test case. Will fix with UOps.BITCAST refactor	2024-03-14 20:00:27 -04:00
chenyu	11c61ae044	Revert "fix const bitcast should not be constant folded (#3743 )" (#3744 ) This reverts commit `38ba277ac8`.	2024-03-14 19:24:05 -04:00
George Hotz	d52d0b0efb	test_assign_kv_cache	2024-03-14 16:17:20 -07:00
chenyu	38ba277ac8	fix const bitcast should not be constant folded (#3743 ) * fix const bitcast should not be constant folded * fixed const bf16 creation * LLVM still broken	2024-03-14 19:13:52 -04:00
chenyu	557c7a5c54	fix yolov8.py (#3742 ) replaced an `assign` with `replace`, and add '.png' for output if input URL does not contain an extention	2024-03-14 17:33:45 -04:00
George Hotz	5b3d8a886e	split tinybox benchmark into two (#3741 ) * split tinybox benchmark into two * symlinks	2024-03-14 14:12:32 -07:00
George Hotz	3527c5a9d2	add Tensor.replace (#3738 ) * add Tensor.replace * fix dtypes in that test * should be replace * and mixtral	2024-03-14 13:34:14 -07:00
chenyu	0ead0bdb65	script to benchmark beam v hcopt (#3737 ) the goal is that big enough beam should be faster than hcopt/tc also this failed on tc opt NUM=2 FILTER_REDUCE=1 TEST_N=20 BEAM=4 DEBUG=2 python test/external/speed_beam_v_hcopt.py	2024-03-14 15:04:03 -04:00
chenyu	90e55a9fd1	fix buf_index not found case in _apply_tc_opt (#3739 ) ValueError if src.src[0] is not a LOAD. Replaced with returning None in _apply_tc_opt and test to make sure the net output is KernelOptError.	2024-03-14 14:27:05 -04:00
nimlgen	6bf11a2ce3	fix incorrect direct store with gep (#3735 ) * fix incorrect direct store with gep * better comment * phi as well * dtype check there * mypy happy? * not used * renames * phi in phi	2024-03-14 20:58:50 +03:00
P4ssenger	bbad3b1dd9	call self.nbytes (#3736 )	2024-03-14 08:10:12 -07:00
qazal	00c56db1a4	Fix JITItem count assert for HSAGraph (#3734 ) * exclude HSA graph * cant import HSAGraph directly	2024-03-14 14:12:35 +03:00
nimlgen	4b01c44579	hotfix: sdma/aql are visible again (#3733 )	2024-03-14 10:33:22 +03:00
qazal	43953c0ba9	skip grouped store for umatching upcasts (#3723 ) * skip if upcasts dont match * outputs match now * this ast is hardcoded --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-14 01:18:31 -04:00
David Hou	199f7c4342	MLPerf Resnet (cleaned up) (#3573 ) * this is a lot of stuff TEST_TRAIN env for less data don't diskcache get_train_files debug message no lr_scaler for fp32 comment, typo type stuff don't destructure proc make batchnorm parameters float make batchnorm parameters float resnet18, checkpointing hack up checkpointing to keep the names in there oops wandb_resume lower lr eval/ckpt use e+1 lars report top_1_acc some wandb stuff split fw and bw steps to save memory oops save model when reach target formatting make sgd hparams consistent just always write the cats tag... pass X and Y into backward_step to trigger input replace shuffle eval set to fix batchnorm eval dataset is sorted by class, so the means and variances are all wrong small cleanup hack restore only one copy of each tensor do bufs from lin after cache check (lru should handle it fine) record epoch in wandb more digits for topk in eval more env vars small cleanup cleanup hack tricks cleanup hack tricks don't save ckpt for testeval cleanup diskcache train file glob clean up a little device_str SCE into tensor small small log_softmax out of resnet.py oops hack :( comments HeNormal, track gradient norm oops log SYNCBN to wandb real truncnorm less samples for truncated normal custom init for Linear log layer stats small Revert "small" This reverts commit 988f4c1cf35ca4be6c31facafccdd1e177469f2f. Revert "log layer stats" This reverts commit 9d9822458524c514939adeee34b88356cd191cb0. rename BNSYNC to SYNCBN to be consistent with cifar optional TRACK_NORMS fix label smoothing :/ lars skip list only weight decay if not in skip list comment default 0 TRACK_NORMS don't allocate beam scratch buffers if in cache clean up data pipeline, unsplit train/test, put back a hack remove print run test_indexing on remu (#3404) * emulated ops_hip infra * add int4 * include test_indexing in remu * Revert "Merge branch 'remu-dev-mac'" This reverts commit 6870457e57dc5fa70169189fd33b24dbbee99c40, reversing changes made to `3c4c8c9e16`. fix bad seeding UnsyncBatchNorm2d but with synced trainable weights label downsample batchnorm in Bottleneck :/ :/ i mean... it runs... its hits the acc... its fast... new unsyncbatchnorm for resnet small fix don't do assign buffer reuse for axis change * remove changes * remove changes * move LARS out of tinygrad/ * rand_truncn rename * whitespace * stray whitespace * no more gnorms * delete some dataloading stuff * remove comment * clean up train script * small comments * move checkpointing stuff to mlperf helpers * if WANDB * small comments * remove whitespace change * new unsynced bn * clean up prints / loop vars * whitespace * undo nn changes * clean up loops * rearrange getenvs * cpu_count() * PolynomialLR whitespace * move he_normal out * cap warmup in polylr * rearrange wandb log * realize both x and y in data_get * use double quotes * combine prints in ckpts resume * take UBN from cifar * running_var * whitespace * whitespace * typo * if instead of ternary for resnet downsample * clean up dataloader cleanup a little? * separate rng for shuffle * clean up imports in model_train * clean up imports * don't realize copyin in data_get * remove TESTEVAL (train dataloader didn't get freed every loop) * adjust wandb_config entries a little * clean up wandb config dict * reduce lines * whitespace * shorter lines * put shm unlink back, but it doesn't seem to do anything * don't pass seed per task * monkeypatch batchnorm * the reseed was wrong * add epoch number to desc * don't unsyncedbatchnorm is syncbn=1 * put back downsample name * eval every epoch * Revert "the reseed was wrong" This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f. * cast lr in onecycle * support fp16 * cut off kernel if expand after reduce * test polynomial lr * move polynomiallr to examples/mlperf * working PolynomialDecayWithWarmup + tests....... add lars_util.py, oops * keep lars_util.py as intact as possible, simplify our interface * no more half * polylr and lars were merged * undo search change * override Linear init * remove half stuff from model_train * update scheduler init with new args * don't divide by input mean * mistake in resnet.py * restore whitespace in resnet.py * add test_data_parallel_resnet_train_step * move initializers out of resnet.py * unused imports * log_softmax to model output in test to fix precision flakiness * log_softmax to model output in test to fix precision flakiness * oops, don't realize here * is None * realize initializations in order for determinism * BENCHMARK flag for number of steps * add resnet to bechmark.yml * return instead of break * missing return * cpu_count, rearrange benchmark.yml * unused variable * disable tqdm if BENCHMARK * getenv WARMUP_EPOCHS * unlink disktensor shm file if exists * terminate instead of join * properly shut down queues * use hip in benchmark for now --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-14 00:53:41 -04:00
nimlgen	0f050b1028	hsa profiler (#3711 ) * hsa profiler * simpler * profile * copy -> is_copy * print when saved * faster * do not create structs --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-13 21:19:22 -07:00
George Hotz	56b914fc8c	hotfix: test_assign_contiguous	2024-03-13 17:49:54 -07:00
chenyu	4d6ec41adb	failed test cases for bf16 Tensor.full (#3729 ) fixable with float const then cast to bf16. cast folding with bitcast is incorrectly skipped	2024-03-13 20:46:45 -04:00
George Hotz	838afbc351	assign tests (#3728 )	2024-03-13 17:04:55 -07:00
chenyu	3d9b882d37	hotfix unlink /dev/shm/resnet_X if it already exists (#3726 )	2024-03-13 18:53:03 -04:00
chenyu	6793db169b	bfloat16 tensor creation from list and numpy (#3724 )	2024-03-13 18:44:05 -04:00
chenyu	f30fb192b7	resnet eval on tinybox ci (#3714 )	2024-03-13 13:26:30 -04:00
George Hotz	f1dd8928c9	where fold prereqs (#3718 )	2024-03-13 10:01:43 -07:00
George Hotz	27a48dd40c	change default from HIP to HSA (#3717 )	2024-03-13 09:42:42 -07:00
chenyu	ad1d873f8d	fix llama shard convo mode (#3716 )	2024-03-13 12:07:02 -04:00
qazal	337cd53444	multioutput ScheduleItem (#3699 ) * refactor realize.py * update docs * update test_sched * update runners and devices * update openpilot and unit tests * cleanup runner lowering * update more tests	2024-03-13 08:59:38 -07:00
nimlgen	08064a0e29	add SEED env to fuzz_linearizer (#3713 ) * add SEED env to test/external/fuzz_linearizer.py * found some * more platforms	2024-03-13 18:08:42 +03:00
David Hou	2befdf86d9	dataloader worker/shm cleanup (#3710 )	2024-03-12 21:44:24 -04:00
chenyu	e1b2a82d89	fix st.real_size can be nagative if valid is always false (#3708 ) two followups after this. (1) if a buffer is never accessed in kernel, it can be removed from input (2) real_size can be smaller conditional on valid being true (the old validhack stuff)	2024-03-12 20:34:07 -04:00
chenyu	b13457e4a7	explicit dtypes in hlb_cifar (#3707 ) prepared bfloat16 change. added float() and cast(default_float) in whiteing, explicitly set dtype in various places that convert between numpy and Tensor	2024-03-12 18:20:23 -04:00
Francis Lam	b6e2495fdd	kernel: limit shared memory usage when adding opts (#3705 ) * kernel: limit shared memory usage when adding opts * search: remove unnecessary limit on search space apply_opt will do the more correct check	2024-03-12 17:06:21 -04:00
George Hotz	2024b24f35	add some graph tests (#3702 ) * add some graph tests * PatternMatcher class * speedup * const cast test * fix tests * itertools chain	2024-03-12 09:49:47 -07:00
chenyu	f599c6e7f4	test output dtypes matche in test_ops (#3703 ) need to cast some torch output to int32 because torch default returns int64 for index related function close #2797	2024-03-12 12:44:40 -04:00
nimlgen	798970cfad	fix gpu hangs when exiting while aql queues are executing (#3700 )	2024-03-12 19:23:23 +03:00
chenyu	02ca067bdf	use default_float.np to construct test data in test_ops (#3701 ) first step of #2797	2024-03-12 11:58:20 -04:00
George Hotz	6755a9254f	constant fold pattern match (#3696 ) * constant fold pattern match * match * better match * fix bug in pattern * more folding	2024-03-12 08:48:07 -07:00
nimlgen	dd1a1c12df	rocm path in autogen (#3697 )	2024-03-12 14:06:43 +03:00
Patrick Tsai	971d7f5d7c	O(n) arange attempt (#3530 ) * It works? * Clamp correctly * Refactor * Make code better * Undo some stuff * First step to trying to make floats work * Floats work in Python op but not metal because int div is different Python integerdivision was implemented as // which rounds towards negative infinity, but C integer division rounds towards 0 so there is an off-by-1 division error * arange does cumsum with ints and then multiplies by step This is so loop optimization can remain int only * Undo a lot of symbolic changes * Final check * Cleanup * There can be multiple phis * Fix multiple phi op removal * const sets dtype correctly * Fix bugs * Fix a couple bugs and add loop vars to resolve * missed one * Don't trim too many ops * Fix symbolic test * Use ones instead of full * Delete test * Lint passes * max node error * Small updates to loop logic * Remove unnecessary changes * We are getting somewhere * Simple case * Fix * rm, prn * Better * If NumNode doesn't work then continue * clamp is needed for arange(256) * Move everything into the optim fn * Replace correctly * Order optimizations better * Delete * mypy * Test for simplification * Rename * Fix test * update test description * Undo more * Cleanup * No replaced_ops map * Fix lint * AssertionError * back again * Reinstate assertion * Return true and make diff not as big * Bigger range for test * Change cumsum impl * fix bug * make big cumsum work * lint * Undo cumsum 2-stage removal * No while helper * optional min/max clamping * floats work * rm giant arange test * fix python cast None * Check phi parents * one phi allowed per where * Fix one phi per where * Rework iteration * Delete assertions * convert to int * Try mul -1 instead of neg for hip..? * Remove one phi per where requirements * one accum only * Lint * should simplify a loop at a time * Don't get rid of loop explcitly * Need to iterate backwards * lint * unary neg * Make optim work for onnx and sum_pad_collapse * Better message * filter alu ops correctly * Fix the limiter * lint and simplify * Add it back * off by one error * test wheres and phis * test max ops and non-if stuff * <= * cast_scalar * Oops * Change test * Pass loop uops instead of a modified map * Cut param transfer between linearizer and uops * Fix issues * Fix lint * fix efficientnet python 3.8 invalid syntax * distinct vars in seen_vars * accurate var names --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-11 16:09:20 -07:00
George Hotz	a5d023dff8	reciprocal mlop (#3694 )	2024-03-11 16:08:46 -07:00
George Hotz	3af1c1051a	Revert "bring reciprocal back (#3687 )" (#3692 ) This reverts commit `bcf6fbd3b2`.	2024-03-11 15:55:14 -07:00
George Hotz	ef44c8959b	Revert "rewrite recip to div (#3690 )" (#3691 ) This reverts commit `2b089bfd18`.	2024-03-11 15:54:58 -07:00
George Hotz	2b089bfd18	rewrite recip to div (#3690 ) * rewrite recip to div * fix bug in uops add	2024-03-11 15:52:24 -07:00
qazal	aec4c4f01b	linearizer ast as a tuple of lazyops (#3689 ) * multi store op linearizer * currently we do only one output per kernel * named opts	2024-03-11 15:39:04 -07:00
chenyu	d0bcc9a66b	replace all `if dim < 0: dim += self.ndim` with _resolve_dim (#3688 )	2024-03-11 17:33:36 -04:00
George Hotz	bcf6fbd3b2	bring reciprocal back (#3687 ) * bring reciprocal back * better * explicit dtype for recip * llvm tighter * sigmoid can use RECIP	2024-03-11 14:19:54 -07:00
Francis Lam	9f13960f72	search: catch RuntimeError when timing acted_lins (#3664 ) when compilation succeeds, but runtime fails due to thread limits on METAL, this allows a beam search to proceed, treating this the same way as a compile failure.	2024-03-11 16:14:03 -04:00
rnxyfvls	490c5a3ec3	examples/stable_diffusion: support model checkpoints without alphas_cumprod key (#3681 ) * examples/stable_diffusion: support model checkpoints without alphas_cumprod key (which is most models on civitai) * fix indent --------- Co-authored-by: a <a@a.aa>	2024-03-11 16:05:52 -04:00
Francis Lam	3219a527d6	search: add a tool that beam searches one or more kernels (#3685 )	2024-03-11 16:02:17 -04:00
chenyu	b68fbd7d81	View.__add__ to merge_view (#3686 ) verified the cases that used real_strides are redundant	2024-03-11 15:52:34 -04:00

... 2 3 4 5 6 ...

3961 Commits All Branches Search

3961 Commits

All Branches