tinygrad

Commit Graph

Author	SHA1	Message	Date
wozeparrot	dc2617bffd	feat: use more correct reg for local dims (#6048 )	2024-08-12 11:15:37 -07:00
chenyu	e6c7c3e499	update pylint path to check indent/space for all (#6022 ) also fixed many errors. it was not checking nested dirs. exclude autogen for now. can we use ruff for this?	2024-08-10 14:41:09 -04:00
wozeparrot	d269bc95fa	faster tinychat (#5993 )	2024-08-08 19:16:26 -07:00
George Hotz	bc55c8a30e	pmatmul example + GB/s bugfix [run_process_replay] (#5974 ) * pmatmul example + bugfix * improve pmatmul * Update real_pmatmul.py	2024-08-07 22:32:11 -07:00
George Hotz	bf8ec23b00	hotfix: contiguous on precompute_freqs_cis	2024-08-07 14:40:56 -07:00
wozeparrot	5808e8a30f	mockgpu remu changes (#5925 )	2024-08-05 19:26:58 -07:00
wozeparrot	6740a0a6a0	hip_ioctl changes (#5917 )	2024-08-05 11:58:38 -07:00
chenyu	996ff0c135	pow(2) -> square in RMSNorm [run_process_replay] (#5901 ) reads nicer in metadata	2024-08-04 14:21:31 -04:00
Elias Wahl	4a114756f6	New BERT dataloader (#5881 ) * One file == One topic * update test * new dataloader * update train script * get index is faster	2024-08-02 15:12:23 -04:00
nimlgen	34168a64e3	optimize nv profiler (#5856 ) * nv profiler fix * cleanup hcq a bit * fixes * fix * typo * all signals put timestamp * a bit cleaner * merge fields * type * import * tiny fix	2024-08-01 23:57:45 +03:00
Vyacheslav Pachkov	610e454132	fix opencl_ioctl on comma (#5814 ) - remove unused code - add CP_REG_TO_MEM opcode - fixed parse_cmd_buf for more than 1 command object by correcting an offset - fixed memory mappings for cases when memory was allocated with KGSL_MEMFLAGS_USE_CPU_MAP. KGSL_MEMFLAGS_USE_CPU_MAP: If set on call and return, the returned GPU address will be 0. Calling mmap() will set the GPU address. So there are no IOCTL_KGSL_GPUOBJ_INFO ioctls for that type of memory and it resulted to crash right after get_mem.	2024-07-30 20:44:06 -07:00
David Hou	9a485f36e4	shard kvcache (#5830 )	2024-07-30 20:29:54 -07:00
George Hotz	4e89d45513	hotfix: put contiguous back in llama	2024-07-30 18:43:48 -07:00
George Hotz	21c5e8e1b7	extreme llama speed, 57.34 tok/s (#5827 ) * extreme llama speed * mergable	2024-07-30 18:32:09 -07:00
George Hotz	e6879035a0	work to make GEMV fast (#5824 ) * work to make GEMV fast * half8 cast * align struct * fix amd * float8 is a later problem	2024-07-30 17:41:40 -07:00
Francis Lata	ce61be16f1	clean up how preprocessed folder is defined (#5813 )	2024-07-30 12:35:26 -04:00
chenyu	471b188d79	fix mypy errors in latest mypy (#5794 ) * fix mypy errors in latest mypy mypy has stricter partial and api arg checks now * PYTHONPATH="."	2024-07-29 14:53:30 -04:00
nimlgen	ea27ec4cd0	nv switch classlist_v2 to classlist (#5763 ) * nv switch classlist_v2 to classlist * support in mockgpu * fix mockgpu	2024-07-28 20:24:42 +03:00
chenyu	3686b6726a	move GraphException to jit.py (#5744 ) same place where GraphRunner is defined	2024-07-26 19:01:12 -04:00
George Hotz	489a5b99a5	hotfix: triton_nv_matmul touchups	2024-07-24 23:24:29 +00:00
George Hotz	bf24be4c8c	triton gets 163 TFLOPS on 4090	2024-07-24 18:32:29 +00:00
George Hotz	4d47968580	fix acc folding for NV tensor cores (#5658 ) * fix acc folding for NV tensor cores * fix correctness of reduce_before_expand	2024-07-23 13:03:02 -07:00
nimlgen	08a9c0ae5e	hcq cache invalidation for beam (#5630 ) * nv full cache invalidation * the same command on amd * linter * fix amd * nv no hardcoded consts * beam default	2024-07-22 18:13:17 +03:00
George Hotz	6c6d74d922	parallel mcts (#5626 ) * start work on parallel mcts * compile was linearizing twice * typing + more early stopping * fix compiler error	2024-07-21 14:53:23 -07:00
George Hotz	ef179087a4	mcts exit condition wasn't right, also use it with BEAM>=100 (#5619 ) * mcts exit condition wasn't right, also use it with BEAM>=100 * mcts touchups * clean up sample	2024-07-21 10:16:47 -07:00
George Hotz	0f67ef4674	mcts graph and dedup support (#5618 ) * mcts graph and dedup support * usable graph * mcts colors * C=4 seems better * C=3 even better * sample_tree * backprop is external function * late expand to match algo	2024-07-20 23:29:14 -07:00
chenyu	eddc5bcfd7	MCTS tweaks (#5616 ) MCTS 500 is competitive with BEAM=8 on resnet on M1 Max. - increment trial times even with compiled error and runtime error. - use best time of children as the node value.	2024-07-20 19:45:59 -07:00
George Hotz	1113e47f96	print best in MCTS + light up the winner in hcopt	2024-07-20 09:39:36 -07:00
George Hotz	ac99ecd94e	use statistics.median for timing (#5606 )	2024-07-20 08:37:32 -07:00
George Hotz	06e336bccb	mcts search (#5598 ) * mcts search * mcts cleanups * mcts cleanup * random shuffle children order * mcts in handcode_opt * src and remove_node * debug 3 to print ast * print the type * mcts in extra	2024-07-19 21:38:39 -07:00
Tobias Fischer	72da3fe7e6	added clip vision model (#5595 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-19 18:35:51 -04:00
George Hotz	fa7e734b49	MetaOps.KERNEL (#5543 )	2024-07-17 19:41:23 -07:00
Francis Lam	2d53abb04a	test/external/fuzz_linearizer: fix for new AST changes (#5519 ) * test/external/fuzz_linearizer: fix for new AST changes also add beautiful_mnist failures * add CLANG and LLVM to test_failure_35 failed_platforms * fix test_linearizer_failure names	2024-07-17 00:08:07 -04:00
Tobias Fischer	85d4ca7caa	FID Inception Model (#5516 ) * added model impl * minor cleanups * extracted weights loading into from_pretrained * reorganized model for better weight loading * removed lru cache for state dict loading	2024-07-16 23:12:03 -04:00
chenyu	28972418c4	s/get_linearizer/get_kernel [run_process_replay] (#5467 )	2024-07-13 20:32:22 -04:00
George Hotz	03c2dc8bd7	lowerer is kernel [run_process_replay] (#5437 )	2024-07-12 18:50:55 -07:00
chenyu	00813a92a0	update Tensor.eye api to match torch (#5433 ) * update Tensor.eye api to match torch input is n for nrows and optional m for ncols * space * fix onnx	2024-07-12 20:25:12 -04:00
George Hotz	870dc8c350	s/Linearizer/Lowerer [run_process_replay] (#5428 )	2024-07-12 15:54:07 -07:00
George Hotz	6707c778d0	scheduleitem is not Tuple [run_process_replay] (#5425 ) * scheduleitem is not Tuple [run_process_replay] * fix tests * fix op + fuzzers * fix mop test	2024-07-12 15:13:19 -07:00
George Hotz	94599c0637	fixup ast in kernel to be MetaOps.SINK [run_process_replay] (#5424 ) * fixup ast in kernel to be MetaOps.SINK [run_process_replay] * fix tests * fix more tests	2024-07-12 14:01:03 -07:00
uuuvn	3cb94a0a15	Rename tinygrad/runtime/driver to support (#5413 )	2024-07-12 11:06:42 -07:00
wozeparrot	a02b38c0ac	download openimages by running it (#5396 )	2024-07-11 16:06:13 -07:00
wozeparrot	fa873df9c1	bring tinychat more inline with tinyos' version (#5358 )	2024-07-10 13:13:52 -07:00
George Hotz	c13da83f12	tests from lowerer branch (#5339 ) * tests from lowerer branch * Update test_image_dtype.py * Update test_image_dtype.py * Update test_image_dtype.py	2024-07-08 21:23:19 -07:00
nimlgen	51d6f372e4	nv get classes based on device (#5325 ) * nv get classes * support in mockgpu * choose sm based on gpu * fix * fix * fix arch	2024-07-08 18:25:05 +03:00
Tobias Fischer	0c3a35e5c2	Stable Diffusion v2 Inference (#5283 ) * model implementation * clip fix, more qol options	2024-07-03 22:47:10 -04:00
chenyu	b2c3a28a5e	nn.RMSNorm (#5272 ) the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize	2024-07-02 21:39:01 -04:00
Tobias Fischer	8c9c1cf62f	Pulled CLIP and UNet into Seperate Files (#5253 ) * pulled clip and unet into seperate files * reference cleanup, lru cache fix * better pool indexing	2024-07-01 22:33:01 -04:00
nimlgen	57e89645cd	hcq spec test (#5226 ) * start hcq spec test * more test * fixes * run on amd as well * test amdgpu exec * fix amd * amd mockgpu support sdma timestamp	2024-07-01 17:36:37 +03:00
George Hotz	14980f79dd	hotfix: unbreak llama	2024-06-30 15:27:54 -07:00
George Hotz	3df47bc21e	OpenELM + repeat_interleave (#5234 ) * start writing openelm * progress...hit bug * repeat_interleave support * gqa * add rotary embedding * spp * i think it runs correctly * broken * output is good now * cleanups * no io_uring on android	2024-06-30 15:18:39 -07:00
nimlgen	dd7eef7d71	libc defs to autogen (#5217 ) * libc defs to autogen * amd import libc * linter * better a bit * remove comment, check this * not hardcoded path	2024-06-29 14:37:33 +03:00
qazal	3e56c8422c	remu err handling (#5208 ) * add error handling * use pre release * minor * works	2024-06-28 13:15:18 +03:00
reddyn12	f1c7944c44	Fix batchnorm shapes for resnet.load_pretrained (#5167 ) * Fix batchnorm shapes * make it general reshape	2024-06-26 18:44:10 -04:00
nimlgen	69f116a7e1	nv/amd profiler (#4718 ) * nv/amd profiler * fix * fix * profile copies * profile logger * fixes * more fixes * less lines and fixes * fixes * some linter * back sync, no related change * fix gpu2cpu time def * simpler * linter * linter * docs * add add_event api	2024-06-23 17:10:12 +03:00
chenyu	e356807696	tinytqdm.set_description and tinytrange (#5101 )	2024-06-22 14:45:06 -04:00
chenyu	8080298739	s/tinytqdm/tqdm (#5103 ) except in unit test where tqdm is imported	2024-06-22 14:18:26 -04:00
chenyu	e468601226	update llama attention casting (#5096 ) * update llama attention casting updated scaled_dot_product_attention middle cast and removed hard-coded half in llama attention. * fix that	2024-06-22 10:57:17 -04:00
chenyu	8bd6cb9511	update llama model RMSNorm casting (#5095 ) following the original implementation, cast back to input dtype before multiplying weight. slightly faster https://github.com/meta-llama/llama/blob/main/llama/model.py	2024-06-21 23:02:04 -04:00
chenyu	0c857ae2d6	some onnx_ops cleanups (#5094 )	2024-06-21 22:01:32 -04:00
nimlgen	fb1bf48cfe	io_uring for copies from disk (#5035 ) * exp uring * fixes and old version * nv * cleaner * cmp vs aio * fix * no lib * fix nv * linter * disk_speed_test now runs default * fixes * uring -> io_uring * linter happy * get_temp_buf comment added * tiny nits * put wait back * test runs everywhere * remove consts * remove mmap consts * do not require iouring to run test, they are generic	2024-06-21 11:36:51 +03:00
chenyu	f6d6760f71	don't cast tuple to list before creating Tensor (#5071 ) Tensor constructor supports creating from tuple now	2024-06-20 13:32:56 -04:00
chenyu	e2c5054bdd	update resnet.load_from_pretrained (#5040 )	2024-06-18 16:29:22 -04:00
chenyu	a3ed4176c8	use tinytqdm in active tests and examples (#5038 ) * use tinytqdm in active tests and examples stress test this before 0.9.1 * no set_description	2024-06-18 16:01:19 -04:00
Junjun Dong	c8cd6e725c	Remove BinaryOps.SUB. Replace SUB by ADD and NEG in all tests. Regenerate dataset (#4977 ) * feat: remove BinaryOps.SUB * remove SUB in test_early_end_local * regenerate dataset. remove SUB in test_linearizer_* * reenable overflow tests * simplify tensor.sub function by returning a+(-b) * remove whitespaces --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-06-18 09:06:13 -04:00
chenyu	67e8df4969	remove numpy from dtype (#4969 ) replaced all dtype.np with _to_np_dtype defined in tensor.py. after this, the only numpy usages are (1) Tensor(np.ndarray), (2) construct .numpy() output, (3) numpy random buffer	2024-06-14 15:38:45 -04:00
George Hotz	9823752397	make uops.add private (#4950 ) * make uops.add private * modernize all tests	2024-06-14 03:23:25 -07:00
Jhenner Tigreros	dc9e9e4363	Convert BinaryOps.DIV to UnaryOps.RECIP and BinaryOps.IDIV (#4887 ) * Create UnaryOps.RECIP and BinaryOps.IDIV and changing uses of BinaryOps.DIV * Delete unused import * Add cstyle renderer * Fix formatting text * Fix test error due to bad implementation of renderer * Add PTX support * Add RECIP to LLVMIR * Remove BinaryOps.DIV from symbolic test * Change some test and fix C floor division * Change references to DIV for the RECIP or IDIV * Add mimic idiv for symbolic test * Restore floor * Mimic idiv * cast to int * Fix some test and renderer * Remove DIV for render nodes * Resolve issue with div * Add TestRenderer * Fix test * fix error * Fix PAD test * Fix div implementation * Remove DIV * Add upcast to rshift, due to use of MUL and RECIP on DIV * Fix linter * Remove complete BinaryOps.DIV * Fix lint * Fix some test * Revert mul modification * Fix tests * Fix CLANG for uops * Revert IDIV function * Minor fix * modify pattern matching rule to support nan * Fix UNSAFE_PADS_OPS to add UnaryOps.RECIP * Remove const folding for IDIV and fix PTX * Complete remove IDIV from extra * Remove test_div from TestFloatUOps due to test on recip * Fix linearizer * fix * Fix test_22 * Fix llvm * Apply trunc function for llvmlit * use floor instead of trunc * Use correct type * Generate new fuzz db * Fix rshift, do not cast to float to support idiv * Return upcast=false to rshift * Add to unsafepad BinaryOps.IDIV * Remove RECIP override for CUDA * add atol / rtol for the test * Remove cast to int on IDIV * Regenerate sops * delete sops.gz * regenerate * regenerate * regenerate * Reduce margins * pass atol and rtol as parametersg for _test_metrics * regenerated dataset * Regenerate * Remove duplicated * Revert changes on extra * Remove changes extra and NOQA for test * Remove E501 * Remove and change line * Remove E501 * Fix atan2 * Revert import and E501 * Remove E501 * Add hrcp to halp ops * Remove 1 of hrcp * Remove last DIV and add type check on uops for IDIV * Fix new tests * Fix tests and custom function * Regenerate dataset * Regenerate dataset * Revert dataset * Change generate dataset script * Remove line * Change IDIV, type checker validate if x,y and z are int --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-06-14 02:43:46 -07:00
George Hotz	e63701fbd4	RDNA3 assembly support (#3637 ) * amazing that i can use comgr for this * compile empty kernel * cleanups * tiny_add compiles * ugh * more work * put that in extra	2024-06-13 09:09:24 +02:00
nimlgen	fd071ba27e	amd mockgpu correct timer resolution (#4942 ) * amd mockgpu correct timer resolution * test it	2024-06-13 10:07:34 +03:00
Elias Wahl	d2e3c391e8	Residual in MLM loss + Change default steps (#4935 ) * Residual in mlm loss * Reduce default steps to 160K * 24 * oops * comment	2024-06-12 16:09:18 -04:00
nimlgen	58cf6eaba9	add missing dir level for amd mockgpu (#4911 )	2024-06-11 18:35:04 +02:00
nimlgen	654a8b9ef7	retire hsa (#4885 ) * retire hsa * EMULATE_AMD	2024-06-09 11:33:03 +03:00
Nik	085c0bbf6b	add mlperf train subset of openimages (#4841 )	2024-06-05 10:10:11 -04:00
Elias Wahl	04e237328b	Refactor to class style (#4804 )	2024-06-04 14:08:31 -07:00
chenyu	3afc914617	CMPEQ -> CMPNE and make it safe to pad (#4818 ) * CMPNE * new dataset	2024-06-03 18:02:15 -04:00
nimlgen	7384ee08a0	amd cleanup sdma (#4796 ) * amd cleanup sdma * faster enqueue for sdma * typo * remove commnted lines * fix overrun check * flushhdp better command	2024-06-01 17:06:44 +03:00
nimlgen	bd2e7c8b31	amd registers from file (#4778 ) * amd registers from file * remove commentes * linetr * no off	2024-05-31 18:48:57 +03:00
chenyu	e614b7c696	docs: showcase remove mnist_gan and add conversation.py (#4757 ) fixed both examples, and i think it's better to show conversation	2024-05-28 11:09:26 -04:00
nimlgen	50e95b8212	nv qmd sync (#4740 ) * qmd sync * better hcq * mockgpu support chain qmd * fix mockgpu & linter	2024-05-27 18:51:30 +03:00
nimlgen	c87b066b66	optimize nv sync (#4729 ) * optimize nv sync * sdma signal without wfi * nv mockgou support * sep change	2024-05-25 23:10:41 +03:00
chenyu	31358cbea5	change Tensor.stack to method (#4719 )	2024-05-24 17:04:19 -04:00
qazal	c170ddceaf	fix commavq benchmark (#4712 ) * fix _slice and assert explicit device * with _slice	2024-05-24 19:40:57 +03:00
chenyu	47aba47f64	update Torch.gather api (#4692 ) * update Torch.gather api gather(self, dim, index) to match torch * fix that	2024-05-22 21:54:06 -04:00
chenyu	792a494eb8	fix various examples (#4691 ) * fix examples that used ax1 and ax2 for transpose * fix that * update those	2024-05-22 20:43:21 -04:00
chenyu	225dcab3be	prepend `_` to broadcast_shape and deepwalk (#4683 ) * prepend `_` to broadcast_shape and deepwalk internal only * that too	2024-05-22 16:39:05 -04:00
chenyu	ae861325ce	update llama sample for mac 32 input buffer limit (#4662 ) set default sampling params to function call to 0, and top k in llama3 to 25.	2024-05-20 17:23:39 -04:00
wozeparrot	b144d4b460	new llama3 example (#4576 )	2024-05-19 22:42:23 -07:00
nimlgen	daf57af3eb	move tc to renderers (#4631 ) * move tc to renderers * missed import * fix typo * fix * fix imports * remove from tests * fix 4607 * nv emulate timestamp * time is int * correct time	2024-05-18 00:36:29 +03:00
nimlgen	10cf8e459b	hcq update queue in place (#4626 ) * do not self wait in hcq * faster enqueue * comments * tests * linter * fix typo	2024-05-17 22:18:20 +03:00
nimlgen	eb9689336e	nv mockgpu (#4600 ) * mockgpu nv * works * comment that out * fix merge * setup gpuocelot * install packages * not run all of them * passes * fix ci * almost * should pass * linter * linter 2 * try this? * ugn, not supported * ci * remove ticket from description * better descs	2024-05-15 23:46:08 +03:00
Ahmed Harmouche	662bca8134	Split UnaryOps.CAST into CAST and BITCAST (#4487 ) * Separate cast and bitcast * Fix lint * No more arg[0] * Revert "No more arg[0]" This reverts commit dee6911335513f092fe2cbb9684e8a9d26aad964. * CAST/BITCAST arg is the dtype only, no more tuple * No image bitcast, regenerate dataset * Small fixes	2024-05-15 11:43:31 -04:00
George Hotz	ff64bcab69	move graph/search to engine (#4596 )	2024-05-14 23:12:59 -07:00
George Hotz	fd02ab1e8b	move disassemblers and openpilot (#4592 ) * move disassemblers and openpilot * delete junk * put that in pre-commit * fixup readme	2024-05-14 19:30:02 -07:00
chenyu	a65c8de735	move .half() llama freq_cis to the end of sin and cos (#4587 ) otherwise arange has inf if either dim or context length exceeds half.max	2024-05-14 15:00:18 -04:00
nimlgen	9b02aef45a	remove rhip (#4579 ) * remove rhip * remove hip runner	2024-05-14 17:58:19 +03:00
nimlgen	2131556c2c	amd mockgpu (#4535 ) * start mock amd gpu * virt files * cleaner * init ci * small fixes * linter * better? * ugh * linter * fix * diable some * run shorter * fixes * add hcq test * fix * fix cmd revert	2024-05-14 14:28:04 +03:00
chenyu	da10cf0be1	extra/threefry.py for mem usage (#4533 ) for now it needs 8N mem to generate size N rand	2024-05-11 13:46:44 -04:00
chenyu	8a0fb3d765	delete old extra/autopad.py (#4532 )	2024-05-11 13:06:10 -04:00
George Hotz	2f970a4fc2	all realize 2 (#4527 ) * all realize 2 * tests fixup * fix more tests * fix openpilot * fix tests * unneeded	2024-05-10 22:43:09 -07:00

1 2 3 4 5 ...

800 Commits