tinygrad

Commit Graph

Author	SHA1	Message	Date
chenyu	22376e53b7	resnet mlperf logging (#4361 ) * resnet mlperf logging * cropping too much?	2024-05-02 00:00:04 -04:00
George Hotz	8bcf533a84	gitignore open-images-v6TEST	2024-05-01 13:55:38 +00:00
Elias Wahl	27613dd881	MLPerf BERT: Main training loop (#4288 ) * BERT language modeling head + trunc normal initializers * add train loop + helpers * shuffle in dataloaders + slight changes in main loop * beam change * Minor changes * random.shuffle * HParam update * Use deque for dataloader * wandb bert project name * half fixes * BENCHMARK + remove epoch * cast + print() --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-04-29 14:35:27 -04:00
geohotstan	bf412aeb80	use tolist instead of numpy for extracting parameters in onnx (#4333 ) * still some numpy left * all pass * oops indent * fix up safe_python * to_python_const	2024-04-29 10:48:20 -04:00
Francis Lata	bb849a57d1	[MLPerf] UNet3D dataloader (#4343 ) * add support for train/val datasets for kits19 * split dataset into train and val sets * add tests for kits19 dataloader * add MLPerf dataset tests to CI * update unet3d model_eval script * fix linting * add nibabel * fix how mock dataset gets created * update ref implementation with permalink and no edits * clean up test and update rand_flip implementation * cleanups	2024-04-28 22:34:18 -04:00
chenyu	82d0ed3cf3	cap default dataset wikipedia max_workers to 32 (#4345 ) 64 on tinybox OOM	2024-04-28 21:55:21 -04:00
geohotstan	bc36940c28	fix (#4319 )	2024-04-28 16:29:04 +08:00
chenyu	5ae252ae83	use at least float32 for optim.lr (#4297 ) * use at least float32 for optim.lr when doing mixed precision training (float32 weight, default_float=half), still use float32 to store lr. it would have been upcasted later in actual weight update, but would have lost precision. this improved resnet convergence significantly * undo type annotation	2024-04-25 14:42:28 -04:00
George Hotz	38f97aa0fe	rename rawbufs to bufs in ExecItem (#4274 )	2024-04-24 11:27:27 +08:00
nimlgen	f3b4dff7c9	KFDProgram -> AMDProgram (#4268 )	2024-04-24 00:29:50 +03:00
Elias Wahl	69341144ba	Wikipedia preprocessing script (#4229 ) * Preprocessing script * short seq prob * comments + env vars * Add preprocessing reference. Add test * lint fix + add eval test support * whitespaces * point to commit * comment * rename * better comments	2024-04-23 10:28:01 -04:00
George Hotz	9a95781d51	renamed (#4260 )	2024-04-23 09:00:28 +04:00
George Hotz	2ae4f45272	WIP PM4 Support (#4110 ) * pm4 kernel launch works * disable USE_THREAD_DIMENSIONS * add kernel code * work on real pm4 * pm4 signal * same * gate pm4 * hcq tests pass * ops passes * pm4 is closer * pm4 debug (#4165) * start debug tests passing * prg * smth * hdp flush * cleaner 1 * do not need this * logs not need * small things * linter * remove AQL * test hcq * fix tests * it's subtracting, it shouldn't be -1 * pm4 changes (#4251) * not need this anymore * sdma signal with non atomic --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-04-23 08:31:27 +04:00
Francis Lam	bbb0ad4800	wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216 ) * wmma: widen TC usage in search by using PADTO on TC axes when possible * test: start tests for the new padding TC behavior * search: upgrade padded TC search to TC_OPT >= 2 * test: add behavior and correctness test for padded TC added optional argument to apply_tensor_core to set TC_OPT level * linearizer: add tests for the PADTO behvaior and docs	2024-04-22 16:50:31 -04:00
nimlgen	e6227bdb15	nv driver (#4044 ) * start * fix err 93 * gpu * ioctl mappings * alloc like cuda * semaphores * wait for semaphores value * start ops_nv * very simple kernels work * init several gpus * qmd dumper * dirty, but most of kernels work * always all test_ops * progress, more tests, stable * test_ops passes, gpt2 works but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated * need better sync * fix sync * alloc2 * all tests pass! * cleanup 1 * cleanup * multigpu, simple transfer * fix sync * correct init * nv_gpu autogen + sync bug fix * clean extra/nv_gpu_driver * p2p * clean up * remove old gen * small fixes * cleanup * cleanup 2 * small fixes * bigger queue size * cleanups * wait * fixed signals for devs * fix hang + parallel beam * small fixes * detect when local memory is big in kernel * correct assert * small fixes * correct tls size est * one va space * less lines * shorter * save 2 lines * save some lines * remove type ignores --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-04-22 19:50:20 +04:00
Elias Wahl	2ecd61e3e2	monkey patching (#4214 )	2024-04-18 19:20:52 -04:00
chenyu	cd801a15f3	scipy.signal.gaussian -> scipy.signal.windows.gaussian (#4205 ) fixed unet3d model_eval, will add to CI after merging new dice loss	2024-04-17 19:15:37 -04:00
Elias Wahl	6eef8ee22a	Wikipedia download script for MLPerf BERT training (#4202 ) * wikipedia download script * add link * checksum valueError * ops	2024-04-17 16:34:57 -04:00
Francis Lam	c91b7b1739	test: add fuzz_matmul and better debugging for simple_matmul (#4199 ) also show unoptimized shape in verify_kernel	2024-04-16 23:40:31 -04:00
George Hotz	55ae73e951	Replicate llm.c in tinygrad (#4179 ) * write llm.c and add a few new methods to tensor * training works * add jit * tests for new functions * test tolist * simple fix for onnx test failures (#4186) * write llm.c and add a few new methods to tensor * training works * add jit * tests for new functions * bump line count to 7500 * simplest fix * safenumpy tolist for now --------- Co-authored-by: George Hotz <geohot@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> --------- Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>	2024-04-16 15:40:48 +04:00
George Hotz	b7e281cf10	JitItem -> ExecItem (#4146 ) * JitItem -> ExecItem * execitem in realize * cleaner * JITRunner -> Runner	2024-04-11 08:24:57 -07:00
George Hotz	e79a11b99c	hotfix: revert llama change	2024-04-10 20:13:15 -07:00
George Hotz	2e6c39b0b2	Do less realizes (#4141 ) * less realize * corealize jit inputs * prints * print before we run	2024-04-10 19:50:50 -07:00
geohotstan	fe88591890	update onnx to 1.16.0 (#4127 ) * update * pass tests and skip tests	2024-04-10 11:19:13 -04:00
Francis Lam	46850a0269	search: add a BEAM_COMPARE env to optionally not compare to hc/tc (#4107 ) * search: add a BEAM_COMPARE env to optionally not compare to hc/tc setting BEAM_COMPARE=0 will prevent additional memory allocation needed to do the timing tests assuming the BEAM result is in the diskcache. * change to optionally use Buffer.allocate	2024-04-08 18:54:01 -04:00
chenyu	f8dc82a8a7	use single tensor for llama kv chache (#4108 ) similar to optimization in gpt2	2024-04-08 00:38:32 -04:00
chenyu	92c0675ccf	setitem initial support (#4093 ) * wip setitem it's an eager assign to output shapetracker view * cleanups and tests * more cleanups	2024-04-07 20:35:22 -04:00
geohotstan	183708b3fd	broadcast expand to match torch (#4085 ) * initial version * heh gimme grrrreen * version 2 * clean ups * some test confusion * fix onnx * rename to _broadcast_tensors * improved errors and test * fixed? * some test fixup * version 3 lol * comments * cleaner * add failure test for expand to 0 test * 1 more assertRaises test * make err msg better * also rewrite the expand onnx op? :s	2024-04-07 16:23:13 -04:00
George Hotz	fffd9b05f5	mock mnist data for imagenet trainer (#4095 ) * mock mnist data for imagenet * move print and test * needed to reshape	2024-04-06 08:08:40 -07:00
geohotstan	dafa42e864	clean up (#4081 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-04-05 11:57:44 -04:00
nimlgen	d6ba44bc1e	kfd free buffers (#4027 ) * kfd free buffers * unmap * all test passes * better pm4 * forgot these * invalidate only range * better cache * forgot * comments * fixes	2024-04-01 15:50:58 -07:00
Francis Lam	dcb58d3bed	extra/gemm/simple_matvec: add simple_matvec.py (#4021 ) we can test with this or add it to CI for benchmarks	2024-03-31 16:38:52 -04:00
chenyu	d3f27761b0	move const folding of ADD/SUB/MUL from tensor to lazy (#4020 ) * move const folding of ADD/SUB/MUL from tensor to lazy will do div and pow separately. * fix onnx adding with None	2024-03-31 16:35:36 -04:00
George Hotz	2abb474d43	kfd driver wip (#3912 ) * kfd driver wip * cleanups * kfd almost ready to ring doorbell * ding dong? * issues with signals * something * works * ops kfd * add amd_signal_t * works...sometimes * program runs * _gpu_alloc cleanup * cleanups * work * header + enable profiling (#3959) * header + enable profiling * just cleaner * measure * only local time domain * remove old comments * fix with master * elf parsing (#3965) * elf parsing * fix kernels with private * not used * clean up * clean up 2 * add flags * kfd sdma (#3970) * working sdma * remove driver, shorter * all commands we might need * svm * kfd remove hardcoded values (#4007) * remove hardcoded values * match above line * 7k lines + revert hsa * update that from origin * fix sdma reg gen * not the updated SDMA * compiler_opts * don't require kfd_ioctl * get ioctls from python * get ioctls from python * remove build_sdma_command * merge into 64-bit fields * shorter * fix property spelling and off by one --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-30 15:08:12 -07:00
Francis Lam	04746022b1	extra/gemm/hip_matmul: fix to use new HSA devices and no headers (#3999 ) * extra/gemm/hip_matmul: fix to use new HSA devices and no headers * remove compile_hip import	2024-03-30 15:42:23 -04:00
chenyu	c71627fee6	move GlobalCounter to helpers (#4002 ) break circular import between ops and buffer	2024-03-30 00:30:30 -04:00
Akshit Talwar	0affbbf81c	update amx gemm (#3991 )	2024-03-29 11:45:03 -04:00
George Hotz	9a6ac2a50a	create the buffer with the LazyBuffer (#3977 ) * create the buffer with the LazyBuffer * fixes * hack underlying buffer when we change dtype * we only care about allocated buffers * asserts	2024-03-28 19:31:28 -07:00
chenyu	b47f6cebb2	LinearizerOptions -> CompilerOptions (#3978 )	2024-03-28 17:50:23 -04:00
David Hou	4b95350c41	fp16 resnet (without expand backwards sum in float, doesn't work) (#3816 ) * fp16 resnet * cast running mean and var back to default float * extra cast * check symbolic no overflow * add linearizer failure * loss scaler after grad contig * oops * i think this works * don't loss scale fp32 * remove overflow test case * remove symbolic bounds check * loss scaler should be float * temporarily disable padto cuz bug shruggie * make running stats in batchnorm float32? * calculate lars stuff in fp32? * oops * remove most changes * move loss scaler out of optimizer * no more FP16 var * oops --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-28 01:25:37 -04:00
Francis Lam	7c5729a3bd	wmma: refactor to remove wmma_func and create TC funcs as needed (#3945 ) * wmma: refactor to remove wmma_func and create TC funcs as needed * test_linearizer: disable bf16 CUDA during emulation testing * cstyle: clean up creation of CUDA vec dtypes * extra/gemm: add option to accumulate to bfloat16 * cleanups * benchmark: add CUDA bfloat16 matmul * more cleanups	2024-03-27 16:43:09 -04:00
George Hotz	68ca4d4276	split to schedule.py (#3949 ) * split to schedule.py * split	2024-03-26 21:02:46 -07:00
George Hotz	150ea2eb76	create engine folder and move code (#3948 ) * retry * older tf * that	2024-03-26 20:38:03 -07:00
George Hotz	778d17fbd3	intel matmul (#3830 ) * almost right * intel xmx	2024-03-25 22:37:20 -07:00
wozeparrot	9a9cac58f9	add lars to nn (#3750 ) * feat: add lars * feat: don't remove this comment * clean: smaller diff * clean: shorter line * feat: remove mlperf lars, switch resnet * fix: fully remove mlperf lars * clean: comment * feat: contiguous * feat: no weight decay on skip params * feat: optimizergroup * feat: classic momentum * fix: pylint * clean: move comment * fix: correct algo * feat: lrschedulergroup * feat: skip list tests * feat: :\| forgot that params are a thing * feat: remove skip_list params from main params * feat: set moment --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-24 11:43:12 -04:00
George Hotz	46a3501cec	nv ioctl sniffer (#3892 ) * nv ioctl sniffer * unused import * Update __init__.py * that work * that fix it	2024-03-23 00:29:30 -07:00
chenyu	ee502c8055	fixup to_movement_ops and add back to CI (#3881 )	2024-03-22 18:14:49 -04:00
Francis Lam	5587594a00	fuzz_linearizer: add --ast and --file params to read kernels (#3877 ) also fix up ast_str_to_str to support the new tuple of LazyOps	2024-03-22 14:27:40 -04:00
Francis Lam	a26090d404	search: change to use "spawn" and limit the number of tasks per child (#3862 ) also clean up some examples to use __main__ and not initialize resources outside of main	2024-03-21 21:23:36 -07:00
Francis Lam	6d5dec2fef	log optimized kernels and a script to compare with non-optimized ones (#3829 ) * search: add BEAM_VERIFY option to validate search results refactor fuzz_linearizer comparison to allow it to be used in for BEAM_VERIFY in device.py * search: fix to verify the beam_search result and not the fastest * search: fix typing and clean up * device: remove imports from test and add LOGKERN options LOGKERN output can be used with test/external/verify_kernel.py to validate correctness * fix example in verify_kernel.py * cleanup fixes * fix to use f-strings	2024-03-20 19:22:08 -04:00

1 2 3 4 5 ...

642 Commits