tinygrad

You like pytorch? You like micrograd? You love tinygrad! ❤️

Go to file

gswangg df44a4e861 Make vectorization of CONST explicit (#5322 ) * remove test_const_vectorize_fold * remove const folding UPat for VECTORIZE * refactor cstyle render_const * remove calls to dtype.scalar() in render_const * add assert * add vectorized const to UOp.const * add UPat GEP-VECTORIZE-CONST -> CONST * render_vectorize for DEFINE_ACC in cstyle * add back missing render_cast in render_const * generate vectorized consts as UOps for DEFINE_ACC * update asserts for DEFINE_ACC with VECTORIZE src * add UPats for PHI with VECTORIZE src * use prev rendered vectorize in DEFINE_ACC render * update DEFINE_ACC in python runtime * update vectorized DEFINE_ACC in PTXRenderer * rebase DEFINE_ACC changes on lowerer * verbose rewrite of bad UPats * simplify UOps.CONST implementation in ops_python * update sum_collapse UPats for DEFINE_ACC-VECTORIZE * revert linearizer to TOT * fix DEFINE_ACC implementation in ops_python * simplify DEFINE_ACC in cstyle * Fix linter error * support VECTORIZE in fold gated load/store UPat * support VECTORIZE in other fold gated load UPats * rewrite VECTORIZE in UPat for no input DEFINE_ACC * simplify DEFINE_ACC render in cstyle * make VECTORIZE rules more concise * add more vectorize fold tests * inline VECTORIZE-CONSTs in cstyle render * revert VECTORIZE/GEP rule refactor * revert cstyle render_const refactor * inline VECTORIZE-CONSTs in cstyle render * implicitly vectorized const rendering -> explicit * WMMA VECTORIZE CONST process replay hacks * VECTORIZE CONST NAN process_replay hacks * more VECTORIZE CONST NAN hacks * cleanup process_replay hacks * isnan() -> not isfinite() cstyle VECTORIZE CONST * tweak isnan and isfinite checks VECTORIZE CONST * tweak for positive vs negative infinity VECTORIZE CONST * add assert to PTX CONST render * process_replay VECTORIZE CONST render parity for PTX STORE * vmin/vmax for VECTORIZE'd CONST * update WMMA folding rules * add tests for WMMA VECTORIZE fold * hack for cstyle half4 CONST zero process_replay parity * revert PTX backend changes * add back minimal DEFINE_ACC PTX change * remove cstyle process_replay hacks * remove dead code in PTX CONST render * cleanup vmin/vmax logic for VECTORIZE'd CONSTs * update vectorize fold tests to use DEFINE_VAR * fix long line formatting in test * remove unwanted merge artifact * more vmin/vmax cleanup * remove unnecessary asserts * yet more vmin/vmax cleanup * get rid of explicit VECTORIZE CONST logic in _min_max * reuse CONST instead of creating a new one * remove unneeded cast * handle DType correctly in sconst * improve readability of tests * save a line * save another line * tuplize pats in src * remove GEP-VECTORIZE pats * add vec +0 fold * HACK: fold only vec8 +0 * remove vectorized ALU fold hack --------- Co-authored-by: qazal <qazal.software@gmail.com> Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>		2024-08-08 20:59:05 +03:00
.github/workflows	hotfix: 8800 lines for AMX+intel tc	2024-08-06 17:50:26 -07:00
docs	improvements to docs	2024-08-07 09:57:24 -07:00
examples	no load in INITMLPERF (#5957 )	2024-08-08 11:28:24 -04:00
extra	pmatmul example + GB/s bugfix [run_process_replay] (#5974 )	2024-08-07 22:32:11 -07:00
test	Make vectorization of CONST explicit (#5322 )	2024-08-08 20:59:05 +03:00
tinygrad	Make vectorization of CONST explicit (#5322 )	2024-08-08 20:59:05 +03:00
.gitignore	tracemeta fixups (#5904 )	2024-08-04 16:15:06 -07:00
.pre-commit-config.yaml	rename uop [run_process_replay] (#5031 )	2024-06-18 21:34:05 +03:00
.pylintrc	docs: clean up mentions of mlops (#4720 )	2024-05-24 17:49:32 -04:00
LICENSE	move disassemblers and openpilot (#4592 )	2024-05-14 19:30:02 -07:00
README.md	process replay docs improvements (#5481 )	2024-07-15 00:09:28 +03:00
autogen_stubs.sh	amd autogened kfd ioctls (#5757 )	2024-07-27 22:49:48 +03:00
mkdocs.yml	improvements to docs	2024-08-07 09:57:24 -07:00
mypy.ini	move gpuctypes in tree (#3253 )	2024-01-26 12:25:03 -08:00
ruff.toml	ruff: suppressible-exception (#5182 )	2024-06-27 08:23:44 -07:00
serve_docs.sh	update docs, remove corealize (#4264 )	2024-04-23 12:05:29 +04:00
setup.py	include "mkdocs" in setup docs (#5798 )	2024-07-29 15:54:52 -04:00
sz.py	hotfix: skip 0 line count files in sz.py	2024-04-23 11:56:03 +04:00

README.md

tinygrad: For something between PyTorch and karpathy/micrograd. Maintained by tiny corp.

Homepage | Documentation | Discord

This may not be the best deep learning framework, but it is a deep learning framework.

Due to its extreme simplicity, it aims to be the easiest framework to add new accelerators to, with support for both inference and training. If XLA is CISC, tinygrad is RISC.

tinygrad is still alpha software, but we raised some money to make it good. Someday, we will tape out chips.

Features

LLaMA and Stable Diffusion

tinygrad can run LLaMA and Stable Diffusion!

Laziness

Try a matmul. See how, despite the style, it is fused into one kernel with the power of laziness.

DEBUG=3 python3 -c "from tinygrad import Tensor;
N = 1024; a, b = Tensor.rand(N, N), Tensor.rand(N, N);
c = (a.reshape(N, 1, N) * b.T.reshape(1, N, N)).sum(axis=2);
print((c.numpy() - (a.numpy() @ b.numpy())).mean())"

And we can change DEBUG to 4 to see the generated code.

Neural networks

As it turns out, 90% of what you need for neural networks are a decent autograd/tensor library. Throw in an optimizer, a data loader, and some compute, and you have all you need.

from tinygrad import Tensor, nn

class LinearNet:
  def __init__(self):
    self.l1 = Tensor.kaiming_uniform(784, 128)
    self.l2 = Tensor.kaiming_uniform(128, 10)
  def __call__(self, x:Tensor) -> Tensor:
    return x.flatten(1).dot(self.l1).relu().dot(self.l2)

model = LinearNet()
optim = nn.optim.Adam([model.l1, model.l2], lr=0.001)

x, y = Tensor.rand(4, 1, 28, 28), Tensor([2,4,3,7])  # replace with real mnist dataloader

with Tensor.train():
  for i in range(10):
    optim.zero_grad()
    loss = model(x).sparse_categorical_crossentropy(y).backward()
    optim.step()
    print(i, loss.item())

See examples/beautiful_mnist.py for the full version that gets 98% in ~5 seconds

Accelerators

tinygrad already supports numerous accelerators, including:

And it is easy to add more! Your accelerator of choice only needs to support a total of ~25 low level ops.

Installation

The current recommended way to install tinygrad is from source.

From source

git clone https://github.com/tinygrad/tinygrad.git
cd tinygrad
python3 -m pip install -e .

Direct (master)

python3 -m pip install git+https://github.com/tinygrad/tinygrad.git

Documentation

Documentation along with a quick start guide can be found on the docs website built from the docs/ directory.

Quick example comparing to PyTorch

from tinygrad import Tensor

x = Tensor.eye(3, requires_grad=True)
y = Tensor([[2.0,0,-2.0]], requires_grad=True)
z = y.matmul(x).sum()
z.backward()

print(x.grad.numpy())  # dz/dx
print(y.grad.numpy())  # dz/dy

The same thing but in PyTorch:

import torch

x = torch.eye(3, requires_grad=True)
y = torch.tensor([[2.0,0,-2.0]], requires_grad=True)
z = y.matmul(x).sum()
z.backward()

print(x.grad.numpy())  # dz/dx
print(y.grad.numpy())  # dz/dy

Contributing

There has been a lot of interest in tinygrad lately. Following these guidelines will help your PR get accepted.

We'll start with what will get your PR closed with a pointer to this section:

No code golf! While low line count is a guiding light of this project, anything that remotely looks like code golf will be closed. The true goal is reducing complexity and increasing readability, and deleting \ns does nothing to help with that.
All docs and whitespace changes will be closed unless you are a well-known contributor. The people writing the docs should be those who know the codebase the absolute best. People who have not demonstrated that shouldn't be messing with docs. Whitespace changes are both useless and carry a risk of introducing bugs.
Anything you claim is a "speedup" must be benchmarked. In general, the goal is simplicity, so even if your PR makes things marginally faster, you have to consider the tradeoff with maintainablity and readablity.
In general, the code outside the core tinygrad/ folder is not well tested, so unless the current code there is broken, you shouldn't be changing it.
If your PR looks "complex", is a big diff, or adds lots of lines, it won't be reviewed or merged. Consider breaking it up into smaller PRs that are individually clear wins. A common pattern I see is prerequisite refactors before adding new functionality. If you can (cleanly) refactor to the point that the feature is a 3 line change, this is great, and something easy for us to review.

Now, what we want:

Bug fixes (with a regression test) are great! This library isn't 1.0 yet, so if you stumble upon a bug, fix it, write a test, and submit a PR, this is valuable work.
Solving bounties! tinygrad offers cash bounties for certain improvements to the library. All new code should be high quality and well tested.
Features. However, if you are adding a feature, consider the line tradeoff. If it's 3 lines, there's less of a bar of usefulness it has to meet over something that's 30 or 300 lines. All features must have regression tests. In general with no other constraints, your feature's API should match torch or numpy.
Refactors that are clear wins. In general, if your refactor isn't a clear win it will be closed. But some refactors are amazing! Think about readability in a deep core sense. A whitespace change or moving a few functions around is useless, but if you realize that two 100 line functions can actually use the same 110 line function with arguments while also improving readability, this is a big win. Refactors should pass process replay.
Tests/fuzzers. If you can add tests that are non brittle, they are welcome. We have some fuzzers in here too, and there's a plethora of bugs that can be found with them and by improving them. Finding bugs, even writing broken tests (that should pass) with @unittest.expectedFailure is great. This is how we make progress.
Dead code removal from core tinygrad/ folder. We don't care about the code in extra, but removing dead code from the core library is great. Less for new people to read and be confused by.

Running tests

You should install the pre-commit hooks with pre-commit install. This will run the linter, mypy, and a subset of the tests on every commit.

For more examples on how to run the full test suite please refer to the CI workflow.

Some examples of running tests locally:

python3 -m pip install -e '.[testing]'  # install extra deps for testing
python3 test/test_ops.py                # just the ops tests
python3 -m pytest test/                 # whole test suite

Process replay tests

Process replay compares your PR's generated kernels against master. If your PR is a refactor or speedup without any expected behavior change, It should include [run_process_replay] in the PR title, example. Note that you should keep your branch up-to-date with master.