add slides from code europe to docs

This commit is contained in:
George Hotz 2024-06-12 14:35:08 +02:00
parent 9a3c1e4a17
commit 828c98d5c4
1 changed files with 0 additions and 0 deletions

319
docs/tinygrad_intro.pdf Normal file
View File

@ -0,0 +1,319 @@
tinygrad: from MNIST to ALUs
What is tinygrad?
● A neural network framework
● Pure Python (seriously)
● Very small (<8000 lines)
● Yet fully functional
The tinygrad stack
tinygrad assembler
kernel
Almost no dependencies => its easy to port new accelerators
Why a new framework?
● To commoditize the petaflop
● The graveyard of AI chip
companies is big.
● To be successful with your
chip, you must be able to
create your own stack
A torch-like frontend
● No `nn.Module` class
● No `forward`
● No classes for
stateless operations
● Many Tensor methods
docs.tinygrad.org
tinygrad is lazy
● Eager operations happen when they run
(PyTorch)
● Graph operations happen after the graph
is compiled (TensorFlow, torch.compile)
● Lazy implicit graph, the simplicity of
eager with the power of graph
The LazyBuffer graph
(16, 3, 3, 3) (16, 3, 64, 64) ● LoadOps.CUSTOM is
LoadOps.CUSTOM LoadOps.CUSTOM Tensor.rand
K:2 K:1 ● Green is a “view”
● A conv is two views,
(16, 1, 16, 62, 62, 3, 3, 3) (16, 1, 16, 62, 62, 3, 3, 3)
(0, 0, 27, 0, 0, 9, 3, 1) (12288, 0, 0, 64, 1, 4096, 64, 1) a MUL, and a SUM
● We copy back to the
(16, 1, 16, 62, 62, 3, 3, 3)
BinaryOps.MUL CPU (aka CLANG)
{(16, 1, 16, 62, 62, 3, 3, 3)}
(16, 1, 16, 62, 62, 1, 1, 1)
ReduceOps.SUM
K:3
(16, 1, 16, 62, 62, 1, 1, 1)
LoadOps.COPY
CLANG
K:4
The code (conv2d)
An OpenCL kernel implementing a 3x3 conv
The UOps (conv2d)
CONST 27 ALU BinaryOps.MUL ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int
CONST 9 ALU BinaryOps.MUL ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int
SPECIAL (2, 'gidx0', 256) ALU BinaryOps.DIV CONST 3 RANGE (2, 0) ALU BinaryOps.MUL DEFINE_GLOBAL (2, False) LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD DEFINE_GLOBAL (0, True)
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int ptr.dtypes.float dtypes.float dtypes.float dtypes.float ptr.dtypes.float
CONST 16 CONST 3844 CONST 0 RANGE (2, 1) DEFINE_ACC (0.0, 0, 0) ALU BinaryOps.ADD LOAD PHI
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.float dtypes.int dtypes.float dtypes.float
CONST 12288 ALU BinaryOps.MUL RANGE (2, 2) ALU BinaryOps.MUL DEFINE_GLOBAL (1, False) STORE
dtypes.int dtypes.int dtypes.int dtypes.int ptr.dtypes.float None
CONST 64 ALU BinaryOps.MUL CONST 4096 ALU BinaryOps.MUL ALU BinaryOps.ADD ALU BinaryOps.ADD ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int
ALU BinaryOps.MOD ALU BinaryOps.MUL ALU BinaryOps.ADD ALU BinaryOps.ADD ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int
CONST 61504 ALU BinaryOps.MUL SPECIAL (0, 'gidx2', 62) ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int dtypes.int
SPECIAL (1, 'gidx1', 62) ALU BinaryOps.MUL ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int
CONST 62
dtypes.int
Slow?
● Problem: Tons of ops are spent on
indexing
● Solution: compute multiple outputs (a
chunk) in the kernel
● Question: what size chunk is optimal?
● Answer: search the possible kernels!
BEAM search
The Optimized UOps (conv2d)
CONST 12288 ALU BinaryOps.MUL DEFINE_ACC (0.0, 0, 5) GEP 1 GEP 0 ALU BinaryOps.ADD CAST PHI
dtypes.int dtypes.int dtypes._float2 dtypes.float dtypes.float dtypes.float dtypes._float2 dtypes._float2
SPECIAL (0, 'lidx3', 16) ALU BinaryOps.MUL DEFINE_ACC (0.0, 0, 1) GEP 1 GEP 0 ALU BinaryOps.ADD CAST PHI
dtypes.int dtypes.int dtypes._float2 dtypes.float dtypes.float dtypes.float dtypes._float2 dtypes._float2
CONST 61504 CONST 4096 CONST 1 ALU BinaryOps.ADD GEP 0 ALU BinaryOps.ADD CONST 3844 ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.float dtypes.float dtypes.int dtypes.int
ALU BinaryOps.ADD ALU BinaryOps.ADD ALU BinaryOps.ADD ALU BinaryOps.ADD GEP 1 ALU BinaryOps.MUL ALU BinaryOps.ADD CONST 3906 ALU BinaryOps.ADD STORE
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.float dtypes.float dtypes.float dtypes.int dtypes.int None
STORE
ALU BinaryOps.MUL CONST 65 ALU BinaryOps.ADD LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CAST PHI None
dtypes.int dtypes.int dtypes.int dtypes.float dtypes.float dtypes.float dtypes._float2 dtypes._float2
GEP 1 STORE
ALU BinaryOps.ADD DEFINE_ACC (0.0, 0, 2) dtypes.float LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CONST 7688 ALU BinaryOps.ADD None
dtypes.int dtypes._float2 dtypes.float dtypes.float dtypes.float dtypes.int dtypes.int
DEFINE_GLOBAL (1, False) STORE
CONST 27 ptr.dtypes.float LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CAST PHI None
dtypes.int dtypes.float dtypes.float dtypes.float dtypes._float2 dtypes._float2
ALU BinaryOps.ADD STORE
CONST 0 CONST 64 ALU BinaryOps.MUL dtypes.int LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CONST 7750 ALU BinaryOps.ADD None
dtypes.int dtypes.int dtypes.int GEP 0 dtypes.float dtypes.float dtypes.float dtypes.int dtypes.int
ALU BinaryOps.MUL dtypes.float STORE
dtypes.int LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD ALU BinaryOps.ADD DEFINE_GLOBAL (0, True) None
CONST 3 ALU BinaryOps.ADD dtypes.float dtypes.float dtypes.float dtypes.int ptr.dtypes.float STORE
dtypes.int dtypes.int None
LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CONST 11532 ALU BinaryOps.ADD
ALU BinaryOps.MUL DEFINE_GLOBAL (2, False) dtypes.float dtypes.float dtypes.float dtypes.int dtypes.int STORE
dtypes.int ptr.dtypes.float None
ALU BinaryOps.MUL CAST PHI
ALU BinaryOps.ADD dtypes.float dtypes._float2 dtypes._float2
dtypes.int
ALU BinaryOps.ADD RANGE (2, 2) DEFINE_ACC (0.0, 0, 6) CONST 11594 ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes._float2 GEP 0 dtypes.int dtypes.int
dtypes.float
CONST 128 RANGE (2, 0) RANGE (2, 1) CONST 54 LOAD ALU BinaryOps.MUL CONST 62 ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int dtypes.int GEP 1 dtypes.float dtypes.float dtypes.int dtypes.int
ALU BinaryOps.ADD dtypes.float
SPECIAL (1, 'gidx1', 31) CONST 9 ALU BinaryOps.MUL dtypes.int LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int CONST 81 GEP 1 dtypes.float dtypes.float dtypes.float
dtypes.int dtypes.float
CONST 2 CONST 124 ALU BinaryOps.MUL ALU BinaryOps.MUL ALU BinaryOps.ADD DEFINE_ACC (0.0, 0, 3) ALU BinaryOps.MUL ALU BinaryOps.ADD CAST PHI
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int dtypes._float2 GEP 0 dtypes.float dtypes.float dtypes._float2 dtypes._float2
SPECIAL (0, 'gidx2', 31) dtypes.float
dtypes.int CONST 15376 ALU BinaryOps.MUL ALU BinaryOps.ADD ALU BinaryOps.ADD DEFINE_ACC (0.0, 0, 7) ALU BinaryOps.MUL ALU BinaryOps.ADD CAST PHI
dtypes.int dtypes.int dtypes.int dtypes.int dtypes._float2 dtypes.float dtypes.float dtypes._float2 dtypes._float2
SPECIAL (2, 'gidx0', 4) ALU BinaryOps.MUL ALU BinaryOps.ADD DEFINE_ACC (0.0, 0, 0) ALU BinaryOps.MUL ALU BinaryOps.ADD
dtypes.int dtypes.int dtypes.int dtypes._float2 dtypes.float dtypes.float
CONST 108 DEFINE_ACC (0.0, 0, 4) ALU BinaryOps.MUL
dtypes.int dtypes._float2 dtypes.float
ALU BinaryOps.MUL
dtypes.float
ALU BinaryOps.MUL
dtypes.float
GEP 0 ALU BinaryOps.ADD CAST PHI
dtypes.float dtypes.float dtypes._float2 dtypes._float2
GEP 1 ALU BinaryOps.ADD
dtypes.float dtypes.float
GEP 1
dtypes.float
GEP 0
dtypes.float
Philosophy of tinygrad
● Surface all complexity
Dont rely on libraries, many of which are
vendor specific with quirks.
● No Turing complete abstractions
Rules out use of LLVM, LLVM IR has thrown
away too much information.
● Embrace ”The Bitter Lesson”
Theres many choices to be made, dont
spend time designing heuristics, use
search.
Model training
Follow along with the MNIST tutorial on docs.tinygrad.org
What is @TinyJit (DEBUG=2)
It captures the run kernels and replays them with new data
What are CUDA Graphs?
● GPUs use command queues to execute
kernels. They are what they sound
like.
● Model training runs can be ~10,000
kernels.
● The CPU time spent enqueuing the
kernels can exceed the GPU runtime
● So...reuse the same command queue!
NV/AMD backends
● These backends replace the CUDA/HIP
runtimes and speak directly with the
kernel using ioctl.
● Aside from the assembler, no CUDA is used
code walkthrough
Tensor Flow
● Tensor → LazyBuffer (function.py)
Forward/backward pass handled here
● LazyBuffer → LazyOp (scheduler.py)
Breaking into Kernels here
● LazyOp → UOp (linearizer.py)
Generate kernel code in an LLVM-like IR
● UOp → Code (renderer)
This code is CUDA code or C code
● Code → /accelerator/ (runtime)
Code: tensor.py:Tensor
The main class. Methods are the useful functions. Where forward and
backward are handled. The lazydata property contains a LazyBuffer
Code: function.py
Thanks to the chain rule, 28 derivatives are all you need to handcode
Code: lazy.py:LazyBuffer
The container of computation, specifies how to construct the buffer.
Below the forward/backward layer, can be constructed from simple ops.
Code: ops.py
The 32 simple ops.
Code: shape/shapetracker.py
● One of the pieces of tinygrad magic, all
“movement” operations are tracked here.
● Reshape can create “multiview” ShapeTracker,
aka the length of the views tuple is > 1
Code: shape/view.py
A view has a shape, strides, an offset, and a mask.
This handles all pad, shrink, expand, permute, and
stride + some reshapes.
Throwback: conv2d
LOAD, MUL, SUM, STORE are Ops defining a Kernel
Theres two single view ShapeTrackers for the inputs
the tiny corp
A company in 2024
● We are a GitHub and a Discord.
● We raised $5M, and will be profitable
this year by selling computers.
● “remote” jobs are fine, but it begins to
deconstruct what a job is.
● We are now 5 people, and hire exclusively
from the pool of tinygrad contributors.
● “collective”
Bounties
tinybox
hardware sales that match the main development platform...
...is ethical value capture
MLPerf
● As promised, we got AMD on MLPerf.
● tinybox green (6x 4090), ResNet-50, 122 minutes
● tinybox red (6x 7900XTX), ResNet-50, 167 minutes
● Done using tinygrad, none of the ML libraries
from either company.
● Our next submission will use none of the
userspace.
Where we are going
1) Build the best training framework for
NVIDIA/AMD/Intel/Qualcomm/etc.
2) Capture all existing chips in a generic
framework. Search for the best possible
chip given a set of tasks.
3) Build that chip. Sell chips and build
clouds at the task abstraction, not the
computer abstraction.
How to join tiny
● Permissionless company! (who has read ?s doc)
● Skills are all that matters
● We dont discriminate against silicon based life
live coding...