mirror of https://github.com/commaai/tinygrad.git
add slides from code europe to docs
This commit is contained in:
parent
9a3c1e4a17
commit
828c98d5c4
|
@ -0,0 +1,319 @@
|
|||
tinygrad: from MNIST to ALUs
|
||||
What is tinygrad?
|
||||
|
||||
● A neural network framework
|
||||
● Pure Python (seriously)
|
||||
● Very small (<8000 lines)
|
||||
● Yet fully functional
|
||||
The tinygrad stack
|
||||
|
||||
tinygrad assembler
|
||||
kernel
|
||||
|
||||
Almost no dependencies => it’s easy to port new accelerators
|
||||
Why a new framework?
|
||||
|
||||
● To commoditize the petaflop
|
||||
● The graveyard of AI chip
|
||||
|
||||
companies is big.
|
||||
● To be successful with your
|
||||
|
||||
chip, you must be able to
|
||||
create your own stack
|
||||
A torch-like frontend
|
||||
|
||||
● No `nn.Module` class
|
||||
● No `forward`
|
||||
● No classes for
|
||||
|
||||
stateless operations
|
||||
● Many Tensor methods
|
||||
|
||||
docs.tinygrad.org
|
||||
tinygrad is lazy
|
||||
|
||||
● Eager – operations happen when they run
|
||||
(PyTorch)
|
||||
|
||||
● Graph – operations happen after the graph
|
||||
is compiled (TensorFlow, torch.compile)
|
||||
|
||||
● Lazy – implicit graph, the simplicity of
|
||||
eager with the power of graph
|
||||
The LazyBuffer graph
|
||||
|
||||
(16, 3, 3, 3) (16, 3, 64, 64) ● LoadOps.CUSTOM is
|
||||
LoadOps.CUSTOM LoadOps.CUSTOM Tensor.rand
|
||||
|
||||
K:2 K:1 ● Green is a “view”
|
||||
● A conv is two views,
|
||||
(16, 1, 16, 62, 62, 3, 3, 3) (16, 1, 16, 62, 62, 3, 3, 3)
|
||||
(0, 0, 27, 0, 0, 9, 3, 1) (12288, 0, 0, 64, 1, 4096, 64, 1) a MUL, and a SUM
|
||||
● We copy back to the
|
||||
(16, 1, 16, 62, 62, 3, 3, 3)
|
||||
BinaryOps.MUL CPU (aka CLANG)
|
||||
|
||||
{(16, 1, 16, 62, 62, 3, 3, 3)}
|
||||
(16, 1, 16, 62, 62, 1, 1, 1)
|
||||
|
||||
ReduceOps.SUM
|
||||
K:3
|
||||
|
||||
(16, 1, 16, 62, 62, 1, 1, 1)
|
||||
LoadOps.COPY
|
||||
CLANG
|
||||
K:4
|
||||
The code (conv2d)
|
||||
|
||||
An OpenCL kernel implementing a 3x3 conv
|
||||
The UOps (conv2d)
|
||||
|
||||
CONST 27 ALU BinaryOps.MUL ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int
|
||||
|
||||
CONST 9 ALU BinaryOps.MUL ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int
|
||||
|
||||
SPECIAL (2, 'gidx0', 256) ALU BinaryOps.DIV CONST 3 RANGE (2, 0) ALU BinaryOps.MUL DEFINE_GLOBAL (2, False) LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD DEFINE_GLOBAL (0, True)
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int ptr.dtypes.float dtypes.float dtypes.float dtypes.float ptr.dtypes.float
|
||||
|
||||
CONST 16 CONST 3844 CONST 0 RANGE (2, 1) DEFINE_ACC (0.0, 0, 0) ALU BinaryOps.ADD LOAD PHI
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.float dtypes.int dtypes.float dtypes.float
|
||||
|
||||
CONST 12288 ALU BinaryOps.MUL RANGE (2, 2) ALU BinaryOps.MUL DEFINE_GLOBAL (1, False) STORE
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int ptr.dtypes.float None
|
||||
|
||||
CONST 64 ALU BinaryOps.MUL CONST 4096 ALU BinaryOps.MUL ALU BinaryOps.ADD ALU BinaryOps.ADD ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int
|
||||
|
||||
ALU BinaryOps.MOD ALU BinaryOps.MUL ALU BinaryOps.ADD ALU BinaryOps.ADD ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int
|
||||
|
||||
CONST 61504 ALU BinaryOps.MUL SPECIAL (0, 'gidx2', 62) ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int
|
||||
|
||||
SPECIAL (1, 'gidx1', 62) ALU BinaryOps.MUL ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int
|
||||
|
||||
CONST 62
|
||||
dtypes.int
|
||||
Slow?
|
||||
|
||||
● Problem: Tons of ops are spent on
|
||||
indexing
|
||||
|
||||
● Solution: compute multiple outputs (a
|
||||
chunk) in the kernel
|
||||
|
||||
● Question: what size chunk is optimal?
|
||||
● Answer: search the possible kernels!
|
||||
BEAM search
|
||||
The Optimized UOps (conv2d)
|
||||
|
||||
CONST 12288 ALU BinaryOps.MUL DEFINE_ACC (0.0, 0, 5) GEP 1 GEP 0 ALU BinaryOps.ADD CAST PHI
|
||||
dtypes.int dtypes.int dtypes._float2 dtypes.float dtypes.float dtypes.float dtypes._float2 dtypes._float2
|
||||
|
||||
SPECIAL (0, 'lidx3', 16) ALU BinaryOps.MUL DEFINE_ACC (0.0, 0, 1) GEP 1 GEP 0 ALU BinaryOps.ADD CAST PHI
|
||||
dtypes.int dtypes.int dtypes._float2 dtypes.float dtypes.float dtypes.float dtypes._float2 dtypes._float2
|
||||
|
||||
CONST 61504 CONST 4096 CONST 1 ALU BinaryOps.ADD GEP 0 ALU BinaryOps.ADD CONST 3844 ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.float dtypes.float dtypes.int dtypes.int
|
||||
|
||||
ALU BinaryOps.ADD ALU BinaryOps.ADD ALU BinaryOps.ADD ALU BinaryOps.ADD GEP 1 ALU BinaryOps.MUL ALU BinaryOps.ADD CONST 3906 ALU BinaryOps.ADD STORE
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.float dtypes.float dtypes.float dtypes.int dtypes.int None
|
||||
STORE
|
||||
ALU BinaryOps.MUL CONST 65 ALU BinaryOps.ADD LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CAST PHI None
|
||||
dtypes.int dtypes.int dtypes.int dtypes.float dtypes.float dtypes.float dtypes._float2 dtypes._float2
|
||||
GEP 1 STORE
|
||||
ALU BinaryOps.ADD DEFINE_ACC (0.0, 0, 2) dtypes.float LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CONST 7688 ALU BinaryOps.ADD None
|
||||
dtypes.int dtypes._float2 dtypes.float dtypes.float dtypes.float dtypes.int dtypes.int
|
||||
DEFINE_GLOBAL (1, False) STORE
|
||||
CONST 27 ptr.dtypes.float LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CAST PHI None
|
||||
dtypes.int dtypes.float dtypes.float dtypes.float dtypes._float2 dtypes._float2
|
||||
ALU BinaryOps.ADD STORE
|
||||
CONST 0 CONST 64 ALU BinaryOps.MUL dtypes.int LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CONST 7750 ALU BinaryOps.ADD None
|
||||
dtypes.int dtypes.int dtypes.int GEP 0 dtypes.float dtypes.float dtypes.float dtypes.int dtypes.int
|
||||
ALU BinaryOps.MUL dtypes.float STORE
|
||||
dtypes.int LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD ALU BinaryOps.ADD DEFINE_GLOBAL (0, True) None
|
||||
CONST 3 ALU BinaryOps.ADD dtypes.float dtypes.float dtypes.float dtypes.int ptr.dtypes.float STORE
|
||||
dtypes.int dtypes.int None
|
||||
LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD CONST 11532 ALU BinaryOps.ADD
|
||||
ALU BinaryOps.MUL DEFINE_GLOBAL (2, False) dtypes.float dtypes.float dtypes.float dtypes.int dtypes.int STORE
|
||||
dtypes.int ptr.dtypes.float None
|
||||
ALU BinaryOps.MUL CAST PHI
|
||||
ALU BinaryOps.ADD dtypes.float dtypes._float2 dtypes._float2
|
||||
dtypes.int
|
||||
ALU BinaryOps.ADD RANGE (2, 2) DEFINE_ACC (0.0, 0, 6) CONST 11594 ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes._float2 GEP 0 dtypes.int dtypes.int
|
||||
dtypes.float
|
||||
CONST 128 RANGE (2, 0) RANGE (2, 1) CONST 54 LOAD ALU BinaryOps.MUL CONST 62 ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int GEP 1 dtypes.float dtypes.float dtypes.int dtypes.int
|
||||
ALU BinaryOps.ADD dtypes.float
|
||||
SPECIAL (1, 'gidx1', 31) CONST 9 ALU BinaryOps.MUL dtypes.int LOAD ALU BinaryOps.MUL ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int CONST 81 GEP 1 dtypes.float dtypes.float dtypes.float
|
||||
dtypes.int dtypes.float
|
||||
CONST 2 CONST 124 ALU BinaryOps.MUL ALU BinaryOps.MUL ALU BinaryOps.ADD DEFINE_ACC (0.0, 0, 3) ALU BinaryOps.MUL ALU BinaryOps.ADD CAST PHI
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int dtypes.int dtypes._float2 GEP 0 dtypes.float dtypes.float dtypes._float2 dtypes._float2
|
||||
SPECIAL (0, 'gidx2', 31) dtypes.float
|
||||
dtypes.int CONST 15376 ALU BinaryOps.MUL ALU BinaryOps.ADD ALU BinaryOps.ADD DEFINE_ACC (0.0, 0, 7) ALU BinaryOps.MUL ALU BinaryOps.ADD CAST PHI
|
||||
dtypes.int dtypes.int dtypes.int dtypes.int dtypes._float2 dtypes.float dtypes.float dtypes._float2 dtypes._float2
|
||||
|
||||
SPECIAL (2, 'gidx0', 4) ALU BinaryOps.MUL ALU BinaryOps.ADD DEFINE_ACC (0.0, 0, 0) ALU BinaryOps.MUL ALU BinaryOps.ADD
|
||||
dtypes.int dtypes.int dtypes.int dtypes._float2 dtypes.float dtypes.float
|
||||
|
||||
CONST 108 DEFINE_ACC (0.0, 0, 4) ALU BinaryOps.MUL
|
||||
dtypes.int dtypes._float2 dtypes.float
|
||||
|
||||
ALU BinaryOps.MUL
|
||||
dtypes.float
|
||||
|
||||
ALU BinaryOps.MUL
|
||||
dtypes.float
|
||||
|
||||
GEP 0 ALU BinaryOps.ADD CAST PHI
|
||||
dtypes.float dtypes.float dtypes._float2 dtypes._float2
|
||||
|
||||
GEP 1 ALU BinaryOps.ADD
|
||||
dtypes.float dtypes.float
|
||||
|
||||
GEP 1
|
||||
dtypes.float
|
||||
|
||||
GEP 0
|
||||
dtypes.float
|
||||
Philosophy of tinygrad
|
||||
|
||||
● Surface all complexity
|
||||
– Don’t rely on libraries, many of which are
|
||||
vendor specific with quirks.
|
||||
|
||||
● No Turing complete abstractions
|
||||
– Rules out use of LLVM, LLVM IR has thrown
|
||||
away too much information.
|
||||
|
||||
● Embrace ”The Bitter Lesson”
|
||||
– There’s many choices to be made, don’t
|
||||
spend time designing heuristics, use
|
||||
search.
|
||||
Model training
|
||||
|
||||
Follow along with the MNIST tutorial on docs.tinygrad.org
|
||||
What is @TinyJit (DEBUG=2)
|
||||
|
||||
It captures the run kernels and replays them with new data
|
||||
What are CUDA Graphs?
|
||||
|
||||
● GPUs use command queues to execute
|
||||
kernels. They are what they sound
|
||||
like.
|
||||
|
||||
● Model training runs can be ~10,000
|
||||
kernels.
|
||||
|
||||
● The CPU time spent enqueuing the
|
||||
kernels can exceed the GPU runtime
|
||||
|
||||
● So...reuse the same command queue!
|
||||
NV/AMD backends
|
||||
|
||||
● These backends replace the CUDA/HIP
|
||||
runtimes and speak directly with the
|
||||
kernel using ioctl.
|
||||
|
||||
● Aside from the assembler, no CUDA is used
|
||||
code walkthrough
|
||||
Tensor Flow
|
||||
|
||||
● Tensor → LazyBuffer (function.py)
|
||||
– Forward/backward pass handled here
|
||||
|
||||
● LazyBuffer → LazyOp (scheduler.py)
|
||||
– Breaking into Kernels here
|
||||
|
||||
● LazyOp → UOp (linearizer.py)
|
||||
– Generate kernel code in an LLVM-like IR
|
||||
|
||||
● UOp → Code (renderer)
|
||||
– This code is CUDA code or C code
|
||||
|
||||
● Code → /accelerator/ (runtime)
|
||||
Code: tensor.py:Tensor
|
||||
|
||||
The main class. Methods are the useful functions. Where forward and
|
||||
backward are handled. The lazydata property contains a LazyBuffer
|
||||
Code: function.py
|
||||
|
||||
Thanks to the chain rule, 28 derivatives are all you need to handcode
|
||||
Code: lazy.py:LazyBuffer
|
||||
|
||||
The container of computation, specifies how to construct the buffer.
|
||||
Below the forward/backward layer, can be constructed from simple ops.
|
||||
Code: ops.py
|
||||
|
||||
The 32 simple ops.
|
||||
Code: shape/shapetracker.py
|
||||
|
||||
● One of the pieces of tinygrad magic, all
|
||||
“movement” operations are tracked here.
|
||||
|
||||
● Reshape can create “multiview” ShapeTracker,
|
||||
aka the length of the views tuple is > 1
|
||||
Code: shape/view.py
|
||||
|
||||
A view has a shape, strides, an offset, and a mask.
|
||||
This handles all pad, shrink, expand, permute, and
|
||||
stride + some reshapes.
|
||||
Throwback: conv2d
|
||||
|
||||
LOAD, MUL, SUM, STORE are Ops defining a Kernel
|
||||
There’s two single view ShapeTrackers for the inputs
|
||||
the tiny corp
|
||||
A company in 2024
|
||||
|
||||
● We are a GitHub and a Discord.
|
||||
● We raised $5M, and will be profitable
|
||||
|
||||
this year by selling computers.
|
||||
● “remote” jobs are fine, but it begins to
|
||||
|
||||
deconstruct what a job is.
|
||||
● We are now 5 people, and hire exclusively
|
||||
|
||||
from the pool of tinygrad contributors.
|
||||
● “collective”
|
||||
Bounties
|
||||
tinybox
|
||||
|
||||
hardware sales that match the main development platform...
|
||||
...is ethical value capture
|
||||
MLPerf
|
||||
|
||||
● As promised, we got AMD on MLPerf.
|
||||
● tinybox green (6x 4090), ResNet-50, 122 minutes
|
||||
● tinybox red (6x 7900XTX), ResNet-50, 167 minutes
|
||||
● Done using tinygrad, none of the ML libraries
|
||||
|
||||
from either company.
|
||||
● Our next submission will use none of the
|
||||
|
||||
userspace.
|
||||
Where we are going
|
||||
|
||||
1) Build the best training framework for
|
||||
NVIDIA/AMD/Intel/Qualcomm/etc.
|
||||
|
||||
2) Capture all existing chips in a generic
|
||||
framework. Search for the best possible
|
||||
chip given a set of tasks.
|
||||
|
||||
3) Build that chip. Sell chips and build
|
||||
clouds at the task abstraction, not the
|
||||
computer abstraction.
|
||||
How to join tiny
|
||||
|
||||
● Permissionless company! (who has read ?s doc)
|
||||
● Skills are all that matters
|
||||
● We don’t discriminate against silicon based life
|
||||
live coding...
|
||||
|
Loading…
Reference in New Issue