add slides from code europe to docs

2024-06-12 14:35:08 +02:00 · 2024-06-12 14:35:08 +02:00 · 828c98d5c4
parent 9a3c1e4a17
commit 828c98d5c4
1 changed files with 0 additions and 0 deletions
--- a/docs/tinygrad_intro.pdf
+++ b/docs/tinygrad_intro.pdf
@ -0,0 +1,319 @@
+tinygrad: from MNIST to ALUs
+        What is tinygrad?
+
+● A neural network framework
+● Pure Python (seriously)
+● Very small (<8000 lines)
+● Yet fully functional
+The tinygrad stack
+
+tinygrad            assembler
+ kernel
+
+Almost no dependencies => it’s easy to port new accelerators
+         Why a new framework?
+
+● To commoditize the petaflop
+● The graveyard of AI chip
+
+  companies is big.
+● To be successful with your
+
+  chip, you must be able to
+  create your own stack
+A torch-like frontend
+
+                                     ● No `nn.Module` class
+                                     ● No `forward`
+                                     ● No classes for
+
+                                        stateless operations
+                                     ● Many Tensor methods
+
+     docs.tinygrad.org
+         tinygrad is lazy
+
+● Eager – operations happen when they run
+   (PyTorch)
+
+● Graph – operations happen after the graph
+   is compiled (TensorFlow, torch.compile)
+
+● Lazy – implicit graph, the simplicity of
+   eager with the power of graph
+The LazyBuffer graph
+
+    (16, 3, 3, 3)                                                (16, 3, 64, 64)               ● LoadOps.CUSTOM is
+LoadOps.CUSTOM                                                LoadOps.CUSTOM                      Tensor.rand
+
+         K:2                                                           K:1                     ● Green is a “view”
+                                                                                               ● A conv is two views,
+(16, 1, 16, 62, 62, 3, 3, 3)                                   (16, 1, 16, 62, 62, 3, 3, 3)
+  (0, 0, 27, 0, 0, 9, 3, 1)                                 (12288, 0, 0, 64, 1, 4096, 64, 1)     a MUL, and a SUM
+                                                                                               ● We copy back to the
+                              (16, 1, 16, 62, 62, 3, 3, 3)
+                                   BinaryOps.MUL                                                  CPU (aka CLANG)
+
+                              {(16, 1, 16, 62, 62, 3, 3, 3)}
+                               (16, 1, 16, 62, 62, 1, 1, 1)
+
+                                     ReduceOps.SUM
+                                             K:3
+
+                               (16, 1, 16, 62, 62, 1, 1, 1)
+                                     LoadOps.COPY
+                                          CLANG
+                                             K:4
+ The code (conv2d)
+
+An OpenCL kernel implementing a 3x3 conv
+                                                                        The UOps (conv2d)
+
+                                                                                CONST 27            ALU BinaryOps.MUL     ALU BinaryOps.ADD
+                                                                                 dtypes.int                dtypes.int            dtypes.int
+
+                                                                                 CONST 9            ALU BinaryOps.MUL                        ALU BinaryOps.ADD
+                                                                                 dtypes.int                dtypes.int                               dtypes.int
+
+SPECIAL (2, 'gidx0', 256)    ALU BinaryOps.DIV              CONST 3           RANGE (2, 0)          ALU BinaryOps.MUL                                           DEFINE_GLOBAL (2, False)      LOAD       ALU BinaryOps.MUL   ALU BinaryOps.ADD   DEFINE_GLOBAL (0, True)
+          dtypes.int                dtypes.int              dtypes.int           dtypes.int                dtypes.int                                                    ptr.dtypes.float  dtypes.float        dtypes.float        dtypes.float           ptr.dtypes.float
+
+         CONST 16                 CONST 3844                CONST 0           RANGE (2, 1)        DEFINE_ACC (0.0, 0, 0)                                            ALU BinaryOps.ADD         LOAD                                                               PHI
+          dtypes.int                dtypes.int              dtypes.int           dtypes.int               dtypes.float                                                      dtypes.int     dtypes.float                                                     dtypes.float
+
+                                 CONST 12288         ALU BinaryOps.MUL        RANGE (2, 2)          ALU BinaryOps.MUL                                           DEFINE_GLOBAL (1, False)                                                                                    STORE
+                                    dtypes.int              dtypes.int           dtypes.int                dtypes.int                                                    ptr.dtypes.float                                                                                    None
+
+                                   CONST 64          ALU BinaryOps.MUL         CONST 4096           ALU BinaryOps.MUL     ALU BinaryOps.ADD  ALU BinaryOps.ADD      ALU BinaryOps.ADD
+                                    dtypes.int              dtypes.int           dtypes.int                dtypes.int            dtypes.int         dtypes.int              dtypes.int
+
+                            ALU BinaryOps.MOD        ALU BinaryOps.MUL    ALU BinaryOps.ADD         ALU BinaryOps.ADD     ALU BinaryOps.ADD
+                                    dtypes.int              dtypes.int           dtypes.int                dtypes.int            dtypes.int
+
+                                 CONST 61504         ALU BinaryOps.MUL  SPECIAL (0, 'gidx2', 62)    ALU BinaryOps.ADD
+                                    dtypes.int              dtypes.int           dtypes.int                dtypes.int
+
+                           SPECIAL (1, 'gidx1', 62)  ALU BinaryOps.MUL    ALU BinaryOps.ADD
+                                    dtypes.int              dtypes.int           dtypes.int
+
+                                   CONST 62
+                                    dtypes.int
+                  Slow?
+
+● Problem: Tons of ops are spent on
+   indexing
+
+● Solution: compute multiple outputs (a
+   chunk) in the kernel
+
+● Question: what size chunk is optimal?
+● Answer: search the possible kernels!
+BEAM search
+The Optimized UOps (conv2d)
+
+                                CONST 12288         ALU BinaryOps.MUL                                                                 DEFINE_ACC (0.0, 0, 5)                GEP 1          GEP 0                           ALU BinaryOps.ADD            CAST                       PHI
+                                   dtypes.int              dtypes.int                                                                        dtypes._float2              dtypes.float    dtypes.float                            dtypes.float       dtypes._float2           dtypes._float2
+
+                          SPECIAL (0, 'lidx3', 16)  ALU BinaryOps.MUL                                                                 DEFINE_ACC (0.0, 0, 1)                GEP 1          GEP 0                           ALU BinaryOps.ADD            CAST                       PHI
+                                   dtypes.int              dtypes.int                                                                        dtypes._float2              dtypes.float    dtypes.float                            dtypes.float       dtypes._float2           dtypes._float2
+
+                                CONST 61504              CONST 4096                                                                            CONST 1            ALU BinaryOps.ADD        GEP 0                           ALU BinaryOps.ADD        CONST 3844          ALU BinaryOps.ADD
+                                   dtypes.int              dtypes.int                                                                          dtypes.int                 dtypes.int     dtypes.float                            dtypes.float         dtypes.int                dtypes.int
+
+                                                                             ALU BinaryOps.ADD  ALU BinaryOps.ADD                       ALU BinaryOps.ADD         ALU BinaryOps.ADD        GEP 1       ALU BinaryOps.MUL   ALU BinaryOps.ADD        CONST 3906          ALU BinaryOps.ADD      STORE
+                                                                                    dtypes.int         dtypes.int                              dtypes.int                 dtypes.int     dtypes.float        dtypes.float        dtypes.float         dtypes.int                dtypes.int      None
+                                                                                                                                                                                                                                                                                               STORE
+                                                                             ALU BinaryOps.MUL                                                CONST 65            ALU BinaryOps.ADD         LOAD       ALU BinaryOps.MUL   ALU BinaryOps.ADD            CAST                       PHI          None
+                                                                                    dtypes.int                                                 dtypes.int                 dtypes.int     dtypes.float        dtypes.float        dtypes.float       dtypes._float2           dtypes._float2
+                                                                                                                                                                            GEP 1                                                                                                              STORE
+                                                                                                                   ALU BinaryOps.ADD  DEFINE_ACC (0.0, 0, 2)             dtypes.float       LOAD       ALU BinaryOps.MUL   ALU BinaryOps.ADD        CONST 7688          ALU BinaryOps.ADD       None
+                                                                                                                          dtypes.int         dtypes._float2                              dtypes.float        dtypes.float        dtypes.float         dtypes.int                dtypes.int
+                                                                                                                                                              DEFINE_GLOBAL (1, False)                                                                                                         STORE
+                                                                                                                                              CONST 27                 ptr.dtypes.float     LOAD       ALU BinaryOps.MUL   ALU BinaryOps.ADD            CAST                       PHI          None
+                                                                                                                                               dtypes.int                                dtypes.float        dtypes.float        dtypes.float       dtypes._float2           dtypes._float2
+                                                                                                                                                                  ALU BinaryOps.ADD                                                                                                            STORE
+                                 CONST 0                                     CONST 64           ALU BinaryOps.MUL                                                         dtypes.int        LOAD       ALU BinaryOps.MUL   ALU BinaryOps.ADD        CONST 7750          ALU BinaryOps.ADD       None
+                                 dtypes.int                                   dtypes.int               dtypes.int                                                           GEP 0        dtypes.float        dtypes.float        dtypes.float         dtypes.int                dtypes.int
+                          ALU BinaryOps.MUL                                                                                                                              dtypes.float                                                                                                          STORE
+                                 dtypes.int                                                                                                                                                 LOAD       ALU BinaryOps.MUL   ALU BinaryOps.ADD   ALU BinaryOps.ADD    DEFINE_GLOBAL (0, True)     None
+                                 CONST 3                                                                                                                          ALU BinaryOps.ADD      dtypes.float        dtypes.float        dtypes.float         dtypes.int             ptr.dtypes.float  STORE
+                                 dtypes.int                                                                                                                               dtypes.int                                                                                                            None
+                                                                                                                                                                                            LOAD       ALU BinaryOps.MUL   ALU BinaryOps.ADD       CONST 11532          ALU BinaryOps.ADD
+                          ALU BinaryOps.MUL                                                                                                                   DEFINE_GLOBAL (2, False)   dtypes.float        dtypes.float        dtypes.float         dtypes.int                dtypes.int     STORE
+                                 dtypes.int                                                                                                                            ptr.dtypes.float                                                                                                         None
+                                                                                                                                                                                                       ALU BinaryOps.MUL                                CAST                       PHI
+                                                                                                                                                                  ALU BinaryOps.ADD                          dtypes.float                           dtypes._float2           dtypes._float2
+                                                                                                                                                                          dtypes.int
+                                                     ALU BinaryOps.ADD                                             RANGE (2, 2)       DEFINE_ACC (0.0, 0, 6)                                                                                       CONST 11594          ALU BinaryOps.ADD
+                                                             dtypes.int                                               dtypes.int             dtypes._float2                 GEP 0                                                                     dtypes.int                dtypes.int
+                                                                                                                                                                         dtypes.float
+        CONST 128                                        RANGE (2, 0)        RANGE (2, 1)                                                     CONST 54                                      LOAD       ALU BinaryOps.MUL                             CONST 62           ALU BinaryOps.ADD
+         dtypes.int                                          dtypes.int         dtypes.int                                                     dtypes.int                   GEP 1        dtypes.float        dtypes.float                             dtypes.int                dtypes.int
+                                                                                                                                        ALU BinaryOps.ADD                dtypes.float
+SPECIAL (1, 'gidx1', 31)                                    CONST 9          ALU BinaryOps.MUL                                                 dtypes.int                                   LOAD       ALU BinaryOps.MUL   ALU BinaryOps.ADD
+         dtypes.int                                          dtypes.int             dtypes.int                                                CONST 81                      GEP 1        dtypes.float        dtypes.float        dtypes.float
+                                                                                                                                               dtypes.int                dtypes.float
+         CONST 2                                           CONST 124         ALU BinaryOps.MUL  ALU BinaryOps.MUL  ALU BinaryOps.ADD  DEFINE_ACC (0.0, 0, 3)                                           ALU BinaryOps.MUL   ALU BinaryOps.ADD       CAST                   PHI
+         dtypes.int                                          dtypes.int             dtypes.int         dtypes.int         dtypes.int         dtypes._float2                 GEP 0                            dtypes.float        dtypes.float  dtypes._float2       dtypes._float2
+SPECIAL (0, 'gidx2', 31)                                                                                                                                                 dtypes.float
+         dtypes.int                                      CONST 15376         ALU BinaryOps.MUL  ALU BinaryOps.ADD  ALU BinaryOps.ADD  DEFINE_ACC (0.0, 0, 7)                                           ALU BinaryOps.MUL   ALU BinaryOps.ADD       CAST                   PHI
+                                                             dtypes.int             dtypes.int         dtypes.int         dtypes.int         dtypes._float2                                                  dtypes.float        dtypes.float  dtypes._float2       dtypes._float2
+
+                                                    SPECIAL (2, 'gidx0', 4)  ALU BinaryOps.MUL  ALU BinaryOps.ADD                     DEFINE_ACC (0.0, 0, 0)                                           ALU BinaryOps.MUL   ALU BinaryOps.ADD
+                                                             dtypes.int             dtypes.int         dtypes.int                            dtypes._float2                                                  dtypes.float        dtypes.float
+
+                                                           CONST 108                                                                  DEFINE_ACC (0.0, 0, 4)                                           ALU BinaryOps.MUL
+                                                             dtypes.int                                                                      dtypes._float2                                                  dtypes.float
+
+                                                                                                                                                                                                       ALU BinaryOps.MUL
+                                                                                                                                                                                                             dtypes.float
+
+                                                                                                                                                                                                       ALU BinaryOps.MUL
+                                                                                                                                                                                                             dtypes.float
+
+                                                                                                                                                                                           GEP 0                           ALU BinaryOps.ADD       CAST                   PHI
+                                                                                                                                                                                         dtypes.float                            dtypes.float  dtypes._float2       dtypes._float2
+
+                                                                                                                                                                                           GEP 1                           ALU BinaryOps.ADD
+                                                                                                                                                                                         dtypes.float                            dtypes.float
+
+                                                                                                                                                                                           GEP 1
+                                                                                                                                                                                         dtypes.float
+
+                                                                                                                                                                                           GEP 0
+                                                                                                                                                                                         dtypes.float
+   Philosophy of tinygrad
+
+● Surface all complexity
+    – Don’t rely on libraries, many of which are
+        vendor specific with quirks.
+
+● No Turing complete abstractions
+    – Rules out use of LLVM, LLVM IR has thrown
+        away too much information.
+
+● Embrace ”The Bitter Lesson”
+    – There’s many choices to be made, don’t
+        spend time designing heuristics, use
+        search.
+          Model training
+
+Follow along with the MNIST tutorial on docs.tinygrad.org
+What is @TinyJit (DEBUG=2)
+
+It captures the run kernels and replays them with new data
+        What are CUDA Graphs?
+
+● GPUs use command queues to execute
+   kernels. They are what they sound
+   like.
+
+● Model training runs can be ~10,000
+   kernels.
+
+● The CPU time spent enqueuing the
+   kernels can exceed the GPU runtime
+
+● So...reuse the same command queue!
+          NV/AMD backends
+
+● These backends replace the CUDA/HIP
+   runtimes and speak directly with the
+   kernel using ioctl.
+
+● Aside from the assembler, no CUDA is used
+code walkthrough
+             Tensor Flow
+
+● Tensor → LazyBuffer (function.py)
+     – Forward/backward pass handled here
+
+● LazyBuffer → LazyOp (scheduler.py)
+     – Breaking into Kernels here
+
+● LazyOp → UOp (linearizer.py)
+     – Generate kernel code in an LLVM-like IR
+
+● UOp → Code (renderer)
+     – This code is CUDA code or C code
+
+● Code → /accelerator/ (runtime)
+       Code: tensor.py:Tensor
+
+The main class. Methods are the useful functions. Where forward and
+backward are handled. The lazydata property contains a LazyBuffer
+            Code: function.py
+
+Thanks to the chain rule, 28 derivatives are all you need to handcode
+      Code: lazy.py:LazyBuffer
+
+The container of computation, specifies how to construct the buffer.
+Below the forward/backward layer, can be constructed from simple ops.
+Code: ops.py
+
+      The 32 simple ops.
+Code: shape/shapetracker.py
+
+ ● One of the pieces of tinygrad magic, all
+    “movement” operations are tracked here.
+
+ ● Reshape can create “multiview” ShapeTracker,
+    aka the length of the views tuple is > 1
+    Code: shape/view.py
+
+A view has a shape, strides, an offset, and a mask.
+This handles all pad, shrink, expand, permute, and
+stride + some reshapes.
+       Throwback: conv2d
+
+     LOAD, MUL, SUM, STORE are Ops defining a Kernel
+There’s two single view ShapeTrackers for the inputs
+the tiny corp
+       A company in 2024
+
+● We are a GitHub and a Discord.
+● We raised $5M, and will be profitable
+
+   this year by selling computers.
+● “remote” jobs are fine, but it begins to
+
+   deconstruct what a job is.
+● We are now 5 people, and hire exclusively
+
+   from the pool of tinygrad contributors.
+● “collective”
+Bounties
+                tinybox
+
+hardware sales that match the main development platform...
+                            ...is ethical value capture
+                    MLPerf
+
+● As promised, we got AMD on MLPerf.
+● tinybox green (6x 4090), ResNet-50, 122 minutes
+● tinybox red (6x 7900XTX), ResNet-50, 167 minutes
+● Done using tinygrad, none of the ML libraries
+
+   from either company.
+● Our next submission will use none of the
+
+   userspace.
+      Where we are going
+
+1) Build the best training framework for
+   NVIDIA/AMD/Intel/Qualcomm/etc.
+
+2) Capture all existing chips in a generic
+   framework. Search for the best possible
+   chip given a set of tasks.
+
+3) Build that chip. Sell chips and build
+   clouds at the task abstraction, not the
+   computer abstraction.
+            How to join tiny
+
+● Permissionless company! (who has read ?s doc)
+● Skills are all that matters
+● We don’t discriminate against silicon based life
+live coding...
+