Add a quick start guide (#900)

* feat: initial quick start guide * fix: fix link * feat: add note about jit * feat: add note about load/store ops * feat: add link to discord * feat: add note about saving and loading models * fix: correct code for saving and loading * feat: overhaul docs * fix: fix link * feat: wording * feat: add link to discord * feat: contributing guidelines * feat: make contributing section more doc focused * feat: add link to env_vars from readme * fix: wording * feat: move community to bottom * feat: showcase * feat: linebreak * feat: redesigned header * feat: tweaks * feat: tweaks * feat: badge for lines of code * feat: move installation instructions to repo readme * feat: readme overhaul number 2 * feat: move visualization to quick start guide * feat: readme 2 electric boogaloo * fix: grammar * fix: formatting * feat: no ugly line * feat: add line back * feat: new load method * feat: split adding accelerator docs out * feat: showcase whisper * feat: smaller tweaks * feat: bring back oneliner
2023-06-04 11:51:20 -04:00 · 2023-06-04 11:51:20 -04:00 · e9c1ae3825
parent d429553730
commit e9c1ae3825
10 changed files with 693 additions and 260 deletions
--- a/.tokeignore
+++ b/.tokeignore
@ -0,0 +1,4 @@
 *
 !*/
 !tinygrad/**
--- a/README.md
+++ b/README.md
@ -1,91 +1,57 @@
-<p align="center">
+<div align="center">
  <img src="https://raw.githubusercontent.com/geohot/tinygrad/master/docs/logo.png">
 </p>
--------------------------------------------------------------------
+[![logo](https://raw.githubusercontent.com/geohot/tinygrad/master/docs/logo.png)](https://tinygrad.org)
-![Unit Tests](https://github.com/geohot/tinygrad/workflows/Unit%20Tests/badge.svg)
+tinygrad: For something between [PyTorch](https://github.com/pytorch/pytorch) and [karpathy/micrograd](https://github.com/karpathy/micrograd). Maintained by [tiny corp](https://tinygrad.org).
-[![tinygrad discord](https://discordapp.com/api/guilds/1068976834382925865/widget.png?style=banner2)](https://discord.gg/ZjZadyC7PK)
+<h3>
-For something in between a [pytorch](https://github.com/pytorch/pytorch) and a [karpathy/micrograd](https://github.com/karpathy/micrograd)
+[Homepage](https://github.com/geohot/tinygrad) | [Documentation](/docs) | [Examples](/examples) | [Showcase](/docs/showcase.md) | [Discord](https://discord.gg/ZjZadyC7PK)
 </h3>
 [![GitHub Repo stars](https://img.shields.io/github/stars/geohot/tinygrad)](https://github.com/geohot/tinygrad/stargazers)
 [![Unit Tests](https://github.com/geohot/tinygrad/actions/workflows/test.yml/badge.svg)](https://github.com/geohot/tinygrad/actions/workflows/test.yml)
 [![Discord](https://img.shields.io/discord/1068976834382925865)](https://discord.gg/ZjZadyC7PK)
 [![Lines of code](https://img.shields.io/tokei/lines/github/geohot/tinygrad)](https://github.com/geohot/tinygrad)
 </div>
 ---
 This may not be the best deep learning framework, but it is a deep learning framework.
-The sub 1000 line core of it is in `tinygrad/`
+Due to its extreme simplicity, it aims to be the easiest framework to add new accelerators to, with support for both inference and training.
-Due to its extreme simplicity, it aims to be the easiest framework to add new accelerators to, with support for both inference and training. Support the simple basic ops, and you get SOTA [vision](https://arxiv.org/abs/1905.11946) `models/efficientnet.py` and [language](https://arxiv.org/abs/1706.03762) `models/transformer.py` models.
+Eventually, we will have a [tinygrad accelerator](https://geohot.github.io/blog/jekyll/update/2021/06/13/a-breakdown-of-ai-chip-companies.html), then tinygrad will be ***fast***. But, for now, it is slow.
-We are working on support for the Apple Neural Engine and the Google TPU in the `accel/` folder. Eventually, [we will build custom hardware](https://geohot.github.io/blog/jekyll/update/2021/06/13/a-breakdown-of-ai-chip-companies.html) for tinygrad, and it will be blindingly fast. Now, it is slow.
+## Features
-This project is maintained by [tiny corp](https://tinygrad.org/).
+### LLaMA and Stable Diffusion
-### Installation
+tinygrad can run [LLaMA](/docs/showcase.md#llama) and [Stable Diffusion](/docs/showcase.md#stable-diffusion)!
-```bash
+### Laziness
 git clone https://github.com/geohot/tinygrad.git
 cd tinygrad
 python3 -m pip install -e .
 ```
 ### Contributing
 There's a lot of interest in tinygrad lately. Here's some guidelines for contributing:
 * Bugfixes are the best and always welcome! Like [this one](https://github.com/geohot/tinygrad/pull/421/files).
 * If you don't understand the code you are changing, don't change it!
 * All code golf PRs will be closed, but [conceptual cleanups](https://github.com/geohot/tinygrad/pull/372/files) are great.
 * Features are welcome. Though if you are adding a feature, you need to include tests.
 * Improving test coverage is great, with reliable non brittle tests.
 ### Example
 ```python
 from tinygrad.tensor import Tensor
 x = Tensor.eye(3, requires_grad=True)
 y = Tensor([[2.0,0,-2.0]], requires_grad=True)
 z = y.matmul(x).sum()
 z.backward()
 print(x.grad.numpy())  # dz/dx
 print(y.grad.numpy())  # dz/dy
 ```
 ### Same example in torch
 ```python
 import torch
 x = torch.eye(3, requires_grad=True)
 y = torch.tensor([[2.0,0,-2.0]], requires_grad=True)
 z = y.matmul(x).sum()
 z.backward()
 print(x.grad)  # dz/dx
 print(y.grad)  # dz/dy
 ```
 ## Is tinygrad fast?
 Try a matmul. See how, despite the style, it is fused into one kernel with the power of laziness.
-```python
+```sh
 DEBUG=3 OPTLOCAL=1 python3 -c "from tinygrad.tensor import Tensor;
 N = 1024; a, b = Tensor.randn(N, N), Tensor.randn(N, N);
 c = (a.reshape(N, 1, N) * b.permute(1,0).reshape(1, N, N)).sum(axis=2);
 print((c.numpy() - (a.numpy() @ b.numpy())).mean())"
 ```
-Change to `DEBUG=4` to see the generated code.
+And we can change `DEBUG` to `4` to see the generated code.
-## Neural networks?
+### Neural networks
-It turns out, a decent autograd tensor library is 90% of what you need for neural networks. Add an optimizer (SGD, Adam, AdamW implemented) from tinygrad.nn.optim, write some boilerplate minibatching code, and you have all you need.
+As it turns out, 90% of what you need for neural networks are a decent autograd/tensor library.
 Throw in an optimizer, a data loader, and some compute, and you have all you need.
-### Neural network example (from test/models/test_mnist.py)
+#### Neural network example (from test/models/test_mnist.py)
-```python
+```py
 from tinygrad.tensor import Tensor
 import tinygrad.nn.optim as optim
@ -100,7 +66,7 @@ class TinyBobNet:
 model = TinyBobNet()
 optim = optim.SGD([model.l1, model.l2], lr=0.001)
-# ... and complete like pytorch, with (x,y) data
+# ... complete data loader here
 out = model.forward(x)
 loss = out.mul(y).mean()
@ -109,114 +75,86 @@ loss.backward()
 optim.step()
 ```
-## GPU and Accelerator Support
+## Accelerators
-tinygrad supports GPUs through PyOpenCL.
+tinygrad already supports numerous accelerators, including:
-```python
+- [x] CPU
 - [x] GPU (OpenCL)
 - [x] C Code (Clang)
 - [x] LLVM
 - [x] METAL
 - [x] CUDA
 - [x] Triton
 - [x] PyTorch
 And it is easy to add more! Your accelerator of choice only needs to support a total of 20 (optionally 21) low level ops.
 More information can be found in the [documentation for adding new accelerators](/docs/adding_new_accelerators.md).
 ## Installation
 The current recommended way to install tinygrad is from source.
 ### From source
 ```sh
 git clone https://github.com/geohot/tinygrad.git
 cd tinygrad
 python3 -m pip install -e . # or `py3 -m pip install -e .` if you are on windows
 ```
 Don't forget the `.` at the end!
 ## Documentation
 Documentation along with a quick start guide can be found in the [docs/](/docs) directory.
 ### Quick example comparing to PyTorch
 ```py
 from tinygrad.tensor import Tensor
-(Tensor.ones(4,4).gpu() + Tensor.ones(4,4).gpu()).cpu()
+
 x = Tensor.eye(3, requires_grad=True)
 y = Tensor([[2.0,0,-2.0]], requires_grad=True)
 z = y.matmul(x).sum()
 z.backward()
 print(x.grad.numpy())  # dz/dx
 print(y.grad.numpy())  # dz/dy
 ```
-### hlops (in tensor.py)
+The same thing but in PyTorch:
 ```py
 import torch
-hlops are syntactic sugar around mlops. They support most things torch does.
+x = torch.eye(3, requires_grad=True)
 y = torch.tensor([[2.0,0,-2.0]], requires_grad=True)
 z = y.matmul(x).sum()
 z.backward()
-### mlops
+print(x.grad.numpy())  # dz/dx
-
+print(y.grad.numpy())  # dz/dy
 mlops are mid level ops. They understand derivatives. They are very simple.
 ```
 Relu, Log, Exp, Sin                            # unary ops
 Sum, Max                                       # reduce ops (with axis argument)
 Maximum, Add, Sub, Mul, Pow, Div, Equal        # binary ops (no broadcasting, use expand)
 Expand, Reshape, Permute, Pad, Shrink, Flip    # movement ops
 ```
-You no longer need to write mlops for a new accelerator
+## Contributing
-### Adding an accelerator (llops)
+There has been a lot of interest in tinygrad lately. Here are some basic guidelines for contributing:
-The autodiff stuff is all in mlops now so you can focus on the raw operations
+- Bug fixes are the best and always welcome! Like [this one](https://github.com/geohot/tinygrad/pull/421/files).
 - If you don't understand the code you are changing, don't change it!
 - All code golf PRs will be closed, but [conceptual cleanups](https://github.com/geohot/tinygrad/pull/372/files) are great.
 - Features are welcome. Though if you are adding a feature, you need to include tests.
 - Improving test coverage is great, with reliable non-brittle tests.
-```
+Additional guidelines can be found in [CONTRIBUTING.md](/CONTRIBUTING.md).
 Buffer                                                       # class of memory on this device
 unary_op  (NOOP, EXP, LOG, CAST, SIN)                        # A -> A
 reduce_op (SUM, MAX)                                         # A -> B (smaller size, B has 1 in shape)
 binary_op (ADD, SUB, MUL, DIV, POW, CMPEQ, MAX)              # A + A -> A (all the same size)
 movement_op (EXPAND, RESHAPE, PERMUTE, PAD, SHRINK, STRIDE)  # A -> B (different size)
 fused_op [[optional]] (MULACC)                               # A * A -> B
 ```
 ## ImageNet inference
 Despite being tiny, tinygrad supports the full EfficientNet. Pass in a picture to discover what it is.
 ```bash
 python3 examples/efficientnet.py https://media.istockphoto.com/photos/hen-picture-id831791190
 ```
 Or, if you have a webcam and cv2 installed
 ```bash
 python3 examples/efficientnet.py webcam
 ```
 PROTIP: Set "DEBUG=2" environment variable if you want to see why it's slow.
 ### tinygrad supports Stable Diffusion!
 You might need to download the [weight](https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt) of Stable Diffusion and put it into weights/
 Run `python3 examples/stable_diffusion.py`
 <p align="center">
  <img src="https://raw.githubusercontent.com/geohot/tinygrad/master/docs/stable_diffusion_by_tinygrad.jpg">
 </p>
 <p align="center">
 "a horse sized cat eating a bagel"
 </p>
 ### tinygrad supports LLaMA
 After putting the weights in weights/LLaMA, you can have a chat with Stacy. She lives inside tinygrad.
 ```bash
 python3 examples/llama.py
 ```
 ### tinygrad supports GANs
 See `examples/mnist_gan.py`
 <p align="center">
  <img src="https://raw.githubusercontent.com/geohot/tinygrad/master/docs/mnist_by_tinygrad.jpg">
 </p>
 ### tinygrad supports yolo
 See `examples/yolov3.py`
 <p align="center">
  <img src="https://raw.githubusercontent.com/geohot/tinygrad/master/docs/yolo_by_tinygrad.jpg">
 </p>
 ### Drawing Execution Graph
 ```bash
 GRAPH=1 python3 test/models/test_mnist.py TestMNIST.test_sgd_onestep
 # requires dot, outputs /tmp/net.svg
 ```
 ### Running tests
 For more examples on how to run the full test suite please refer to the [CI workflow](.github/workflows/test.yml).
-```bash
+Some examples:
 ```sh
 python3 -m pip install -e '.[testing]'
 python3 -m pytest
 python3 -m pytest -v -k TestTrain
 python3 ./test/models/test_train.py TestTrain.test_efficientnet
 ```
--- a/docs/README.md
+++ b/docs/README.md
@ -1,125 +1,37 @@
-### Welcome to the tinygrad documentation
+# Welcome to the tinygrad documentation!
-General instructions you will find in [README.md](https://github.com/geohot/tinygrad/blob/master/README.md)
+Here you will find documentation for tinygrad, as well as some examples and tutorials.
-[abstraction.py](https://github.com/geohot/tinygrad/blob/master/docs/abstractions.py) is a well documented showcase of the abstraction stack.
+## Getting Started
-There are plenty of [tests](https://github.com/geohot/tinygrad/tree/master/test) you can read through
+Read the quick start guide [here](/docs/quickstart.md).
 [Examples](https://github.com/geohot/tinygrad/tree/master/examples) contains tinygrad implementations of popular models (vision and language) and neural networks. LLama, Stable diffusion, GANs and Yolo to name a few
-### Environment variables
+Or if you want to jump right in to how tinygrad works, you can read the [abstraction stack](/docs/abstractions.py) documentation.
 Here is a list of environment variables you can use with tinygrad.
 Most of these are self-explanatory, and used to enable an option at runtime.
 Example : `GPU=1 DEBUG=4 python3 -m pytest`
-The columns are: Variable, Value and Description
+Or if you want to see some examples, you can look at the examples in the [examples](/examples) directory.
 They are also grouped into either general tinygrad or specific files
-##### General tinygrad
+Or if you just want to see some of the things tinygrad can do, check out the [showcase](/docs/showcase.md).
 DEBUG: [1-4], enable debugging output, with 4 you get operations, timings, speed, generated code and more
 GPU: [1], enable the GPU backend
 CPU: [1], enable CPU backend
 MPS: [1], emable MPS device (for Mac M1 and after)
 METAL: [1], enable Metal backend (for Mac M1 and after)
 METAL_XCODE: [1], enable Metal using MacOS Xcode sdk
 TORCH: [1], enable Torch backend
 CLANG: [1], enable Clang backend
 LLVM: [1], enable LLVM backend
 LLVMOPT: [1], enable LLVM optimization
 LAZY: [1], enable lazy operations
 OPT: [1-4], enable optimization
 OPTLOCAL: [1], enable local optimization
 JIT: [1], enable Jit
 GRAPH: [1], Create a graph of all operations
 GRAPHPATH: [/path/to], what path to generate the graph image
 PRUNEGRAPH, [1], prune movementops and loadops from the graph
 PRINT_PRG: [1], print program
 FLOAT16: [1], use float16 instead of float32
 ENABLE_METHOD_CACHE: [1], enable method cache
 EARLY_STOPPING: [1], stop early
 DISALLOW_ASSIGN: [1], enable not assigning the realized lazydata to the lazy output buffer
-##### tinygrad/codegen/cstyle.py
+## API
 NATIVE_EXPLOG: [1], enable using native explog
-##### accel/ane/2_compile/hwx_parse.py
+This is currently a big work in progress.
 PRINTALL: [1], print all ane registers
-##### extra/onnx.py
+## Resources
 ONNXLIMIT: [ ], set a limit for Onnx
 DEBUGONNX: [1], enable Onnx debugging
-##### extra/thneed.py
+### Environment Variables
 DEBUGCL: [1-4], enable Debugging for OpenCL
 PRINT_KERNEL: [1], Print OpenCL Kernels
-##### extra/kernel_search.py
+[env_vars.md](/docs/env_vars.md)
 OP: [1-3], different operations
 NOTEST: [1], enable not testing ast
 DUMP: [1], enable dumping of intervention cache
 REDUCE: [1], enable reduce operations
 SIMPLE_REDUCE: [1], enable simpler reduce operations
 BC: [1], enable big conv operations
 CONVW: [1], enable convw operations
 FASTCONV: [1], enable faster conv operations
 GEMM: [1], enable general matrix multiply operations
 BROKEN: [1], enable a kind of operation
 BROKEN3: [1], enable a kind of operation
-##### examples/vit.py
+### Adding New Accelerators
 LARGE: [1], enable larger dimension model
-##### examples/llama.py
+[adding_new_accelerators.md](/docs/adding_new_accelerators.md)
 WEIGHTS: [1], enable using weights
-##### examples/mlperf
+### Community
 MODEL: [resnet,retinanet,unet3d,rnnt,bert,maskrcnn], what models to use
-##### examples/benchmark_train_efficientnet.py
+[![tinygrad discord](https://discordapp.com/api/guilds/1068976834382925865/widget.png?style=banner2)](https://discord.gg/ZjZadyC7PK)
 CNT: [10], the amount of times to loop the benchmark
 BACKWARD: [1], enable backward call
 TRAINING: [1], set Tensor.training
 CLCACHE: [1], enable Cache for OpenCL
-##### examples/hlb_cifar10.py
+## Contributing
 TORCHWEIGHTS: [1], use torch to initialize weights
 DISABLE_BACKWARD: [1], dont use backward operations
-##### examples/benchmark_train_efficientnet.py & examples/hlb_cifar10.py
+The documentation mainly follows the core contributing guidelines in the [README.md](/README.md#contributing).
 ADAM: [1], enable Adam optimization
 ##### examples/hlb_cifar10.py & xamples/hlb_cifar10_torch.py
 STEPS: [0-10], number of steps
 FAKEDATA: [1], enable to use random data
 ##### examples/train_efficientnet.py
 STEPS: [1024 dividable], number of steps
 TINY: [1], use a tiny convolution network
 IMAGENET: [1], use imagenet for training
 ##### examples/train_efficientnet.py & examples/train_resnet.py
 TRANSFER: [1], enable to use pretrained data
 ##### examples & test/external/external_test_opt.py
 NUM: [18, 2], what ResNet[18] / EfficientNet[2] to train
 ##### test/test_ops.py
 PRINT_TENSORS: [1], print tensors
 FORWARD_ONLY: [1], use forward operations only
 ##### test/test_speed_v_torch.py
 TORCHCUDA: [1], enable the torch cuda backend
 ##### test/external/external_test_gpu_ast.py
 KOPT: [1], enable kernel optimization
 KCACHE: [1], enable kernel cache
 ##### test/external/external_test_opt.py
 ENET_NUM: [-2,-1], what EfficientNet to use
 ##### test/test_dtype.py & test/extra/test_utils.py & extra/training.py
 CI: [1], enable to avoid some tests to run in CI
 ##### examples & extra & test
 BS: [8, 16, 32, 64, 128], bytesize
 Additionally, we always welcome documentation contributions, especially for features that are currently under documented.
--- a/docs/adding_new_accelerators.md
+++ b/docs/adding_new_accelerators.md
@ -0,0 +1,32 @@
 # Adding a new accelerator to tinygrad
 It's pretty easy to add a new accelerator to tinygrad. All you need to do is implement a total of 20 (optionally 21) low level ops. Then tinygrad takes care of the rest, handling derivatives and syntactic sugar.
 ## llops
 These are the ops that you must implement for your accelerator of choice.
 ```
 Buffer                                                       # class of memory on this device
 unary_op  (NOOP, EXP, LOG, CAST, SIN)                        # A -> A
 reduce_op (SUM, MAX)                                         # A -> B (smaller size, B has 1 in shape)
 binary_op (ADD, SUB, MUL, DIV, POW, CMPEQ, MAX)              # A + A -> A (all the same size)
 movement_op (EXPAND, RESHAPE, PERMUTE, PAD, SHRINK, STRIDE)  # A -> B (different size)
 fused_op [[optional]] (MULACC)                               # A * A -> B
 ```
 ## mlops
 These are the mid level ops that handle the derivatives.
 ```
 Relu, Log, Exp, Sin                            # unary ops
 Sum, Max                                       # reduce ops (with axis argument)
 Maximum, Add, Sub, Mul, Pow, Div, Equal        # binary ops (no broadcasting, use expand)
 Expand, Reshape, Permute, Pad, Shrink, Flip    # movement ops
 ```
 These are implemented in [mlops.py](/tinygrad/mlops.py).
 ## hlops
 These are the syntax sugar. They are built on top of the mlops and support most of the things that you could expect from a tensor library.
 These are implemented in [tensor.py](/tinygrad/tensor.py).
--- a/docs/env_vars.md
+++ b/docs/env_vars.md
@ -0,0 +1,186 @@
 # List of environment variables that control tinygrad behavior.
 This is a list of environment variable that control the runtime behavior of tinygrad and its examples.
 Most of these are self-explanatory, and are usually used to set an option at runtime.
 Example: `GPU=1 DEBUG=4 python3 -m pytest`
 The columns are: Variable, Possible Value(s) and Description.
 - A `#` means that the variable can take any integer value.
 ## Global Variables
 These control the behavior of core tinygrad even when used as a library.
 Variable | Possible Value(s) | Description
 ---|---|---
 DEBUG               | [1-4]      | enable debugging output, with 4 you get operations, timings, speed, generated code and more
 GPU                 | [1]        | enable the GPU backend
 CPU                 | [1]        | enable CPU backend
 MPS                 | [1]        | enable MPS device (for Mac M1 and after)
 METAL               | [1]        | enable Metal backend (for Mac M1 and after)
 METAL_XCODE         | [1]        | enable Metal using macOS Xcode SDK
 TORCH               | [1]        | enable PyTorch backend
 CLANG               | [1]        | enable Clang backend
 LLVM                | [1]        | enable LLVM backend
 LLVMOPT             | [1]        | enable slightly more expensive LLVM optimizations
 LAZY                | [1]        | enable lazy operations (this is the default)
 OPT                 | [1-4]      | optimization level
 OPTLOCAL            | [1-2]      | enable local optimization
 GRAPH               | [1]        | create a graph of all operations (requires graphviz)
 GRAPHPATH           | [/path/to] | where to put the generated graph
 PRUNEGRAPH          | [1]        | prune MovementOps and LoadOps from the graph
 PRINT_PRG           | [1]        | print program code
 IMAGE               | [1]        | enable 2d specific optimizations
 FLOAT16             | [1]        | use float16 for images instead of float32
 ENABLE_METHOD_CACHE | [1]        | enable method cache (this is the default)
 EARLY_STOPPING      | [# > 0]  | stop after this many kernels
 DISALLOW_ASSIGN     | [1]        | disallow assignment of tensors
 NATIVE_EXPLOG       | [1]        | enable using native exp and log
 ## File Specific Variables
 These are variables that control the behavior of a specific file, these usually don't affect the library itself.
 Most of the time these will never be used, but they are here for completeness.
 ### accel/ane/2_compile/hwx_parse.py
 Variable | Possible Value(s) | Description
 ---|---|---
 PRINTALL | [1] | print all ANE registers
 ### extra/onnx.py
 Variable | Possible Value(s) | Description
 ---|---|---
 ONNXLIMIT | [#] | set a limit for ONNX
 DEBUGONNX | [1] | enable ONNX debugging
 ### extra/thneed.py
 Variable | Possible Value(s) | Description
 ---|---|---
 DEBUGCL      | [1-4] | enable Debugging for OpenCL
 PRINT_KERNEL | [1]   | Print OpenCL Kernels
 ### extra/kernel_search.py
 Variable | Possible Value(s) | Description
 ---|---|---
 OP            | [1-3] | different operations
 NOTEST        | [1]   | enable not testing AST
 DUMP          | [1]   | enable dumping of intervention cache
 REDUCE        | [1]   | enable reduce operations
 SIMPLE_REDUCE | [1]   | enable simpler reduce operations
 BC            | [1]   | enable big conv operations
 CONVW         | [1]   | enable convw operations
 FASTCONV      | [1]   | enable faster conv operations
 GEMM          | [1]   | enable general matrix multiply operations
 BROKEN        | [1]   | enable a kind of operation
 BROKEN3       | [1]   | enable a kind of operation
 ### examples/vit.py
 Variable | Possible Value(s) | Description
 ---|---|---
 LARGE | [1] | enable larger dimension model
 ### examples/llama.py
 Variable | Possible Value(s) | Description
 ---|---|---
 WEIGHTS | [1] | enable loading weights
 ### examples/mlperf
 Variable | Possible Value(s) | Description
 ---|---|---
 MODEL | [resnet,retinanet,unet3d,rnnt,bert,maskrcnn] | what models to use
 ### examples/benchmark_train_efficientnet.py
 Variable | Possible Value(s) | Description
 ---|---|---
 CNT      | [10] | the amount of times to loop the benchmark
 BACKWARD | [1]  | enable backward pass
 TRAINING | [1]  | set Tensor.training
 CLCACHE  | [1]  | enable cache for OpenCL
 ### examples/hlb_cifar10.py
 Variable | Possible Value(s) | Description
 ---|---|---
 TORCHWEIGHTS     | [1] | use torch to initialize weights
 DISABLE_BACKWARD | [1] | don't do backward pass
 ### examples/benchmark_train_efficientnet.py & examples/hlb_cifar10.py
 Variable | Possible Value(s) | Description
 ---|---|---
 ADAM | [1] | use the Adam optimizer
 ### examples/hlb_cifar10.py & xamples/hlb_cifar10_torch.py
 Variable | Possible Value(s) | Description
 ---|---|---
 STEPS    | [0-10] | number of steps
 FAKEDATA | [1]    | enable to use random data
 ### examples/train_efficientnet.py
 Variable | Possible Value(s) | Description
 ---|---|---
 STEPS    | [# % 1024] | number of steps
 TINY     | [1]        | use a tiny convolution network
 IMAGENET | [1]        | use imagenet for training
 ### examples/train_efficientnet.py & examples/train_resnet.py
 Variable | Possible Value(s) | Description
 ---|---|---
 TRANSFER | [1] | enable to use pretrained data
 ### examples & test/external/external_test_opt.py
 Variable | Possible Value(s) | Description
 ---|---|---
 NUM | [18, 2] | what ResNet[18] / EfficientNet[2] to train
 ### test/test_ops.py
 Variable | Possible Value(s) | Description
 ---|---|---
 PRINT_TENSORS | [1] | print tensors
 FORWARD_ONLY  | [1] | use forward operations only
 ### test/test_speed_v_torch.py
 Variable | Possible Value(s) | Description
 ---|---|---
 TORCHCUDA | [1] | enable the torch cuda backend
 ### test/external/external_test_gpu_ast.py
 Variable | Possible Value(s) | Description
 ---|---|---
 KOPT   | [1] | enable kernel optimization
 KCACHE | [1] | enable kernel cache
 ### test/external/external_test_opt.py
 Variable | Possible Value(s) | Description
 ---|---|---
 ENET_NUM | [-2,-1] | what EfficientNet to use
 ### test/test_dtype.py & test/extra/test_utils.py & extra/training.py
 Variable | Possible Value(s) | Description
 ---|---|---
 CI | [1] | disables some tests for CI
 ### examples & extra & test
 Variable | Possible Value(s) | Description
 ---|---|---
 BS | [8, 16, 32, 64, 128] | batch size to use
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@ -0,0 +1,300 @@
 # tinygrad Quick Start Guide
 This guide assumes no prior knowledge of pytorch or any other deep learning framework, but does assume some basic knowledge of neural networks.
 It is intended to be a very quick overview of the high level API that tinygrad provides.
 This guide is also structured as a tutorial which at the end of it you will have a working model that can classify handwritten digits.
 We need some imports to get started:
 ```py
 import numpy as np
 import time
 ```
 ## Tensors
 Tensors are the base data structure in tinygrad. They can be thought of as a multidimensional array of a specific data type.
 All high level operations in tinygrad operate on these tensors.
 The tensor class can be imported like so:
 ```py
 from tinygrad.tensor import Tensor
 ```
 Tensors can be created from an existing data structure like a python list or numpy ndarray:
 ```py
 t1 = Tensor([1, 2, 3, 4, 5])
 na = np.array([1, 2, 3, 4, 5])
 t2 = Tensor(na)
 ```
 Tensors can also be created using one of the many factory methods:
 ```py
 full = Tensor.full(shape=(2, 3), fill_value=5) # create a tensor of shape (2, 3) filled with 5
 zeros = Tensor.zeros(2, 3) # create a tensor of shape (2, 3) filled with 0
 ones = Tensor.ones(2, 3) # create a tensor of shape (2, 3) filled with 1
 full_like = Tensor.full_like(full, fill_value=2) # create a tensor of the same shape as `full` filled with 2
 zeros_like = Tensor.zeros_like(full) # create a tensor of the same shape as `full` filled with 0
 ones_like = Tensor.ones_like(full) # create a tensor of the same shape as `full` filled with 1
 eye = Tensor.eye(3) # create a 3x3 identity matrix
 arange = Tensor.arange(start=0, stop=10, step=1) # create a tensor of shape (10,) filled with values from 0 to 9
 rand = Tensor.rand(2, 3) # create a tensor of shape (2, 3) filled with random values from a uniform distribution
 randn = Tensor.randn(2, 3) # create a tensor of shape (2, 3) filled with random values from a normal distribution
 uniform = Tensor.uniform(2, 3, low=0, high=10) # create a tensor of shape (2, 3) filled with random values from a uniform distribution between 0 and 10
 ```
 There are even more of these factory methods, you can find them in the [tensor.py](/tinygrad/tensor.py) file.
 All the tensors creation methods can take a `dtype` argument to specify the data type of the tensor.
 ```py
 from tinygrad.helpers import dtypes
 t3 = Tensor([1, 2, 3, 4, 5], dtype=dtypes.int32)
 ```
 Tensors allow you to perform operations on them like so:
 ```py
 t4 = Tensor([1, 2, 3, 4, 5])
 t5 = (t4 + 1) * 2
 t6 = (t5 * t4).relu().log_softmax()
 ```
 All of these operations are lazy and are only executed when you realize the tensor using `.realize()` or `.numpy()`.
 ```py
 print(t6.numpy())
 # [-56. -48. -36. -20.   0.]
 ```
 There are a lot more operations that can be performed on tensors, you can find them in the [tensor.py](/tinygrad/tensor.py) file.
 Additionally reading through [abstractions.py](/docs/abstractions.py) will help you understand how operations on these tensors make their way down to your hardware.
 ## Models
 Neural networks in tinygrad are really just represented by the operations performed on tensors.
 These operations are commonly grouped into the `__call__` method of a class which allows modularization and reuse of these groups of operations.
 These classes do not need to inherit from any base class, in fact if they don't need any trainable parameters they don't even need to be a class!
 An example of this would be the `nn.Linear` class which represents a linear layer in a neural network.
 ```py
 # from tinygrad.nn import Linear
 class Linear:
  def __init__(self, in_features, out_features, bias=True, initialization: str='kaiming_uniform'):
    self.weight = getattr(Tensor, initialization)(out_features, in_features)
    self.bias = Tensor.zeros(out_features) if bias else None
  def __call__(self, x):
    return x.linear(self.weight.transpose(), self.bias)
 ```
 There are more neural network modules already implemented in [nn](/tinygrad/nn/__init__.py), and you can also implement your own.
 We will be implementing a simple neural network that can classify handwritten digits from the MNIST dataset.
 Our classifier will be a simple 2 layer neural network with a Leaky ReLU activation function.
 It will use a hidden layer size of 128 and an output layer size of 10 (one for each digit) with no bias on either Linear layer.
 ```py
 from tinygrad.nn import Linear
 class TinyNet:
  def __init__(self):
    self.l1 = Linear(784, 128, bias=False)
    self.l2 = Linear(128, 10, bias=False)
  def __call__(self, x):
    x = self.l1(x)
    x = x.leakyrelu()
    x = self.l2(x)
    return x.log_softmax()
 net = TinyNet()
 ```
 We can see that the forward pass of our neural network is just the sequence of operations performed on the input tensor `x`.
 We can also see that functional operations like `leakyrelu` and `log_softmax` are not defined as classes and instead are just methods we can just call.
 Finally, we just initialize an instance of our neural network, and we are ready to start training it.
 ## Training
 Now that we have our neural network defined we can start training it.
 Training neural networks in tinygrad is super simple.
 All we need to do is define our neural network, define our loss function, and then call `.backward()` on the loss function to compute the gradients.
 They can then be used to update the parameters of our neural network using one of the many optimizers in [optim.py](/tinygrad/nn/optim.py).
 First we need to set the training flag in `Tensor`:
 ```py
 Tensor.training = True
 ```
 For our loss function we will be using cross entropy loss.
 ```py
 # from extra.training import sparse_categorical_crossentropy
 def cross_entropy(out, Y):
  num_classes = out.shape[-1]
  YY = Y.flatten().astype(np.int32)
  y = np.zeros((YY.shape[0], num_classes), np.float32)
  y[range(y.shape[0]),YY] = -1.0*num_classes
  y = y.reshape(list(Y.shape)+[num_classes])
  y = Tensor(y)
  return out.mul(y).mean()
 ```
 As we can see in this implementation of cross entropy loss, there are certain operations that tinygrad does not support.
 Namely, operations that are load/store like indexing a tensor with another tensor or assigning a value to a tensor at a certain index.
 Load/store ops are not supported in tinygrad because they add complexity when trying to port to different backends and 90% of the models out there don't use/need them.
 For our optimizer we will be using the traditional stochastic gradient descent optimizer with a learning rate of 3e-4.
 ```py
 from tinygrad.nn.optim import SGD
 opt = SGD([net.l1.weight, net.l2.weight], lr=3e-4)
 ```
 We can see that we are passing in the parameters of our neural network to the optimizer.
 This is due to the fact that the optimizer needs to know which parameters to update.
 There is a simpler way to do this just by using `get_parameters(net)` from `tinygrad.nn.optim` which will return a list of all the parameters in the neural network.
 The parameters are just listed out explicitly here for clarity.
 Now that we have our network, loss function, and optimizer defined all we are missing is the data to train on!
 There are a couple of dataset loaders in tinygrad located in [/datasets](/datasets).
 We will be using the MNIST dataset loader.
 ```py
 from datasets import fetch_mnist
 ```
 Now we have everything we need to start training our neural network.
 We will be training for 1000 steps with a batch size of 64.
 ```py
 X_train, Y_train, X_test, Y_test = fetch_mnist()
 for step in range(1000):
  # random sample a batch
  samp = np.random.randint(0, X_train.shape[0], size=(64))
  batch = Tensor(X_train[samp], requires_grad=False)
  # get the corresponding labels
  labels = Y_train[samp]
  # forward pass
  out = net(batch)
  # compute loss
  loss = cross_entropy(out, labels)
  # zero gradients
  opt.zero_grad()
  # backward pass
  loss.backward()
  # update parameters
  opt.step()
  # calculate accuracy
  pred = np.argmax(out.numpy(), axis=-1)
  acc = (pred == labels).mean()
  if step % 100 == 0:
    print(f"Step {step+1} | Loss: {loss.numpy()} | Accuracy: {acc}")
 ```
 ## Evaluation
 Now that we have trained our neural network we can evaluate it on the test set.
 We will be using the same batch size of 64 and will be evaluating for 1000 of those batches.
 ```py
 # set training flag to false
 Tensor.training = False
 st = time.perf_counter()
 avg_acc = 0
 for step in range(1000):
  # random sample a batch
  samp = np.random.randint(0, X_test.shape[0], size=(64))
  batch = Tensor(X_test[samp], requires_grad=False)
  # get the corresponding labels
  labels = Y_test[samp]
  # forward pass
  out = net(batch)
  # calculate accuracy
  pred = np.argmax(out.numpy(), axis=-1)
  avg_acc += (pred == labels).mean()
 print(f"Test Accuracy: {avg_acc / 1000}")
 print(f"Time: {time.perf_counter() - st}")
 ```
 ## And that's it!
 Highly recommend you check out the [examples/](/examples) folder for more examples of using tinygrad.
 Reading the source code of tinygrad is also a great way to learn how it works.
 Specifically the tests in [tests/](/tests) are a great place to see how to use and the semantics of the different operations.
 There are also a bunch of models implemented in [models/](/models) that you can use as a reference.
 Additionally, feel free to ask questions in the `#learn-tinygrad` channel on the [discord](https://discord.gg/beYbxwxVdx). Don't ask to ask, just ask!
 ## Extras
 ### JIT
 Additionally, it is possible to speed up the computation of certain neural networks by using the JIT.
 Currently, this does not support models with varying input sizes and non tinygrad operations.
 To use the JIT we just need to add a function decorator to the forward pass of our neural network and ensure that the input and output are realized tensors.
 Or in this case we will create a wrapper function and decorate the wrapper function to speed up the evaluation of our neural network.
 ```py
 from tinygrad.jit import TinyJit
@TinyJit
 def jit(x):
  return net(x).realize()
 st = time.perf_counter()
 avg_acc = 0
 for step in range(1000):
  # random sample a batch
  samp = np.random.randint(0, X_test.shape[0], size=(64))
  batch = Tensor(X_test[samp], requires_grad=False)
  # get the corresponding labels
  labels = Y_test[samp]
  # forward pass with jit
  out = jit(batch)
  # calculate accuracy
  pred = np.argmax(out.numpy(), axis=-1)
  avg_acc += (pred == labels).mean()
 print(f"Test Accuracy: {avg_acc / 1000}")
 print(f"Time: {time.perf_counter() - st}")
 ```
 You will find that the evaluation time is much faster than before and that your accelerator utilization is much higher.
 ### Saving and Loading Models
 The standard weight format for tinygrad is [safetensors](https://github.com/huggingface/safetensors). This means that you can load the weights of any model also using safetensors into tinygrad.
 There are functions in [state.py](/tinygrad/state.py) to save and load models to and from this format.
 ```py
 from tinygrad.state import safe_save, safe_load, get_state_dict, load_state_dict
 # first we need the state dict of our model
 state_dict = get_state_dict(net)
 # then we can just save it to a file
 safe_save(state_dict, "model.safetensors")
 # and load it back in
 state_dict = safe_load("model.safetensors")
 load_state_dict(net, state_dict)
 ```
 Many of the models in the [models/](/models) folder have a `load_from_pretrained` method that will download and load the weights for you. These usually are pytorch weights meaning that you would need pytorch installed to load them.
 ### Environment Variables
 There exist a bunch of environment variables that control the runtime behavior of tinygrad.
 Some of the commons ones are `DEBUG` and the different backend enablement variables.
 You can find a full list and their descriptions in [env_vars.md](/docs/env_vars.md).
 ### Visualizing the Computation Graph
 It is possible to visualize the computation graph of a neural network using [graphviz](https://graphviz.org/).
 This is easily done by running a single pass (forward or backward!) of the neural network with the environment variable `GRAPH` set to `1`.
 The graph will be saved to `/tmp/net.svg` by default.
--- a/docs/showcase.md
+++ b/docs/showcase.md
@ -0,0 +1,61 @@
 # tinygrad Showcase
 Despite being a tiny library, tinygrad is capable of doing a lot of things. From state-of-the-art [vision](https://arxiv.org/abs/1905.11946) to state-of-the-art [language](https://arxiv.org/abs/1706.03762) models.
 ## Vision
 ### EfficientNet
 You can either pass in the URL of a picture to discover what it is:
 ```sh
 python3 examples/efficientnet.py https://media.istockphoto.com/photos/hen-picture-id831791190
 ```
 Or, if you have a camera and OpenCV installed, you can detect what is in front of you:
 ```sh
 python3 examples/efficientnet.py webcam
 ```
 ### YOLOv3
 Take a look at [yolov3.py](/examples/yolov3.py).
 ![yolo by tinygrad](/docs/showcase/yolo_by_tinygrad.jpg)
 ## Audio
 ### Whisper
 Take a look at [whisper.py](/examples/whisper.py). You need pyaudio and torchaudio installed.
 ```sh
 SMALL=1 python3 examples/whisper.py
 ```
 ## Generative
 ### Generative Adversarial Networks
 Take a look at [mnist_gan.py](/examples/mnist_gan.py).
 ![mnist gan by tinygrad](/docs/showcase/mnist_by_tinygrad.jpg)
 ### Stable Diffusion
 You will need to download the [weights](https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt) of Stable Diffusion and put it into the [weights/](/weights) directory.
 ```sh
 python3 examples/stable_diffusion.py
 ```
 ![a horse sized cat eating a bagel](/docs/showcase/stable_diffusion_by_tinygrad.jpg)
 *"a horse sized cat eating a bagel"*
 ### LLaMA
 You will need to download and put the weights into the [weights/LLaMA](/weightsLLaMA) directory, which may need to be created.
 Then you can have a chat with Stacy:
 ```sh
 python3 examples/llama.py
 ```
--- a/docs/showcase/mnist_by_tinygrad.jpg
+++ b/docs/showcase/mnist_by_tinygrad.jpg
--- a/docs/showcase/stable_diffusion_by_tinygrad.jpg
+++ b/docs/showcase/stable_diffusion_by_tinygrad.jpg
--- a/docs/showcase/yolo_by_tinygrad.jpg
+++ b/docs/showcase/yolo_by_tinygrad.jpg