tinygrad/examples/benchmark_train_efficientne...

#!/usr/bin/env python3
import gc
import time
from tqdm import trange
from models.efficientnet import EfficientNet
from tinygrad.nn.state import get_parameters
from tinygrad.nn import optim
from tinygrad.tensor import Tensor
from tinygrad.ops import GlobalCounters
from tinygrad.helpers import getenv
from tinygrad.jit import CacheCollector

def tensors_allocated():
  return sum(isinstance(x, Tensor) for x in gc.get_objects())

NUM = getenv("NUM", 2)
BS = getenv("BS", 8)
CNT = getenv("CNT", 10)
BACKWARD = getenv("BACKWARD", 0)
TRAINING = getenv("TRAINING", 1)
ADAM = getenv("ADAM", 0)
CLCACHE = getenv("CLCACHE", 0)

if __name__ == "__main__":
  print(f"NUM:{NUM} BS:{BS} CNT:{CNT}")
  model = EfficientNet(NUM, classes=1000, has_se=False, track_running_stats=False)
  parameters = get_parameters(model)
  for p in parameters: p.realize()
  if ADAM: optimizer = optim.Adam(parameters, lr=0.001)
  else: optimizer = optim.SGD(parameters, lr=0.001)

  Tensor.training = TRAINING
  Tensor.no_grad = not BACKWARD
  for i in trange(CNT):
    GlobalCounters.reset()
    cpy = time.monotonic()
    x_train = Tensor.randn(BS, 3, 224, 224, requires_grad=False).realize()
    y_train = Tensor.randn(BS, 1000, requires_grad=False).realize()

    # TODO: replace with TinyJit
    if i < 3 or not CLCACHE:
      st = time.monotonic()
      out = model.forward(x_train)
      loss = out.log_softmax().mul(y_train).mean()
      if i == 2 and CLCACHE: CacheCollector.start()
      if BACKWARD:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
      mt = time.monotonic()
      loss.realize()
      for p in parameters:
        p.realize()
      et = time.monotonic()
    else:
      st = mt = time.monotonic()
      for prg, args in cl_cache: prg(*args)
      et = time.monotonic()

    if i == 2 and CLCACHE:
      cl_cache = CacheCollector.finish()

    mem_used = GlobalCounters.mem_used
    loss_cpu = loss.detach().numpy()
    cl = time.monotonic()

    print(f"{(st-cpy)*1000.0:7.2f} ms cpy,  {(cl-st)*1000.0:7.2f} ms run, {(mt-st)*1000.0:7.2f} ms build, {(et-mt)*1000.0:7.2f} ms realize, {(cl-et)*1000.0:7.2f} ms CL, {loss_cpu:7.2f} loss, {tensors_allocated():4d} tensors, {mem_used/1e9:.2f} GB used, {GlobalCounters.global_ops*1e-9/(cl-st):9.2f} GFLOPS")
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00			`#!/usr/bin/env python3`
Fix examples (#540) * Fix examples * Remove training in parameters * Simplify a bit * Remove extra import * Fix linter errors * factor out Device * NumPy-like semantics for Tensor.__getitem__ (#506) * Rewrote Tensor.__getitem__ to fix negative indices and add support for np.newaxis/None * Fixed pad2d * mypy doesn't know about mlops methods * normal python behavior for out-of-bounds slicing * type: ignore * inlined idxfix * added comment for __getitem__ * Better comments, better tests, and fixed bug in np.newaxis * update cpu and torch to hold buffers (#542) * update cpu and torch to hold buffers * save lines, and probably faster * Mypy fun (#541) * mypy fun * things are just faster * running fast * mypy is fast * compile.sh * no gpu hack * refactor ops_cpu and ops_torch to not subclass * make weak buffer work * tensor works * fix test failing * cpu/torch cleanups * no or operator on dict in python 3.8 * that was junk * fix warnings * comment and touchup * dyn add of math ops * refactor ops_cpu and ops_torch to not share code * nn/optim.py compiles now * Reorder imports * call mkdir only if directory doesn't exist --------- Co-authored-by: George Hotz <geohot@gmail.com> Co-authored-by: Mitchell Goff <mitchellgoffpc@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> 2023-02-11 02:09:37 +08:00			`import gc`
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00			`import time`
			`from tqdm import trange`
			`from models.efficientnet import EfficientNet`
move state to nn/state (#1619) 2023-08-22 22:36:24 +08:00			`from tinygrad.nn.state import get_parameters`
Fix examples (#540) * Fix examples * Remove training in parameters * Simplify a bit * Remove extra import * Fix linter errors * factor out Device * NumPy-like semantics for Tensor.__getitem__ (#506) * Rewrote Tensor.__getitem__ to fix negative indices and add support for np.newaxis/None * Fixed pad2d * mypy doesn't know about mlops methods * normal python behavior for out-of-bounds slicing * type: ignore * inlined idxfix * added comment for __getitem__ * Better comments, better tests, and fixed bug in np.newaxis * update cpu and torch to hold buffers (#542) * update cpu and torch to hold buffers * save lines, and probably faster * Mypy fun (#541) * mypy fun * things are just faster * running fast * mypy is fast * compile.sh * no gpu hack * refactor ops_cpu and ops_torch to not subclass * make weak buffer work * tensor works * fix test failing * cpu/torch cleanups * no or operator on dict in python 3.8 * that was junk * fix warnings * comment and touchup * dyn add of math ops * refactor ops_cpu and ops_torch to not share code * nn/optim.py compiles now * Reorder imports * call mkdir only if directory doesn't exist --------- Co-authored-by: George Hotz <geohot@gmail.com> Co-authored-by: Mitchell Goff <mitchellgoffpc@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> 2023-02-11 02:09:37 +08:00			`from tinygrad.nn import optim`
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00			`from tinygrad.tensor import Tensor`
Kernel Optimizer (#489) * kernel optimizer * 10x faster, but wrong. not good deal * move test -> extra * print x speedup * clcache * fix clcache + DEBUG * GFLOPS estimate * i==3 2023-01-30 09:15:00 +08:00			`from tinygrad.ops import GlobalCounters`
Refactor getenv into helpers (#508) * Refactor getenv into helpers * Remove unused os * Fix default value * Fix more defaults for CI * Fix bracket * Revert changes to openpilot/compile.py * Use getenv from helpers when possible 2023-02-01 07:09:09 +08:00			`from tinygrad.helpers import getenv`
add cache collector (#1595) * init cache collector * add test_cache_collector.py * switch GlobalCounters.cache to CacheCollector * init jit models test * jitted SD * add debug msg to print loaded bufs count * moved cache collctor to jit * clearer SD * no double device import 2023-08-29 10:59:55 +08:00			`from tinygrad.jit import CacheCollector`
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00
track cl mem used 2022-07-05 03:19:00 +08:00			`def tensors_allocated():`
Fix examples (#540) * Fix examples * Remove training in parameters * Simplify a bit * Remove extra import * Fix linter errors * factor out Device * NumPy-like semantics for Tensor.__getitem__ (#506) * Rewrote Tensor.__getitem__ to fix negative indices and add support for np.newaxis/None * Fixed pad2d * mypy doesn't know about mlops methods * normal python behavior for out-of-bounds slicing * type: ignore * inlined idxfix * added comment for __getitem__ * Better comments, better tests, and fixed bug in np.newaxis * update cpu and torch to hold buffers (#542) * update cpu and torch to hold buffers * save lines, and probably faster * Mypy fun (#541) * mypy fun * things are just faster * running fast * mypy is fast * compile.sh * no gpu hack * refactor ops_cpu and ops_torch to not subclass * make weak buffer work * tensor works * fix test failing * cpu/torch cleanups * no or operator on dict in python 3.8 * that was junk * fix warnings * comment and touchup * dyn add of math ops * refactor ops_cpu and ops_torch to not share code * nn/optim.py compiles now * Reorder imports * call mkdir only if directory doesn't exist --------- Co-authored-by: George Hotz <geohot@gmail.com> Co-authored-by: Mitchell Goff <mitchellgoffpc@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> 2023-02-11 02:09:37 +08:00			`return sum(isinstance(x, Tensor) for x in gc.get_objects())`
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00
Refactor getenv into helpers (#508) * Refactor getenv into helpers * Remove unused os * Fix default value * Fix more defaults for CI * Fix bracket * Revert changes to openpilot/compile.py * Use getenv from helpers when possible 2023-02-01 07:09:09 +08:00			`NUM = getenv("NUM", 2)`
			`BS = getenv("BS", 8)`
			`CNT = getenv("CNT", 10)`
			`BACKWARD = getenv("BACKWARD", 0)`
			`TRAINING = getenv("TRAINING", 1)`
			`ADAM = getenv("ADAM", 0)`
			`CLCACHE = getenv("CLCACHE", 0)`
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00
			`if __name__ == "__main__":`
			`print(f"NUM:{NUM} BS:{BS} CNT:{CNT}")`
don't track_running_stats, parameters must require_grad 2022-07-03 05:38:45 +08:00			`model = EfficientNet(NUM, classes=1000, has_se=False, track_running_stats=False)`
Refactor nn.optim (#1091) * Refactor: nn.optim.py * Refactor: nn.optim.py; Fix all tests * Refactor: Replace all optim.get_parameters() * Refactor: Revert list comp. * Refactor: Replace optim.get_state_dict * Refactor: Change quickstart.md 2023-07-03 06:07:30 +08:00			`parameters = get_parameters(model)`
skip reduce noops 2022-07-16 22:47:43 +08:00			`for p in parameters: p.realize()`
adam in benchmark_train_efficientnet 2022-07-20 00:33:07 +08:00			`if ADAM: optimizer = optim.Adam(parameters, lr=0.001)`
			`else: optimizer = optim.SGD(parameters, lr=0.001)`
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00
training param for batchnorm 2022-07-05 04:28:03 +08:00			`Tensor.training = TRAINING`
no_grad = NOT backward 2022-07-11 11:54:57 +08:00			`Tensor.no_grad = not BACKWARD`
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00			`for i in trange(CNT):`
refactor to keep cl in the runtime (#545) * refactor to keep cl in the runtime * fix thneed, rename cl to _cl * bugfix + _cuda * fix tests * thneed more correct 2023-02-09 06:46:09 +08:00			`GlobalCounters.reset()`
LAZY and CLCACHE are defaults 2022-07-05 04:09:15 +08:00			`cpy = time.monotonic()`
dashed loadops 2022-07-05 00:50:56 +08:00			`x_train = Tensor.randn(BS, 3, 224, 224, requires_grad=False).realize()`
			`y_train = Tensor.randn(BS, 1000, requires_grad=False).realize()`
CACHE_LAZYBUFFERS options + benchmark. only a couple x from torch 2022-06-25 13:33:53 +08:00
CL.CACHE is over, GlobalCounters.cache is it 2023-02-12 04:00:14 +08:00			`# TODO: replace with TinyJit`
Kernel Optimizer (#489) * kernel optimizer * 10x faster, but wrong. not good deal * move test -> extra * print x speedup * clcache * fix clcache + DEBUG * GFLOPS estimate * i==3 2023-01-30 09:15:00 +08:00			`if i < 3 or not CLCACHE:`
			`st = time.monotonic()`
			`out = model.forward(x_train)`
rename log_softmax, support dim, fix onnx Softmax 2023-02-25 02:11:24 +08:00			`loss = out.log_softmax().mul(y_train).mean()`
add cache collector (#1595) * init cache collector * add test_cache_collector.py * switch GlobalCounters.cache to CacheCollector * init jit models test * jitted SD * add debug msg to print loaded bufs count * moved cache collctor to jit * clearer SD * no double device import 2023-08-29 10:59:55 +08:00			`if i == 2 and CLCACHE: CacheCollector.start()`
Kernel Optimizer (#489) * kernel optimizer * 10x faster, but wrong. not good deal * move test -> extra * print x speedup * clcache * fix clcache + DEBUG * GFLOPS estimate * i==3 2023-01-30 09:15:00 +08:00			`if BACKWARD:`
			`optimizer.zero_grad()`
			`loss.backward()`
			`optimizer.step()`
			`mt = time.monotonic()`
			`loss.realize()`
			`for p in parameters:`
			`p.realize()`
			`et = time.monotonic()`
			`else:`
			`st = mt = time.monotonic()`
refactor to keep cl in the runtime (#545) * refactor to keep cl in the runtime * fix thneed, rename cl to _cl * bugfix + _cuda * fix tests * thneed more correct 2023-02-09 06:46:09 +08:00			`for prg, args in cl_cache: prg(*args)`
Kernel Optimizer (#489) * kernel optimizer * 10x faster, but wrong. not good deal * move test -> extra * print x speedup * clcache * fix clcache + DEBUG * GFLOPS estimate * i==3 2023-01-30 09:15:00 +08:00			`et = time.monotonic()`

			`if i == 2 and CLCACHE:`
add cache collector (#1595) * init cache collector * add test_cache_collector.py * switch GlobalCounters.cache to CacheCollector * init jit models test * jitted SD * add debug msg to print loaded bufs count * moved cache collctor to jit * clearer SD * no double device import 2023-08-29 10:59:55 +08:00			`cl_cache = CacheCollector.finish()`
Kernel Optimizer (#489) * kernel optimizer * 10x faster, but wrong. not good deal * move test -> extra * print x speedup * clcache * fix clcache + DEBUG * GFLOPS estimate * i==3 2023-01-30 09:15:00 +08:00
CL.mem_used -> GlobalCounters.mem_used 2023-02-11 13:13:29 +08:00			`mem_used = GlobalCounters.mem_used`
examples: numpy() array returns only one value, not an array (#1534) Fixes issue: ``` loss_cpu = loss.detach().numpy()[0] ~~~~~~~~~~~~~~~~~~~~~^^^ IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed ``` Signed-off-by: David Heidelberg <david@ixit.cz> 2023-08-14 05:33:05 +08:00			`loss_cpu = loss.detach().numpy()`
dashed loadops 2022-07-05 00:50:56 +08:00			`cl = time.monotonic()`

refactor to keep cl in the runtime (#545) * refactor to keep cl in the runtime * fix thneed, rename cl to _cl * bugfix + _cuda * fix tests * thneed more correct 2023-02-09 06:46:09 +08:00			`print(f"{(st-cpy)1000.0:7.2f} ms cpy, {(cl-st)1000.0:7.2f} ms run, {(mt-st)1000.0:7.2f} ms build, {(et-mt)1000.0:7.2f} ms realize, {(cl-et)1000.0:7.2f} ms CL, {loss_cpu:7.2f} loss, {tensors_allocated():4d} tensors, {mem_used/1e9:.2f} GB used, {GlobalCounters.global_ops1e-9/(cl-st):9.2f} GFLOPS")`