tinygrad/test/test_winograd.py

import unittest
from tinygrad import Tensor, GlobalCounters
from tinygrad.ops import UOps
from tinygrad.helpers import Timing, CI, Profiling, WINO, DEBUG, getenv
from tinygrad.codegen.kernel import Kernel
from tinygrad.engine.schedule import create_schedule

class TestWinograd(unittest.TestCase):
  def setUp(self):
    self.old = WINO.value
    WINO.value = 1
  def tearDown(self):
    WINO.value = self.old

  def test_speed(self):
    x = Tensor.empty(1,4,9,9)
    w = Tensor.empty(4,4,3,3)

    with Timing("running conv: "):
      out = Tensor.conv2d(x, w)

    with Timing("scheduling: "):
      sched = create_schedule([out.lazydata])

    for i,s in enumerate(sched):
      if s.ast.op is not UOps.SINK: continue
      ops = s.ast.parents
      with Timing(f"linearize {i} with {len(ops):4d} ops: "):
        l = Kernel(s.ast)
        l.hand_coded_optimizations()
        l.linearize()
      assert len(l.sts) <= 256  # just the current value to prevent regression
      if DEBUG >= 2: print(f"{len(l.sts):4d} shapetrackers with max {max(len(x.views) for x in l.sts)} views")
      for st in l.sts:
        assert len(st.views) <= 2, "too many views in winograd"
        if DEBUG >= 3:
          print(f"{len(st.views):3d} views")
          for v in st.views: print(v)

  def test_profile(self):
    x,w = Tensor.rand(1,4,9,9).realize(), Tensor.rand(4,4,3,3).realize()
    with Profiling(enabled=not CI, sort='time'):
      out = Tensor.conv2d(x,w).realize()
    out.numpy()

  def test_four_kernels(self):
    x,w = Tensor.rand(1,4,9,9).realize(), Tensor.rand(4,4,3,3).realize()
    GlobalCounters.reset()
    out = Tensor.conv2d(x,w).realize()
    assert GlobalCounters.kernel_count == 4
    out.numpy()

  @unittest.skipIf(getenv("PTX"), "winograd uses too much in PTX")
  def test_counters(self):
    IC, OC, X, Y = 4,4,9,9
    #OC, IC, X, Y = 512, 256, 8, 8
    x,w = Tensor.rand(1,IC,Y,X).realize(), Tensor.rand(OC,IC,3,3).realize()
    GlobalCounters.reset()
    Tensor.conv2d(x,w).realize()
    ops_wino, mem_wino = GlobalCounters.global_ops, GlobalCounters.global_mem
    WINO.value = 0
    GlobalCounters.reset()
    Tensor.conv2d(x,w).realize()
    ops_normal, mem_normal = GlobalCounters.global_ops, GlobalCounters.global_mem

    ops_ratio, mem_ratio = ops_wino/ops_normal, mem_wino/mem_normal
    print(f"ops: normal {ops_normal:9d} wino {ops_wino:9d} ratio {ops_ratio:.2f}")
    print(f"mem: normal {mem_normal:9d} wino {mem_wino:9d} ratio {mem_ratio:.2f}")
    self.assertLess(ops_ratio, 2.6)  # TODO: there's issues with factorization now
    self.assertLess(mem_ratio, 10)

if __name__ == '__main__':
  unittest.main(verbosity=2)
winograd speed test (#1942) 2023-09-29 19:40:35 +08:00			`import unittest`
winograd should be 4 kernels (#3268) 2024-01-29 01:21:26 +08:00			`from tinygrad import Tensor, GlobalCounters`
merge uops with ops (#6111) Co-authored-by: chenyu <chenyu@fastmail.com> 2024-08-17 06:17:57 +08:00			`from tinygrad.ops import UOps`
Linearizer -> Lowerer (#4957) * st to uops function * lowerer * uops reduce * uops reduce * acc_number correct * reduce unroll * complete unroll * do upcasts * handle multioutput * define_accs * fix valid * get grouped dims * revert lin * minor * fixup_ast * group for reduce * group works now * all forwards pass * all ops tests pass * fix clang * mypy * lil cleanups, no image yet * ugh, variables everywhere * bugfix * counters and name fix * use symbolic, not uops * cleanups * Fix tests * linearizer tests * expands * float4 expand load * tests pass * woooo, float4 test * test ops works again * one more lin test * more lin tests * bypass * fix tests * something like this * const in defineacc * uops get_reduce_acc * move around * allow consts in the LOAD/STORE * each axis should only appear once, 21 failures * 16 failures * fix some image * optional float4 * onnx tests * gate the stores * add reorder * fix terrible skip function * tc work * opt add/mul merge * fix float4 tests * tiny tweak, 9 failing * 7 test failures * start tc, but i don't think this will work * progress on tensorcores * note * fix ops tests * closer on tc * weeee...one tensor core works * still works, more generic * large WMMA works * tc test passes * use WMMA as accumulator * basic tc tests passing * small gemm padded works * 4 failures * 3 tests failing * super barrier * now two tests failing * one test failing * cleanpus, add reduce to UopGraph * remove the linearizer * remove unused * lil cleanups * Lowerer everywhere * remove test that doesn't exist now * image indexing * llvm fix * fix metal * fix image * fix images * might fix ptx * fix image type mismatch * more tests pass * CAST -> VECTORIZE * forgot that one * fix TestOps.test_flip_eye_crash * locals shouldn't be image dtype * change less files * test fix * fix recursive expands * touches * MULACC support in python * delete unneeded * alu before contract * bug fixes * tests * no var multireduce * simpler tc * metal works in new style * working on AMD and METAL * fix amd * shot in the dark, fix amd * something for CUDA * CUDA WORKS from the docs * comment * correct merge * cleanups + ptx fix + get_reduce_acc * local alias isn't used anymore * add store sanity check * fix for AMD * cleanups and single expand pass * more correct with acc_cache * tests should pass * block on WMMA * tests pass * merge contract and reduce * contractor fixes issue * multicontract * pre expand wmma (same as a reduce) * expand wmma and only take one * all expands * comments and whitespace 2024-07-11 06:07:42 +08:00			`from tinygrad.helpers import Timing, CI, Profiling, WINO, DEBUG, getenv`
lowerer is kernel [run_process_replay] (#5437) 2024-07-13 09:50:55 +08:00			`from tinygrad.codegen.kernel import Kernel`
split to schedule.py (#3949) * split to schedule.py * split 2024-03-27 12:02:46 +08:00			`from tinygrad.engine.schedule import create_schedule`
winograd speed test (#1942) 2023-09-29 19:40:35 +08:00
			`class TestWinograd(unittest.TestCase):`
			`def setUp(self):`
make WINO a context var, and LATEWINO in hlb_cifar (#3161) 2024-01-18 09:21:26 +08:00			`self.old = WINO.value`
			`WINO.value = 1`
			`def tearDown(self):`
			`WINO.value = self.old`
winograd speed test (#1942) 2023-09-29 19:40:35 +08:00
			`def test_speed(self):`
			`x = Tensor.empty(1,4,9,9)`
			`w = Tensor.empty(4,4,3,3)`

			`with Timing("running conv: "):`
			`out = Tensor.conv2d(x, w)`

			`with Timing("scheduling: "):`
move create schedule and delete old API (#3377) * move create schedule and delete old API * fix test multitensor 2024-02-13 01:10:45 +08:00			`sched = create_schedule([out.lazydata])`
winograd speed test (#1942) 2023-09-29 19:40:35 +08:00
			`for i,s in enumerate(sched):`
AST is UOp (#6030) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc. * fix benchmark * remove extra dedup 2024-08-17 03:09:00 +08:00			`if s.ast.op is not UOps.SINK: continue`
rename lazyops to parents [run_process_replay] (#6091) 2024-08-15 22:27:32 +08:00			`ops = s.ast.parents`
winograd test prints op count 2023-09-29 20:41:29 +08:00			`with Timing(f"linearize {i} with {len(ops):4d} ops: "):`
lowerer is kernel [run_process_replay] (#5437) 2024-07-13 09:50:55 +08:00			`l = Kernel(s.ast)`
winograd speed test (#1942) 2023-09-29 19:40:35 +08:00			`l.hand_coded_optimizations()`
			`l.linearize()`
faster wino compile by catting consts across data expand dim (#3293) * PoC faster wino compile by catting consts across data expand dim * fix fusions * faster + golf it * noqa 501 * implicit broadcast * Revert "implicit broadcast" This reverts commit 5915a9083d045ec1e6be84dcb492333325d48666. * shorter * shorter * oops * 216 upcasts is probably fine * wino kernel count test * test winograd number of sts * specify device for apply_matrix mat elements 2024-02-02 16:47:45 +08:00			`assert len(l.sts) <= 256 # just the current value to prevent regression`
reintroduce merge views in update benchmark (#3279) * Reapply "take merge views from corsix branch" (#3278) This reverts commit d2989162321e1b5346e6be1f6142fc238f6b0b7b. * reintroduce merge views 2024-01-31 01:47:20 +08:00			`if DEBUG >= 2: print(f"{len(l.sts):4d} shapetrackers with max {max(len(x.views) for x in l.sts)} views")`
			`for st in l.sts:`
			`assert len(st.views) <= 2, "too many views in winograd"`
			`if DEBUG >= 3:`
			`print(f"{len(st.views):3d} views")`
			`for v in st.views: print(v)`
winograd speed test (#1942) 2023-09-29 19:40:35 +08:00
move print tree into graph (#2003) * move print tree into graph * add winograd profiling test * change pre-commit to run ruff first 2023-10-07 19:39:21 +08:00			`def test_profile(self):`
			`x,w = Tensor.rand(1,4,9,9).realize(), Tensor.rand(4,4,3,3).realize()`
Profiling-helper (#2321) * change profiler * remove unused imports * remove unused imports * change lazybuffer references * remove unused line * remove unused import * remove unused stuff * add types * typing * typing * typing * trigger actions * -1 loc * fixup * trigger actions * revert lazy typing changes * WIP profiler helper * replace old start & stop profiler * fixup * linting * Update llama.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> 2023-11-17 06:15:56 +08:00			`with Profiling(enabled=not CI, sort='time'):`
			`out = Tensor.conv2d(x,w).realize()`
move print tree into graph (#2003) * move print tree into graph * add winograd profiling test * change pre-commit to run ruff first 2023-10-07 19:39:21 +08:00			`out.numpy()`

winograd should be 4 kernels (#3268) 2024-01-29 01:21:26 +08:00			`def test_four_kernels(self):`
			`x,w = Tensor.rand(1,4,9,9).realize(), Tensor.rand(4,4,3,3).realize()`
			`GlobalCounters.reset()`
			`out = Tensor.conv2d(x,w).realize()`
			`assert GlobalCounters.kernel_count == 4`
			`out.numpy()`

Linearizer -> Lowerer (#4957) * st to uops function * lowerer * uops reduce * uops reduce * acc_number correct * reduce unroll * complete unroll * do upcasts * handle multioutput * define_accs * fix valid * get grouped dims * revert lin * minor * fixup_ast * group for reduce * group works now * all forwards pass * all ops tests pass * fix clang * mypy * lil cleanups, no image yet * ugh, variables everywhere * bugfix * counters and name fix * use symbolic, not uops * cleanups * Fix tests * linearizer tests * expands * float4 expand load * tests pass * woooo, float4 test * test ops works again * one more lin test * more lin tests * bypass * fix tests * something like this * const in defineacc * uops get_reduce_acc * move around * allow consts in the LOAD/STORE * each axis should only appear once, 21 failures * 16 failures * fix some image * optional float4 * onnx tests * gate the stores * add reorder * fix terrible skip function * tc work * opt add/mul merge * fix float4 tests * tiny tweak, 9 failing * 7 test failures * start tc, but i don't think this will work * progress on tensorcores * note * fix ops tests * closer on tc * weeee...one tensor core works * still works, more generic * large WMMA works * tc test passes * use WMMA as accumulator * basic tc tests passing * small gemm padded works * 4 failures * 3 tests failing * super barrier * now two tests failing * one test failing * cleanpus, add reduce to UopGraph * remove the linearizer * remove unused * lil cleanups * Lowerer everywhere * remove test that doesn't exist now * image indexing * llvm fix * fix metal * fix image * fix images * might fix ptx * fix image type mismatch * more tests pass * CAST -> VECTORIZE * forgot that one * fix TestOps.test_flip_eye_crash * locals shouldn't be image dtype * change less files * test fix * fix recursive expands * touches * MULACC support in python * delete unneeded * alu before contract * bug fixes * tests * no var multireduce * simpler tc * metal works in new style * working on AMD and METAL * fix amd * shot in the dark, fix amd * something for CUDA * CUDA WORKS from the docs * comment * correct merge * cleanups + ptx fix + get_reduce_acc * local alias isn't used anymore * add store sanity check * fix for AMD * cleanups and single expand pass * more correct with acc_cache * tests should pass * block on WMMA * tests pass * merge contract and reduce * contractor fixes issue * multicontract * pre expand wmma (same as a reduce) * expand wmma and only take one * all expands * comments and whitespace 2024-07-11 06:07:42 +08:00			`@unittest.skipIf(getenv("PTX"), "winograd uses too much in PTX")`
import from wino_cleanup (#3374) 2024-02-12 23:26:50 +08:00			`def test_counters(self):`
			`IC, OC, X, Y = 4,4,9,9`
			`#OC, IC, X, Y = 512, 256, 8, 8`
			`x,w = Tensor.rand(1,IC,Y,X).realize(), Tensor.rand(OC,IC,3,3).realize()`
			`GlobalCounters.reset()`
			`Tensor.conv2d(x,w).realize()`
			`ops_wino, mem_wino = GlobalCounters.global_ops, GlobalCounters.global_mem`
			`WINO.value = 0`
			`GlobalCounters.reset()`
			`Tensor.conv2d(x,w).realize()`
			`ops_normal, mem_normal = GlobalCounters.global_ops, GlobalCounters.global_mem`

			`ops_ratio, mem_ratio = ops_wino/ops_normal, mem_wino/mem_normal`
			`print(f"ops: normal {ops_normal:9d} wino {ops_wino:9d} ratio {ops_ratio:.2f}")`
			`print(f"mem: normal {mem_normal:9d} wino {mem_wino:9d} ratio {mem_ratio:.2f}")`
single pass rewrite (#5159) * single pass rewrite * claude cleanups * claude cleanups * skip those tests * restrict that to ints * comment * asserts i don't expect to fail do fail * simplest...rewrite...ever * simplest...rewrite...ever * add that rule back * tests pass? * only collapse reduce loops * second SHL/SHR arg must be 4 bytes * fix verify * no SHL/SHR in ptx * put that back * skip them in PTX...bad tests 2024-06-28 02:36:05 +08:00			`self.assertLess(ops_ratio, 2.6) # TODO: there's issues with factorization now`
			`self.assertLess(mem_ratio, 10)`
import from wino_cleanup (#3374) 2024-02-12 23:26:50 +08:00
winograd speed test (#1942) 2023-09-29 19:40:35 +08:00			`if __name__ == '__main__':`
multioutput ScheduleItem (#3699) * refactor realize.py * update docs * update test_sched * update runners and devices * update openpilot and unit tests * cleanup runner lowering * update more tests 2024-03-13 23:59:38 +08:00			`unittest.main(verbosity=2)`