chenyu
da4fa77e92
move import cProfile and pstats inside Profiling class ( #6148 )
2024-08-17 16:08:53 -04:00
George Hotz
88edc2902d
axis_is_masked with graph_rewrite [run_process_replay] ( #6144 )
2024-08-17 10:28:49 -07:00
chenyu
039163e664
more tqdm touchup ( #6143 )
...
* more tqdm touchup
don't default iterable to None, and more text cleanups
* oh iterable can be None
2024-08-17 13:06:05 -04:00
qazal
5a266d5d0c
type verify ImageDType and PtrDType [run_process_replay] ( #6137 )
...
* type verify ImageDType and PtrDType [run_process_replay]
* fix tests
2024-08-17 16:37:07 +03:00
qazal
d1d41130cd
use membufs in ImageDType checks [run_process_replay] ( #6136 )
...
* use membufs in ImageDType checks
* set by key [run_process_replay]
2024-08-17 16:17:46 +03:00
qazal
41ac8bdd63
verify_ast prep refactor for intermediate uops type spec ( #6135 )
...
* refactor to ops
* refactor to two functions
* the uop's shape become local_reduce
2024-08-17 15:34:18 +03:00
qazal
d9ce664350
add test_verify_ast [run_process_replay] ( #6134 )
2024-08-17 14:14:30 +03:00
qazal
151a62ad32
hotfix: store dtype for ImageDType ( #6133 )
2024-08-17 13:44:53 +03:00
George Hotz
d0513087e1
hotfix: revert axis_is_masked for stable diffusion speed
2024-08-17 00:22:08 -07:00
George Hotz
4df4845b47
cache is_int [run_process_replay] ( #6131 )
...
* cache is_int [run_process_replay]
* functools.cached_property is pretty slow
2024-08-17 00:19:03 -07:00
George Hotz
3a2d724cb2
extra matcher from renderer [run_process_replay] ( #6130 )
...
* extra matcher from renderer
* cache_pm [run_process_replay]
2024-08-16 23:53:11 -07:00
George Hotz
9bc81c6db4
UOps.SHAPETRACKER ( #6129 )
...
* UOps.SHAPETRACKER [run_process_replay]
* no process replay
2024-08-16 23:26:34 -07:00
George Hotz
5048066e79
st_arg, never -1 [run_process_replay] ( #6128 )
2024-08-16 22:46:56 -07:00
George Hotz
9e6ad4b40f
hotfix: free minor speedup
2024-08-16 21:08:03 -07:00
George Hotz
d9cb45af09
only axis is masked [run_process_replay] ( #6123 )
2024-08-16 21:01:17 -07:00
George Hotz
94aa5f11b5
Revert "use vmax for real_size [run_process_replay] ( #6120 )" ( #6122 )
...
This reverts commit a6e3211444
.
2024-08-16 20:33:19 -07:00
George Hotz
a6e3211444
use vmax for real_size [run_process_replay] ( #6120 )
...
* use vmax for real_size [run_process_replay]
* axis is masked
2024-08-16 20:17:23 -07:00
George Hotz
912f01ed4b
UOpGraph -> linearize_uop [run_process_replay] ( #6119 )
2024-08-16 19:48:39 -07:00
George Hotz
7cae152aa2
move uop logic into shapetracker [run_process_replay] ( #6118 )
2024-08-16 17:47:15 -07:00
George Hotz
89c7989659
no shapetracker in ops [run_process_replay] ( #6117 )
2024-08-16 17:23:27 -07:00
George Hotz
74ee9febec
remove iter from uopgraph ( #6110 )
...
* remove iter from uopgraph
* linearize returns uops
* fix tests
* linearize in linearize
* tests fix
* touchup
* test failures
2024-08-16 15:58:29 -07:00
qazal
28c75bf2a6
merge uops with ops ( #6111 )
...
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-08-16 18:17:57 -04:00
chenyu
379d080e74
tqdm touchup ( #6113 )
...
more precise names and don't repeat set_description
2024-08-16 17:34:21 -04:00
nimlgen
5f1554b574
amd fix uaf in program ( #6114 )
...
* amd fix uaf in program
* keep it align
* sync before free
2024-08-17 00:22:46 +03:00
qazal
d5e3217076
hotfix: scheduler differ ( #6115 )
...
* hotfix: scheduler differ
* add the test back
* track keys
2024-08-16 23:34:49 +03:00
qazal
c23d44c779
AST is UOp ( #6030 )
...
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.
* fix benchmark
* remove extra dedup
2024-08-16 22:09:00 +03:00
George Hotz
d6f64c0c1f
do better image indexing [run_process_replay] ( #6109 )
...
* do better image indexing [run_process_replay]
* fix tests
2024-08-16 09:55:22 -07:00
CaltropHungerton
38fb1e14a2
Intel XMX Tensor Core Support ( #5622 )
...
* fixed xmx demo
* i think i'm invoking the DPAS but it's slow
* compiler build arg to stop register spilling, indicated where to fix flop counter
* don't mind this
* do NOT mind me
* do not mind me
* do not view
* i will add bf16 later
* in process of figuring out tc fields
* we figured out the fields!!!
* added check for cl device vendor, added seperate IntelRenderer
* remove tc thread_local_aliases
* cleaning debris before draft pr
* edits for linter
* deduping and checking device extensions
* i will find more line reductions in other places
* before merge upstream
* double grf size in compiler to fix register spilling (bandaid), device checking changes
* tc python emulation
* fixed emulation
* tests for emulated intel tensor core
* TC=0, 1 working on upstream, fixed perf
* test
* debris
* check for specialized cl device when we canonicalize device
* bf16 support, tc=3 test added
* address tests
* revert half2 loads on intel tc, cleanup
* linter
* fold_expanded revert
* lint, whitespace fix
* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too
* make line shorter, no need for noqa E501
* removed device intel
* fix python emulation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-16 09:19:21 -07:00
George Hotz
f82ecd8802
remove uop symobilc rendering [run_process_replay] ( #6108 )
2024-08-16 09:02:15 -07:00
George Hotz
e8ae9af962
bump line count to 9000. we should be here a while
2024-08-16 08:46:36 -07:00
chenyu
7d46fb0c83
load balance NV benchmark ci ( #6107 )
2024-08-16 10:08:08 -04:00
qazal
1ff6c7c519
add more types to search [run_process_replay] ( #6096 )
...
* add more types to search [run_process_replay]
* bufs_from_lin
2024-08-16 13:19:25 +03:00
chenyu
e5da88873b
enable UOP_IS_SYMBOLIC ( #5954 )
2024-08-16 00:15:46 -04:00
George Hotz
553ae9ebc0
bilinear interp uint8 fails ( #6103 )
...
* new test for e2e compile failures
* fix bug
* bilinear interp uint8 fails
* better tests
2024-08-15 19:34:39 -07:00
George Hotz
c850e03758
new test for e2e compile failures ( #6101 )
...
* new test for e2e compile failures
* fix bug
2024-08-15 18:56:22 -07:00
chenyu
e4a7869893
move cancel mod pattern into mod_folding ( #6100 )
...
changed some kernel in a good way because x does not go through add chain
2024-08-15 19:04:18 -04:00
qazal
11d62668a3
refactor ast ops dtype access [run_process_replay] ( #6093 )
...
* refactor ast ops dtype access [run_process_replay]
* fix assert message
2024-08-15 19:13:33 +03:00
chenyu
9ef82e1f2b
UOp pattern DEFINE_VAR with min==max is also CONST ( #6095 )
...
* UOp pattern DEFINE_VAR with min==max is also CONST
* fix tests
2024-08-15 12:09:44 -04:00
chenyu
a41c9dd12c
test py.typed as a package ( #6094 )
...
* test py.typed as a package
* try this?
* and this
* try that?
* add this back
* cleanup
2024-08-15 11:19:08 -04:00
qazal
25dffb2079
kernel.py more typing [run_process_replay] ( #6092 )
2024-08-15 17:59:24 +03:00
qazal
4d38fec8c1
rename lazyops to parents [run_process_replay] ( #6091 )
2024-08-15 17:27:32 +03:00
chenyu
5accfe26a0
rewrite bool ADD to OR and MUL to AND ( #6084 )
...
* rewrite bool ADD to OR and MUL to AND
fixed running `tinyphysics.onnx`, which contains a getitem from a boolean tensor.
only can repro through BEAM_COMPARE, which i think is a different bug in test_linearizer_failure
* fold those, and fix tests
* only for bool
* move dtypes.bool
2024-08-15 10:11:57 -04:00
nimlgen
b765996d54
hcq remove offset from progs ( #6090 )
2024-08-15 17:02:54 +03:00
chenyu
df03dca6e3
move % inside UOp mod_folding and remove deprecated tests ( #6085 )
...
[run_process_replay]
2024-08-14 23:25:10 -04:00
George Hotz
c6e117c899
add a single py.typed ( #6083 )
2024-08-14 17:31:46 -07:00
qazal
2bf7b56485
minor test fixups from the AST is UOp diff ( #6081 )
...
* add assert_equiv_uops cache
* dont expect lowering and schedule errors
2024-08-14 23:58:04 +03:00
chenyu
95aa6d8ccd
remove redundant x/c pattern [run_process_replay] ( #6082 )
...
there's no div and 1/c is const folded
2024-08-14 16:57:39 -04:00
chenyu
a61cb1ff7c
move mod mod pattern into generic mod folding ( #6077 )
2024-08-14 16:24:21 -04:00
George Hotz
64563abc90
add LSTMCell to nn ( #6080 )
...
* add LSTMCell to nn
* lstmcell works with no input on first
* fix no bias 0
* simpler
2024-08-14 12:08:42 -07:00
chenyu
6b3112d525
fix qcom process_replay for kernel diff ( #6079 )
...
* debug why qcom process_replay does not run
skipping the wrong exception?
* um-hum
* get_step_times was parsed incorrectly
* cleanup
2024-08-14 15:05:49 -04:00