Commit Graph

5335 Commits

Author SHA1 Message Date
nimlgen 73fda023d3
amd better comments for ENABLE_SGPR_DISPATCH_PTR (#5768)
* amd better comments for ENABLE_SGPR_DISPATCH_PTR

* fix lkinter
2024-07-28 16:23:38 +03:00
qazal 95dda8dadf
more unmatching vectorize/gep asserts [run_process_replay] (#5760)
* merge vectorize/gep rules [run_process_replay]

* assert dtypes

* src=

* float2=(float4.x,float4.y)
2024-07-28 15:08:54 +08:00
chenyu bfbd7c5461
more generic UOp mul mod folding (#5765) 2024-07-27 20:20:35 -04:00
chenyu 80c6475757
update test_uop_symbolic to test UOp min and max (#5764)
covers #5750, #5748, #5741
2024-07-27 19:53:21 -04:00
nimlgen 1903542c2d
nv/cuda compilers touchup (#5759)
* nv/cuda compilers touchup

* fix cuda check + move nv disasm

* remove includes

* fix nvrtc_check
2024-07-28 00:15:28 +03:00
chenyu 3c79faaf77
remove redundant UOps max folding [run_process_replay] (#5762)
all covered by generic max folding
2024-07-27 16:46:51 -04:00
chenyu 05748e5a84
fix vmax of Uop.RANGE off by 1 (#5750)
with this, can remove several redundant max folding rules, do it separately to check kernel diff
2024-07-27 16:30:46 -04:00
nimlgen fff19b961b
docs: user runtime docs (#5756) 2024-07-27 23:21:54 +03:00
nimlgen 5d53fa491b
amd autogened kfd ioctls (#5757)
* amd autogened kio

* unused import

* linter
2024-07-27 22:49:48 +03:00
nimlgen ed1d784077
test profiler timer sync across devs (#5751)
* test profiler timer sync across devs

* more correct

* typo
2024-07-27 16:47:37 +03:00
qazal e5fb08acbc
simpler expand UOps acc [run_process_replay] (#5754) 2024-07-27 15:20:56 +03:00
gswangg de66d93859
PTX render vec CONST (#5729)
* dedupe PTX vec CONST render

* fix linter errors

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-07-27 13:40:19 +03:00
qazal 890e11ce11
fix UOps.STORE folding returning NOp [run_process_replay] (#5753) 2024-07-27 13:32:54 +03:00
qazal 3e49d86c01
process replay diffs 3 things now (#5731)
* github api infra

* process replay is 3 parts now

* parse benchmarks

* add gh_token

* complete diff

* move process replay tests

* last successful run

* add tempdir

* skip master
2024-07-27 12:52:20 +03:00
qazal 57b4a8e98d
assert process replay asserts (#5737)
* assert process replay asserts

* one ci job is fine

* test: Revert "separate process replay main loop (#5734)"

This reverts commit 94d578396f.

* mac sed needs that

* Revert "test: Revert "separate process replay main loop (#5734)""

This reverts commit e4ad7684d5472a64841a66b43bc1db7c9bbbf9e8.

* disable process replay capture

* save time

* amd is tiny

* send to /dev/null
2024-07-27 12:07:50 +03:00
George Hotz f8972ace38
test flops (and allow wide ALU in UOps) [run_process_replay] (#5749)
* flops test in external_test_speed_theoretical.py

* test speed theo

* min SZMAX

* allow wide ALU for things that support it

* needed for mypy
2024-07-26 21:07:28 -07:00
George Hotz 2fde2d2914 hotfix: external_test_speed_theoretical works on 24GB 2024-07-26 18:41:52 -07:00
chenyu b75d1e8793
UOp._min_max for IDIV (#5748) 2024-07-26 21:40:16 -04:00
George Hotz 829262a5ee add external_test_speed_theoretical 2024-07-26 17:45:22 -07:00
chenyu 5f168e7499
remove the optimization in AndNode.substitute (#5747)
was used in the old linearizer but longer needed. still need substitute because some fuzz tests calls sym_infer on AndNode
2024-07-26 20:08:07 -04:00
kormann c50e354936
NOp clean up any_len passing [run_process_replay] (#5743)
* clean allow_any_len

* min
2024-07-26 17:00:31 -07:00
George Hotz db1d093b29
reenable LLaMA-3 8B BEAM on NV (#5746) 2024-07-26 16:56:41 -07:00
chenyu c6b2d96474
minor uop uopgraph cleanups (#5745) 2024-07-26 19:23:48 -04:00
chenyu 3686b6726a
move GraphException to jit.py (#5744)
same place where GraphRunner is defined
2024-07-26 19:01:12 -04:00
kormann a5ede535ef
NOp field name [run_process_replay] (#5742)
* rm def name

* add field name
2024-07-26 18:45:59 -04:00
chenyu 0d7d4dd731
UOp._min_max for MUL and MOD (#5741) 2024-07-26 18:38:10 -04:00
George Hotz c50e374bb6
multiple locals + get_kernel_modifier + fix valid (#5739)
* multiple locals + get_kernel_modifier + fix valid

* fix test pattern matcher
2024-07-26 15:10:10 -07:00
nimlgen f6c0e17a2c
optimize symbolic-related updates in graphs (#5727)
* try

* faster

* cleaner

* better?

* better?

* cleaner

* fixes

* unused

* mypy

* fix clang

* remove comment

* better var names

* rename

* fix cuda

* rename
2024-07-27 00:57:59 +03:00
chenyu dc7483ee6f
UOp simple div folding (#5740)
made UOp.divides return the Optional[quotient] and used it for simple div folding
2024-07-26 17:14:32 -04:00
chenyu 671259417f
reuse UOp `__repr__` for NOp (#5738) 2024-07-26 16:59:55 -04:00
kormann b0c1dba299
named UOp class "NOP" [run_process_replay] (#5728)
* NOP

* fix const + simplify compile

* rm VAR for NOOP

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-07-26 13:25:53 -07:00
George Hotz 4df46eac67
clean up tensor cores [run_process_replay] (#5736)
* clean up tensor cores [run_process_replay]

* remove tuple(wmma_sz), self.opts.device

* remove tls, leave DEVICE
2024-07-26 13:21:23 -07:00
qazal 94d578396f
separate process replay main loop (#5734)
* separate process replay main loop

* [run_process_replay]

* add kernel_changed

* test with [run_process_replay]

* revert temp [run_process_replay]
2024-07-26 21:43:08 +03:00
chenyu 9838c1a6ff
update import style in runtime (#5735) 2024-07-26 14:00:23 -04:00
chenyu a4e9ebc68a
update test_uop_symbolic (#5733)
enabled more passed tests
2024-07-26 13:46:09 -04:00
George Hotz 5c688560bc
move CUDA/HIP compilers to their own files [run_process_replay] (#5732) 2024-07-26 10:00:15 -07:00
chenyu 2cc55a3095
UOp simple mul add div fold (#5726) 2024-07-25 22:00:30 -04:00
chenyu 78f75aa80d
remove redundant symbolic mod rule [run_process_replay] (#5725) 2024-07-25 21:21:02 -04:00
chenyu 5521b6d437
UOp simple mul-add-lt fold (#5721) 2024-07-25 20:49:38 -04:00
qazal 1b53207b4f
revert isolated dags scheduling (#5724) 2024-07-25 19:45:12 -04:00
chenyu 845b0d1c9d
UOp more generic div folding (#5722)
old: `x // c` can fold if `0 <= x.vmin <= x.vmax < c`
new: `x // c` can fold if `0 < c and x.vmin // c == x.vmax // c`
2024-07-25 17:49:14 -04:00
nimlgen fb8148077e
hcq do not update the same signal (#5719)
* hcq do not update the same signal

* import them
2024-07-26 00:24:45 +03:00
nimlgen 6ec9ea9ddd
hcq update_exec with optional params (#5708) 2024-07-26 00:04:57 +03:00
George Hotz 8b34ee2f52
remove global_size and local_size from Kernel class [run_process_replay] (#5720)
* remove global_size and local_size from Kernel class [run_process_replay]

* sizes from the prg
2024-07-25 13:55:08 -07:00
George Hotz 142b7fb22f
faster beam [run_process_replay] (#5718) 2024-07-25 11:58:41 -07:00
chenyu eff7c5fd2c
halve kernel counts in metal Fuzz Test linearizer (#5716)
the test time has increased to 3 minutes
2024-07-25 14:35:11 -04:00
George Hotz e877ed9688
cleaner uop expand [run_process_replay] (#5715)
* cleaner uop expand [run_process_replay]

* comments
2024-07-25 11:29:53 -07:00
chenyu a82815262c
more test_pattern_matcher fixups (#5714) 2024-07-25 14:12:21 -04:00
George Hotz b8b5411845 move Function to Developer section of docs 2024-07-25 11:05:23 -07:00
qazal f02124ffa0
rename to realize_reduceop (#5713)
* rename to realize_reduceop

* shorter comment
2024-07-25 20:57:33 +03:00