Commit Graph

575 Commits

Author SHA1 Message Date
qazal 9d2ea94fe9
temp: disable process replay on metal (#6062) 2024-08-13 16:31:55 +03:00
nimlgen 8f787785d9
fix openpilot benchmark (#6049) 2024-08-12 21:12:32 +03:00
chenyu e6c7c3e499
update pylint path to check indent/space for all (#6022)
also fixed many errors. it was not checking nested dirs. exclude autogen for now.

can we use ruff for this?
2024-08-10 14:41:09 -04:00
George Hotz cfb04c67d1
run unit tests separate from others (and only once) (#6020)
* run unit tests separate from others

* ignore unit tests elsewhere
2024-08-10 11:17:56 -07:00
qazal 266afad8ed
hotfix: skip schedule capture in benchmarks (#6012) 2024-08-10 17:13:53 +03:00
qazal 24c7c41ce0
diff LazyBuffer schedules in process replay (#5996)
* start diff printing

* this should be 2

* add to process_replay.py

* enable schedule capture

* arange diff is process replay
2024-08-09 14:16:43 +03:00
George Hotz 3d445039c2 hotfix: 8800 lines for AMX+intel tc 2024-08-06 17:50:26 -07:00
chenyu adba5efc64
enable llama 2 70B in tinybox green CI (#5905)
runnable with MAX_CONTEXT=256
2024-08-04 18:48:46 -04:00
George Hotz 7348c40d9d
sampling time sync (8700 lines) (#5843)
* sampling time sync

* jitter matrix

* comment

* pass mypy

* line count
2024-08-02 14:44:35 -07:00
wozeparrot acadccf344
comma benchmark (#5518) 2024-08-02 14:36:54 -07:00
chenyu f27f949a5d
Revert "revert some UOp IDIV bound (#5863)" (#5871)
This reverts commit 0c8d202348.
2024-08-01 21:38:31 -04:00
chenyu df138bc558
Revert "revert a mod pattern (#5864)" (#5870)
This reverts commit 5c8de2d044.
2024-08-01 20:44:26 -04:00
chenyu 1b0314d9ef
Revert "remove one more UOp mod pattern (#5865)" (#5868)
This reverts commit b03b8e18c2.
2024-08-01 20:28:35 -04:00
chenyu b03b8e18c2
remove one more UOp mod pattern (#5865)
fixed UOP_IS_SYMBOLIC=1 test_failure_40
2024-08-01 18:29:04 -04:00
chenyu 5c8de2d044
revert a mod pattern (#5864)
fixed UOP_IS_SYMBOLIC=1 linearizer failure 47
2024-08-01 17:24:26 -04:00
chenyu 0c8d202348
revert some UOp IDIV bound (#5863)
* revert some UOp IDIV bound

breaks conv with UOP_IS_SYMBOLIC, added some conv tests in CI

* those are correct

* skip slow ones
2024-08-01 15:09:06 -04:00
George Hotz 5eedd9e3ad raise the line ceiling to 8600. USE LINES CAREFULLY 2024-07-31 09:56:39 -07:00
wozeparrot eebb1b9922
feat: temperature 0 llama3 benchmark (#5806) 2024-07-30 12:05:36 -07:00
chenyu cb6718347f
`python -m mkdocs build --strict` in CI (#5800) 2024-07-29 16:46:30 -04:00
chenyu be3899d211
hotfix increase ci timeout to 20 mintues (#5799)
when cache is clear it takes time to populate cache
2024-07-29 16:25:27 -04:00
chenyu 471b188d79
fix mypy errors in latest mypy (#5794)
* fix mypy errors in latest mypy

mypy has stricter partial and api arg checks now

* PYTHONPATH="."
2024-07-29 14:53:30 -04:00
George Hotz 0392123e6e
TC=2 still sets tensor cores (and TC=3 support for locals) (#5780)
* TC=2 still sets tensor cores

* add TC=3 support for using locals

* bugfix

* lines + TC=3 tests

* CUDA can use threads, fix fuzz linearizer
2024-07-28 16:16:53 -07:00
qazal 3e49d86c01
process replay diffs 3 things now (#5731)
* github api infra

* process replay is 3 parts now

* parse benchmarks

* add gh_token

* complete diff

* move process replay tests

* last successful run

* add tempdir

* skip master
2024-07-27 12:52:20 +03:00
qazal 57b4a8e98d
assert process replay asserts (#5737)
* assert process replay asserts

* one ci job is fine

* test: Revert "separate process replay main loop (#5734)"

This reverts commit 94d578396f.

* mac sed needs that

* Revert "test: Revert "separate process replay main loop (#5734)""

This reverts commit e4ad7684d5472a64841a66b43bc1db7c9bbbf9e8.

* disable process replay capture

* save time

* amd is tiny

* send to /dev/null
2024-07-27 12:07:50 +03:00
George Hotz db1d093b29
reenable LLaMA-3 8B BEAM on NV (#5746) 2024-07-26 16:56:41 -07:00
chenyu eff7c5fd2c
halve kernel counts in metal Fuzz Test linearizer (#5716)
the test time has increased to 3 minutes
2024-07-25 14:35:11 -04:00
chenyu 7c8fe0fe47
skip interpolate tests for PYTHON=1 (#5664) 2024-07-23 18:47:15 -04:00
George Hotz e3f00ac77d
Fix cuda tc emu test (#5663)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand

* fix test emulated CUDA tensor cores

* test_gemm_fp16 on some devices
2024-07-23 15:04:25 -07:00
qazal fdfc0015a7
[run_process_replay] for opencl/openpilot (#5009)
* lil reset script

* find the prg

* use lower_schedule_item

* add process replay back

* cleanups
2024-07-18 19:42:33 +03:00
wozeparrot 6ccb2390c3
feat: update_benchmark_staging (#5529) 2024-07-17 20:40:57 -07:00
George Hotz d3b098299d
add failing regression test for image (#5540)
* add failing regression test for image

* tg type

* simpler test

* don't realize image to image casts caused issue

* simple pad
2024-07-17 17:27:18 -07:00
wozeparrot 218e157f00
benchmark on update_benchmark_staging (#5541) 2024-07-17 17:11:52 -07:00
Alessandro Benetti 13e200b437
add strict mkdocs check (#5497) 2024-07-15 14:21:37 -07:00
qazal 40ec9410f9
simpler process replay (#5452)
* remove check_process_replay

* that can go to the top

* add assert back

* [run_process_replay]

* checkout code [run_process_replay]

* temp [run_process_replay]

* revert temp [run_process_replay]

* ahh this is why [run_process_replay]

* revert temp [run_process_replay]
2024-07-13 19:55:06 +03:00
George Hotz 955e1179fb
move compile tests and merge (#5451)
* move compile tests and merge

* revert enet move, bump download cache

* oh, try setting clang
2024-07-13 08:04:46 -07:00
chenyu 9a187e6102
fix handcode_opt script (#5435)
* fix handcode_opt script

* run in ci

* real run in ci

* HALF=0
2024-07-12 20:52:28 -04:00
George Hotz b055ece550 hotfix: bump to cache gpuocelot 2024-07-12 13:54:14 -07:00
chenyu b17e4adb3a
add `-c advice.detachedHead=false` to process replay git checkout (#5419)
remove the noisy `Note: switching to 'origin/master'.

You are in 'detached HEAD' state. You can look around, make experimental
changes...` in log
2024-07-12 15:13:26 -04:00
qazal 31fcc516dc
more process replay tooling (#5407)
* replays

* what's in there

* can it be up there

* sha is enough

* insert sha as the key

* fix str

* update reset utils

* that nested try/except was terrible

* github_context can go
2024-07-12 13:11:34 +03:00
Roelof van Dijk 6ec7dbc287
ci: parallelize uops tests (#5405) 2024-07-12 11:22:41 +03:00
qazal b91a0ccdc3
make [run_process_replay] [no_assert] the default (#5390) 2024-07-11 22:36:59 +03:00
qazal 004366b193
context aware process replay [run_process_replay] (#5378)
* test tc as ctx var

* remove from opts

* process replay

* pop variable

* B -> Variable

* fix re-assign

* pop temp vars

* move TRANSCENDENTAL=2
2024-07-11 13:07:28 +03:00
chenyu 2396ab9b33
more transcend cleanup [run_process_replay] (#5369)
fix test name, less # noqa: E501 and removed the cast
2024-07-10 23:05:03 -04:00
chenyu 64986f949c
more transcend math tests in ci (#5368)
* more transcend math tests in ci

test large input to trig functions that hit different reduction algo, and test TRANSCENDENTAL=2 for all backend

* no CUDACPU

* try that
2024-07-10 21:19:09 -04:00
chenyu 322c37e621
use helpers.JIT in llama and gpt2 examples (#5350)
* use helpers.JIT in llama and gpt2 examples

replaced getenv("JIT"), effectively made gpt2 default jit

* fix test_gpt2
2024-07-09 15:04:43 -04:00
Ian Paul d5a68ae6b3
Simple abstractions3.py fix (#5343)
* abstractions3.py fix

* Add abstractions3.py to CI tests
2024-07-09 13:48:42 +03:00
chenyu 631bc974a0
raise line count limit to 8500 (#5331) 2024-07-08 14:00:28 -04:00
SnakeOnex 8c03816ae9
fix README example (#5284)
* fixed README example

* README test

* changed py -> python markdown code flags in REAME
2024-07-04 11:15:07 -04:00
chenyu 191463a919
add timing to SDXL (#5273) 2024-07-02 23:29:54 -04:00
chenyu 5808c37302
hotfix disable flaky llama3 beam benchmark on green (#5249) 2024-07-01 15:00:47 -04:00
chenyu b9122ecdaf
revert stable diffusion validation with threefry (#5248)
* Revert "use threefry in stable diffusion benchmark (#4988)"

This reverts commit 44dfa37c70.

* sdxl and validation fix

* relax threshold
2024-07-01 14:43:47 -04:00
nimlgen 57e89645cd
hcq spec test (#5226)
* start hcq spec test

* more test

* fixes

* run on amd as well

* test amdgpu exec

* fix amd

* amd mockgpu support sdma timestamp
2024-07-01 17:36:37 +03:00
chenyu 88763eb9ff
fix stable_diffusion with fp16 (#5239) 2024-06-30 12:59:31 -04:00
nimlgen dd7eef7d71
libc defs to autogen (#5217)
* libc defs to autogen

* amd import libc

* linter

* better a bit

* remove comment, check this

* not hardcoded path
2024-06-29 14:37:33 +03:00
nimlgen 6b08cb5e38
ptx runs on nv in benchmarks (#5224) 2024-06-29 11:06:44 +03:00
nimlgen b4c49ae3fa
remove cudacpu in favour of mockgpu (#5225)
* remove cudacpu in favour of mockgpu

* remove unused import

* not used as well
2024-06-29 11:05:16 +03:00
chenyu 7090eac8cb
validate sdxl output and put it in benchmark (#5211)
* validate sdxl output and put it in benchmark

* don't print fetch progress_bar in CI
2024-06-28 11:40:52 -04:00
chenyu d8dc43ad06
remove JIT_BATCH_SIZE=4 from gpt2 NV benchmark (#5198)
this no longer helps
2024-06-27 15:20:34 -04:00
chenyu 83da8b3558
use NV instead of CUDA in benchmark (#5192)
also reenabled mixtral on green
2024-06-27 13:52:58 -04:00
chenyu 0c6c7c5f7b
CACHELEVEL=0 -> IGNORE_BEAM_CACHE=1 in benchmark (#5191)
ignoring beam cache but using compile cache should be fine, saved some benchmark time.

also updated `beam_search` to check flag value before accessing diskcache
2024-06-27 13:15:18 -04:00
chenyu c12de4f47d
benchmark use JITBEAM for llama and gpt2 (#5189) 2024-06-27 12:56:02 -04:00
qazal 3af17849bf
safely parse quoted titles [run_process_replay] (#5183) 2024-06-27 16:39:48 +03:00
qazal 6ca7b13ed1
limit pickled objects [run_process_replay] (#5154)
* limit pickled objects

* delete uop from the list

* debug metal

* need self.opts for TC

* dont need device

* [run_process_replay]

* minor
2024-06-26 13:51:32 +03:00
qazal 8aa786232d
docs for running process replay locally (#5083) 2024-06-21 09:55:08 -04:00
nimlgen fb1bf48cfe
io_uring for copies from disk (#5035)
* exp uring

* fixes and old version

* nv

* cleaner

* cmp vs aio

* fix

* no lib

* fix nv

* linter

* disk_speed_test now runs default

* fixes

* uring -> io_uring

* linter happy

* get_temp_buf comment added

* tiny nits

* put wait back

* test runs everywhere

* remove consts

* remove mmap consts

* do not require iouring to run test, they are generic
2024-06-21 11:36:51 +03:00
qazal 97f1347dd9
fix check_process_replay for special characters (#5072)
* 'test' [run_process_replay] [no_assert]

* test with ( ) { } '' " "

* remove the log [run_process_replay] '' () { } '{

* helpful echos [run_process_replay] [no_assert] () ''

* test [run_process_replay] [no_assert]

* test2 [run_process_replay] [no_assert]

* test3 [run_process_replay] [no_assert]

* it's also correct this way [run_process_replay] [no_assert]

* remove extras [run_process_replay]
2024-06-20 20:23:29 +03:00
qazal a6a5dba637
Revert "UPat for has_valid in load/store (#5052)" (#5056)
* manually insert in the Linearizer

* fix process replay
2024-06-19 20:53:36 +03:00
qazal ee01e464e3
use process replay as a diff creator (#4903)
* add no_assert option [run_process_replay] [no_assert]

* test [run_process_replay] [no_assert]

* [run_process_replay]

* back to normal [run_process_replay]

* remove the log
2024-06-19 18:17:31 +03:00
chenyu dc942bf1f6
jit sampling functionn in test_randomness.test_multinomial (#5034)
* jit sampling functionn in test_randomness.test_multinomial

`THREEFRY=1 python3 -m pytest test/test_randomness.py::TestRandomness::test_multinomial --durations 1` 7 sec -> 1.2 sec

* skip that
2024-06-18 14:21:05 -04:00
chenyu e9c6a36894
remove CACHELEVEL=0 in llama3 benchmark (#5025) 2024-06-17 22:43:16 -04:00
chenyu acaf9a490d
RECIP(-0.0) should be -inf (#5024)
* RECIP(-0.0) should be -inf

added test_dtype_alu for PYTHON backend

* catcht that

* fix those two
2024-06-17 22:26:58 -04:00
George Hotz bee8fc29ee
add GPT2 half/half+beam to AMD (#5000)
* add GPT2 half/half+beam to AMD

* winograd in training. half and half/beam file upload
2024-06-16 14:07:14 -07:00
chenyu 44dfa37c70
use threefry in stable diffusion benchmark (#4988)
also updated default steps to 10. easier to tell the image is following the prompt.
2024-06-15 20:25:29 -04:00
wozeparrot ce1ed374c9
more tinychat fixes (#4971) 2024-06-15 16:29:39 -07:00
qazal ff8e9eefc3
hotfix: don't use ASSERT_COMPILE for benchmarks process replay (#4981)
* use replay_codegen [run_process_replay]

* disable for now [run_process_replay]
2024-06-15 16:57:47 +03:00
uuuvn 92f49efd06
Trigger process replay from pull request title [run_process_replay] (#4980)
* Trigger process replay from pull request title

* idk how this thing works btw

* test if it will work

* try 2

* Revert "idk how this thing works btw"

This reverts commit 580da51b07a243020f79b1c333c8a2349ea00beb.

* Revert "try 2"

This reverts commit 7ff1e86d5d15d1a1745a139db1e1c13c5903b366.

* test if it works

* meh

* Reapply "idk how this thing works btw"

This reverts commit dd33ad7c143d1649d3f071970aceeb266291d24f.

* revert
2024-06-15 16:21:00 +03:00
wozeparrot 62dc36d371
autogen _try_dlopen (#4949) 2024-06-14 12:12:18 -07:00
chenyu f902af4f0b
increase metal ci test timeout to 20 minutes (#4920)
make it less annoying for now
2024-06-11 18:45:51 -04:00
qazal 7f3d9e6d94
revert hsa autogen removal (#4914)
* Revert "only install comgr in AMD CI (#4909)"

This reverts commit 7f03420d05.

* rocm-llvm only removal
2024-06-11 12:55:45 -04:00
qazal 7f03420d05
only install comgr in AMD CI (#4909)
* test

* delete hsa autogen
2024-06-11 06:19:33 -04:00
qazal 8b5bcf309a
process replay in all of CI (#4884) 2024-06-10 14:49:29 -04:00
George Hotz f42183ba28 hotfix: relax cifar to 93.2 2024-06-09 13:09:21 +02:00
nimlgen 654a8b9ef7
retire hsa (#4885)
* retire hsa

* EMULATE_AMD
2024-06-09 11:33:03 +03:00
nimlgen 6327b50e51
amd in benchmarks (#4861)
* amd in benchmarks

* remove all hsa
2024-06-08 23:24:46 +03:00
qazal 66dfd5e7bf
faster codegen process replay (#4858)
* faster codegen process replay

* use self.copy

* regenerate

* delete copy

* test a real error [run_process_replay]

* revert the error change
2024-06-07 16:20:57 +03:00
qazal 0db9674dea
skip process replay on master (#4808) 2024-06-03 12:29:28 +03:00
qazal f64fa51a64
process replay for test/* (#4799)
* add input to unit tests [run_process_replay]

* add setup [run_process_replay]

* run tests [run_process_replay]

* add cuda and amd [run_process_replay]

* run everything but BEAM=2 [run_process_replay]

* skip export_model [run_process_replay]

* fix amd CI

* add concurrency back
2024-06-03 12:01:58 +03:00
qazal 240d6b5bc0
process replay benchmarks (#4668) 2024-06-01 14:36:21 +03:00
nimlgen bd2e7c8b31
amd registers from file (#4778)
* amd registers from file

* remove commentes

* linetr

* no off
2024-05-31 18:48:57 +03:00
Szymon Ożóg a4de81e9a6
Update ocelot version (#4715) 2024-05-24 14:32:53 -04:00
chenyu 38bc38cdff
fix llama example quantize (#4699)
* fix llama example quantize

import quantize layers from new example llama3

add to mac benchmark

* fix that

* save the files
2024-05-23 15:35:26 -04:00
chenyu 72560e30fe
add CACHELEVEL=0 to tinybox green GEMM BEAM (#4693)
* add CACHELEVEL=0 to tinybox green GEMM BEAM

* BEAM=4 is more stable
2024-05-22 23:59:50 -04:00
Yury Zhuravlev af56f0e68a
fix HSA/KFD load for system-wide installation (#4218)
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2024-05-22 20:33:21 -07:00
nimlgen 12339f6564
disable cuda test in ci (#4630)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-05-22 23:23:32 -04:00
qazal 498cf3e7e0
fuzzer path search for DEFINE_ACC (#4656)
* insert acc

* add test_ops

* find toposorts

* todo - not yet ready

* remove the import

* atol and childless children
2024-05-23 00:50:01 +03:00
qazal 458a3961eb
catch compile errors in uops tests (#4672)
* use helper and compile

* llama beam=2

* ast length

* skip float4, fix hsa

* use empty tensors
2024-05-21 12:20:35 +03:00
wozeparrot 00432496d7
feat: tinyboxgreen (#4366)
* feat: tinyboxgreen

* feat: tinyboxgreenv2

* fix symlink weights

* fix: remove llama 2 70b for now

* feat: naming

* fix: remove extra cifar steps

* feat: disable mixtral on nvidia
2024-05-20 22:39:34 -04:00
chenyu 8a0d1ca7bb
CI test timeout 20 min -> 10 min (#4645)
if it takes more than 10 usually setup fails anyway. also updated matmul_kfd -> matmul_amd in benchmark
2024-05-18 13:58:28 -04:00
George Hotz b74cc1d01a
uops cleanup (#4634)
* def add cleanup

* minor speedup

* add back ptx speed

* a little faster

* merge that

* only linearize once for ptx

* two graph rewrites for ptx, bug?
2024-05-17 20:02:38 -07:00
George Hotz 07b350a8f4
new uops is an actual graph (#4560)
* new uops is an actual graph

* it's way slower

* simpler

* fix define acc

* render_loop unique

* ops test pass

* add pattern matcher back, there's bugs

* rewrite

* use priority queue

* recursive children

* fix tests

* fix tests with SINK

* fix abstractions

* fix assembly

* simpler

* link define_acc

* fix DEFINE_ACC placement

* type verify

* full cmp

* fix cmp

* ACCESS_ACC

* insert DEFINE_ACC

* fix PHI

* recursive rewrite

* fix many tests

* sum collapse

* more patterns

* correct change

* fold arange

* fix that lin test

* space

* big folding rule works

* close

* has more maxes, meh

* cached node replace

* set changed

* simplest folding yet

* works

* works

* DIV

* all tests pass

* del

* fuzz linearizer fails

* sum_collapse

* test depth 2 cf

* fix lin test 14

* fix clang depth

* disable that

* failure 14 is fixed

* fix ptx

* failure 27 is fixed

* fix llama

* run_cnt

* Revert "Optimize PTX gated loads index calculation (#4304)"

This reverts commit d97d5a7689.

* fix uops loop

* fix ptx bugs

* add barrier

* print

* mem_type in ptx direct

* bypass tests that fail in CI but pass locally

* ptx remove ptr_ar

* more ptx passing

* fix ptx tests

* assert compile support

* remove  model inference benchmark from red
2024-05-17 18:00:18 -07:00