Commit Graph

775 Commits

Author SHA1 Message Date
George Hotz 9d72119a0c
minor resnet cleanups (#6382)
* minor resnet cleanups

* that should have been long

* jit

* meh
2024-09-06 12:50:21 +08:00
George Hotz 86d34daac9
UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383) 2024-09-06 12:38:35 +08:00
George Hotz 72be31cb56
remove mla [run_process_replay] (#6357)
* remove mla

* other bad uses of const
2024-09-05 10:37:46 +08:00
Vyacheslav Pachkov 4c33192a8b
add qcom runtime (#5213)
* qcom: driver init

* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros

* autogen: add adreno commands and registers

* ops_qcom: QcomAllocator + signals

* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom

* qcom: we do not really need all these constants input/output is enough

* qcom: perfctr for CS (do not really need all the rest)

* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max

* qcom: explicitly set instruction len based on the shader size

* ops_qcom: Program init

extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader

* use data64_le from helpers

* ops_qcom: use fill_kernargs for filling i/o buffers

* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset

* new signals & fix exec

* add QCOM to the list of supported devices

* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM

* fix exec, synchronize before copyout

* correct setting num_units for ST_SHADER

* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway

* extract offsets to kernel arguments from opencl binary

* extract constants values and offsets from opencl binary

* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly

* align kernel name to 4 bytes when skipping kernel opencl struct

* skip to consts directly using an offset from opencl binary header

* fix alloc

* get halfreg and fullreg from opencl bin

* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE

* parse prg offset from open cl binary

* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG

* support for vals in _fill_kernargs

* support 16-bit constants

* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts

this helps to not fall down when executing big kernels

    /* Don't time out if the context has disabled it */
    if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
        return;

* minor changes of _exec

* QCOMRenderer

* disable HCQGraph for demo. TOOD: support HCQ update api

* support HCQ

- remove copy queue
- add updates
- add strides for buffs and vars for QCOM

* bufs_stride

* clean ups

* linter

* call super().__init__(value) in QcomSignal

* disable=unused-import

* mypy

* type ignore when queue is on the device

* fix

* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7

* working timestamps

* free context after device is done

* move gpu stack to the device

* reserve some space with lib_gpu for gpu to write to

this fixes test_interpolate_bilinear

* exclude tests that fails with GPU=1 on qualcomm

* lint

* unmap mem in _gpu_free

* ctxt priority and preemtion policy

* remove old qcom

* pass size to self.device.allocator.free

* skip tests only on qcom

* use kgsl and adreno defines instead of numeric vals

* use allocator for allocating lib_gpu

* update to QcomArgsState from master

* intermediate commit while conquering images

* enable image tests on qcom

* fix shader disasm size, dump textures stuff

* working images

* allow signals to be 0

* set branchstack from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* set shared memory size from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* update images in QcomArgsState & less loc for images

* set stack sizes from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* stack allocation based on OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* better autogen for kgsl and adreno. no more bitshifts

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* cleanup commit for parse cl lib

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dont forget actual generated files

* refactor + less loc

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* device.py back

* lint

* ruff

* timestamp divisor

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* fix tex fmt & round global size

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dtypes

* 19.2MHz

* -1 loc in _update_exec

* remove noqa

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-09-02 19:35:47 +03:00
wozeparrot cb61cfce24
feat: example and extra tweaks (#6310) 2024-08-28 19:26:11 -07:00
gswangg 94a72d44d2
update CI tests in extra with UOp AST (#6290) 2024-08-28 22:26:50 +03:00
Tobias Fischer 3517aa89d9
sdxl batched inference fixes (#6293) 2024-08-28 07:44:58 -04:00
Tobias Fischer 211bfb6d8a
fixed batched clip computation (#6292) 2024-08-26 20:48:15 -04:00
Tobias Fischer 331b0f5477
new clip gather (#6277) 2024-08-25 19:27:24 -04:00
qazal bcb2f1caa3
init REDUCE_AXIS with BinaryOps (#6256)
* REDUCE_AXIS arg with BinaryOps

* more work in kernel.py
fixup sops.gz

* fix TestGraphRewriteEfficiency
2024-08-24 11:28:41 +03:00
qazal 0d4887e9df
use UOps.WMMA everywhere (#6255)
* add UOps.WMMA_AXIS

* delete ReduceOps.WMMA from ops
2024-08-23 15:03:26 -04:00
chenyu 590c0922b6
Tensor.prod (#6250)
* Tensor.prod

a new reduce op!

* onnx ReduceProd
2024-08-23 10:06:32 -04:00
chenyu e745e16441
remove UnaryOps.NEG (#6238)
* Remove UnaryOps.NEG

generated new dataset with
```
time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```

* fix that
2024-08-22 14:21:39 -04:00
Francis Lam 7376b67e36
extra/gemm/triton_nv_matmul: fix Program arguments (#6212)
remove op_estimate
2024-08-20 14:05:38 -07:00
Francis Lata 8fd8b970b0
update URL to eval cases from recent MLPerf file movements (#6201) 2024-08-20 08:43:13 -04:00
chenyu 9db2d0d5c6
fix some type error in onnx [run_process_replay] (#6153) 2024-08-17 19:54:20 -04:00
chenyu 7c9c8ce22f
use TensorProto enum in onnx dtype mapping [run_process_replay] (#6151) 2024-08-17 17:58:40 -04:00
George Hotz 9bc81c6db4
UOps.SHAPETRACKER (#6129)
* UOps.SHAPETRACKER [run_process_replay]

* no process replay
2024-08-16 23:26:34 -07:00
George Hotz 89c7989659
no shapetracker in ops [run_process_replay] (#6117) 2024-08-16 17:23:27 -07:00
George Hotz 74ee9febec
remove iter from uopgraph (#6110)
* remove iter from uopgraph

* linearize returns uops

* fix tests

* linearize in linearize

* tests fix

* touchup

* test failures
2024-08-16 15:58:29 -07:00
qazal 28c75bf2a6
merge uops with ops (#6111)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-08-16 18:17:57 -04:00
qazal c23d44c779
AST is UOp (#6030)
* most of the work from the uops2 branch

* schedule

* realize

* kernel

* lowerer

* search

* green

* merge uops with ops

* Revert "merge uops with ops"

This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.

* fix benchmark

* remove extra dedup
2024-08-16 22:09:00 +03:00
CaltropHungerton 38fb1e14a2
Intel XMX Tensor Core Support (#5622)
* fixed xmx demo

* i think i'm invoking the DPAS but it's slow

* compiler build arg to stop register spilling, indicated where to fix flop counter

* don't mind this

* do NOT mind me

* do not mind me

* do not view

* i will add bf16 later

* in process of figuring out tc fields

* we figured out the fields!!!

* added check for cl device vendor, added seperate IntelRenderer

* remove tc thread_local_aliases

* cleaning debris before draft pr

* edits for linter

* deduping and checking device extensions

* i will find more line reductions in other places

* before merge upstream

* double grf size in compiler to fix register spilling (bandaid), device checking changes

* tc python emulation

* fixed emulation

* tests for emulated intel tensor core

* TC=0, 1 working on upstream, fixed perf

* test

* debris

* check for specialized cl device when we canonicalize device

* bf16 support, tc=3 test added

* address tests

* revert half2 loads on intel tc, cleanup

* linter

* fold_expanded revert

* lint, whitespace fix

* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too

* make line shorter, no need for noqa E501

* removed device intel

* fix python emulation

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-16 09:19:21 -07:00
nimlgen 7ab531aede
autogen cleanup (#6064)
* start autogen cleanup

* nvgpu

* better?

* better

* amd part

* gpu regen

* fix mockgpu amd

* nv

* amd fix linter

* remove import

* ugh

* nv on master

* amd on master
2024-08-14 20:20:35 +03:00
wozeparrot 059cf2a90d
feat: autogen from kernel register offset headers (#6056) 2024-08-12 14:08:35 -07:00
wozeparrot dc2617bffd
feat: use more correct reg for local dims (#6048) 2024-08-12 11:15:37 -07:00
chenyu e6c7c3e499
update pylint path to check indent/space for all (#6022)
also fixed many errors. it was not checking nested dirs. exclude autogen for now.

can we use ruff for this?
2024-08-10 14:41:09 -04:00
wozeparrot d269bc95fa
faster tinychat (#5993) 2024-08-08 19:16:26 -07:00
George Hotz bc55c8a30e
pmatmul example + GB/s bugfix [run_process_replay] (#5974)
* pmatmul example + bugfix

* improve pmatmul

* Update real_pmatmul.py
2024-08-07 22:32:11 -07:00
George Hotz bf8ec23b00 hotfix: contiguous on precompute_freqs_cis 2024-08-07 14:40:56 -07:00
wozeparrot 5808e8a30f
mockgpu remu changes (#5925) 2024-08-05 19:26:58 -07:00
wozeparrot 6740a0a6a0
hip_ioctl changes (#5917) 2024-08-05 11:58:38 -07:00
chenyu 996ff0c135
pow(2) -> square in RMSNorm [run_process_replay] (#5901)
reads nicer in metadata
2024-08-04 14:21:31 -04:00
Elias Wahl 4a114756f6
New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00
nimlgen 34168a64e3
optimize nv profiler (#5856)
* nv profiler fix

* cleanup hcq a bit

* fixes

* fix

* typo

* all signals put timestamp

* a bit cleaner

* merge fields

* type

* import

* tiny fix
2024-08-01 23:57:45 +03:00
Vyacheslav Pachkov 610e454132
fix opencl_ioctl on comma (#5814)
- remove unused code
- add CP_REG_TO_MEM opcode
- fixed parse_cmd_buf for more than 1 command object by correcting
an offset
- fixed memory mappings for cases when memory was allocated with
KGSL_MEMFLAGS_USE_CPU_MAP.
KGSL_MEMFLAGS_USE_CPU_MAP: If set on call and return, the returned GPU
address will be 0. Calling mmap() will set the GPU address.
So there are no IOCTL_KGSL_GPUOBJ_INFO ioctls for that type of memory
and it resulted to crash right after get_mem.
2024-07-30 20:44:06 -07:00
David Hou 9a485f36e4
shard kvcache (#5830) 2024-07-30 20:29:54 -07:00
George Hotz 4e89d45513 hotfix: put contiguous back in llama 2024-07-30 18:43:48 -07:00
George Hotz 21c5e8e1b7
extreme llama speed, 57.34 tok/s (#5827)
* extreme llama speed

* mergable
2024-07-30 18:32:09 -07:00
George Hotz e6879035a0
work to make GEMV fast (#5824)
* work to make GEMV fast

* half8 cast

* align struct

* fix amd

* float8 is a later problem
2024-07-30 17:41:40 -07:00
Francis Lata ce61be16f1
clean up how preprocessed folder is defined (#5813) 2024-07-30 12:35:26 -04:00
chenyu 471b188d79
fix mypy errors in latest mypy (#5794)
* fix mypy errors in latest mypy

mypy has stricter partial and api arg checks now

* PYTHONPATH="."
2024-07-29 14:53:30 -04:00
nimlgen ea27ec4cd0
nv switch classlist_v2 to classlist (#5763)
* nv switch classlist_v2 to classlist

* support in mockgpu

* fix mockgpu
2024-07-28 20:24:42 +03:00
chenyu 3686b6726a
move GraphException to jit.py (#5744)
same place where GraphRunner is defined
2024-07-26 19:01:12 -04:00
George Hotz 489a5b99a5 hotfix: triton_nv_matmul touchups 2024-07-24 23:24:29 +00:00
George Hotz bf24be4c8c triton gets 163 TFLOPS on 4090 2024-07-24 18:32:29 +00:00
George Hotz 4d47968580
fix acc folding for NV tensor cores (#5658)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand
2024-07-23 13:03:02 -07:00
nimlgen 08a9c0ae5e
hcq cache invalidation for beam (#5630)
* nv full cache invalidation

* the same command on amd

* linter

* fix amd

* nv no hardcoded consts

* beam default
2024-07-22 18:13:17 +03:00
George Hotz 6c6d74d922
parallel mcts (#5626)
* start work on parallel mcts

* compile was linearizing twice

* typing + more early stopping

* fix compiler error
2024-07-21 14:53:23 -07:00
George Hotz ef179087a4
mcts exit condition wasn't right, also use it with BEAM>=100 (#5619)
* mcts exit condition wasn't right, also use it with BEAM>=100

* mcts touchups

* clean up sample
2024-07-21 10:16:47 -07:00