* real strides with uops [run_process_replay]
* compare with old
* Revert "compare with old"
This reverts commit f53a8d42768e0b95d37b1bae8e80e288a69c6e3f.
* make those @unittest.expectedFailure
* Revert "late gate creation for STORE [run_process_replay] (#6373)"
This reverts commit c26744de9f.
* Revert "gated store rewrite to UOps.IF (#5976)"
This reverts commit 48061e8400.
* almost working with relu, even hackable... but acc size is wrong, fix needed
* upcast based on threads, change thread size to 4x4
* revert wrongfully commented assert
* fix tc load indexing
* modify for size 8
* fix bug for size 8
* Revert "fix bug for size 8"
This reverts commit cdb3f5df85b6116e8bef10214647a9201c400655.
* Revert "modify for size 8"
This reverts commit 3ef0904bd96291c7a3a351c702fba2905c196bcc.
* good kernel with changes in lowerer
* revert "good kernel with changes in lowerer"
This reverts commit 975e2b5a4ecfe475370e88ce9db78b2d42e4c4d4.
* good kernel for relu!
* refactor lowerer changes
* add amx context var to helper
* clean up amx flag
* improve lowerer changes readability
* improve check for amx
* revert lowerer if
* add float4 type rendering for clang
* add amx definitions
* enable indexing for clang if amx
* working amx example, wrong because of dims
* almost works for float 16, need to spot using double load in amx
* cleaner render_kernel
* revert chages in simple_matmul and delete env
* add new var upcast_offset to get_optimized_ast
* change axis for axes
* invert if in rendering phi
* fix some bugs
* fix linearizer tests
* fix vec/get pat for amx
* remove clang tc if amx is disabled
* add ops_python support
* refactor into one complementary function in ops_python
* add job for EMUALTE_AMX
* improve checking for AMX in UPCAST and TC extra ops
* fix lint issue
* commit before refactor into autocontained AMX
* start refactor by removing special rendering for AMX
* all ready for amx handcoded kernel
* working poc, most straightforward amx support
* avoid local opts for tc if amx
* fix merge bugs
* skip test for clang
* skip tc hand-coded opts if amx
* remove hardcoded ops_python values
* remove hardcoded sizes for amx kernel
* fix ops_python bug where dim was hard-coded
* change contract for vectorize
* working without changes in lowerer
* revert changes in gep rendering
* fix ops_python
* modify comment
* skip test if clang for different type accumulation
* move rename and bug for seperate pr
* fix wrong path for test
* addmm not implemented in torch for cpu
* change struct for vector; equally slow but cleaner
* revert modified test
* simply wmma rendering
* minor change
* noqa:501
* add length 16 for AMX
* fix vectorized half issue
* fix error
* remove comment
* change set for dedup
* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX
* add amx reference
* load acc into amx registers
* fix dtype rendering and remove noqa
* moved tests change into another pr
* add real AMX job for CI and fix bug
* fix ops_python bug
* fix test class
* remove real AMX tests and fix uops_stats test
* remove wrong test
* acc folding
* hotfix: bug
* fix float4 tests for amx
* hack for fixing flops counting
* hotfix: mypy
* add flop counts test for amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_unaligned_load_amx
* nits tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Core change to gate stores in IFs
* Updates to cstyle renderer to handle IFs around STOREs
* Make uops asserts happy
* Add tests and fix newly broken tests
* make ruff happy
* make mypy happy
* Simplify renderer to have all gated stores use IF
* Revert some changes
* Make test_where_fold happy
* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out
* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE
* Re-change broken test
* Make ifs be grouped together
* get non-merged IFs working. ALl tests pass except grouping related ifs together
* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp
* Changes to uopgraph
* Simplify graph rewrite logic
* Changes to get test_padto_where_multireduce working
* Simplify uops.store renderer
* Make test_padto_where_multireduce pass but now other tests fail
* Clean up uopgraph from scrach work
* Ignore sudo IF srcs when rendering
* Attempt to fix llvm tests
* rm comment
* reduce lines
* Add line to make mypy happy :(
* llvmir fix pt 1
* Mods after rebasing to master
* Fix llvmir
* Fix ptx tests
* Fix other ptx tests
* Move changes from uops.py to ops.py
* rm uops.py
* Fix TestGateStoreRewrite tests
* Get multireduce tests working
* reset to remote branch
* Fix linearizer tests
* uop_graph test patch
* Add comment to create_gate
* hotfix: uncomment those tests
* Attempt to fix ptx tests by including whitespace inside if block
* Patch from remote tinybox. Tests passing here
* Min changes to get some ptx tests passsing
* Changes after rebase
* Exclude ifs and endifs from ptx
* IF conditional branching within ptx
* Save lines on delete_redundant_gates
* Simplify merge_gates
* rm noqa
* Remove unnecessary checks when merging gates
* Fix ops error msg
* Smarter check for if/endif in llvmir
* simplify delete redundant gates to only have 2 returns
* spacing
* Smarter check at beginning of merge_gates
* patches from comments
* Remove need for merge_gates
* include proper srcs in IF from the get-go
* test expand ifs dumb will result in 4 ifs, not 1 now
* Make tests happy
* Fix uops stats
* rm merge_gates method. Will add back in separate PR
* Spacing
* cleaner error msg
* Fix uops rendering when expanding. test_failure_43
* patch tests
* undo changes in delete_redundant_gates
* process replay attempt
* re-intro deletion of redundant gates
* fix addition of gates when they get nested in stores and loads
* patch tests
* smarter init of IF srcs when adding gate to STORE
* make ruff happy
* Resp to comment
* include all src[2]'s srcs in IF for gated store
* add reference of the storing value to the gate's src
* minor patch after rebasing
* change ptx renderer
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* math trait [run_process_replay]
* const -> const_like
* Revert "const -> const_like"
This reverts commit 85727c83d38f59e153333a3dbfa68f87b3a5a6ce.
* add MathTrait to LazyBuffer
* clean up function
* fixup the rest of function
* fix custom function
* mlb math trait
* fix that test
* qcom: driver init
* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros
* autogen: add adreno commands and registers
* ops_qcom: QcomAllocator + signals
* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom
* qcom: we do not really need all these constants input/output is enough
* qcom: perfctr for CS (do not really need all the rest)
* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max
* qcom: explicitly set instruction len based on the shader size
* ops_qcom: Program init
extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader
* use data64_le from helpers
* ops_qcom: use fill_kernargs for filling i/o buffers
* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset
* new signals & fix exec
* add QCOM to the list of supported devices
* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM
* fix exec, synchronize before copyout
* correct setting num_units for ST_SHADER
* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway
* extract offsets to kernel arguments from opencl binary
* extract constants values and offsets from opencl binary
* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly
* align kernel name to 4 bytes when skipping kernel opencl struct
* skip to consts directly using an offset from opencl binary header
* fix alloc
* get halfreg and fullreg from opencl bin
* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE
* parse prg offset from open cl binary
* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG
* support for vals in _fill_kernargs
* support 16-bit constants
* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts
this helps to not fall down when executing big kernels
/* Don't time out if the context has disabled it */
if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
return;
* minor changes of _exec
* QCOMRenderer
* disable HCQGraph for demo. TOOD: support HCQ update api
* support HCQ
- remove copy queue
- add updates
- add strides for buffs and vars for QCOM
* bufs_stride
* clean ups
* linter
* call super().__init__(value) in QcomSignal
* disable=unused-import
* mypy
* type ignore when queue is on the device
* fix
* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7
* working timestamps
* free context after device is done
* move gpu stack to the device
* reserve some space with lib_gpu for gpu to write to
this fixes test_interpolate_bilinear
* exclude tests that fails with GPU=1 on qualcomm
* lint
* unmap mem in _gpu_free
* ctxt priority and preemtion policy
* remove old qcom
* pass size to self.device.allocator.free
* skip tests only on qcom
* use kgsl and adreno defines instead of numeric vals
* use allocator for allocating lib_gpu
* update to QcomArgsState from master
* intermediate commit while conquering images
* enable image tests on qcom
* fix shader disasm size, dump textures stuff
* working images
* allow signals to be 0
* set branchstack from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* set shared memory size from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* update images in QcomArgsState & less loc for images
* set stack sizes from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* stack allocation based on OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* better autogen for kgsl and adreno. no more bitshifts
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* cleanup commit for parse cl lib
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* dont forget actual generated files
* refactor + less loc
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* device.py back
* lint
* ruff
* timestamp divisor
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* fix tex fmt & round global size
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* dtypes
* 19.2MHz
* -1 loc in _update_exec
* remove noqa
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* loooking into graph rewrite speed
* track, replace is slow
* if all same, no permutations [run_process_replay]
* types so compile works
* no implied comprehension
* TRACK_MATCH_STATS=2
* no UnaryOps.NEG in generated UOp patterns
removed pattern `x * (-1) -> -x` and `x != True`
* those are fine because NEG became CMPNE and True
* fix sd validation L2 norm
* faster rewrite, no folder in expand/reduce [run_process_replay]
* is removing the expander there okay
* parens
* don't reconstruct exact match uop
* fast do_reduce
* expand pyint
* most of the parents gains with less lines
* move cifar into datasets
* support for pathlib Tensors, tar_extract, and fetch gunzip
* too early for Device.DEFAULT
* simpler hlb_cifar + .to(None) is default
* new compiler failure, start beautiful_cifar
* beautiful cifar runs but is broken
* jit train step
* cleaner
* std_mean, not mean_std
* more correct
* fast indexing
* don't print that
* torch load broken
* add eval
* nicer bar
* decoraters are the way to do this
* bounds check the load
* a few ops
* batchnorm bugfix, if track_running_stats is False, use online estimate
* full timing
* fix fusion
* unneeded realize
* master tensor
* Fix track_running_stats in batchnorm
* Fix linter
* Update test_fold_conv_batchnorm_notrain to keep allowed at 1
* Add test_fold_conv_batchnorm_notrain_no_running_stats
* Save 1 line
* num_classes=-1
If num_classes set to -1, the number of classes will be inferred as one greater than the largest class value in the input tensor.
* num_classes desc
comment to explain num_classes default and what that means.
* replacing ' with `
* add support for retain_graph in backward
* fix: dont accumulate grad on non-leaf tensors
* fix order
* fix: do not delete grad on leafs
* fix linter
* fix: can't exactly match torch behaviour internally
* allow numerical room for test
* refactor
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.
* fix benchmark
* remove extra dedup
* fixed xmx demo
* i think i'm invoking the DPAS but it's slow
* compiler build arg to stop register spilling, indicated where to fix flop counter
* don't mind this
* do NOT mind me
* do not mind me
* do not view
* i will add bf16 later
* in process of figuring out tc fields
* we figured out the fields!!!
* added check for cl device vendor, added seperate IntelRenderer
* remove tc thread_local_aliases
* cleaning debris before draft pr
* edits for linter
* deduping and checking device extensions
* i will find more line reductions in other places
* before merge upstream
* double grf size in compiler to fix register spilling (bandaid), device checking changes
* tc python emulation
* fixed emulation
* tests for emulated intel tensor core
* TC=0, 1 working on upstream, fixed perf
* test
* debris
* check for specialized cl device when we canonicalize device
* bf16 support, tc=3 test added
* address tests
* revert half2 loads on intel tc, cleanup
* linter
* fold_expanded revert
* lint, whitespace fix
* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too
* make line shorter, no need for noqa E501
* removed device intel
* fix python emulation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* rewrite bool ADD to OR and MUL to AND
fixed running `tinyphysics.onnx`, which contains a getitem from a boolean tensor.
only can repro through BEAM_COMPARE, which i think is a different bug in test_linearizer_failure
* fold those, and fix tests
* only for bool
* move dtypes.bool
* bitcast & tests
* use to_dtype
* put disk tensor tests back
* tests
* bitmask
* no bitmask
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* test: uop and lazyop have the same compare
* typings
* self.assert_equiv_uops -> assertEqual
* hash dtype
* test nop too
* TestPatternMatcher never used this compare anyway
* nop eq and ne tests
* remove test_const_vectorize_fold
* remove const folding UPat for VECTORIZE
* refactor cstyle render_const
* remove calls to dtype.scalar() in render_const
* add assert
* add vectorized const to UOp.const
* add UPat GEP-VECTORIZE-CONST -> CONST
* render_vectorize for DEFINE_ACC in cstyle
* add back missing render_cast in render_const
* generate vectorized consts as UOps for DEFINE_ACC
* update asserts for DEFINE_ACC with VECTORIZE src
* add UPats for PHI with VECTORIZE src
* use prev rendered vectorize in DEFINE_ACC render
* update DEFINE_ACC in python runtime
* update vectorized DEFINE_ACC in PTXRenderer
* rebase DEFINE_ACC changes on lowerer
* verbose rewrite of bad UPats
* simplify UOps.CONST implementation in ops_python
* update sum_collapse UPats for DEFINE_ACC-VECTORIZE
* revert linearizer to TOT
* fix DEFINE_ACC implementation in ops_python
* simplify DEFINE_ACC in cstyle
* Fix linter error
* support VECTORIZE in fold gated load/store UPat
* support VECTORIZE in other fold gated load UPats
* rewrite VECTORIZE in UPat for no input DEFINE_ACC
* simplify DEFINE_ACC render in cstyle
* make VECTORIZE rules more concise
* add more vectorize fold tests
* inline VECTORIZE-CONSTs in cstyle render
* revert VECTORIZE/GEP rule refactor
* revert cstyle render_const refactor
* inline VECTORIZE-CONSTs in cstyle render
* implicitly vectorized const rendering -> explicit
* WMMA VECTORIZE CONST process replay hacks
* VECTORIZE CONST NAN process_replay hacks
* more VECTORIZE CONST NAN hacks
* cleanup process_replay hacks
* isnan() -> not isfinite() cstyle VECTORIZE CONST
* tweak isnan and isfinite checks VECTORIZE CONST
* tweak for positive vs negative infinity VECTORIZE CONST
* add assert to PTX CONST render
* process_replay VECTORIZE CONST render parity for PTX STORE
* vmin/vmax for VECTORIZE'd CONST
* update WMMA folding rules
* add tests for WMMA VECTORIZE fold
* hack for cstyle half4 CONST zero process_replay parity
* revert PTX backend changes
* add back minimal DEFINE_ACC PTX change
* remove cstyle process_replay hacks
* remove dead code in PTX CONST render
* cleanup vmin/vmax logic for VECTORIZE'd CONSTs
* update vectorize fold tests to use DEFINE_VAR
* fix long line formatting in test
* remove unwanted merge artifact
* more vmin/vmax cleanup
* remove unnecessary asserts
* yet more vmin/vmax cleanup
* get rid of explicit VECTORIZE CONST logic in _min_max
* reuse CONST instead of creating a new one
* remove unneeded cast
* handle DType correctly in sconst
* improve readability of tests
* save a line
* save another line
* tuplize pats in src
* remove GEP-VECTORIZE pats
* add vec +0 fold
* HACK: fold only vec8 +0
* remove vectorized ALU fold hack
---------
Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* add toposort key to LBScheduleItem
* use dedup
* graph LBScheduleItem
* make that comment beautiful again
* diff_schedule utils
* update fuzz_schedule
* divide by gcd in UOp div folding
`(6x+6y)//16 -> (3x+3y)//8` etc
simpler version
* only factor out const
* don't apply for unsigned
* don't need that if
* space