* fixed xmx demo
* i think i'm invoking the DPAS but it's slow
* compiler build arg to stop register spilling, indicated where to fix flop counter
* don't mind this
* do NOT mind me
* do not mind me
* do not view
* i will add bf16 later
* in process of figuring out tc fields
* we figured out the fields!!!
* added check for cl device vendor, added seperate IntelRenderer
* remove tc thread_local_aliases
* cleaning debris before draft pr
* edits for linter
* deduping and checking device extensions
* i will find more line reductions in other places
* before merge upstream
* double grf size in compiler to fix register spilling (bandaid), device checking changes
* tc python emulation
* fixed emulation
* tests for emulated intel tensor core
* TC=0, 1 working on upstream, fixed perf
* test
* debris
* check for specialized cl device when we canonicalize device
* bf16 support, tc=3 test added
* address tests
* revert half2 loads on intel tc, cleanup
* linter
* fold_expanded revert
* lint, whitespace fix
* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too
* make line shorter, no need for noqa E501
* removed device intel
* fix python emulation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* rewrite bool ADD to OR and MUL to AND
fixed running `tinyphysics.onnx`, which contains a getitem from a boolean tensor.
only can repro through BEAM_COMPARE, which i think is a different bug in test_linearizer_failure
* fold those, and fix tests
* only for bool
* move dtypes.bool
* bitcast & tests
* use to_dtype
* put disk tensor tests back
* tests
* bitmask
* no bitmask
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* test: uop and lazyop have the same compare
* typings
* self.assert_equiv_uops -> assertEqual
* hash dtype
* test nop too
* TestPatternMatcher never used this compare anyway
* nop eq and ne tests
* remove test_const_vectorize_fold
* remove const folding UPat for VECTORIZE
* refactor cstyle render_const
* remove calls to dtype.scalar() in render_const
* add assert
* add vectorized const to UOp.const
* add UPat GEP-VECTORIZE-CONST -> CONST
* render_vectorize for DEFINE_ACC in cstyle
* add back missing render_cast in render_const
* generate vectorized consts as UOps for DEFINE_ACC
* update asserts for DEFINE_ACC with VECTORIZE src
* add UPats for PHI with VECTORIZE src
* use prev rendered vectorize in DEFINE_ACC render
* update DEFINE_ACC in python runtime
* update vectorized DEFINE_ACC in PTXRenderer
* rebase DEFINE_ACC changes on lowerer
* verbose rewrite of bad UPats
* simplify UOps.CONST implementation in ops_python
* update sum_collapse UPats for DEFINE_ACC-VECTORIZE
* revert linearizer to TOT
* fix DEFINE_ACC implementation in ops_python
* simplify DEFINE_ACC in cstyle
* Fix linter error
* support VECTORIZE in fold gated load/store UPat
* support VECTORIZE in other fold gated load UPats
* rewrite VECTORIZE in UPat for no input DEFINE_ACC
* simplify DEFINE_ACC render in cstyle
* make VECTORIZE rules more concise
* add more vectorize fold tests
* inline VECTORIZE-CONSTs in cstyle render
* revert VECTORIZE/GEP rule refactor
* revert cstyle render_const refactor
* inline VECTORIZE-CONSTs in cstyle render
* implicitly vectorized const rendering -> explicit
* WMMA VECTORIZE CONST process replay hacks
* VECTORIZE CONST NAN process_replay hacks
* more VECTORIZE CONST NAN hacks
* cleanup process_replay hacks
* isnan() -> not isfinite() cstyle VECTORIZE CONST
* tweak isnan and isfinite checks VECTORIZE CONST
* tweak for positive vs negative infinity VECTORIZE CONST
* add assert to PTX CONST render
* process_replay VECTORIZE CONST render parity for PTX STORE
* vmin/vmax for VECTORIZE'd CONST
* update WMMA folding rules
* add tests for WMMA VECTORIZE fold
* hack for cstyle half4 CONST zero process_replay parity
* revert PTX backend changes
* add back minimal DEFINE_ACC PTX change
* remove cstyle process_replay hacks
* remove dead code in PTX CONST render
* cleanup vmin/vmax logic for VECTORIZE'd CONSTs
* update vectorize fold tests to use DEFINE_VAR
* fix long line formatting in test
* remove unwanted merge artifact
* more vmin/vmax cleanup
* remove unnecessary asserts
* yet more vmin/vmax cleanup
* get rid of explicit VECTORIZE CONST logic in _min_max
* reuse CONST instead of creating a new one
* remove unneeded cast
* handle DType correctly in sconst
* improve readability of tests
* save a line
* save another line
* tuplize pats in src
* remove GEP-VECTORIZE pats
* add vec +0 fold
* HACK: fold only vec8 +0
* remove vectorized ALU fold hack
---------
Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>