Commit Graph

2686 Commits

Author SHA1 Message Date
qazal 4259311006
merge views in conv swizzle (#6464) 2024-09-11 10:11:01 +08:00
qazal 803b8b9313
conv bw schedule and correctness tests to iterate on (#6461)
first to fix AST_REWRITE=1, then to implement the same fusion for dtypes.half.
2024-09-11 08:47:07 +08:00
chenyu b574caadc9
fix UOp const_factor for ADD [run_process_replay] (#6459)
currently not used, fixed for completeness
2024-09-10 20:04:26 -04:00
chenyu 2105832b87
_min_max of MUL of 2 non-positive inputs (#6454) 2024-09-10 07:13:01 -04:00
qazal f4f705a07c
can push SWIZZLE through reduce both ways (#6453) 2024-09-10 16:00:50 +08:00
qazal 1347e49e82
second iteration on UOps.SWIZZLE (#6451)
* new swizzle

* fix the failing tests

* test a double swizzle

* ci
2024-09-10 14:43:21 +08:00
chenyu e0d35e3657
update test_padto_sum_not_ok (#6450)
updated the setup as `exp() < -1` could be folded to False
2024-09-09 22:46:42 -04:00
qazal 95c9fe841e
UOp.st infra for the new SWIZZLE (#6449) 2024-09-10 09:39:45 +08:00
qazal abfbd9fd2f
fix Variable init from the DEFINE_VAR refactor (#6448)
prereq for UOps.VALID.
2024-09-10 09:14:29 +08:00
chenyu fcc69adfc5
simplify c0*x<c1 for negative int c0,c1 (#6431)
* simplify c0*x<c1 for negative int c0,c1

* fine if rhs is zero
2024-09-09 21:05:53 -04:00
qazal 29e63097a0
st is a cached_property on UOp [run_process_replay] (#6433) 2024-09-10 08:30:35 +08:00
George Hotz 904f6a63fa
Revert "Revert "cleanup process_replay/* namings [run_process_replay] (#6429)…" (#6442)
This reverts commit eda177da84.
2024-09-10 07:00:16 +08:00
George Hotz dbd4536167
Revert "add UOps.VALID (#6387)" (#6441)
This reverts commit 8186e4e7d6.
2024-09-09 21:33:00 +08:00
George Hotz eda177da84
Revert "cleanup process_replay/* namings [run_process_replay] (#6429)" (#6437)
This reverts commit f4e83b30b4.
2024-09-09 18:52:36 +08:00
George Hotz 42e5c8335e
remove args from min/max [run_process_replay] (#6430)
* remove args from min/max [run_process_replay]

* it's a ConstType

* sconst_like unused

* any const is fine
2024-09-09 18:18:20 +08:00
qazal f4e83b30b4
cleanup process_replay/* namings [run_process_replay] (#6429) 2024-09-09 16:59:04 +08:00
George Hotz 8186e4e7d6
add UOps.VALID (#6387)
* uops valid

* broke full_shape

* fixup that st (hardcoded asts still red)

* fixup DEFINE_VAR

debug

more debug

* start moving stuff to ast_const

* move test_linearizer

* move test_linearizer_failures to ast_const

* fixup test_schedule

* small diff change

* regenerate dataset

* fixup test_multitensor

* regen dataset try 2

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-09 16:58:43 +08:00
qazal 935b6b658f
delete seen from the scheduler api [run_process_replay] (#6427)
docs
2024-09-09 16:26:34 +08:00
chenyu 1941e66cc9
real strides with uops (#6365)
* real strides with uops [run_process_replay]

* compare with old

* Revert "compare with old"

This reverts commit f53a8d42768e0b95d37b1bae8e80e288a69c6e3f.

* make those @unittest.expectedFailure
2024-09-09 03:06:27 -04:00
chenyu ac98f5056e
move lt-folding to a function [run_process_replay] (#6422)
and added more tests (some failed to match symbolic)
2024-09-09 02:04:52 -04:00
qazal ff8a9ac3c1
test new style gated store rendering (#6413)
* test new style gated store rendering

* switch to lidx

* make lidx optional

* fixup [run_process_replay]
2024-09-09 13:59:22 +08:00
George Hotz 90fb17304f
put rewrite back in ops [run_process_replay] (#6421) 2024-09-09 13:53:51 +08:00
qazal 442150a8df
more ast_const for hardcoding consts [run_process_replay] (#6418) 2024-09-09 11:35:08 +08:00
chenyu 25af78c593
failed uop_symbolic divmod test by variable (#6414) 2024-09-08 23:08:58 -04:00
chenyu ad05302232
tests of real_stride of symbolic shape (#6409)
these would have failed in #6365
2024-09-08 21:37:19 -04:00
qazal 935b4ddff6
use ast_const in test_linearizer asts [run_process_replay] (#6407) 2024-09-09 08:46:58 +08:00
qazal 9a67ec6174
refactor to list of kernels [run_process_replay] (#6403) 2024-09-08 17:19:45 +08:00
chenyu 7df4373fd9
tensor reduction touchup (#6402)
- fixing spacing
- use get_args to get valid Literal values and raise ValueError to match, and a test for that
- use `Y` to be consistent
2024-09-08 03:55:51 -04:00
Irakli Salia 2e01efc35f
tensor roll (#6375)
* tensor roll function and tests

* fix type annotations

* reduce line count

* more readable
2024-09-07 05:14:28 +08:00
Tim Becker dfb818788e
Support `reduction` parameter in more loss functions (#6302) 2024-09-07 05:11:20 +08:00
chenyu 26c5d8346a
remove Variable from UOp.DEFINE_VAR (#6393)
now it's just arg = (expr as str, min as UOp.const, max as UOp.const)
2024-09-06 05:55:19 -04:00
chenyu 9ed2b8b818
fix DEFINE_VAR setup in test_uop_graph [run_process_replay] (#6392)
making sure arg always have 3 items
2024-09-06 05:32:12 -04:00
George Hotz 282af21b95 hotfix: DEBUG_EXPAND -1 and NOOPT in benchmark schedule 2024-09-06 17:22:30 +08:00
chenyu 9a9fea7b8c
move DEFINE_VAR min/max from src to arg (#6388)
new arg is (Variable, min as CONST, max as CONST)
2024-09-06 05:01:02 -04:00
qazal f1bd2a5519
fix BUFFER_UOPS sts in verify_ast [run_process_replay] (#6389) 2024-09-06 16:59:22 +08:00
chenyu cc05016fa8
move test_pattern_matcher to test/unit (#6386) 2024-09-06 03:22:43 -04:00
George Hotz 86d34daac9
UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383) 2024-09-06 12:38:35 +08:00
chenyu 002303c145
fix output of truncate_fp16 (#6381)
make sure the non-inf path returns the truncated value
2024-09-05 22:55:43 -04:00
George Hotz c88329244b
create rewrite.py [run_process_replay] (#6379)
* create rewrite.py [run_process_replay]

* fix tests

* not in rewrite or ops

* skip flaky test
2024-09-06 10:51:01 +08:00
George Hotz 66e7e51c79
Revert beam failure (#6376)
* Revert "late gate creation for STORE [run_process_replay] (#6373)"

This reverts commit c26744de9f.

* Revert "gated store rewrite to UOps.IF (#5976)"

This reverts commit 48061e8400.
2024-09-06 09:36:44 +08:00
ignaciosica c15506fc35
[WIP] amx support as TC (#5693)
* almost working with relu, even hackable... but acc size is wrong, fix needed

* upcast based on threads, change thread size to 4x4

* revert wrongfully commented assert

* fix tc load indexing

* modify for size 8

* fix bug for size 8

* Revert "fix bug for size 8"

This reverts commit cdb3f5df85b6116e8bef10214647a9201c400655.

* Revert "modify for size 8"

This reverts commit 3ef0904bd96291c7a3a351c702fba2905c196bcc.

* good kernel with changes in lowerer

* revert "good kernel with changes in lowerer"

This reverts commit 975e2b5a4ecfe475370e88ce9db78b2d42e4c4d4.

* good kernel for relu!

* refactor lowerer changes

* add amx context var to helper

* clean up amx flag

* improve lowerer changes readability

* improve check for amx

* revert lowerer if

* add float4 type rendering for clang

* add amx definitions

* enable indexing for clang if amx

* working amx example, wrong because of dims

* almost works for float 16, need to spot using double load in amx

* cleaner render_kernel

* revert chages in simple_matmul and delete env

* add new var upcast_offset to get_optimized_ast

* change axis for axes

* invert if in rendering phi

* fix some bugs

* fix linearizer tests

* fix vec/get pat for amx

* remove clang tc if amx is disabled

* add ops_python support

* refactor into one complementary function in ops_python

* add job for EMUALTE_AMX

* improve checking for AMX in UPCAST and TC extra ops

* fix lint issue

* commit before refactor into autocontained AMX

* start refactor by removing special rendering for AMX

* all ready for amx handcoded kernel

* working poc, most straightforward amx support

* avoid local opts for tc if amx

* fix merge bugs

* skip test for clang

* skip tc hand-coded opts if amx

* remove hardcoded ops_python values

* remove hardcoded sizes for amx kernel

* fix ops_python bug where dim was hard-coded

* change contract for vectorize

* working without changes in lowerer

* revert changes in gep rendering

* fix ops_python

* modify comment

* skip test if clang for different type accumulation

* move rename and bug for seperate pr

* fix wrong path for test

* addmm not implemented in torch for cpu

* change struct for vector; equally slow but cleaner

* revert modified test

* simply wmma rendering

* minor change

* noqa:501

* add length 16 for AMX

* fix vectorized half issue

* fix error

* remove comment

* change set for dedup

* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX

* add amx reference

* load acc into amx registers

* fix dtype rendering and remove noqa

* moved tests change into another pr

* add real AMX job for CI and fix bug

* fix ops_python bug

* fix test class

* remove real AMX tests and fix uops_stats test

* remove wrong test

* acc folding

* hotfix: bug

* fix float4 tests for amx

* hack for fixing flops counting

* hotfix: mypy

* add flop counts test for amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_unaligned_load_amx

* nits tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-09-06 09:01:10 +08:00
qazal c26744de9f
late gate creation for STORE [run_process_replay] (#6373) 2024-09-06 03:32:19 +08:00
Ian Paul 48061e8400
gated store rewrite to UOps.IF (#5976)
* Core change to gate stores in IFs

* Updates to cstyle renderer to handle IFs around STOREs

* Make uops asserts happy

* Add tests and fix newly broken tests

* make ruff happy

* make mypy happy

* Simplify renderer to have all gated stores use IF

* Revert some changes

* Make test_where_fold happy

* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out

* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE

* Re-change broken test

* Make ifs be grouped together

* get non-merged IFs working. ALl tests pass except grouping related ifs together

* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp

* Changes to uopgraph

* Simplify graph rewrite logic

* Changes to get test_padto_where_multireduce working

* Simplify uops.store renderer

* Make test_padto_where_multireduce pass but now other tests fail

* Clean up uopgraph from scrach work

* Ignore sudo IF srcs when rendering

* Attempt to fix llvm tests

* rm comment

* reduce lines

* Add line to make mypy happy :(

* llvmir fix pt 1

* Mods after rebasing to master

* Fix llvmir

* Fix ptx tests

* Fix other ptx tests

* Move changes from uops.py to ops.py

* rm uops.py

* Fix TestGateStoreRewrite tests

* Get multireduce tests working

* reset to remote branch

* Fix linearizer tests

* uop_graph test patch

* Add comment to create_gate

* hotfix: uncomment those tests

* Attempt to fix ptx tests by including whitespace inside if block

* Patch from remote tinybox. Tests passing here

* Min changes to get some ptx tests passsing

* Changes after rebase

* Exclude ifs and endifs from ptx

* IF conditional branching within ptx

* Save lines on delete_redundant_gates

* Simplify merge_gates

* rm noqa

* Remove unnecessary checks when merging gates

* Fix ops error msg

* Smarter check for if/endif in llvmir

* simplify delete redundant gates to only have 2 returns

* spacing

* Smarter check at beginning of merge_gates

* patches from comments

* Remove need for merge_gates

* include proper srcs in IF from the get-go

* test expand ifs dumb will result in 4 ifs, not 1 now

* Make tests happy

* Fix uops stats

* rm merge_gates method. Will add back in separate PR

* Spacing

* cleaner error msg

* Fix uops rendering when expanding. test_failure_43

* patch tests

* undo changes in delete_redundant_gates

* process replay attempt

* re-intro deletion of redundant gates

* fix addition of gates when they get nested in stores and loads

* patch tests

* smarter init of IF srcs when adding gate to STORE

* make ruff happy

* Resp to comment

* include all src[2]'s srcs in IF for gated store

* add reference of the storing value to the gate's src

* minor patch after rebasing

* change ptx renderer

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-06 01:05:30 +08:00
nimlgen a1a15b54c9
qcom cache flush (#6367)
* qcom cache flush

* bench

* linter

* move
2024-09-05 13:23:39 +03:00
chenyu 62f9f273f7
increase test_profile_multidev_transfer threshold (#6370)
flaky, bumpped to 16000 for CI
2024-09-05 05:49:32 -04:00
George Hotz e882294c02
uops touchups [run_process_replay] (#6368)
* uops touchups [run_process_replay]

* those are classmethods

* oops, kwargs

* no kwargs there
2024-09-05 17:22:32 +08:00
George Hotz a28ed7ba4d
math trait [run_process_replay] (#6364)
* math trait [run_process_replay]

* const -> const_like

* Revert "const -> const_like"

This reverts commit 85727c83d38f59e153333a3dbfa68f87b3a5a6ce.

* add MathTrait to LazyBuffer

* clean up function

* fixup the rest of function

* fix custom function

* mlb math trait

* fix that test
2024-09-05 16:19:17 +08:00
George Hotz 4a51c28ee7
switch const to const_like [run_process_replay] (#6356)
* const like

* no more _const

* missed one

* mypy ops.py

* file missing

* const_like

* fix image and test uop graph [run_process_replay]

* fix ptx
2024-09-05 13:57:54 +08:00
George Hotz 0d6922edb4
faster local tests. copy torch permuted to defautl device [run_process_replay] (#6363) 2024-09-05 13:57:20 +08:00
chenyu 6fd24561d1
distribute MUL const into ADD for int (#6361)
pre-req for real_stride
2024-09-05 01:36:57 -04:00
qazal e7f6b654ad
cleanup uop eq asserts for swizzle [run_process_replay] (#6362)
* cleanup uop eq asserts for swizzle [run_process_replay]

* more stuff
2024-09-05 13:36:36 +08:00
Oleg Rybalko 64f1384f5b
Einsum ellipsis support (#6333)
* working ellipsis expansion

* refactor

* fix commas in output

* add capital letters

* refactor
2024-09-05 10:08:55 +08:00
nimlgen 326a77336e
qcom remove some tests skips (#6353) 2024-09-04 15:38:18 +03:00
qazal 99018a4aa1
minor schedule differ utils [run_process_replay] (#6348)
* minor schedule differ utils [run_process_replay]

* rm
2024-09-04 03:41:38 +08:00
nimlgen 3adb76894d
validate image=2 float16=1 openpilot benchmark (#6346)
* validate image=2 float=16 openpilot

* linter

* linter2
2024-09-03 20:13:40 +03:00
qazal 2f00bf0c78
conv bw in one kernel with graph_rewrite (#6330)
* double reduce merger

* add test_fold_conv_relu_backward_ast_rewrite

* a correctness test to iterate on

* merge axes the other way around

* better
2024-09-03 03:53:53 +08:00
Vyacheslav Pachkov 4c33192a8b
add qcom runtime (#5213)
* qcom: driver init

* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros

* autogen: add adreno commands and registers

* ops_qcom: QcomAllocator + signals

* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom

* qcom: we do not really need all these constants input/output is enough

* qcom: perfctr for CS (do not really need all the rest)

* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max

* qcom: explicitly set instruction len based on the shader size

* ops_qcom: Program init

extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader

* use data64_le from helpers

* ops_qcom: use fill_kernargs for filling i/o buffers

* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset

* new signals & fix exec

* add QCOM to the list of supported devices

* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM

* fix exec, synchronize before copyout

* correct setting num_units for ST_SHADER

* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway

* extract offsets to kernel arguments from opencl binary

* extract constants values and offsets from opencl binary

* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly

* align kernel name to 4 bytes when skipping kernel opencl struct

* skip to consts directly using an offset from opencl binary header

* fix alloc

* get halfreg and fullreg from opencl bin

* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE

* parse prg offset from open cl binary

* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG

* support for vals in _fill_kernargs

* support 16-bit constants

* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts

this helps to not fall down when executing big kernels

    /* Don't time out if the context has disabled it */
    if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
        return;

* minor changes of _exec

* QCOMRenderer

* disable HCQGraph for demo. TOOD: support HCQ update api

* support HCQ

- remove copy queue
- add updates
- add strides for buffs and vars for QCOM

* bufs_stride

* clean ups

* linter

* call super().__init__(value) in QcomSignal

* disable=unused-import

* mypy

* type ignore when queue is on the device

* fix

* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7

* working timestamps

* free context after device is done

* move gpu stack to the device

* reserve some space with lib_gpu for gpu to write to

this fixes test_interpolate_bilinear

* exclude tests that fails with GPU=1 on qualcomm

* lint

* unmap mem in _gpu_free

* ctxt priority and preemtion policy

* remove old qcom

* pass size to self.device.allocator.free

* skip tests only on qcom

* use kgsl and adreno defines instead of numeric vals

* use allocator for allocating lib_gpu

* update to QcomArgsState from master

* intermediate commit while conquering images

* enable image tests on qcom

* fix shader disasm size, dump textures stuff

* working images

* allow signals to be 0

* set branchstack from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* set shared memory size from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* update images in QcomArgsState & less loc for images

* set stack sizes from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* stack allocation based on OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* better autogen for kgsl and adreno. no more bitshifts

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* cleanup commit for parse cl lib

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dont forget actual generated files

* refactor + less loc

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* device.py back

* lint

* ruff

* timestamp divisor

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* fix tex fmt & round global size

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dtypes

* 19.2MHz

* -1 loc in _update_exec

* remove noqa

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-09-02 19:35:47 +03:00
George Hotz 406ec8240e hotfix: lin_fail_41 passes on my M3 Max 2024-08-31 11:46:46 -07:00
Roelof van Dijk ad4b3b457f
bump limit for test_llama_embedding_opt (#6332) 2024-08-31 10:03:43 -04:00
George Hotz 72939901fc hotfix: ebs print kernel names 2024-08-29 21:20:36 -07:00
George Hotz 365babe391
precompute early_reject [run_process_replay] (#6327)
* precompute early_reject [run_process_replay]

* features for ebs

* fix ocelot cache
2024-08-29 18:26:24 -07:00
George Hotz 385904526f
remove more rules [run_process_replay] (#6326)
* remove more rules [run_process_replay]

* disable invalid test

* ptx needs that str
2024-08-29 16:27:10 -07:00
qazal 539654fbe1
graph_rewrite complexity tests [run_process_replay] (#6317) 2024-08-29 22:39:08 +03:00
qazal 07942ef361
Proposal: Better UOps.SWIZZLE (#6309)
* better UOps.SWIZZLE

* test_swizzle_rewrite

* add it to docs

* show a diff

* a lil more verbose

* two teeny notes

* hotfix: sink
2024-08-29 15:39:48 +03:00
qazal dd4e5f1c8d
process replay rewrite (#6284)
* process replay rewrite

p2

* start some unittests + exceptions and exits

* shebang

* remove extra kernel init
2024-08-29 15:08:27 +03:00
pedro 7de4eac8f7
add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308)
* add `nearest` mode to interpolate

matching pytorch `nearest` which is knowingly buggy

+ relevant TestsOps

* add `nearest-exact` mode to interpolate

matching pytorch `nearest-exact`

+ relevant TestOps

* fix uint8 bilinear interpolation

by matching custom torch implementation

* implement uint8 lerp with torch interpolation trick

without converting it to float
2024-08-28 21:59:51 -07:00
qazal ec34d9ee36
start benchmarking ast graph rewrite (#6297)
* ast_rewrite to ctx var

* add external_benchmark_ast

* refactor to asts

* track lazybuffers

* more work

* record checkpoint

* cleanup
2024-08-27 18:18:44 +03:00
Max-We ab2714423b
Add einsum tests (#6286)
Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>
2024-08-26 09:09:25 -07:00
chenyu b76f0c875e
lazy const fold idiv 1 (#6285) 2024-08-26 10:29:59 -04:00
chenyu af7c04ff57
Tensor.__floordiv__ (#6283)
support Tensor.__floordiv__ and friends
2024-08-26 09:43:40 -04:00
qazal d2f8eeed2e
make [compare_schedule] the default [run_process_replay] (#6273)
* make [compare_schedule] the default

* capture ctx

* logging

* set capture to false
2024-08-26 21:40:03 +08:00
CaltropHungerton 002f60b4c3
fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192)
* fix wmma flop counting on intel, add count tests

* half

* add half gemm

* Update test.yml

* one test

* Update test_uops_stats.py

* Update test_uops_stats.py

* Update test_uops_stats.py

* smaller matrix, use unittest skipUnless decorator
2024-08-25 18:37:05 -07:00
qazal f0cc8ca5f2
generic st_fixup in scheduler graph rewrite [compare_schedule] (#6278) 2024-08-25 11:02:17 +03:00
gswangg 3cf507ae7f
remove extra.ops and LazyOp support from Kernel (#6267)
* remove extra.ops and BufferOps

* remove extra.ops and LazyOp support in Kernel
2024-08-24 16:44:38 +03:00
qazal ccb05d8baa
fixup neg tests [run_process_replay] (#6268) 2024-08-24 16:35:43 +03:00
gswangg ea76b93814
migrate test_linearizer_dumb.py to UOp AST (#6241)
* add imports and update test_unmerged_ifs to UOp AST

* test_max_simplify_and_cancel

* test_expander_new_srcs

* test_llama_embedding

* test_unaligns_idxs

* test_unrolled_float4_align

* test_upcasted_stores_out_of_order

* remove LazyOp

* remove extra/ops and replace ReduceOps.SUM with BinaryOps.ADD
2024-08-24 16:27:29 +03:00
gswangg e44653e25a
migrate test_linearizer_failures.py to UOp AST (#6240)
* add imports and update test_failure_1 to UOp AST

* update test_failure_2 with UOp AST

* update test_failure_3

* test_failure_5

* test_failure_6

* test_failure_7

* test_failure_8

* test_failure_9

* test_failure_10

* test_failure_11

* test_failure_12

* test_failure_12_multireduce

* uncomment skip and migrate test_failure_13

* test_failure_14

* test_failure_15

* test_failure_16

* test_failure_17

* test_failure_18

* test_failure_19

* test_failure_20

* test_failure_21

* test_failure_22

* test_failure_23

* test_failure_24

* test_failure_25

* test_failure_26

* test_failure_27

* test_failure_28

* test_failure_29

* test_failure_30

* test_failure_31

* test_failure_32

* test_failure_33

* test_failure_34

* test_failure_36

* test_failure_37

* test_failure_38

* test_update_39

* test_failure_40

* test_failure_41

* test_failure_42

* test_failure_43

* test_failure_44

* test_failure_45

* test_failure_46

* test_failure_47

* test_failure_48

* test_failure_49

* test_failure_50

* remove LazyOp

* reskip test_failure_22

* remove extra/ops

* replace ReduceOps with BinaryOps

* fixup that import

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-08-24 16:26:58 +03:00
gswangg 1dc6040877
migrate test_search.py to UOp AST (#6245)
* add imports and update test_kernel_count with UOp AST

* test_filter_global_buffer

* remove LazyOp

* remove extra.ops and ReduceOps

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-24 16:13:53 +03:00
qazal ae23540d6e
refresh process replay schedule ref in reset.py (#6265) 2024-08-24 16:12:51 +03:00
gswangg 7be5eede71
migrate test_linearizer_overflows.py to UOp AST (#6244)
* add imports, remove ConstBuffer, and update test_overflow_1 with UOp AST

* test_overflow_2

* test_overflow_3

* test_overflow_4

* test_overflow_5

* test_overflow_6

* test_overflow_7

* TestLinearizerOverflowAlt::test_overflow_1

* TestLinearizerOverflowAlt::test_overflow_2

* remove LazyOp

* remove extra.ops

* remove ReduceOps
2024-08-24 16:10:29 +03:00
chenyu 943ab97d24
fix Tensor.prod for multitensor (#6264) 2024-08-24 08:52:24 -04:00
qazal bcb2f1caa3
init REDUCE_AXIS with BinaryOps (#6256)
* REDUCE_AXIS arg with BinaryOps

* more work in kernel.py
fixup sops.gz

* fix TestGraphRewriteEfficiency
2024-08-24 11:28:41 +03:00
chenyu da5cf11859
fix acc init value for MUL (#6263) 2024-08-23 23:19:44 -04:00
George Hotz 26498b322e add BEAM to external_benchmark_schedule.py 2024-08-23 18:10:46 -07:00
George Hotz 53a73038e3 hotfix: TestGraphRewriteEfficiency.test_create_many_uops 2024-08-23 15:51:57 -07:00
chenyu 590c0922b6
Tensor.prod (#6250)
* Tensor.prod

a new reduce op!

* onnx ReduceProd
2024-08-23 10:06:32 -04:00
qazal 78d6bd8b41
start graph rewrite in the scheduler (#6248)
* start graph rewrite in the scheduler

* test: enable it

* test timings

* only fails in multi reduce

* more isolated tests
2024-08-23 13:15:55 +03:00
George Hotz 238896ca02
loooking into graph rewrite speed (#6239)
* loooking into graph rewrite speed

* track, replace is slow

* if all same, no permutations [run_process_replay]

* types so compile works

* no implied comprehension

* TRACK_MATCH_STATS=2
2024-08-22 13:17:55 -07:00
chenyu e745e16441
remove UnaryOps.NEG (#6238)
* Remove UnaryOps.NEG

generated new dataset with
```
time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```

* fix that
2024-08-22 14:21:39 -04:00
nimlgen 6c4ddd6260
hcq skip tests when no multidev (#6235)
* hcq skip tests when no multidev

* linter

* a bit higher tinout
2024-08-22 18:27:16 +03:00
chenyu 08539f08b0
fix UOp repr with Variable in arg (#6236) 2024-08-22 11:06:33 -04:00
chenyu 3fc8203475
remove NEG from handwritten ast in tests (#6234)
* remove NEG from handwritten ast in tests

* test_linearizer_failures
2024-08-22 09:06:59 -04:00
chenyu 1c5ef5b793
format test_linearizer_failure (#6231)
made it easier to remove NEG
2024-08-21 21:10:56 -04:00
nimlgen 78c94abe9c
raise time limit for ci in test_profile_multidev_transfer (#6227) 2024-08-21 22:42:03 +03:00
gswangg c74b318458
migrate test_linearizer.py to UOp AST, pt. 2 (#6228) 2024-08-21 22:16:11 +03:00
George Hotz c3168952f0
wip: tracking pattern matcher [run_process_replay] (#6225)
* wip: tracking pattern matcher

* better

* proper dedup

* timing

* early reject

* mergable match stats

* TrackedPattenMatcher

* fix TrackedPattenMatcher

* cleanups

* clean that too

* remove early_reject

* Revert "remove early_reject"

This reverts commit dc2aef14b8f5da58f5ec9566daf252513cac394c.

* total

* sort by time

* match_stats cleanup
2024-08-21 11:57:26 -07:00
chenyu a666450e4d
UOp pattern x + x -> x * 2 (#6224)
* UOp pattern x + x -> x * 2

now there's no NEG, with this it covers all kinds of a*x+b*x

* can remove x-x
2024-08-21 12:06:19 -04:00
chenyu c9a9631818
no UnaryOps.NEG in generated UOp patterns (#6209)
* no UnaryOps.NEG in generated UOp patterns

removed pattern `x * (-1) -> -x`  and `x != True`

* those are fine because NEG became CMPNE and True

* fix sd validation L2 norm
2024-08-21 11:08:22 -04:00
qazal 3b8cc5a3e0
more multireduce tests prep for neg removal [run_process_replay] (#6220) 2024-08-21 12:45:24 +03:00
qazal f03e5a4b3b
test_multireduce const has a shape (#6218) 2024-08-21 11:02:45 +03:00
George Hotz 2c42e9c2c6
faster rewrite, no folder in expand/reduce [run_process_replay] (#6216)
* faster rewrite, no folder in expand/reduce [run_process_replay]

* is removing the expander there okay

* parens

* don't reconstruct exact match uop

* fast do_reduce

* expand pyint

* most of the parents gains with less lines
2024-08-20 23:36:58 -07:00
George Hotz 16f420f7a7
split full_graph_rewrite and linearize_uop [run_process_replay] (#6215)
* split full_graph_rewrite and linearize_uop

* fix tests

* graph rewrite in test uops

* add types
2024-08-20 20:12:33 -07:00
George Hotz 9faf205601
CIFAR trainer + various bugfixes / improvements (#6146)
* move cifar into datasets

* support for pathlib Tensors, tar_extract, and fetch gunzip

* too early for Device.DEFAULT

* simpler hlb_cifar + .to(None) is default

* new compiler failure, start beautiful_cifar

* beautiful cifar runs but is broken

* jit train step

* cleaner

* std_mean, not mean_std

* more correct

* fast indexing

* don't print that

* torch load broken

* add eval

* nicer bar

* decoraters are the way to do this

* bounds check the load

* a few ops

* batchnorm bugfix, if track_running_stats is False, use online estimate

* full timing

* fix fusion

* unneeded realize

* master tensor
2024-08-20 16:58:46 -07:00
madt2709 4bb98d8882
Fix track_running_stats in batchnorm (#6200)
* Fix track_running_stats in batchnorm

* Fix linter

* Update test_fold_conv_batchnorm_notrain to keep allowed at 1

* Add test_fold_conv_batchnorm_notrain_no_running_stats

* Save 1 line
2024-08-20 14:01:22 -07:00
George Hotz a5d79688db
fix indexing out of bounds (#6208)
* fix indeing out of bounds

* 5 ops per access is fine
2024-08-20 11:34:56 -07:00
chenyu 4451bcaf95
update test_arange test_llama_embedding_opt (#6207)
non CI uses larger embedding, still same orders of magnitude
2024-08-20 13:58:43 -04:00
qazal 074cf780dd
add option to only benchmark schedule [run_process_replay] (#6204) 2024-08-20 16:51:27 +03:00
gswangg 0e6f057eae
migrate test_linearizer.py to UOP AST (pt. 1) (#6150)
* migrate test_multioutput to UOP AST

* inline buf declarations

* migrate test_multireduce to UOp AST

* update test_mid_dim_multireduce to UOp AST

* update test_triple_multireduce with UOp AST

* make global definitions more concise

* update test_double_reduce_multireduce with UOp AST

* update test_multireduce_with_parallel with UOp AST

* update test_multiout_multireduce to UOp AST

* make gidx style consistent across updated tests

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-20 10:02:20 +03:00
chenyu 10330a41c7
add CMPNE tests in test_uops (#6196)
fixed the output_dtype for CMPNE and match the tests for CMPLT
2024-08-19 19:41:21 -04:00
chenyu 21d6739237
remove UnaryOps.NEG from lazy.py (#6193)
* remove UnaryOps.NEG from lazy.py

* neg is no longer unary
2024-08-19 18:41:28 -04:00
Gabe Caldwell bdd6325f31
default num_classes value for one_hot (#6182)
* num_classes=-1

If num_classes set to -1, the number of classes will be inferred as one greater than the largest class value in the input tensor.

* num_classes desc

comment to explain num_classes default and what that means.

* replacing ' with `
2024-08-19 12:07:14 -07:00
Alessandro Benetti 9328248610
support for std_mean and cross_entropy (#6181)
* support for std_mean and cross_entropy (#3)

* Cross entropy and std mean support

* remove extra examples
2024-08-19 12:06:44 -07:00
Max-We 53b20afa3f
Write tar_extract (#6180)
* Add tar_extract

* Add tar_extract tests

* Fix dtype for initialization from path

* Tests for path initialization

* rm print

---------

Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>
2024-08-19 12:06:17 -07:00
Eitan Turok 8556d0c642
Support `gunzip` in `fetch` (#6176)
* init

* update

* clean

* add type

* clean

* fix import order

* shorten variable names
2024-08-19 12:04:40 -07:00
samm393 5d742f7fe3
Missing features from rearrange (#6184)
* fixes and tests

* typo in test
2024-08-19 11:19:07 -07:00
qazal 2242ff84be
type verify intermediate UOps [run_process_replay] (#6140)
* type verify intermediate UOps [run_process_replay]

* merge asserts

* variable const
2024-08-19 20:59:01 +03:00
qazal 478145cb8e
lowering error in diff_schedule is fine [run_process_replay] (#6185) 2024-08-19 20:51:12 +03:00
chenyu 00578a021b
re:6125 switch real_size to use uops [run_process_replay] (#6138)
* switch real_size to use uops [run_process_replay]

* enough to pass

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2024-08-19 13:20:24 -04:00
qazal e28d29641f
more scheduler process replay tooling [run_process_replay] (#6178) 2024-08-19 15:35:51 +03:00
chenyu b36a7273c6
RUF018 assignment-in-assert [run_process_replay] (#6172)
assertion should not have side effect or `-O` breaks.

initially just wanted to fix the one in rearrange, but it also made some long lines less long
2024-08-19 00:34:52 -04:00
chenyu 9c60a27ece
lower float64 sin fuzzer threshold (#6173)
139216373.71875 failed
https://github.com/tinygrad/tinygrad/actions/runs/10446960642/job/28925156240
2024-08-19 00:25:42 -04:00
samm393 fd7c84c1c8
Rearrange (#6106)
* rearrange and tests

* tidy

* whitespace

* remove line

* -5 lines

* test fix

* static -> instance

* fix () & add more tests

* remove flags

* -1 line

* match einops

* whitespace

* repeated names
2024-08-18 20:22:28 -07:00
chenyu 2de174677a
threefry touchup [run_process_replay] (#6169)
also why is test_gc testing _rng_counter is allocated??
2024-08-18 23:01:24 -04:00
David González Martínez 724e408736
add support for retain_graph in backward (#6145)
* add support for retain_graph in backward

* fix: dont accumulate grad on non-leaf tensors

* fix order

* fix: do not delete grad on leafs

* fix linter

* fix: can't exactly match torch behaviour internally

* allow numerical room for test

* refactor
2024-08-18 16:08:31 -07:00
wozeparrot 0c5189de25
threefry half (#6154) 2024-08-18 15:23:12 -07:00
Timmy e3d14d1ccc
Lowerer Multireduce Grouping (#6097)
* grouping changes to codegen

* linters + tests

* fix identical store issue on PTX

* comment in grouping multireduce tests

* cleaning up diff

* cleaning up diff

* comments

* linters

* hotfix: dont change kernels

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-08-18 19:57:51 +03:00
qazal 1ba83cc7fa
split test_sgd_4convs_fuse [run_process_replay] (#6158) 2024-08-18 18:35:42 +03:00
qazal be6dda4093
hotfix: more lazyop rename to uop [run_process_replay] (#6157) 2024-08-18 17:28:44 +03:00
George Hotz 17a043edad
tensor inference (#6156)
* tensor inference

* test is even better name
2024-08-18 00:19:28 -07:00
chenyu f7950fc2b6
add E275 missing-whitespace-after-keyword linting rule (#6149)
requires space after keywords like `assert`, `not`, `return`, `else`
2024-08-17 16:44:34 -04:00
George Hotz 88edc2902d
axis_is_masked with graph_rewrite [run_process_replay] (#6144) 2024-08-17 10:28:49 -07:00
qazal 5a266d5d0c
type verify ImageDType and PtrDType [run_process_replay] (#6137)
* type verify ImageDType and PtrDType [run_process_replay]

* fix tests
2024-08-17 16:37:07 +03:00
qazal d1d41130cd
use membufs in ImageDType checks [run_process_replay] (#6136)
* use membufs in ImageDType checks

* set by key [run_process_replay]
2024-08-17 16:17:46 +03:00
qazal d9ce664350
add test_verify_ast [run_process_replay] (#6134) 2024-08-17 14:14:30 +03:00
George Hotz 3a2d724cb2
extra matcher from renderer [run_process_replay] (#6130)
* extra matcher from renderer

* cache_pm [run_process_replay]
2024-08-16 23:53:11 -07:00
George Hotz 5048066e79
st_arg, never -1 [run_process_replay] (#6128) 2024-08-16 22:46:56 -07:00
George Hotz d9cb45af09
only axis is masked [run_process_replay] (#6123) 2024-08-16 21:01:17 -07:00
George Hotz 94aa5f11b5
Revert "use vmax for real_size [run_process_replay] (#6120)" (#6122)
This reverts commit a6e3211444.
2024-08-16 20:33:19 -07:00
George Hotz a6e3211444
use vmax for real_size [run_process_replay] (#6120)
* use vmax for real_size [run_process_replay]

* axis is masked
2024-08-16 20:17:23 -07:00
George Hotz 912f01ed4b
UOpGraph -> linearize_uop [run_process_replay] (#6119) 2024-08-16 19:48:39 -07:00
George Hotz 89c7989659
no shapetracker in ops [run_process_replay] (#6117) 2024-08-16 17:23:27 -07:00
George Hotz 74ee9febec
remove iter from uopgraph (#6110)
* remove iter from uopgraph

* linearize returns uops

* fix tests

* linearize in linearize

* tests fix

* touchup

* test failures
2024-08-16 15:58:29 -07:00
qazal 28c75bf2a6
merge uops with ops (#6111)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-08-16 18:17:57 -04:00
qazal d5e3217076
hotfix: scheduler differ (#6115)
* hotfix: scheduler differ

* add the test back

* track keys
2024-08-16 23:34:49 +03:00
qazal c23d44c779
AST is UOp (#6030)
* most of the work from the uops2 branch

* schedule

* realize

* kernel

* lowerer

* search

* green

* merge uops with ops

* Revert "merge uops with ops"

This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.

* fix benchmark

* remove extra dedup
2024-08-16 22:09:00 +03:00
CaltropHungerton 38fb1e14a2
Intel XMX Tensor Core Support (#5622)
* fixed xmx demo

* i think i'm invoking the DPAS but it's slow

* compiler build arg to stop register spilling, indicated where to fix flop counter

* don't mind this

* do NOT mind me

* do not mind me

* do not view

* i will add bf16 later

* in process of figuring out tc fields

* we figured out the fields!!!

* added check for cl device vendor, added seperate IntelRenderer

* remove tc thread_local_aliases

* cleaning debris before draft pr

* edits for linter

* deduping and checking device extensions

* i will find more line reductions in other places

* before merge upstream

* double grf size in compiler to fix register spilling (bandaid), device checking changes

* tc python emulation

* fixed emulation

* tests for emulated intel tensor core

* TC=0, 1 working on upstream, fixed perf

* test

* debris

* check for specialized cl device when we canonicalize device

* bf16 support, tc=3 test added

* address tests

* revert half2 loads on intel tc, cleanup

* linter

* fold_expanded revert

* lint, whitespace fix

* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too

* make line shorter, no need for noqa E501

* removed device intel

* fix python emulation

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-16 09:19:21 -07:00
George Hotz 553ae9ebc0
bilinear interp uint8 fails (#6103)
* new test for e2e compile failures

* fix bug

* bilinear interp uint8 fails

* better tests
2024-08-15 19:34:39 -07:00
George Hotz c850e03758
new test for e2e compile failures (#6101)
* new test for e2e compile failures

* fix bug
2024-08-15 18:56:22 -07:00
chenyu 9ef82e1f2b
UOp pattern DEFINE_VAR with min==max is also CONST (#6095)
* UOp pattern DEFINE_VAR with min==max is also CONST

* fix tests
2024-08-15 12:09:44 -04:00
qazal 4d38fec8c1
rename lazyops to parents [run_process_replay] (#6091) 2024-08-15 17:27:32 +03:00
chenyu 5accfe26a0
rewrite bool ADD to OR and MUL to AND (#6084)
* rewrite bool ADD to OR and MUL to AND

fixed running `tinyphysics.onnx`, which contains a getitem from a boolean tensor.

only can repro through BEAM_COMPARE, which i think is a different bug in test_linearizer_failure

* fold those, and fix tests

* only for bool

* move dtypes.bool
2024-08-15 10:11:57 -04:00
chenyu df03dca6e3
move % inside UOp mod_folding and remove deprecated tests (#6085)
[run_process_replay]
2024-08-14 23:25:10 -04:00
qazal 2bf7b56485
minor test fixups from the AST is UOp diff (#6081)
* add assert_equiv_uops cache

* dont expect lowering and schedule errors
2024-08-14 23:58:04 +03:00
George Hotz 64563abc90
add LSTMCell to nn (#6080)
* add LSTMCell to nn

* lstmcell works with no input on first

* fix no bias 0

* simpler
2024-08-14 12:08:42 -07:00
chenyu 6b3112d525
fix qcom process_replay for kernel diff (#6079)
* debug why qcom process_replay does not run

skipping the wrong exception?

* um-hum

* get_step_times was parsed incorrectly

* cleanup
2024-08-14 15:05:49 -04:00
chenyu 2fe9d62451
increase test_recursive_add time from 1s to 2s (#6078)
flaky https://github.com/chenyuxyz/tinygrad/actions/runs/10392144818/job/28776666700
2024-08-14 13:52:02 -04:00
samm393 2dc586ffe5
Shape change bitcast for more dtypes (#6047)
* bitcast & tests

* use to_dtype

* put disk tensor tests back

* tests

* bitmask

* no bitmask

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-14 10:03:34 -07:00
qazal 83a2543c74
spec for in order LOAD/STORE indexing (#6073)
* test_unaligns_idxs

* spec for in order LOAD/STORE indexing

* test UOps.SPECIAL

* check for supports_float4
2024-08-14 19:18:00 +03:00
chenyu 5048f9a4d5
test linearizer failure 49 (#6074)
with UOP_IS_SYMBOLIC=1, on METAL it breaks store fusion and have A+B and B+A being two different UOp
2024-08-14 11:29:10 -04:00
qazal 30035df5a4
add metal process replay back (#6068)
test this new one
2024-08-14 12:29:56 +03:00
chenyu 1782e4f64d
use div folding to do lt folding (#6065) 2024-08-13 16:59:05 -04:00
chenyu e3af273fa1
touchup cl_errors (#6058)
* touchup cl_errors

* update test
2024-08-13 13:06:59 -04:00
qazal 9145ad52ff
revert UOps eq, this needs to be isolated in realize.py (#6063)
This reverts commit dccca7f227.
2024-08-13 18:02:34 +03:00
Tobias Fischer 6e3eb50fd1
added fix and reg tests (#6060) 2024-08-12 21:00:48 -04:00
qazal dccca7f227
test: uop and lazyop have the same compare (#6053)
* test: uop and lazyop have the same compare

* typings

* self.assert_equiv_uops -> assertEqual

* hash dtype

* test nop too

* TestPatternMatcher never used this compare anyway

* nop eq and ne tests
2024-08-13 00:33:19 +03:00
chenyu 3f2d24a6ec
test_failure_48 for wrong truncation in idx on NV (#6055)
also added `RAWAST` to print pre-modified AST in DEBUG=3
2024-08-12 16:17:42 -04:00
chenyu 6ed9711898
UOps pattern (x%c)+(x//c)*c = x (#6051)
pretty cool that this is very easy to write now
2024-08-12 14:58:48 -04:00
ignaciosica 777d6b3349
Fix compile error for max with inline const (#5840) 2024-08-12 23:40:39 +08:00
ignaciosica 164ca5632e
split tensor core tests (#6041) 2024-08-12 09:42:02 -04:00
chenyu 7ce716b3a0
bigint -> pyint [run_process_replay] (#6040)
it's a python int. priority should be  higher than bool, but we are not using it in type promo now.
2024-08-12 09:12:23 -04:00
Timmy a00994b423
Lowerer Multireduce Uopgraph (#6007)
* uopgraph changes

* fixing for non-reducing ranges

* multireduce tests

* linters

* linters

* removing comments

* removing arg[1]

* linters

* prettier

* linters

* more linters

* use any instead of intersection
2024-08-12 15:16:07 +03:00
qazal 7d1f118731
use assertIs in test_schedule (#6035)
* use self.assertIs in test_schedule

* test_lazybuffer
2024-08-11 19:19:18 +03:00
qazal b918e3c255
cache assert_equiv_uops (#6033) 2024-08-11 12:17:05 +03:00
George Hotz 1b3443902c
don't use tgmath with clang (#6029)
* don't use tgmath with clang

* fix tests

* nostdlib for clang

* needs ffreestanding on OSX
2024-08-10 13:58:19 -07:00
chenyu 5820940d98
more relax rtol for test_arange_fuse_grouped_children (#6027)
one more https://github.com/chenyuxyz/tinygrad/actions/runs/10334072657/job/28607120462
2024-08-10 16:10:03 -04:00
chenyu 10374a2741
relax rtol for test_arange_fuse_grouped_children (#6026)
flaky https://github.com/tinygrad/tinygrad/actions/runs/10333939631/job/28606831006?pr=6023
2024-08-10 15:49:11 -04:00
George Hotz cf7d3c1eb8
fix tests locally on metal (#6025)
* remove contiguous child, it was breaking tests locally

* hmm, it's still needed

* include NOOPT in method cache key
2024-08-10 12:36:22 -07:00
chenyu e6c7c3e499
update pylint path to check indent/space for all (#6022)
also fixed many errors. it was not checking nested dirs. exclude autogen for now.

can we use ruff for this?
2024-08-10 14:41:09 -04:00
George Hotz cfb04c67d1
run unit tests separate from others (and only once) (#6020)
* run unit tests separate from others

* ignore unit tests elsewhere
2024-08-10 11:17:56 -07:00
uuuvn ee3b015407
ELF loader strtab fix and tests (#6011)
* ELF loader strtab fix and tests

* ruff

* typos

* only one test
2024-08-10 10:13:16 -07:00
Jun Zhang 54e176fb4f
Ignore non-computational backends when overwriting the default (#5770) 2024-08-10 09:23:29 -07:00
qazal 3ef2788c4f
hotfix: run the entire test_conv_bw schedule (#6014) 2024-08-10 17:55:41 +03:00
qazal 0e62076cf5
more process replay cleanups (#6013)
* more process replay cleanups

* comma benchmark missing
2024-08-10 17:29:10 +03:00
chenyu 63a8bc29d4
addition divisor in UOp div_folding (#6002)
in addition to try gcd of all terms, also try least common divisor of all MULs
2024-08-09 20:09:05 -04:00
chenyu 5961faa4be
minor change to UOp div_fold (#6004)
remove an unnecessary gcd and swap the quo rem order, minimize diff for divisor pr
2024-08-09 17:09:59 -04:00
qazal 7373b05ee8
assert conv bw reduceops merge [compare_schedule] (#6001)
* assert conv bw reduceops merge [compare_schedule]

* diff with ref_commit_hash
2024-08-09 19:29:56 +03:00
qazal b67d521a07
assert test_conv_bw correctness (#6000)
* assert test_conv_bw correctness

* reorder half

* metal and clang still red
2024-08-09 18:30:36 +03:00
qazal a833f1a735
scheduler process replay with [compare_schedule] (#5997) 2024-08-09 16:58:22 +03:00
qazal 24c7c41ce0
diff LazyBuffer schedules in process replay (#5996)
* start diff printing

* this should be 2

* add to process_replay.py

* enable schedule capture

* arange diff is process replay
2024-08-09 14:16:43 +03:00
chenyu 1f1eb46af6
more failed simplified UOp div test case (#5992)
this speculative div was handled by "divisor" in symbolic.
2024-08-08 18:39:25 -04:00
chenyu c3e1ae2535
add failed simplified UOp div test case (#5990)
more cases!
2024-08-08 17:37:48 -04:00
nimlgen 38d5eecc68
hcq profiler support args (#5989)
* hcq profiler support args

* bytes -> _bytes

* fix

* add test

* mypy

* not f strings

* percison
2024-08-09 00:18:36 +03:00
qazal 45b1761175
smaller test_llama_embedding + assert correctness (#5986)
* smaller test_llama_embedding in CI

* test correctness
2024-08-08 22:11:29 +03:00
Timmy 8c99bdab08
More Multireduce Tests (#5968)
* multireduce tests

* linters

* more linters

* more linters

* seeing how it works with parallel
2024-08-08 22:04:08 +03:00
gswangg df44a4e861
Make vectorization of CONST explicit (#5322)
* remove test_const_vectorize_fold

* remove const folding UPat for VECTORIZE

* refactor cstyle render_const

* remove calls to dtype.scalar() in render_const

* add assert

* add vectorized const to UOp.const

* add UPat GEP-VECTORIZE-CONST -> CONST

* render_vectorize for DEFINE_ACC in cstyle

* add back missing render_cast in render_const

* generate vectorized consts as UOps for DEFINE_ACC

* update asserts for DEFINE_ACC with VECTORIZE src

* add UPats for PHI with VECTORIZE src

* use prev rendered vectorize in DEFINE_ACC render

* update DEFINE_ACC in python runtime

* update vectorized DEFINE_ACC in PTXRenderer

* rebase DEFINE_ACC changes on lowerer

* verbose rewrite of bad UPats

* simplify UOps.CONST implementation in ops_python

* update sum_collapse UPats for DEFINE_ACC-VECTORIZE

* revert linearizer to TOT

* fix DEFINE_ACC implementation in ops_python

* simplify DEFINE_ACC in cstyle

* Fix linter error

* support VECTORIZE in fold gated load/store UPat

* support VECTORIZE in other fold gated load UPats

* rewrite VECTORIZE in UPat for no input DEFINE_ACC

* simplify DEFINE_ACC render in cstyle

* make VECTORIZE rules more concise

* add more vectorize fold tests

* inline VECTORIZE-CONSTs in cstyle render

* revert VECTORIZE/GEP rule refactor

* revert cstyle render_const refactor

* inline VECTORIZE-CONSTs in cstyle render

* implicitly vectorized const rendering -> explicit

* WMMA VECTORIZE CONST process replay hacks

* VECTORIZE CONST NAN process_replay hacks

* more VECTORIZE CONST NAN hacks

* cleanup process_replay hacks

* isnan() -> not isfinite() cstyle VECTORIZE CONST

* tweak isnan and isfinite checks VECTORIZE CONST

* tweak for positive vs negative infinity VECTORIZE CONST

* add assert to PTX CONST render

* process_replay VECTORIZE CONST render parity for PTX STORE

* vmin/vmax for VECTORIZE'd CONST

* update WMMA folding rules

* add tests for WMMA VECTORIZE fold

* hack for cstyle half4 CONST zero process_replay parity

* revert PTX backend changes

* add back minimal DEFINE_ACC PTX change

* remove cstyle process_replay hacks

* remove dead code in PTX CONST render

* cleanup vmin/vmax logic for VECTORIZE'd CONSTs

* update vectorize fold tests to use DEFINE_VAR

* fix long line formatting in test

* remove unwanted merge artifact

* more vmin/vmax cleanup

* remove unnecessary asserts

* yet more vmin/vmax cleanup

* get rid of explicit VECTORIZE CONST logic in _min_max

* reuse CONST instead of creating a new one

* remove unneeded cast

* handle DType correctly in sconst

* improve readability of tests

* save a line

* save another line

* tuplize pats in src

* remove GEP-VECTORIZE pats

* add vec +0 fold

* HACK: fold only vec8 +0

* remove vectorized ALU fold hack

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-08 20:59:05 +03:00
chenyu 62c77a2831
trim const in UOp div_folding (#5982)
simplify `(4*x+4*y+7)//16` to `(x+y+1)//4`.
fixed `GPU=1 UOP_IS_SYMBOLIC=1 IMAGE=2 python -m pytest test/test_ops.py -k conv`
2024-08-08 12:49:05 -04:00
qazal e6d41b0ce7
hotfix: adjust test_backward_pass_diamond_model thresholds (#5981) 2024-08-09 00:20:53 +08:00
nimlgen 183c4c91a3
fix non-jitted transfers in profile (#5980)
* fix transfers in profile

* fix linter

* sync to be sure everythin is recorded
2024-08-08 17:58:08 +03:00
George Hotz c5baa3d66b hotfix: don't run OOM test in CI 2024-08-07 22:19:29 -07:00
chenyu 859d0e4709
UOp simplify `(x+c0)*c1 -> x*c1+c0*c1` (#5973) 2024-08-07 21:25:22 -04:00
wozeparrot 97d708252a
remove realize from threefry (#5969) 2024-08-07 15:08:49 -07:00
George Hotz bf8ec23b00 hotfix: contiguous on precompute_freqs_cis 2024-08-07 14:40:56 -07:00
nimlgen 8d8704af2d
fix amd exec_update for locals (#5966) 2024-08-07 21:02:56 +03:00
tyoc213 0c4e9dbe71
retrieve defined opencl error codes (#5792) 2024-08-07 10:46:24 -07:00
qazal d6f4a61c42
graph LBScheduleItem [run_process_replay] (#5960)
* add toposort key to LBScheduleItem

* use dedup

* graph LBScheduleItem

* make that comment beautiful again

* diff_schedule utils

* update fuzz_schedule
2024-08-07 19:59:11 +03:00
qazal 7677361d90
test pushing through different expands in 1 kernel (#5963)
* test pushing through different expands in 1 kernel

* realize eye

* back to test_example_matmul
2024-08-07 19:33:18 +03:00
qazal 39dda3d042
rename prescheduled items to lsi [run_process_replay] (#5959)
* rename to lsi

* fuzz_schedule more typings

* rename fuzz_schedule
2024-08-07 14:31:50 +03:00
qazal 728b7e189e
diff_schedule tests [run_process_replay] (#5958)
* diff_schedule tests [run_process_replay]

* ok to run serial
2024-08-07 13:50:27 +03:00
chenyu a7163b80d8
lower test_transcendental fuzz test threshold for sin float64 (#5956) 2024-08-07 02:04:37 -04:00
chenyu fa3a36e576
fancier UOp div gcd folding (#5953)
combine and cancel the remaining const based on gcd of other terms like SumNode.
2024-08-07 02:04:25 -04:00
chenyu aa7fd7ef74
Use `(-self).lt(-x+1)` for `UOp.ge` (#5955)
matched symbolic and fixed UOP_IS_SYMBOLIC=1 arange folding
2024-08-07 01:31:27 -04:00
George Hotz 658d58784b
embedding doesn't cast (#5952)
* embedding doesn't cast

* test the right thing

* too much annoying with that test
2024-08-06 17:49:14 -07:00
wozeparrot 30d0cb2a82
fix: fix transcendental flakyness on exp float with 9.96875 (#5951) 2024-08-06 17:32:13 -07:00
George Hotz 3a0515ea22 hotfix: process_replay/diff_schedule.py to LBScheduleItem 2024-08-06 17:01:05 -07:00
chenyu aee737bd9e
divide by gcd in UOp div folding (#5949)
* divide by gcd in UOp div folding

`(6x+6y)//16 -> (3x+3y)//8` etc
simpler version

* only factor out const

* don't apply for unsigned

* don't need that if

* space
2024-08-06 20:00:57 -04:00
George Hotz 6d1fdcfce2
don't reduce the same thing in a vector (#5950)
* don't reduce the same thing over and over

* cleaner way to write it that doesn't loop
2024-08-06 16:59:15 -07:00
qazal d5d7f4e7b8
more TestIndexing correctness asserts [run_process_replay] (#5948)
* use torch in test_mnist_val

* more asserts
2024-08-07 01:50:42 +03:00
chenyu 794796256c
UOp.const_factor [run_process_replay] (#5945)
* UOp.const_factor [run_process_replay]

simplify mod and div folding

* test does not work now
2024-08-06 18:18:29 -04:00
George Hotz 73d4d51845
add LBScheduleItem type [run_process_replay] (#5944)
* add LBScheduleItem type [run_process_replay]

* minor cleanups

* fix

* fix fuzz tests

* add group cache type
2024-08-06 14:49:40 -07:00
qazal 7b6496f2e6
fix the reduceops cache breaking beautiful_mnist (#5938)
* fix the reduceops cache breaking beautiful_mnist

* test_sparse_categorical_crossentropy_simple

* starting tests

* atol from test_nn

* test_sparse_categorical_crossentropy_alt

* dont use torch
2024-08-07 00:02:54 +03:00
George Hotz 1417cc8df1
can reenable that test now (#5914) 2024-08-06 13:38:21 -07:00
chenyu 489575c3be
more UOp sum div with gcd tests (#5936)
* more UOp sum div with gcd tests

* one more
2024-08-06 12:50:10 -04:00
ignaciosica 81ae9fadc8
Float4 support for CLANG (#5915)
* float4 support on clang

* skip linearizer tests that require locals

* add aligned attribute
2024-08-06 07:50:12 -07:00
qazal a7db4c3ee9
show timings for DIFF_ARANGE=1 (#5935)
* show timings for DIFF_ARANGE=1

* always with DEBUG=2
2024-08-06 17:20:38 +03:00
qazal 102a8c184b
diff fused arange schedules with ARANGE_DIFF=1 (#5934)
* diff fused arange schedules with ARANGE_DIFF=1

* better llama diff
2024-08-06 16:52:26 +03:00
qazal 3d4742dd2e
override output shape in fused assign (#5930)
* override output shape in fused assign

This makes

```
FUSE_ARANGE=1 JIT=0 python3 examples/llama.py --gen 1 --prompt "Hello." --count 10 --temperature 0 --timing
```
work. In general we should assert ASSIGN doesn't change shape.

* merge asserts
2024-08-06 13:28:50 +03:00
chenyu 09b7722637
UOp generic div folding (#5896) 2024-08-05 21:38:43 -04:00
George Hotz 3e1336957d
test arange with all opts (#5923)
* test arange with all opts

* Update test_arange.py

* Update test_arange.py

* Update test_arange.py

* Update test_arange.py

* Update test_arange.py
2024-08-05 18:38:25 -07:00
George Hotz 5d17f54e3c
fast mnist indexing (#5921)
* fast mnist indexing

* more tests

* remove those tests, new indexing rule
2024-08-05 13:55:15 -07:00
George Hotz e81c18f494
make the arange test check correctness [run_process_replay] (#5920) 2024-08-05 13:41:06 -07:00
George Hotz 8d1c884e78
capture the const pattern in both directions (#5919)
* capture the const pattern in both directions

* add regression test
2024-08-05 12:15:38 -07:00
George Hotz 42f599870c
unroll arange is broken (#5918)
* unroll arange is broken

* fix unrolled arange

* one more test
2024-08-05 12:15:07 -07:00
qazal 70949ea7e6
test cstyle compile error for max with inline const (#5838)
* test_failure_46

* GPU=1 fails too

* add test_renderer

* add failing platforms

* nv too

* assert return value
2024-08-05 19:02:16 +03:00
qazal e0c6520138
check arange fusing with VIEW and COPY (#5912)
* check arange fusing with VIEW and COPY

* gpu and clang
2024-08-05 17:09:21 +03:00
nimlgen 590b9ebb34
hcq copy queue is optional (#5909)
* hcq copy queue is optional

* one more

* this
2024-08-05 14:03:25 +03:00
George Hotz 159ac06b5b
remove unused reduce rules + improve unparented (#5908)
* remove unused reduce rules [run_process_replay]

* this work

* those tests are meaningless now
2024-08-04 18:18:27 -07:00
George Hotz d7387d31bf
remove useless reduce cases [run_process_replay] (#5907)
* remove useless reduce cases [run_process_replay]

* do_reduce cleanup

* more cleanups + no longer supported tests

* Revert "more cleanups + no longer supported tests"

This reverts commit e9f2f6ba7061f8697a308aacdc3442fa922a77f5.

* no longer supported tests

* switch ReduceOps.SUM -> BinaryOps.ADD
2024-08-04 17:11:08 -07:00
George Hotz be8958e26b
use CONTRACT before REDUCE (#5903)
* use CONTRACT before REDUCE [run_process_replay]

* support half expand

* EXPAND GEP
2024-08-04 16:17:33 -07:00
chenyu 4a65010de8
remove CUDACPU flag in tests [run_process_replay] (#5902)
no longer used
2024-08-04 16:06:38 -04:00
qazal aad9234e52
test fused precompute_freqs_cis (#5900)
* test_precompute_freqs_cis

* tiny for ci
2024-08-04 21:01:05 +03:00
chenyu c67e9887f7
support using str to specify dtype (#5897)
* support using str to specify dtype

in Tensor creation and args into `cast` and `bitcast`, and acc_dtype

* more tests
2024-08-04 12:56:28 -04:00
qazal 4c5ef2cc4f
setitem with arange fusion 1 (#5898) 2024-08-04 16:09:21 +03:00
chenyu da61dea1b2
simple failed UOp sub symbolic test case (#5894) 2024-08-03 14:27:23 -04:00
qazal 56ef9e453e
pad reduceops to the max of each dimension (#5889)
* early verify

* pad reduceops to the max of each dim

* remove the function
2024-08-03 14:03:30 +03:00
qazal 65fa86901a
indexing fusion 2 (#5888)
* arange fusion

* kernels that fuse

* tests
2024-08-03 13:13:39 +03:00
qazal af59b2eea9
tests from the indexing fusion branch (#5886) 2024-08-03 11:56:48 +03:00
chenyu d5de44340e
UOp add mod folding (#5862)
* UOp add mod folding

* that passes now
2024-08-02 18:31:46 -04:00
chenyu 41bbd3f4c1
update UOp mod reduction patterns (#5883)
prepare generic mod folding, also some test changes from mod folding pr
2024-08-02 17:43:40 -04:00
wozeparrot acadccf344
comma benchmark (#5518) 2024-08-02 14:36:54 -07:00
Elias Wahl 4a114756f6
New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00