* [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime
* Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf)
* [WIP] Added a support for LLVM IR
* cleaned up the code for the mypy and linter
* [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue.
* [Add] added fast=true mode which disables the payne-hanek reduction which is slow
* [Fix] fails to compute elements when shape includes zero
* [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly
* [wip] update the assembly for ptx
* Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required).
* [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64)
* [Fix] Cyclic dependencies existing in xlog2
* [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py)
* [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed...
* [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode.
* [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored)
* [WIP] Added fp16 exp2 implementation
* [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation.
* stashed the changes for FP16 sin
* [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower)
* [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al.
* [Refactor] Added the function polyN to clean-up N-terms polynomial approximation.
* [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2
* [Patch] added bitcast_forward option
* [Patch] resolved cycle graph
* patch fix cycle graph
* set bitcast_forward=True in ilogb2k
* bitcast_forward for multi.py
* E501
* Break into multiple small PRs
* [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN
* [Patch] NV still required FP64 for xlog2
* updated schedule test
* updated the count of kernels
* [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD.
* Bitcast: make them api-compatible
* [update] force to use bitcast
* updated the count of constant folding
* [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value
* [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash.
* xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time
* some minor simplification to payne hanek reduction
* [refactor] refactored some rebundant parts existing in payne hanek
* [refactor] more readable payne hanek impl
* [refactor] improved the code consistency of payne hanek
* [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)
* Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)"
This reverts commit 0eee08b87c9e46da8aec0a8edec5316634031a49.
* use allow_buffer_view
* lets support multilazytensor
* updated the count of kernels
* [test] added the jit tests for approx ops
* keep failed constant folding tests tested, added expectedFailure
* explict the timeout deadline when testing approx jit timeout
* [WIP] Simplified the implementation of xsin, never timeouts
* [Refactor] Improved the consistency of approx sin implementation, passing time out tests
* integrated xexp2_base into xexp2
* Set switch_over=39800.0
* delete: is_buffer_fastmath_supported
* sin: compute against abs(x)
* some cleanups
* fix typo
* removed the space between param and dtype
* allow 514 kernels on CI for sd
* [refactor] no need to upcast ad ldexp3k
* [refactor] added some comments, references to help understanding the code.
* [Fix] 1.0 ULP Sine Approximation for FP16
* [update] assume e != 0
* use pow2if instead of ldexp3k to fuse payne_hanek reduction into one
* check if approximated sin/log2/exp are fused into one
* clean up changes
* test amd exp
* some code cleanup and test sigmoid
* fix: enabled payne_hanek for fp16 to achieve higher acc
* fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel
* [Refactor] Rename: fastmath -> transcendental
* [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py
* updated const folding tests
* TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al.
* Add: unittest.main()
* Import TRANSCENDENTAL instead of getenv
* Refactor: Added dtype check when TRANSCENDENTAL=2, more context var
* Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
it's possible to support TinyJit inside TinyJit, but there are edge cases like two TinyJit functions shared another TinyJit function. so just give a more precise error for now
* mockgpu nv
* works
* comment that out
* fix merge
* setup gpuocelot
* install packages
* not run all of them
* passes
* fix ci
* almost
* should pass
* linter
* linter 2
* try this?
* ugn, not supported
* ci
* remove ticket from description
* better descs
* fix _to_const_val and const folding around it
is_unrealized_contiguous_const is too strict and almost never hit if const is expanded.
suffice to check if there's no pad
* that test is folded
* test_const_folding
* feat: initial xor
* feat: initial threefly
* feat: remove custom random
* fix: really need to install precommit
* feat: lmao forgot that this is rotate not a shift
* clean: put that there
* feat: numpy xor
* feat: quick test for xor
* feat: llvm xor
* feat: slightly working xor in torch
* feat: rand works in jit
* clean: save a line
* feat: match jax
* feat: maybe test against jax
* feat: requires_grad
* fix: fix test_symbolic_ops
* feat: lower alpha
* feat: just pad
* fix: maybe fix training tests?
* fix: fix some llvm stuff
* feat: cursed realize on the way out
* feat: testing jax
* fix: why is the jax install process not simple
* fix: maybe passing test
* fix: symbolic workarounds
* clean: still need that precommit
* fix: aaaa
* fix: more test fixes
* fix: quick fix for wgsl
* feat: need to set requires_grad on the final tensor
* feat: one more tensor
* feat: don't take forever
* feat: seeing y ci is brok
* feat: can't allocate 64GiB lmao
* fix: fix this
* feat: hope this doesn't break smth before i go to bed
* feat: don't destroy ram
* feat: int
* feat: remove jax
* feat: properish workaround?
* feat: skip slow webgpu tests
* feat: no longer fails
* feat: use dtypes
* feat: real number
* fix: torch
* fix: don't test against reference for torch
* feat: to device
* feat: fix advanced indexing
* feat: correct casting
* feat: even rng_counter
* feat: match master
* feat: this was actually bad
* fix: maybe?
* feat: store
* feat: remove realizes
* feat: somehow this is important
* feat: somehow this is also important
* feat: save a line
* fix: don't need that anymore
* feat: restore this
* fix: linter
* feat: remove realizes
* fix: realized is in base now
* fix: add back cast
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: :(
* fix: :(
* fix: not being dumb
* feat: try changing less tests
* feat: shouldn't have to change that
* feat: contiguous bumps it by one
* fix: hmm
* fix: numpy memory moment
* fix: cl_khr_fp16
* fix: torch has different tensor count
* fix: missing contiguous
* hmm: hmm
* fix: some fixes
* fix: typing
* feat: dont do that
* feat: typing fixes
* feat: why is this realize required?
* feat: ngl kinda odd typing
* feat: oh
* feat: remove realizes
* feat: why is this realize required?
* fix: hacky patch for cudacpu
* fix: without this realize pytest crashes?????
* fix: shorter line
* fix: cudacpu fixes
* fix: cudacpu fixes
* feat: real buffer
* feat: don't search when searching lmao
* fix: can't use contiguous things
* fix: no more 100GB arrays
* fix: revert
* fix: skip 7 and 10
* feat: working ish beam
* feat: minimize changes
* feat: seed 0 stable diffusion example changed
* fix: different on ci
* fix: no beam
* feat: make threefry optional
* fix: check value
* fix: unused import
* feat: threefry default
* fix: 5d
* feat: allow non upcast div
* fix: 5d better
* fix: 5d better
* fix: save all dtype
* feat: proper error
* feat: lazyop key
* fix: check float
* feat: try removing this realize now
* feat: disable threefry for uops hip tensor cores
* feat: don't need that
* feat: only check upcast
* fix: disable threefry for some metal tests
* feat: disable for metal tensor uops as well
* feat: disable for most uops
* fix: disable threefry for new uops tests
* feat: multitensor
* fix: typing
* feat: threefry default off
* feat: skip threefry half rand
* feat: restore old
* fix: bad git
* clean: ruff
* feat: bfloat16 fix
* fix: :|
* feat: restore old
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* feat: initial xor
* feat: initial threefly
* feat: remove custom random
* fix: really need to install precommit
* feat: lmao forgot that this is rotate not a shift
* clean: put that there
* feat: numpy xor
* feat: quick test for xor
* feat: llvm xor
* feat: slightly working xor in torch
* feat: rand works in jit
* clean: save a line
* feat: match jax
* feat: maybe test against jax
* feat: requires_grad
* fix: fix test_symbolic_ops
* feat: lower alpha
* feat: just pad
* fix: maybe fix training tests?
* fix: fix some llvm stuff
* feat: cursed realize on the way out
* feat: testing jax
* fix: why is the jax install process not simple
* fix: maybe passing test
* fix: symbolic workarounds
* clean: still need that precommit
* fix: aaaa
* fix: more test fixes
* fix: quick fix for wgsl
* feat: need to set requires_grad on the final tensor
* feat: one more tensor
* feat: don't take forever
* feat: seeing y ci is brok
* feat: can't allocate 64GiB lmao
* fix: fix this
* feat: hope this doesn't break smth before i go to bed
* feat: don't destroy ram
* feat: int
* feat: remove jax
* feat: properish workaround?
* feat: skip slow webgpu tests
* feat: no longer fails
* feat: use dtypes
* feat: real number
* fix: torch
* fix: don't test against reference for torch
* feat: to device
* feat: fix advanced indexing
* feat: correct casting
* feat: even rng_counter
* feat: match master
* feat: this was actually bad
* fix: maybe?
* feat: store
* feat: remove realizes
* feat: somehow this is important
* feat: somehow this is also important
* feat: save a line
* fix: don't need that anymore
* feat: restore this
* fix: linter
* feat: remove realizes
* fix: realized is in base now
* fix: add back cast
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: :(
* fix: :(
* fix: not being dumb
* feat: try changing less tests
* feat: shouldn't have to change that
* feat: contiguous bumps it by one
* fix: hmm
* fix: numpy memory moment
* fix: cl_khr_fp16
* fix: torch has different tensor count
* fix: missing contiguous
* hmm: hmm
* fix: some fixes
* fix: typing
* feat: dont do that
* feat: typing fixes
* feat: why is this realize required?
* feat: ngl kinda odd typing
* feat: oh
* feat: remove realizes
* feat: why is this realize required?
* fix: hacky patch for cudacpu
* fix: without this realize pytest crashes?????
* fix: shorter line
* fix: cudacpu fixes
* fix: cudacpu fixes
* feat: real buffer
* feat: don't search when searching lmao
* fix: can't use contiguous things
* fix: no more 100GB arrays
* fix: revert
* fix: skip 7 and 10
* feat: working ish beam
* feat: minimize changes
* feat: seed 0 stable diffusion example changed
* fix: different on ci
* fix: no beam
* feat: make threefry optional
* fix: check value
* fix: unused import
* feat: threefry default
* fix: 5d
* feat: allow non upcast div
* fix: 5d better
* fix: 5d better
* fix: save all dtype
* feat: proper error
* feat: lazyop key
* fix: check float
* feat: try removing this realize now
* feat: disable threefry for uops hip tensor cores
* feat: don't need that
* feat: only check upcast
* fix: disable threefry for some metal tests
* feat: disable for metal tensor uops as well
* feat: disable for most uops
* fix: disable threefry for new uops tests
* feat: multitensor
* fix: typing
* feat: threefry default off
* feat: skip threefry half rand
* feat: restore old
* fix: bad git
* clean: ruff
* feat: bfloat16 fix
* fix: :|
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* simple LoadOps.ASSIGN
* skip that test
* don't assign in onnx ops gemm
* track cache usage
* recreate the lazybuffer to avoid the cache
* fix contigs
* skip that test
* lol
* better letters
* jit graph split
* update
* that's fine, not all buffers are there now
* use logariphmic tho, seems good
* no keep it simple
* add test
* simplify
* split graph when jit item cannot be graphed
* Remove the rawbuffer copy in runtime/lib.py on line 44
* remove buffer view
* added metadata back, oops
* delayed cpu testcase
* whitespace
* whitespace
* buffer behavior as is
* Update test_jit.py
* models matrix
* fix typo and install gpu deps
* install llvm deps if needed
* fix
* testops with cuda
* remove pip cache since not work
* cuda env
* install cuda deps
* maybe it will work now
* i can't read
* all tests in matrix
* trim down more
* opencl stuff in matrix
* opencl pip cache
* test split
* change cuda test exclusion
* test
* fix cuda maybe
* add models
* add more n=auto
* third thing
* fix bug
* cache pip more
* change name
* update tests
* try again cause why not
* balance
* try again...
* try apt cache for cuda
* try on gpu:
* try cuda again
* update packages step
* replace libz-dev with zlib1g-dev
* only cache cuda
* why error
* fix gpuocelot bug
* apt cache err
* apt cache to slow?
* opt and image in single runner
* add a couple n=autos
* remove test matrix
* try cuda apt cache again
* libz-dev -> zlib1g-dev
* remove -s since not supported by xdist
* the cache takes too long and doesn't work
* combine webgpu and metal tests
* combine imagenet to c and cpu tests
* torch tests with linters
* torch back by itself
* small windows clang test with torch tests
* fix a goofy windows bug
* im dumb
* bro
* clang with linters
* fix pylint error
* linter not work on windows
* try with clang again
* clang and imagenet?
* install deps
* fix
* fix quote
* clang by itself (windows too slow)
* env vars for imagenet
* cache pip for metal and webgpu tests
* try torch with metal and webgpu
* doesn't work, too long
* remove -v
* try -n=logical
* don't use logical
* revert accidental thing
* remove some prints unless CI
* fix print unless CI
* ignore speed tests for slow tests
* clang windows in matrix (ubuntu being tested in imagenet->c test)
* try manual pip cache
* fix windows pip cache path
* all manual pip cache
* fix pip cache dir for macos
* print_ci function in helpers
* CI as variable, no print_ci
* missed one
* cuda tests with docker image
* remove setup-python action for cuda
* python->python3?
* remove -s -v
* try fix pip cache
* maybe fix
* try to fix pip cache
* is this the path?
* maybe cache pip
* try again
* create wheels dir
* ?
* cuda pip deps in dockerfile
* disable pip cache for clang
* image from ghcr instead of docker hub
* why is clang like this
* fast deps
* try use different caches
* remove the fast thing
* try with lighter image
* remove setup python for cuda
* small docker and cuda fast deps
* ignore a few more tests
* cool docker thing (maybe)
* oops
* quotes
* fix docker command
* fix bug
* ignore train efficientnet test
* remove dockerfile (docker stuff takes too long)
* remove docker stuff and normal cuda
* oops
* ignore the tests for cuda
* does this work
* ignore test_train on slow backends
* add space
* llvm ignore same tests as cuda
* nvm
* ignore lr scheduler tests
* get some stats
* fix ignore bug
* remove extra '
* remove and
* ignore test for llvm
* change ignored tests and durationon all backends
* fix
* and -> or
* ignore some more cuda tests
* finally?
* does this fix it
* remove durations=0
* add some more tests to llvm
* make last pytest more readable
* fix
* don't train efficientnet on cpu
* try w/out pip cache
* pip cache seems to be generally better
* pytest file markers
* try apt fast for cuda
* use quick install for apt-fast
* apt-fast not worth
* apt-get to apt
* fix typo
* suppress warnings
* register markers
* disable debug on fuzz tests
* change marker names
* apt update and apt install in one command
* update marker names in test.yml
* webgpu pytest marker