* tqdm replacement almost
* formatting
* formatting
* imports
* line len
* fix
* removed set description :(
* removed set description :(
* fix
* fix
* green check?
* rewrote as class, fixed several bugs
* types spacing
* removed imports
* fix
* iterable
* typing
* mypy disagreement
* imports
* more e2e tests vs tqdm
* removed seed setting
* robustness against time.sleep() flakiness
* flaky fix
* automatic bar closing when count==total
* cleanup
* clang error with tqdm
* tqdm back
* use os lib, print to stderr (fixes the clang bug, where the bar was leaking into the generated c program
* back to shutil
* unit_scale + unit_scale test
* custom unit to tests
* pretty
* clean
* removed flaky test
* less test iters
* empty line
* remove disable
* add support for train/val datasets for kits19
* split dataset into train and val sets
* add tests for kits19 dataloader
* add MLPerf dataset tests to CI
* update unet3d model_eval script
* fix linting
* add nibabel
* fix how mock dataset gets created
* update ref implementation with permalink and no edits
* clean up test and update rand_flip implementation
* cleanups
* lars optimizer + tests
* fix skip list!
* use id to compare in skip list
* go back to using set
* Tensor(bool) * Tensor(bool) is and
* don't lint external/mlperf_resnet
* whitespace
* add external_test_optim to opencl tests
* give mlperf task a name
* mlperf under onnx
* remove track_gnorm
* contiguous instead of realize
* assert momentum and weight decay positive
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* move gpuctypes in tree
* fix mypy
* regex exclude
* autogen sh
* mypy exclude
* does that fix it
* fix mypy
* add hip confirm
* verify all autogens
* build clang2py
* opencl headers
* gpu on 22.04
* dtypes alu test
* those types don't exist in torch
* floats
* more tests
* disable those
* a couple unary tests
* skip float16 tests in CI for GPU
* fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM
* remove hardcoded float for LLVM ALU fns
* less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32.
* return on overflows
* fix CUDA exp2
* compute results of op regardless of bounds in a python backend
* skip fp16 in GPU and CUDACPU
* fuzz a smaller range in the float_midcast_int32 test
I sampled this and we overflow ~70% of the time.
because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here
* remove CUDA exp2 overload it's already there now
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* ops_gpu is go
* fix size 0
* fix image, and add more tests
* nerf openpilot test, doesn't test thneed
* run the schedule
* better
* oops, new inputs
* delete pyopencl
* Update ops_gpu.py
* cuda with gpuctypes
* hip gpuctypes
* graphs
* rename + linter happy
* use cpu_time_execution
* no ji in build_kernel_node_params
* remove hip_wrapper
* hip fix
* no arc
* smalle changes
* no clean moduke in cudacpu
* add name support
* use fetch in gpt2
* remove requests from main lib, networkx also optional
* umm, keep that assert
* updates to fetch
* i love the walrus so much
* stop bundling mnist with tinygrad
* err, https
* download cache names
* add DOWNLOAD_CACHE_VERSION
* need env.
* ugh, wrong path
* replace get_child
* Move ops_triton to runtime and remove errors from deprecated code
* Remove deprecated AST Kernel
* Remove deprecated buffer
* Add TritonProgram
* Triton Buffer
* Use RawCUDABuffer
* triton_compile
* Added new parameter
* pass _buf to program
* remove deprecated include
* Added triton tests
* Deprecated includes removed
* remove double print
* Disable float4 support
* Disable float4 support
* variable load fix
* Track local size
* Add pycuda to triton dependencies
* Merge test.yml
* install cuda packages for testing
* merge double package install
* remove emulated from triton tests
* upscale local index to power of 2 and add masking
* cuda envs
* Add TernaryOps
* ConstOp loading
* proper function name
* remove deprecated variables
* get global program from name
* const ops match local shape
* Enable test_nn
* remove deprecated import
* fix linter error
* Add wait logic
* Add local size override
* accumulate local shapes instead of using max shape
* Merge triton tests into global tests
* fix envs in testing
* Old testing routine
* split file into renderer and program
* remove print and starting whitespace
* pretty ptx print on debug 5
* linter errors
* ignore triton saturation tests
* ignore test example
* remove pytorch cpu extra index
* Add triton to existing testing routine
* use triton tests
* disable cuda backend in triton tests
* use cudacpu in tests
* print used device
* Print device default
* Remove print
* ensure we are running triton backend
* update variable signatures
* update dtypes for load
* infinity render fixed
* limit global size
* negative infinity now properly rendered
* split chain with parentheses for and node
* Add option to disable shared memory, disable for triton
* missing import
* Properly index and mask conditional load
* use mask only if not loading a block pointer
* nan support
* fix symbolic tests to include chain split
* proper masking for stores
* Implemented bool dtype
* Add mod
* fix loads for variables with valid range
* merge triton with cuda runtime
* merge from master
* run triton tests with cuda
* Correct target when running from triton
* conftest with triton compiler config
* use triton nightly
* verbose tests for triton
* capture stdout
* fix function depth when exiting multiple loops
* add render valid function for readabilty
* fix mask for local loops
* add _arg_int32 datatype
* fix dims for conditional loads
* enable non float stores
* correct variable dtypes
* fix type for arg_int32
* remove junk
* Added get max function for range based var.max
* remove deprecated code
* Fix triton ptxas path
* Fix testing for CI
* clamp local size by max local size instead of always running max
* Disable matmul test in triton cpu
* rerun tests
* Disable broken test in triton cpu
* whitespace removed
* rerun tests again
* Disable TestSymbolicOps for triton
* update to new uops
* linter fix
* ignore test/extra
* linting fix
* Update tinygrad/renderer/triton.py
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* remove deprecated line
* quotes type fix
* linter
* Remove unnecesary lines
* UnaryOps.NEG
* dont define constants
* Linting fix
* Disable tests that are broken in ocelot
* remove trailing whitespace
* reduce line count
* linting fix
* update to new uast
* New looping style
* Update to new uast
* make AST runner work with triton
* linting fix
* set renderer var for testing
* disable local for ocelot
* reenable all tests for ocelot
* Pass shared to cuda
* Don't group if the backend doesn't support shared mem
* use working gpuocelot branch
* enable all tests
* enable local for ocelot
* cleanup
* Update test.yml
* update cache key
* reenable test symbolic and extra
* Update test.yml
* Revert "Update test.yml" (rerun tests)
This reverts commit 98c0630ee5da4379e5c6b2437a5145fe87058c35.
* Revert "fix symbolic tests to include chain split"
This reverts commit 22a9a4c9cd14d23735e6540c8d90ee005ac4ea17.
* Revert "split chain with parentheses for and node"
This reverts commit 7499a7004ef4db785d0cd05cf292fdeff65ca90d.
* use global size from linearizer
* rename newvar to dtype to match other renderers
* join program start lines
* simplify code that adds axis to local dims
* assign r[u] in ssa
* We no longer need to replace target in src
* we no longer need to cast indices to int by hand
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
---------
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>