* update cstyle renderers to take a dtype in code_for_op
* implement NEG for bools in LLVM
* update triton
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* cuda with gpuctypes
* hip gpuctypes
* graphs
* rename + linter happy
* use cpu_time_execution
* no ji in build_kernel_node_params
* remove hip_wrapper
* hip fix
* no arc
* smalle changes
* no clean moduke in cudacpu
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* bring hip graph back
* share with metal
* fix linter
* remove hasattrs
* Update ops_hip.py
* hip wrapper does not use _buf
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* add name support
* use fetch in gpt2
* remove requests from main lib, networkx also optional
* umm, keep that assert
* updates to fetch
* i love the walrus so much
* stop bundling mnist with tinygrad
* err, https
* download cache names
* add DOWNLOAD_CACHE_VERSION
* need env.
* ugh, wrong path
* replace get_child
* remove force_wait
* refactor
* get rid of stupid ASTRunner
* fix del in diskbuffer
* BufferOps.FROM_UNDERLYING
* put offset in the rawbuffer
* fix bugs
* use exec
* autopad shapetracker for BEAM
* OptOps.PADTO
* skip that test for now
* correct padding reduce axis
* just 32
* avoid more than double the FLOPs
* cleanups
* test case
* no support for triton and llvm yet
* typos
* symbolic shape would not work
* cannot PADTO with MAX kernel
* advance db version
* no breaking change - don't advance db version
* is triton just python?
* Revert "is triton just python?"
This reverts commit 17e776c25587615e33a3634c2fb0bb8591ce65d4.
* Revert "Revert "is triton just python?""
This reverts commit 6c434c01e1c4b0ea0431ec18632cd859fb3cf260.
* support llvm
* is it really passing in CI only?
* update tests
* oh triton test passed
* simpler
* revert that, with a test
* check if st are the same
* Revert "check if st are the same"
This reverts commit d2a5eac110a5da1af82a2728c883779ef69c3cad.
* update the db version
* rebase artifact
* replace all _dtypen with dtype.vec(n)
fix: print works
* conceptul refactor of cstyle render_load logic
* linearizer GEP is explicit that its dtype is the scalar version of localtype
* vectorized global_store and load don't need a conditional
* beautiful mnist
* beautiful mnist example
* from tinygrad import Tensor
* more beautiful
* the jit is super core tinygrad
* globalcounters reset on jit run
* symlinks and exclude
* beautiful_cartpole
* evaluate is it's own function
* no symlinks
* more beautiful
* jit reset for double speed
* type hinting for JIT
* beautiful_mnist gets 98%
* beautiful_mnist < 4s with BEAM=2
* better cartpole
* use actor critic
* zero_grad got lost
* delete double relu
* stable cartpole with PPO
* beautiful_cartpole is more beautiful
* REPLAY_BUFFER
* beautiful stuff typechecks
* None support in shape
* hp tuning
* add back as_strided, move rebuilt mops to extra
* negative stride for ops_cpu
* Revert "negative stride for ops_cpu"
This reverts commit a13b6815ac31478d31ae71c26f4d4e4d274bf155.
* skip that
* style
* very close
* remove comment
* negative strides working
* almost everything passes
* calculate offset with list comprehension
* some cleanup
* got disk load working
* review suggestions
* fix after merge
* overlap working
* did it
* clean
* fixed disk load
* lint
* mypy
* removed as_strided
* trying without simplify
* added back simplify
* make sure expanding to smaller shape
* cleanup
* removed comment
* removed env file
* trying whisper test again
* onnx test sqlite issue
* working on test
* finished test
* eliminate unnecessary shrink-then-pad
* don't shrink buffer
* added strides check
* added to ci under linters
* switch issue
* allow symbolic stride
* removed .env
* isinstance
* adjust strides for double expand
* cleanup
* needed to add type hint for mypy
* set pythonpath
* metal indirect command buffers
* sub 1ms gpt
* metal batch exec is good
* remove whitespace
* input_replace
* fix ci
* useResources
* very simple cacheallocator
* update_stats
* fix CI
* minor
* remove that from jit
* refactor/ci: delete many `# type: ignore`
* replace `axis.__class__ is int` with `isinstance(axis, int)` to make mypy happy
* add `--warn-unused-ignores` to mypy flag
refs #2240
* ci: move `--warn-unused-ignores` flag to mypy config
refs #2240
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* remove arm64, caching for cuda
* caching in llvm
* switch cache_compiled to new cache
* fix clang
* caching for metal
* fix pylint
* cleanups
* perf_counter and binary
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test
* stable diffusion < 324ms
* revert swap action
* fix tests due to more sum splitting
* REDUCEOP_SPLIT_THRESHOLD env var
* added from unaligned np test (#2134)
* align cpu buffer before copy into cl buffer (#2135)
* remove shelve from handcode_resnet50_opt.py (#2139)
* Add dictionary keys to reduce db size (#2131)
* work
* ignore beam cache
* dictionary keys are generic
* minor db cleanups
* fix baseline and extract dataset
* fix training
* log likelihood
* more lin to feats
* sts
* training policynet
* net sort of works
* dedup
* refactor, stupid new actions
* fix uops deduping
* BEAM_ESTIMATE
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
* feat: move to hip
* feat: special path for RawBufferTransfer
* feat: initial rawbuffertransfer
* feat: hip ipc
* feat: working hip ipc
* feat: need to base device without args
* feat: close mem handle
* feat: modified test
* feat: more multihip stuff
* clean: cleanup
* feat: cleaner
* feat: don't crash
* feat: test more
* clean: way cleaner hip wrapper
* feat: barrier
* feat: barrier
* feat: this breaks stuff
* feat: we can use empty here
* feat: maybe fix tests
* feat: maybe fix tests again?
* fix: probably fix tests
* feat: no waiting here
* feat: wait here
* feat: much larger test
* feat: need to sync here
* feat: make this async
* feat: no waiting!
* feat: cut here
* feat: sync copy
* feat: random imports
* feat: much cleaner world
* feat: restore this
* feat: restore this
* clean: cleanup
* feat: set this
* Revert "disable flaky triton test"
This reverts commit 1e15fdaee7.
* Update test.yml
* check if has shared for matvec
* disable ocelot cache for triton
* disable ocelot cache
* disable ocelot cache
* pass shared to triton uops tests
* temporary debugs for CI crash
* Revert "temporary debugs for CI crash"
This reverts commit fee3ea96c818e83c19b935c2f8482e0ccc91a542.
* Revert "triton isn't tested, and allows this refactor (#2007)"
This reverts commit dea8bb0938.
* add runtime_args to every renderer, move triton local size override to runtime args
* Add binary to args, correct type returned
* update to new loops
* Update test.yml
* some cleanup
* move continue back
* more more more
* added to CI
* try
* try intentionally break some tests
* wtf
* del True for test
* yay tests broke, now pls no break
* try AGAIN
* gahy
* lol
* try
* move over constant
* moved over MORE
* move shrink over
* trailing lines
* try CUDA CI
* try again
* boom
* oops
* improved comments
* try: disable some flags and disable CUDA
* try breaking tests
* traceback has too much info so add --tb=no
* revert forced CI failure
* add comments and del unused imports
* oooooooo using regular debug try enable tb
* intentionally break tests
* added tb back. Maybe not too verbose
* strip whitespcae
* missed something
* Shape op int32 -> int64
* oops missed something
* add some types
* get rid of crazy 1 liners in pad op
* actually test Split this time LOL
* strip that whitespace
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* start compile2
* tweak
* why are there two more kernels?
* minor cleanups
* don't break onnx tests
* add __metadata__ support to safetensors
* no early realize in onnx
* cleanups
* bugfix
* clean up image type, add optimize
* opt to match old
* try that
* opt work
* run compile2
* optimizer
* prt more
* prerealize
* imp
* NOLOCALS works
* no locals means no locals
* support fractional globals
* all locals welcome
* int that
* cleanups
* show gemv regression
* clean up diff
* use idx for the cond
* nolocals
---------
Co-authored-by: Comma Device <device@comma.ai>
* start work on auto opt
* lin failure
* not beating hcopt
* greedy
* timing is fast
* codegen.search
* greedy search in handcode_opt
* track running gflops
* clean up those files
* no failure
* Allow multi-input model export
* Add model export unit test
* Fix efficientnet compilation
* Only run model export test on JIT supported devices
* Skip export model test if not EXPORT_SUPPORTED_DEVICE
* small changes
* expand in terms of substitute, directly expand g_idxs g_valid
* delete expand_ops
* don't compare using hash
* any instead of in
thanks gijskoning
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* support tc
* testing code
* no more create_rednode
* maxsize none in view/node
* oops
* undo
* typing
* oops
* oops
* lmao
* lmao
* add expand multi test
* Node.iter_idxs
* type
* type
* delete checks!
* clean up a little?
* expand_idx in symbolic
* un-golf
* play around with types >.>
* test_substitute and also remove an incorrect test?
* get rid of range
* Update symbolic.py
* split out view cache change
* split out flat components change
* reduce diff
* reduce diff
* add some float4 tests
* fix
---------
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* init hip graph
* optimize args update
* cache symbolic in jit
* remove NOSTAT
* init BasicBatchExecutor
* symbolic infer cache per jit instance
* basicbatchexec is defualt for compiled
* batch_exec is taken from ASTRunner
* no infer cache
* batched execution of hip graph
* add comment about hip graph batches
* readable hip graph
* Move ops_triton to runtime and remove errors from deprecated code
* Remove deprecated AST Kernel
* Remove deprecated buffer
* Add TritonProgram
* Triton Buffer
* Use RawCUDABuffer
* triton_compile
* Added new parameter
* pass _buf to program
* remove deprecated include
* Added triton tests
* Deprecated includes removed
* remove double print
* Disable float4 support
* Disable float4 support
* variable load fix
* Track local size
* Add pycuda to triton dependencies
* Merge test.yml
* install cuda packages for testing
* merge double package install
* remove emulated from triton tests
* upscale local index to power of 2 and add masking
* cuda envs
* Add TernaryOps
* ConstOp loading
* proper function name
* remove deprecated variables
* get global program from name
* const ops match local shape
* Enable test_nn
* remove deprecated import
* fix linter error
* Add wait logic
* Add local size override
* accumulate local shapes instead of using max shape
* Merge triton tests into global tests
* fix envs in testing
* Old testing routine
* split file into renderer and program
* remove print and starting whitespace
* pretty ptx print on debug 5
* linter errors
* ignore triton saturation tests
* ignore test example
* remove pytorch cpu extra index
* Add triton to existing testing routine
* use triton tests
* disable cuda backend in triton tests
* use cudacpu in tests
* print used device
* Print device default
* Remove print
* ensure we are running triton backend
* update variable signatures
* update dtypes for load
* infinity render fixed
* limit global size
* negative infinity now properly rendered
* split chain with parentheses for and node
* Add option to disable shared memory, disable for triton
* missing import
* Properly index and mask conditional load
* use mask only if not loading a block pointer
* nan support
* fix symbolic tests to include chain split
* proper masking for stores
* Implemented bool dtype
* Add mod
* fix loads for variables with valid range
* merge triton with cuda runtime
* merge from master
* run triton tests with cuda
* Correct target when running from triton
* conftest with triton compiler config
* use triton nightly
* verbose tests for triton
* capture stdout
* fix function depth when exiting multiple loops
* add render valid function for readabilty
* fix mask for local loops
* add _arg_int32 datatype
* fix dims for conditional loads
* enable non float stores
* correct variable dtypes
* fix type for arg_int32
* remove junk
* Added get max function for range based var.max
* remove deprecated code
* Fix triton ptxas path
* Fix testing for CI
* clamp local size by max local size instead of always running max
* Disable matmul test in triton cpu
* rerun tests
* Disable broken test in triton cpu
* whitespace removed
* rerun tests again
* Disable TestSymbolicOps for triton
* update to new uops
* linter fix
* ignore test/extra
* linting fix
* Update tinygrad/renderer/triton.py
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* remove deprecated line
* quotes type fix
* linter
* Remove unnecesary lines
* UnaryOps.NEG
* dont define constants
* Linting fix
* Disable tests that are broken in ocelot
* remove trailing whitespace
* reduce line count
* linting fix
* update to new uast
* New looping style
* Update to new uast
* make AST runner work with triton
* linting fix
* set renderer var for testing
* disable local for ocelot
* reenable all tests for ocelot
* Pass shared to cuda
* Don't group if the backend doesn't support shared mem
* use working gpuocelot branch
* enable all tests
* enable local for ocelot
* cleanup
* Update test.yml
* update cache key
* reenable test symbolic and extra
* Update test.yml
* Revert "Update test.yml" (rerun tests)
This reverts commit 98c0630ee5da4379e5c6b2437a5145fe87058c35.
* Revert "fix symbolic tests to include chain split"
This reverts commit 22a9a4c9cd14d23735e6540c8d90ee005ac4ea17.
* Revert "split chain with parentheses for and node"
This reverts commit 7499a7004ef4db785d0cd05cf292fdeff65ca90d.
* use global size from linearizer
* rename newvar to dtype to match other renderers
* join program start lines
* simplify code that adds axis to local dims
* assign r[u] in ssa
* We no longer need to replace target in src
* we no longer need to cast indices to int by hand
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
---------
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* 1
* 83 failed
* learning how git works
* lol idk
* zero shape aaaa
* space lol
* aaa
* test check
* haha
* fixed gather
* 73 failing
* 71 failing
* 68 failing
* added some debug
* fking resize
* lol
* 62 failing
* 58 failling fucking did nearest resize hell yeah
* clean up
* 56 failing
* janitor duty
* lol
* 53 failing
* hi mom
* 50 failing
* added linear interp, but coord_trans is wrong
* did lin interpolation woohoo
* 43 failing
* 40 failing
* temporary Gather fix
* 39 failing
* fixed slice onnxver<10
* 37 failing
* 35 failing
* excluded tests that use float64
* 32 failing with hacks
* added _batchnorm() for 3D 5D batchnorm, 29 failing
* changed ALLOWED_KERNEL_COUNT from 199 to 207
* added improved Gather op, reverted ALLOWED_KERNEL_COUNT commit
* support Round op
* added storage_order/indices maxpool, 27 failing
* support maxunpool, 25 failures
* support Gradient, 23 failures
* merged new where
* added Adam
* cleanups
* added Momentum and Nesterov Momentum
* added Adagrad
* support sequence_type, 20 failing
* ugh git
* I give up on cubic interp :D, 9 failing
* sexy 1 liner gather, much improved, wow
* polished gather to make it shine bright like a diamond
* clean 1 liner for gather
* improved readability of gather
* uhh
* clean up
* more clean up
* WHITEspace
* implemented SoftmaxCrossEntropyLoss op
* added comments and cleaned up if statements
* update
* thank based wozeparrot for pow and new GatherElements
* CPU and TORCH all pass | cast float64 -> float32 for all fromCPU()
* _nearest_gather() failing on yolo
* reverted ops_cpu change and added assert in Resize
* added comments for resize for multiple channels
* oops
* merge
* test
* switched np.pad to Tensor.pad for constant padding
* gah
* gah2
* sexy reflect pad with movementops -> add
* delete commented out lines
* edge mode pad sexy as well
* trying out model_benchmark
* revert gitignore change lol
* init
* Revert "init"
This reverts commit 682bf2073a8b4eca111596c67cf6ebd79f59e585.
* wrote cast workaround for CPU, CPU and TORCH all pass
* wrote cast workaround for CPU, CPU and TORCH all pass
* skipped tests w/ 0 shape for METAL and GPU
* excluded tests for CLANG, CPU, TORCH, CLANG pass
* fixed hacky ConvTranspose
* gotta figure out autopad
* UOps.STORE support cast bool -> float
* small fix for fast gather
* reverted 0 shape skipped tests
* oops missed a file
* added comment
* fixed slice op hack
* First commit to pr
* More trig ops
* More trig ops
* format
* isinf support
* More ops
* changed onnx_ops to use our new gather :D
* Det op bug fix
* rebase
* fixed some tests
* det broken and slow
* fixed compress to use new gather
* implemented argmax argmin
* support variable types in type_proto
* support Upsample and Identity sequence
* we support float64 now and tinygrad support automatic broadcasting
* added EyeLike op
* resize does support multiple channels now actually
* yolov8 onnx runs successfully
* added batch size 1
* oops
* finally fixed type_proto I think
* fixed some llvm bugs
* del whitespaces
* added ZenginU Format PR
* test
* oops
* added float64 exclude tests back
* more skipped tests
* try
* ok openpilot pass
* flake8 pass
* woooooohooo
* revert external_model_benchmark changes
* perf tested gather
* removed promote types from ops_cpu
* numerical errors from 1681 is fixed
---------
Co-authored-by: ZenginU <umutzengin00@gmail.com>
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* feat: train cifar using multigpu
* feat: split eval batch across 5
* feat: cleaner allreduce
* feat: 93.88%
* feat: cleaner batch chunking from bert
* feat: cleaner grad sync
* feat: tinygrad argmax
* feat: make it work with different gpu counts
* feat: move some stuff into the normal __init__
* feat: autodetect gpu count
* feat: move import inside
* move assembly, assembly_ptx
* successful but broken rendering of ptx asm
* clear ins before render asm
* slightly less broken :')
* we needed thread syncs
* fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half
* Fix runtime_args for gpuocelot
* our casts were flipped on both ends
* more casting
* add ternary where op
* dealing with storing/loading bool
* add test for casting to bool from negative
* Fix args.valid on ConstOp
* add to CI, TODO: fix runtime_args for test_uops
* fix placement of runtime_args to work with lazy.Device
* undo ci changes so I can push
* fix lints
* start cleanup and fix things we broke fixing lints
* add checks for PTX specifc asm instructions
* revert added test -- doesn't pass on llvm
* skip tests for underflow,overflow
* another fix for how we're setting runtime args
* Less broken cleanup
* add to CI
* add more env variables for ci test
* fix ci to install pycuda for ptx
* ci: copy cuda test command
* cleanup
* assert to make sure we're actually running ptx in ci
* remove test assert
* move is_ptx arg
* move assembly, assembly_ptx back to extras
* fix imports
* initial merge fixes
* clear registers, fix UOps.LOAD with invalid value
* draft merge fixes
* remove prints
* quick lint and merge fixes
* cleanup
* remove PTXProgram wrapper
* final cleanup
* temp change for ci rerun
* ci rerun
* rollback ISA version
* try to run commavq
* fix 0 dim, start implementing new ops
- Implement EmbedLayerNormalization
- Implement Attention
* SkipLayerNormalization and FastGelu
* use original torch model, cast inputs
* fix some ops:
- properly do Cast
- Attention: bi- and unidirectional
- FastGelu: add bias before gelu
* cleanup onnx_ops.py
* add validation option to benchmark
* cleanup imports
* add checks incase onnx2torch implements ops in future
* run onnx instead of original torch
* just skip gpu on m1
* reactivate the other models
* check for strange params & squash whitespace
* cleanup
* fix causal mask Attention
* Range doesn't need int cast
* embedding vocab_counter same dtype as input
* no need to cast
* always validate, fix PosixPath ort
---------
Co-authored-by: George Hotz <george@comma.ai>
* testing new memops
* better debugging
* testing padded conv
* branching with load
* refactoring a bit
* first try
* fixing bugs
* fixing some
* eq
* eq2
* do not use x's
* working
* fixing imm
* getting things working
* refactor
* pow not working
* working except one
* refactor: one store mem
* refactor: global load
* refactor: imm
* refactor: cleaning
* fixing big offsets
* refactor with ci
* try ci
* typo
* another typo
* ubuntu default
* forgot git
* do i need git?
* missing packages
* adding python-dev
* with cache?
* buildx action
* buildx name issue?
* maybe now?
* python3
* newline warning
* maybe now
* i actually need this
* ci should work now
* improved caching
* fixing cache
* maybe now it will cache
* this
* testing cache
* trying again
* load
* missing platform
* caching gha
* testing cache
* full testing
* typo
* now?
* why
* adding checkout back
* bad formatting
* fixing convention issues
* supporting python
* adding CI flag
* testing all
* better comments
* adding debugging
* takes 12x longer
* does it output progress now?
* ignore models for speed
* fixing merge
* excluding conv_transpose2d
* only 2 test cuz is to slow
* another approach
* let's see
* faster duh
* my bad
* T_T
* typo
* sup
* with output?
* comment test
* comment test
* comment test
* :?
* no comment
* with cache
* back to normal
* testing that ci works
* back to passing
* trying again
* does it create another entry
* does it create another entry?
* build local
* hey
* Revert "excluding conv_transpose2d"
This reverts commit cc7348de03033e032f47d69caff174e2f1a7bfea.
* does it cache if done before?
* does it cache?
* done
* adding test ops
* bad formatting
* no need for this
* working static mem
* sum 1d
* add ndim
* better reg import
* fix stack
* back to np
* working except for softmax
* 5 failing
* no pogress
* remove keystone
* remove keystone
* testops passing
* cleanups
* more cleanup
* typo
* ci
* ci2
* cond import
* ci3
* ci4
* ci4
* ci5
* ci5
* ci6
* aligment
* test all
* correct test
* err read_unmapped
* passing test
* ignore for speed
* ignore for speed
* ci7
* cleanup
* remove docker
* fixing merge
* fixing bugs
* add skipload for const ops
* comments
* First merge to master: Renderer
* fix emulation
* passing all tests arm64
* cleaning
* fix handcoded binary
* cleaning
* fix errs
* fix runtime arg binary
* clean git diff
* fix and clean
* fixing metal test
* cleaning
* fix metal test
* ci ~8 min
* fix pylint and clang
* cache the files in ops_clang
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* feat: allreduce
* feat: test
* feat: need contiguous
* feat: test in ci
* feat: exit with correct code
* feat: don't need that
* feat: opencl wait_for just doesn't work
* feat: synchronize on out
* feat: try?
* feat: try again?
* feat: add extra realizes
* feat: print
* feat: seed
* feat: tol
* feat: test ones and zeros
* feat: remove print
* feat: are you just flaky
* feat: seperate scatter and gather?
* feat: just try synchronizing
* feat: remove print again
* feat: bring back difference
* feat: no sync
* feat: revert that
* feat: back to wait_for
* fix: typo
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* add disk_tensor
* fix jit
* new baseline before whitening
* whitening through torch
* whiting done currently at 91.65%
* 91.99%
* clean up mixup and 92.3%
* clean up 92.30%
* 92.49% before searching for new hyper-parameters
* fix CI
* fix white space
* add whitening init in test
* refactor, update hyperpara, 92.72%
* converting whiting to tinygrad operation
* update CI kernels count for CIFAR
* add pad reflect
* add random crop 92.53%
* update hyperpara 93%
* 93.15% on docker container, need to refactor the assignment for hyper param
* print out weights and bias to be separated
* bias/non-bias params separated
* fix whitespace
* clean up
* refactor hyper-param with dict
* refactor lr schedular params
* fix whitespace
* fix cross entropy loss
* fix whitespace
* move opt hyp to hyp dict
* minor fixup
* adjust model, loss scaling
* 92.74% while using half of compute as before
* update hyp for cutmix
* random shuffle during batches
* clean up
* updating the model
* update ConvGroup
* disable gradients for batchnorm layer weights
* whitespace
* 93.92%
* clean up
* finally 94%git add .!
* rewrite whitening to remove dependency on torch
* whitespace
* remove dependency on torch, 93.91%
* back to 94.03%
* clean up
* update test_real_world
* Rename in files
* Move files
* Moved to extra/datasets as suggested
* Changes to files
* Fixed stupid mistake
---------
Co-authored-by: terafo <terafo@protonmail.com>
* Fixes + improved test coverage for helpers.py
- added exception handling in `proc`, if an exception was thrown, the thread would hang
- made `_early_exec_process` catch any Exception, before if an exception was thrown before the process was started, it would hand the thread
* Made `_early_exec_process` catch any Exception
Otherwise, if an exception was thrown before the process was started, it would hang the thread. For example a type error for an argument passed to `subprocess.check_output`
* Fixed `from tinygrad.helpers import Timing` import
oops, for some reason my IDE cleaned that import from extra/helpers.
* Fixed import in llama.py
Another one that I skipped by accident, mybad
* Extracted a class for tests of early exec
* Normalize line endings, windows uses /r/n
* Made `cross_process` not a daemon
* fixed division by zero for fast operations
* made et closer to 0
* replace POW llop with SQRT
* updated mlops to swap SQRT and POW llops
* updated hlops to swap POW and SQRT
* added sqrt llop to cpu runtime
* added sqrt llop to cstyle codegen
* added POW llop to llvm ir codegen
* added SQRT llop to torch runtime
* moved pow from mlops to hlops
* found a better way to do reverse pow
* fixed indentation
* added SQRT llop to triton
* update docs to match new llops
* removed POW operator from assembly codegen
* added sqrt and rsqrt to pow hlop
* rewrote pow function in tensor.py
* Adjust tolerance
* Adjust for adamw
* Reduce for Adam too
* removed accidental leftover code
* removed all of accidental code
* added rsqrt test
* removed pow from mlops again
it was added back when resolving merge conflicts
---------
Co-authored-by: Jacky Lee <jla524@sfu.ca>
* fix syntax issues in imagenet_download.py
* use cloudpickle in cross_process to make it work in Python 3.9+
* add cross_process test
* prevent unpickling on every function call
* add cloudpickle to setup.py
* add support for args/kwargs
* Use generators in any(..) instead of lists for better best-case
* Use generators in all(...) instead of lists
* enable R1729 in .pylintrc
* revert import sorting
---------
Co-authored-by: Anselm Coogan <anselm@scandit.com>
* matrix strategy
* push env to GITHUB_ENV
* use printf instead of echo
* use temp helper function for cross os paths
* use path join
* switched to using temp helper function
* skip test on windows due to memory limit
* small fix
* removed semi
* touchups
* clean up
* seperate tests
* test changes to test_utils on windows
* small refactor
* more cleanups
* undo helpers change
* only skip if in CI and WINDOWS
* Revert "Revert "ops rdna""
This reverts commit 0400315078.
* Revert "Revert "writing 2""
This reverts commit 325a3bf2cf.
* no dump
* 2x 2
* simple asm
* local size
* sub
* lil work
* support args != 3
* assembler work
* generate that
* ptx assembler
* begin index renderer
* max
* ptx loops
* gemms work
* valid works
* asm working a bit more
* close
* passing all ops tests
* ptx is a codegen only, not a backend
* ptx
* float16 support
* rdna goes here
* install types
* make amd disassemble
* ansilen for pretty print
* fix ptx log2/exp2
* assemblyinstruction
* new asm
* working gemm
* fix cmp
* more passing
* mod
* ptx works again
* rdan3 add works
* log exp
* sin is sin 2pi
* fix types
* progress
* loops work
* rdna xyz
* better addressing
* cleanups
* handle exception in early process
* div support
* rdna float4
* locals work
* fix neg index
* cast
* smaller diff
* yaml
* import only if selected
* fromimport
* types
* this all needs rewriting
* a few more
* resolved some slice test errors and added some more debugging logs
* use same device in cumsum
* increased float priority
* onnx debug ouput match input
* ConstantOfShape ONNX test fixed.
* removed redundant if statement
* value is optional and should default to a float32 tensor with value of 0
* fixed: default parameters are created at function definition, bad for mutable objects.
* Fix ONNX dropout and unify the implementation
* Use tensor rand method for dropout
* Change approach for RNG in ONNX Dropout
* Fix style
* Test legacy RNG seeding
* Remove the necessity for legacy RNG in Tensor class