removed DISABLE_DROPOUT=1.
updated BS to 54 that works on tinyboxes with dropouts.
used bert's sparse_categorical_crossentropy that takes Tensor ignore_index in accuracy method
* calling qualcomm dsp from python
* include so files
* add include file
* adsprpc.py
* running with adsprpc
* work
* 32-bit support in elf
* compilation works
* ion
* msm_ion
* working DSP backend
* getting 500 MFLOPS on matmul
* beam works with timing
* move to autogen
* disasm
* progress
* simple tests pass
* qcom_dsp
* more dsp autogen
* progress
* some progress
* works w/o lib
* checkpoint
* no lib
* ugh, better
* cleaner, but with lib. test good, but with the hack
* remove autogens
* small
* push
* simpler
* revert this
* run_3
* simpler
* android
* handle
* run it
* why?
* run2
* to gen
* cc
* cleaner
* elf
* part of autogen
* comemnt
* no lib
* autohen
* linter
* bug reproducer
* cleaner
* this repro is almost empty and doesn't work!!!!
* with this test_ops passes, no crashes anymore
* cleaner
* linter
* renames
* shorter
* remoev contextlib
* ugh
* myoy
* cleaner
* cleaner
* remove import
* conn
* import
* revert this
* remove heavy .so
* shorter alloc
* not tue anymore
---------
Co-authored-by: Comma Device <device@comma.ai>
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <george@comma.ai>
* unwrap_dtype maybe
* uopgraph stuff that hardcoded None
* test_ops passes
* dtypes.py fixups
* update test_linearizer and friends
* more ast updates
* test_beam and test_schedule too
* add void type to uop [run_process_replay]
* remove dumb casts
* start making it green
* more cast cleanups
* more cls methods to fix
* regenerate dataset
* split UOp and NOp const
* maybe that too
* fix docs
* update test_uop_symbolic
* test_verify_ast
* new sops with no diff
* meh, type_ignore is alright
* remove that assert
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* add training set transforms
* add DICE cross entropy loss
* convert pred and label to Tensor when calculating DICE score
* cleanups and allow train dataset batching
* fix DICE CE loss calculation
* jitted training step
* clean up DICE CE loss calculation
* initial support for sharding
* Revert "initial support for sharding"
This reverts commit e3670813b8a67469e7f694e09f2d15a8c40065da.
* minor updates
* cleanup imports
* add support for sharding
* apply temp patch to try to avoid OOM
* revert cstyle changes
* add gradient acc
* hotfix
* add FP16 support
* add ability to train on smaller image sizes
* add support for saving and loading checkpoints + cleanup some various modes
* fix issue with using smaller patch size + update W&B logging
* disable LR_WARMUP_EPOCHS
* updates
* minor cleanups
* cleanup
* update order of transformations
* more cleanups
* realize loss
* cleanup
* more cleanup
* some cleanups
* add RAM usage
* minor cleanups
* add support for gradient accumulation
* cleanup imports
* minor updates to not use GA_STEPS
* remove FP16 option since it's available now globally
* update multi-GPU setup
* add timing logs for training loop
* go back to using existing dataloader and add ability to preprocess data to save time
* clean up optimization and re-enable JIT and multi-GPU support for training and evaluation
* free train and eval steps memory
* cleanups and scale batch size based on the number of GPUs
* fix GlobalCounters import
* fix seed
* fix W&B setup
* update batch size default size
* add back metric divergence check
* put back JIT on UNet3d eval
* move dataset preprocessing inside training code
* add test for dice_loss
* add config logging support to W&B and other cleanups
* change how default float is getting retrieved
* remove TinyJit import duplicate
* update config logging to W&B and remove JIT on eval_step
* no need for caching preprocessed data anymore
* fix how evaluation is ran and how often
* add support for LR scaling
* fix issue with gaussian being moved to scipy.signal.windows
* remove DICE loss unit test
* fix issue where loss isn't compatible with multiGPU
* add individual BEAM control for train and eval steps
* fix ndimage scipy import
* add BENCHMARK
* cleanups on BENCHMARK + fix on rand_flip augmentation during training
* cleanup train and eval BEAM envs
* add checkpointing support after every eval
* cleanup model_eval
* disable grad during eval
* use new preprocessing dataset mechanism
* remove unused import
* use training and inference_mode contexts
* start eval after benchmarking
* add data fetching time
* cleanup decorators
* more cleanups on training script
* add message during benchmarking mode
* realize when reassigning LR on scheduler and update default number of epochs
* add JIT on eval step
* remove JIT on eval_step
* add train dataloader for unet3d
* move checkpointing to be done after every epoch
* revert removal of JIT on unet3d inference
* save checkpoint if metric is not successful
* Revert "add train dataloader for unet3d"
This reverts commit c166d129dfbe2e1c46d1937135a60b4ed25caa3d.
* Revert "Revert "add train dataloader for unet3d""
This reverts commit 36366c65d26f59ed1227acb670d5ce7b997606ae.
* hotfix: seed was defaulting to a value of 0
* fix SEED value
* remove the usage of context managers for setting BEAM and going from training to inference
* support new stack API for calculating eval loss and metric
* Revert "remove the usage of context managers for setting BEAM and going from training to inference"
This reverts commit 2c0ba8d322ec912bd8617cbe167c542e9ba229d9.
* check training and test preprocessed folders separately
* clean up imports and log FUSE_CONV_BW
* use train and val preprocessing constants
* add kits19 dataset setup script
* update to use the new test decorator for disabling grad
* update kits19 dataset setup script
* add docs on how to train the model
* set default value for BASEDIR
* add detailed instruction about BASEDIR usage
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* try
* pass
* clean up
* done
* I'm becoming dumber
* clean up 2
* remove useless max
* useless but make computer brrr [run_process_replay]
* try process replay
* try again
* 1 less line, just use pad2d
* qcom: driver init
* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros
* autogen: add adreno commands and registers
* ops_qcom: QcomAllocator + signals
* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom
* qcom: we do not really need all these constants input/output is enough
* qcom: perfctr for CS (do not really need all the rest)
* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max
* qcom: explicitly set instruction len based on the shader size
* ops_qcom: Program init
extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader
* use data64_le from helpers
* ops_qcom: use fill_kernargs for filling i/o buffers
* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset
* new signals & fix exec
* add QCOM to the list of supported devices
* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM
* fix exec, synchronize before copyout
* correct setting num_units for ST_SHADER
* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway
* extract offsets to kernel arguments from opencl binary
* extract constants values and offsets from opencl binary
* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly
* align kernel name to 4 bytes when skipping kernel opencl struct
* skip to consts directly using an offset from opencl binary header
* fix alloc
* get halfreg and fullreg from opencl bin
* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE
* parse prg offset from open cl binary
* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG
* support for vals in _fill_kernargs
* support 16-bit constants
* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts
this helps to not fall down when executing big kernels
/* Don't time out if the context has disabled it */
if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
return;
* minor changes of _exec
* QCOMRenderer
* disable HCQGraph for demo. TOOD: support HCQ update api
* support HCQ
- remove copy queue
- add updates
- add strides for buffs and vars for QCOM
* bufs_stride
* clean ups
* linter
* call super().__init__(value) in QcomSignal
* disable=unused-import
* mypy
* type ignore when queue is on the device
* fix
* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7
* working timestamps
* free context after device is done
* move gpu stack to the device
* reserve some space with lib_gpu for gpu to write to
this fixes test_interpolate_bilinear
* exclude tests that fails with GPU=1 on qualcomm
* lint
* unmap mem in _gpu_free
* ctxt priority and preemtion policy
* remove old qcom
* pass size to self.device.allocator.free
* skip tests only on qcom
* use kgsl and adreno defines instead of numeric vals
* use allocator for allocating lib_gpu
* update to QcomArgsState from master
* intermediate commit while conquering images
* enable image tests on qcom
* fix shader disasm size, dump textures stuff
* working images
* allow signals to be 0
* set branchstack from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* set shared memory size from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* update images in QcomArgsState & less loc for images
* set stack sizes from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* stack allocation based on OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* better autogen for kgsl and adreno. no more bitshifts
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* cleanup commit for parse cl lib
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* dont forget actual generated files
* refactor + less loc
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* device.py back
* lint
* ruff
* timestamp divisor
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* fix tex fmt & round global size
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* dtypes
* 19.2MHz
* -1 loc in _update_exec
* remove noqa
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12c97e3466679884266b247cf9df46bc.
* fix benchmark
* remove extra dedup
* fixed xmx demo
* i think i'm invoking the DPAS but it's slow
* compiler build arg to stop register spilling, indicated where to fix flop counter
* don't mind this
* do NOT mind me
* do not mind me
* do not view
* i will add bf16 later
* in process of figuring out tc fields
* we figured out the fields!!!
* added check for cl device vendor, added seperate IntelRenderer
* remove tc thread_local_aliases
* cleaning debris before draft pr
* edits for linter
* deduping and checking device extensions
* i will find more line reductions in other places
* before merge upstream
* double grf size in compiler to fix register spilling (bandaid), device checking changes
* tc python emulation
* fixed emulation
* tests for emulated intel tensor core
* TC=0, 1 working on upstream, fixed perf
* test
* debris
* check for specialized cl device when we canonicalize device
* bf16 support, tc=3 test added
* address tests
* revert half2 loads on intel tc, cleanup
* linter
* fold_expanded revert
* lint, whitespace fix
* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too
* make line shorter, no need for noqa E501
* removed device intel
* fix python emulation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>