* calling qualcomm dsp from python
* include so files
* add include file
* adsprpc.py
* running with adsprpc
* work
* 32-bit support in elf
* compilation works
* ion
* msm_ion
* working DSP backend
* getting 500 MFLOPS on matmul
* beam works with timing
* move to autogen
* disasm
* progress
* simple tests pass
* qcom_dsp
* more dsp autogen
* progress
* some progress
* works w/o lib
* checkpoint
* no lib
* ugh, better
* cleaner, but with lib. test good, but with the hack
* remove autogens
* small
* push
* simpler
* revert this
* run_3
* simpler
* android
* handle
* run it
* why?
* run2
* to gen
* cc
* cleaner
* elf
* part of autogen
* comemnt
* no lib
* autohen
* linter
* bug reproducer
* cleaner
* this repro is almost empty and doesn't work!!!!
* with this test_ops passes, no crashes anymore
* cleaner
* linter
* renames
* shorter
* remoev contextlib
* ugh
* myoy
* cleaner
* cleaner
* remove import
* conn
* import
* revert this
* remove heavy .so
* shorter alloc
* not tue anymore
---------
Co-authored-by: Comma Device <device@comma.ai>
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <george@comma.ai>
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
* wmma: enable METAL half tensor cores and clean up cstyle
* revert simple_matmul rand changes and break line in tensor
* added metal fp16->fp32 tensor core