* global -> group
* allow None for local_size in custom function
* lil local
* comment on shape
* fix cuda
* smart local cast
* better local heuristic
* fix ptx, and work_dim cleanup
* fix metal
* fix ops test
* fix openpilot jit
* no more optlocal
* might fix metal tests
* try metal now
* see generated metal code
* test free removal. REVERT THIS
* mergable
* Minor improvements + cleanup to `ops_gpu.py`
* Add some previously undocumented environment variables from `ops_gpu.py` to `env_vars.md`
* Update debug print for OpenCL to print the devices that will be used post-filtering with `CL_EXCLUDE`
* Remove a couple unused or superfluous variables and assignments
* Use `fromimport` shorthand to shave off a couple precious LOC
* Couple small whitespace changes to clean things up
* Revert change to ordering of OpenCL devices
* Small refactor for OpenCL context creation
Changing if not exist to the exist_ok=True parameter and adding a variable check if you want to download training data also
adding variable to env_vars.md