mirror of https://github.com/commaai/tinygrad.git
9c6e507518 | ||
---|---|---|
.. | ||
logs | ||
tfexample | ||
README.md |
README.md
Google's TPU
We document the Google TPU v2/v3 in order to support it in tinygrad without the XLA compiler.
Creating a Google Cloud TPU VM
This costs $4.50/hr for a TPUv2-8 machine, the cheapest VM.
gcloud alpha compute tpus tpu-vm create test --zone=us-central1-b --accelerator-type=v2-8 --version=v2-alpha
gcloud alpha compute tpus tpu-vm ssh test --zone us-central1-b
# and for when you are done
gcloud alpha compute tpus tpu-vm delete test --zone us-central1-b
gcloud alpha compute tpus tpu-vm list --zone us-central1-b
Aside from the usual VM stuff, there's 4 accelerators on the PCI-E bus. (v2-8 is 4 chips with 2 cores each)
# lspci
00:04.0 Unassigned class [ff00]: Google, Inc. Device 0027
00:05.0 Unassigned class [ff00]: Google, Inc. Device 0027
00:06.0 Unassigned class [ff00]: Google, Inc. Device 0027
00:07.0 Unassigned class [ff00]: Google, Inc. Device 0027
They show up in /sys/class/accel
(tons of files here) and the driver lives in /lib/libtpu.so
. The devices are in /dev/accel[0-3]
, and a bunch of stuff is mmaped. They are "ba16c7433" chips.
We grab the minimal TPU example from TensorFlow. When the compiler runs, it produces tons of great logs in /tmp/tpu_logs
cd tfexample
gcc -o libtpu_client libtpu_client.c -ltpu
TPU_VLOG_LEVEL=99 ./libtpu_client
From these logs, we find the "LLO Instructions"
VLIW Instruction (322b VLIW bundle)
spare : 0 (0,1)
vex_mxu : 0 (1,1)
* 1 misc slot
msc_targ : 0 (2,3)
msc_opnd : 0 (5,3)
msc_op : 0 (8,5)
msc_pred : 31 (13,5)
* 2 matrix slots (push, pop)
vres_dest : 28 (18,2)
vres_op : 28 (20,2)
vres_pred : 31 (22,5)
vex_source : 28 (27,2)
vex_subop : 24 (29,3)
vex_op : 24 (32,3)
vex_pred : 31 (35,5)
* 4 vector slots (2 for load/store)
vld_ttu : 30 (40,1)
vld_stride : 24 (41,3)
vld_offset : 24 (44,2)
vld_base : 24 (46,2)
vld_submsk : 24 (48,3)
vld_dest : 0 (51,5)
vld_op : 0 (56,2)
vld_pred : 31 (58,5)
vst_ttu : 30 (63,1)
vst_iar : 30 (64,1)
vst_value_two : 24 (65,3)
vst_offset : 24 (68,2)
vst_base : 24 (70,2)
vst_value_one : 24 (72,3)
vst_source : 0 (75,5)
vst_op : 0 (80,5)
vst_pred : 31 (85,5)
* 4 vector slots (2 for ALU)
v1_dest : 0 (90,5)
v1_y_vreg : 0 (95,5)
v1_y_src : 0 (100,5)
v1_x : 0 (105,5)
v1_op : 0 (110,6)
v1_pred : 31 (116,5)
v0_dest : 0 (121,5)
v0_y_vreg : 0 (126,5)
v0_y_src : 0 (131,5)
v0_x : 0 (136,5)
v0_op : 0 (141,6)
v0_pred : 31 (147,5)
* 3 scalar registers copied in to the vector units?
vs2 : 0 (152,5)
vs1 : 0 (157,5)
vs0 : 0 (162,5)
* 6 immediates (16-bit each, two can be merged for 32)
imm_5 : 0 (167,16)
imm_4 : 0 (183,16)
imm_3 : 0 (199,16)
imm_2 : 0 (215,16)
imm_1 : 0 (231,16)
imm_0 : 0 (247,16)
* ttu? what's a ttu?
ttu_set_btr : 0 (263,1)
ttu_iterate : 0 (264,1)
ttu_row : 0 (265,3)
* 2 scalar slots
s1_dest : 0 (268,5)
s1_y : 0 (273,6)
s1_x : 0 (279,5)
s1_op : 0 (284,6)
s1_pred : 31 (290,5)
s0_dest : 0 (295,5)
s0_y : 0 (300,6)
s0_x : 0 (306,5)
s0_op : 0 (311,6)
s0_pred : 15 (317,5)
Running a Program (WIP)
Our goal is to run a program on TPU without the driver.
...
openat(AT_FDCWD, "/dev/accel3", O_RDWR) = 184
mmap(NULL, 27799736, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_LOCKED, 184, 0) = 0x7f59a74b3000
# size is 0x1a830b8, aka 28MB