SURF Week 4 - 5: Trace format and MGPUSim Investigation
So for my research project, we need to generate hardware traces from AMD GPU hardware. However, due to limitations of software tool, if we want to do it, we would have to make modification to the JIT compiler of vISA and mISA, which is way beyond my current knowledge. Instead, we switch to use a mISA AMD GPU simulator to generate the traces we needed, which is the MGPUSim, which simulates AMD GCN3 ISA and is very accurate in terms of performance.
Accel-Sim Trace Format
Still, even we now have a simulator ready, there are still work needed to be done for generate the trace file in Accel-Sim format. There are two types of trace files needed by Accel-Sim, one is the kernelslist
file and the others are kernel-[num].trace
file. The kernelslist
file looks like the following:
MemcpyHtoD,0x0000000000002000,16384
MemcpyHtoD,0x000000000000a000,64
kernel-1671.traceg
Which serves as a meta file for a set of traces that instruct the Accel-Sim however to perform memory operations and look for individual kernel trace files, which are in the form of kernel-[num].trace
, an sample output of the file is listed below:
-kernel name = FIR
-kernel id = 1671
-warp size = 64
-isa type = GCN3
-shmem = 0
-nregs = 32
-binary version = 100
-cuda stream id = 0
-shmem base_addr = 0x0
-local mem base_addr = 0x0
-nvbit version = -1
-accelsim tracer version = 3
-grid dim = (16,1,1)
-block dim = (256,1,1)
#traces format = threadblock_x threadblock_y threadblock_z warpid_tb PC mask dest_num [reg_dests] opcode src_num [reg_srcs] mem_width [adrrescompress?] [mem_addresses]
0 0 0 0 0000b108 ffffffffffffffff 8 S12 S13 S14 S15 S16 S17 S18 S19 S_LOAD_DWORDX8 2 S6 S7 32 1 0xc000 0
0 0 0 0 0000b110 ffffffffffffffff 1 S0 S_LOAD_DWORD 2 S6 S7 4 1 0xc020 0
0 0 0 0 0000b118 ffffffffffffffff 2 S2 S3 S_LOAD_DWORDX2 2 S6 S7 8 1 0xc028 0
0 0 0 0 0000b120 ffffffffffffffff 1 S1 S_LOAD_DWORD 2 S4 S5 4 1 0xd004 0
0 0 0 0 0000b124 ffffffffffffffff 1 R2 V_MOV_B32_E32 0 0
0 0 0 0 0000b128 ffffffffffffffff 0 S_WAITCNT 0 0
0 0 0 0 0000b130 ffffffffffffffff 1 S1 S_AND_B32 1 S1 0
0 0 0 0 0000b134 ffffffffffffffff 1 S8 S_MUL_I32 2 S8 S1 0
0 0 0 0 0000b138 ffffffffffffffff 1 R0 V_ADD_U32_E32 2 S8 R0 0
0 0 0 0 0000b13c ffffffffffffffff 1 R0 V_ADD_U32_E32 2 S2 R0 0
0 0 0 0 0000b140 ffffffffffffffff 0 S_CMP_EQ_U32 1 S0 0
0 0 0 0 0000b144 ffffffffffffffff 1 S1 S_MOV_B32 0 0
0 0 0 0 0000b148 ffffffffffffffff 0 S_CBRANCH_SCC1 0 0
0 0 0 0 0000b14c ffffffffffffffff 1 R4 V_MOV_B32_E32 0 0
0 0 0 0 0000b150 ffffffffffffffff 2 S2 S3 S_MOV_B64 2 S14 S15 0
0 0 0 0 0000b154 ffffffffffffffff 1 R1 V_MOV_B32_E32 1 R0 0
As you can see, the kernel-[num].trace
files have two parts: header and actual traces. The header part (above and including the #traces format
line) tells the Accel-Sim trace-driven part the meta information of this kernel. The actual trace part will undergo a post-processing tool to group traces from the same threadblock/workgroup together into a kernel-[num].traceg
file:
-kernel name = FIR
-kernel id = 1671
-warp size = 64
-isa type = GCN3
-shmem = 0
-nregs = 32
-binary version = 100
-cuda stream id = 0
-shmem base_addr = 0x0
-local mem base_addr = 0x0
-nvbit version = -1
-accelsim tracer version = 3
-grid dim = (16,1,1)
-block dim = (256,1,1)
#traces format = threadblock_x threadblock_y threadblock_z warpid_tb PC mask dest_num [reg_dests] opcode src_num [reg_srcs] mem_width [adrrescompress?] [mem_addresses]
#BEGIN_TB
thread block = 0,0,0
warp = 0
insts = 392
0000b108 ffffffffffffffff 8 S12 S13 S14 S15 S16 S17 S18 S19 S_LOAD_DWORDX8 2 S6 S7 32 1 0xc000 0
0000b110 ffffffffffffffff 1 S0 S_LOAD_DWORD 2 S6 S7 4 1 0xc020 0
0000b118 ffffffffffffffff 2 S2 S3 S_LOAD_DWORDX2 2 S6 S7 8 1 0xc028 0
0000b120 ffffffffffffffff 1 S1 S_LOAD_DWORD 2 S4 S5 4 1 0xd004 0
0000b124 ffffffffffffffff 1 R2 V_MOV_B32_E32 0 0
0000b128 ffffffffffffffff 0 S_WAITCNT 0 0
0000b130 ffffffffffffffff 1 S1 S_AND_B32 1 S1 0
0000b134 ffffffffffffffff 1 S8 S_MUL_I32 2 S8 S1 0
0000b138 ffffffffffffffff 1 R0 V_ADD_U32_E32 2 S8 R0 0
0000b13c ffffffffffffffff 1 R0 V_ADD_U32_E32 2 S2 R0 0
0000b140 ffffffffffffffff 0 S_CMP_EQ_U32 1 S0 0
0000b144 ffffffffffffffff 1 S1 S_MOV_B32 0 0
0000b148 ffffffffffffffff 0 S_CBRANCH_SCC1 0 0
MGPUSim Modifications
In order to generate Accel-Sim trace files *.trace
, we made some modifications to the MGPUSim to accomplish this. The majority of modifications were done in the driver/api.go
, emu/isadebugger.go
, emu/wavefront.go
, insts/inst.go
, and insts/decodetable.go
, with the later one with modifications made on names of registers primarily.
The first part of modifications were made to produce the kernelslist
, which was done by adding printouts in the driver/api.go
file on MemcpyHtoD
calls.
The second part of modifications were to generate the *.trace
files. To do so, the existing isadebugger functionality of MGPUSim was utilized. Basically we replaced the logWholeWf
function with our own to generate the Accel-Sim format traces. The header information part of the trace files was extracted from the wavefront launch packet and the kernel binary passed with the wavefront. Additionally, to get the Accel-Sim format for each instruction, we overwrote the existing String()
functions in the insts/inst.go
as well.
Since MGPUSim only outputs traces executed by each compute unit not by kernel, special care was taken to ensure the modified version can operate normally, especially in parallel mode. Thus, traces for a single kernel will be first generated on by each compute unit. These partial traces will then be combined together when the last compute unit finish its assigned wavefront.
The modified MGPUSim is hosted on GitHub here.
To get the Accel-Sim format traces from the modified MGPUSim, simply run with the -debug-isa
flag when running the benchmark and you should see kernelslist
and kernel-[num].trace
files generated.