SURF Week 4 - 5: Trace format and MGPUSim Investigation

So for my research project, we need to generate hardware traces from AMD GPU hardware. However, due to limitations of software tool, if we want to do it, we would have to make modification to the JIT compiler of vISA and mISA, which is way beyond my current knowledge. Instead, we switch to use a mISA AMD GPU simulator to generate the traces we needed, which is the MGPUSim, which simulates AMD GCN3 ISA and is very accurate in terms of performance.

Accel-Sim Trace Format

Still, even we now have a simulator ready, there are still work needed to be done for generate the trace file in Accel-Sim format. There are two types of trace files needed by Accel-Sim, one is the kernelslist file and the others are kernel-[num].trace file. The kernelslist file looks like the following:

MemcpyHtoD,0x0000000000002000,16384
MemcpyHtoD,0x000000000000a000,64
kernel-1671.traceg

Which serves as a meta file for a set of traces that instruct the Accel-Sim however to perform memory operations and look for individual kernel trace files, which are in the form of kernel-[num].trace, an sample output of the file is listed below:

-kernel name = FIR
-kernel id = 1671
-warp size = 64
-isa type = GCN3
-shmem = 0
-nregs = 32
-binary version = 100
-cuda stream id = 0
-shmem base_addr = 0x0
-local mem base_addr = 0x0
-nvbit version = -1
-accelsim tracer version = 3
-grid dim = (16,1,1)
-block dim = (256,1,1)
#traces format = threadblock_x threadblock_y threadblock_z warpid_tb PC mask dest_num [reg_dests] opcode src_num [reg_srcs] mem_width [adrrescompress?] [mem_addresses]
0 0 0 0 0000b108 ffffffffffffffff 8 S12 S13 S14 S15 S16 S17 S18 S19 S_LOAD_DWORDX8 2 S6 S7 32 1 0xc000 0
0 0 0 0 0000b110 ffffffffffffffff 1 S0 S_LOAD_DWORD 2 S6 S7 4 1 0xc020 0
0 0 0 0 0000b118 ffffffffffffffff 2 S2 S3 S_LOAD_DWORDX2 2 S6 S7 8 1 0xc028 0
0 0 0 0 0000b120 ffffffffffffffff 1 S1 S_LOAD_DWORD 2 S4 S5 4 1 0xd004 0
0 0 0 0 0000b124 ffffffffffffffff 1 R2 V_MOV_B32_E32 0  0
0 0 0 0 0000b128 ffffffffffffffff 0 S_WAITCNT 0 0
0 0 0 0 0000b130 ffffffffffffffff 1 S1 S_AND_B32 1 S1 0
0 0 0 0 0000b134 ffffffffffffffff 1 S8 S_MUL_I32 2 S8 S1 0
0 0 0 0 0000b138 ffffffffffffffff 1 R0 V_ADD_U32_E32 2 S8 R0 0
0 0 0 0 0000b13c ffffffffffffffff 1 R0 V_ADD_U32_E32 2 S2 R0 0
0 0 0 0 0000b140 ffffffffffffffff 0 S_CMP_EQ_U32 1 S0 0
0 0 0 0 0000b144 ffffffffffffffff 1 S1 S_MOV_B32 0  0
0 0 0 0 0000b148 ffffffffffffffff 0 S_CBRANCH_SCC1 0 0
0 0 0 0 0000b14c ffffffffffffffff 1 R4 V_MOV_B32_E32 0  0
0 0 0 0 0000b150 ffffffffffffffff 2 S2 S3 S_MOV_B64 2 S14 S15 0
0 0 0 0 0000b154 ffffffffffffffff 1 R1 V_MOV_B32_E32 1 R0 0

As you can see, the kernel-[num].trace files have two parts: header and actual traces. The header part (above and including the #traces format line) tells the Accel-Sim trace-driven part the meta information of this kernel. The actual trace part will undergo a post-processing tool to group traces from the same threadblock/workgroup together into a  kernel-[num].traceg file:

-kernel name = FIR
-kernel id = 1671
-warp size = 64
-isa type = GCN3
-shmem = 0
-nregs = 32
-binary version = 100
-cuda stream id = 0
-shmem base_addr = 0x0
-local mem base_addr = 0x0
-nvbit version = -1
-accelsim tracer version = 3
-grid dim = (16,1,1)
-block dim = (256,1,1)
#traces format = threadblock_x threadblock_y threadblock_z warpid_tb PC mask dest_num [reg_dests] opcode src_num [reg_srcs] mem_width [adrrescompress?] [mem_addresses]


#BEGIN_TB

thread block = 0,0,0

warp = 0
insts = 392
0000b108 ffffffffffffffff 8 S12 S13 S14 S15 S16 S17 S18 S19 S_LOAD_DWORDX8 2 S6 S7 32 1 0xc000 0
0000b110 ffffffffffffffff 1 S0 S_LOAD_DWORD 2 S6 S7 4 1 0xc020 0
0000b118 ffffffffffffffff 2 S2 S3 S_LOAD_DWORDX2 2 S6 S7 8 1 0xc028 0
0000b120 ffffffffffffffff 1 S1 S_LOAD_DWORD 2 S4 S5 4 1 0xd004 0
0000b124 ffffffffffffffff 1 R2 V_MOV_B32_E32 0  0
0000b128 ffffffffffffffff 0 S_WAITCNT 0 0
0000b130 ffffffffffffffff 1 S1 S_AND_B32 1 S1 0
0000b134 ffffffffffffffff 1 S8 S_MUL_I32 2 S8 S1 0
0000b138 ffffffffffffffff 1 R0 V_ADD_U32_E32 2 S8 R0 0
0000b13c ffffffffffffffff 1 R0 V_ADD_U32_E32 2 S2 R0 0
0000b140 ffffffffffffffff 0 S_CMP_EQ_U32 1 S0 0
0000b144 ffffffffffffffff 1 S1 S_MOV_B32 0  0
0000b148 ffffffffffffffff 0 S_CBRANCH_SCC1 0 0

MGPUSim Modifications

In order to generate Accel-Sim trace files *.trace, we made some modifications to the MGPUSim to accomplish this. The majority of modifications were done in the driver/api.go, emu/isadebugger.go, emu/wavefront.go, insts/inst.go, and insts/decodetable.go, with the later one with modifications made on names of registers primarily.

The first part of modifications were made to produce the kernelslist, which was done by adding printouts in the driver/api.go file on MemcpyHtoD calls.

The second part of modifications were to generate the *.trace files. To do so, the existing isadebugger functionality of MGPUSim was utilized. Basically we replaced the logWholeWf function with our own to generate the Accel-Sim format traces. The header information part of the trace files was extracted from the wavefront launch packet and the kernel binary passed with the wavefront. Additionally, to get the Accel-Sim format for each instruction, we overwrote the existing String() functions in the insts/inst.go as well.

Since MGPUSim only outputs traces executed by each compute unit not by kernel, special care was taken to ensure the modified version can operate normally, especially in parallel mode. Thus, traces for a single kernel will be first generated on by each compute unit. These partial traces will then be combined together when the last compute unit finish its assigned wavefront.

The modified MGPUSim is hosted on GitHub here.

To get the Accel-Sim format traces from the modified MGPUSim, simply run with the -debug-isa flag when running the benchmark and you should see kernelslist and kernel-[num].trace files generated.