This week, what I did primarily is to familiarize with the existing GPGPU-Sim framework and run certain tests on it. Instead of focusing on the lower level details like how to run the simulator (which is available in the handout), I would like to discuss some thoughts on the GPU simulator.

Why simulator?

The first question is obvious, why should we build and use a simulator? To this, I consider the first answer is for quicker prototyping, testing out new ideas before putting lots of resources in them. From my previous experience on the very simple dual-core MIPS CPU, building a processor even from Verilog level is very time consuming due to the numerous details needed to be taken care of simply by using the hardware description language itself. While if we use the simulator, we can build processors using a much higher level language like C++ or Go instead of meddling with the implementation details of RTL designs.

Secondly, using a simulator means that the verification suite is much more easy to program and we can easily get statistics from the simulator for these verification suite or benchmarks rather than coding additional verification testbence in Verilog.

Finally, I consider the use of simulators allow researchers to focus more on the abstract architecture designs rather than implementation details that should be handled by Verilog experts instead. The researchers are not producing a fully functional and producible Verilog design for the most of the time but instead want to investigate any potential improvements that can be done to the current GPU designs. By hiding and avoiding such details, researchers could focus more on their job.

Literature Review on GPU simulator

Attached below is the literature review I have done for GPU simulator. It is but no means comprehensive but should provide certain insights into the current development of publicly available GPU simulators.

With the market pressure of better Graphics Processing Units (GPUs) and the slowing of Moore’s Law, the field of GPU architecture now becomes a major research direction for companies and academia to improve the current GPU design to achieve either faster performance or better energy efficiency [1]. However, without a calibrated baseline GPU simulator, it presents difficulties for researchers to compare their designs with the state-of-the-art industrial GPU designs.

Since the introduction of the CUDA programming framework in 2007, multiple public GPU simulators for both NVIDIA and AMD have been developed by academia to facilitate research on possible improvements on GPU architecture [2]–[4]. GPGPU-Sim, introduced in [2] during 2009, is one of the first GPU simulators for NVIDIA GPUs. With the support of NVIDIA Parallel Thread Execution (PTX) virtual instruction set architecture (ISA) both for functional simulation and timing simulation, GPGPU-Sim enables detailed analysis of CUDA programs in simulation which shows potential optimizations to current GPUs [2]. A later version of GPGPU-Sim also adds support for power consumption simulation in [5]. However, the base GPGPU-Sim and its successive versions up to 3.x focus mainly on PTX simulation [2], [5], which is not a 1-1 representation of the machine ISA, SASS, that gets run on the real GPUs. Although certain attempts are made to implement SASS simulation in GPGPU-Sim [6], as NVIDIA SASS language is poorly documented and changes over generations of GPU architectures, the SASS simulation on GPGPU-Sim will not work for newer GPUs. This mismatch of PTX simulation and SASS is one of the causes for the deviation of simulation results on GPGPU-Sim and the actual hardware performance along with the constant evolution of memory subsystem and architecture [7].

Nevertheless, with the introduction of NVBit in 2019 [8], another kind of simulation arose: trace simulation. Introduced in Accel-Sim, trace simulation allows researchers to have simulators for state-of-the-art GPUs that match closely to real silicon GPUs [7]. The NVBit tool from NVIDIA enables users to insert custom instrumentation codes down to the SASS level, which is used in Accel-Sim to generate hardware traces [7]. Thus, with this tracing information, authors of the Accel-Sim were able to utilize this and generate more accurate timing simulations of real GPUs to exploit any potential optimizations and discover false assumptions from academia in the current GPU designs [7]. In addition, custom microbenchmarks are also introduced in Accel-Sim to capture the undocumented architectures changes in modern GPUs [7].

Despite the accurate simulation capability, as the NVBit tool is only available for NVIDIA GPUs, Accel-Sim currently could not perform trace simulation for GPUs from other vendors like AMD, Intel, or Apple. Specifically for AMD GPUs, at the time of writing, there do not exist tools like NVBit that can perform custom machine ISA code instrumentation, which prevents Accel-Sim to use the same approach to model AMD GPUs.

However, unlike NVIDIA, AMD has detailed documentation for their machine ISA GCN [9], which enables detailed simulators for AMD GPUs running on GCN ISA to be built and used to analyze GPU workload [3], [4], [10]. As the main goal of this paper, these simulators running with GCN ISA will then be used to generate trace information for Accel-Sim to simulate AMD GPUs and compare with hardware profiling results to derive the deviation as illustrated in figure 1.

Graphical user interface, text, application

Description automatically generated

Figure 1: Trace information data flow

References

[1]      T. M. Aamodt, W. W. L. Fung, and T. G. Rogers, “General-purpose graphics processor architectures,” Synthesis Lectures on Computer Architecture, vol. 13, no. 2, pp. 1–140, 2018, Accessed: Jun. 02, 2021. [Online]. Available: https://skos.ii.uni.wroc.pl/pluginfile.php/28568/mod_resource/content/2/General-purpose-graphics-processor-architectures.pdf

[2]      A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in ISPASS 2009 - International Symposium on Performance Analysis of Systems and Software, 2009, pp. 163–174. doi: 10.1109/ISPASS.2009.4919648.

[3]      Y. Sun et al., “MGPUSim: Enabling multi-GPU performance modeling and optimization,” in Proceedings - International Symposium on Computer Architecture, Jun. 2019, pp. 197–209. doi: 10.1145/3307650.3322230.

[4]      T. Gutierrez, S. Puthoor, M. Sinclair, and B. Beckmann Amd Research, “THE AMD gem5 APU SIMULATOR: MODELING GPUS USING THE MACHINE ISA,” 2018.

[5]      J. Leng et al., “GPUWattch: Enabling Energy Optimizations in GPGPUs,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013, pp. 487–498. doi: 10.1145/2485922.2485964.

[6]      T. M. Aamodt, W. W. L. Fung, and T. H. Hetherington, “GPGPU-Sim 3.x  Manual.” http://gpgpu-sim.org/manual/index.php/Main_Page (accessed Jun. 10, 2021).

[7]      M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,” in Proceedings - International Symposium on Computer Architecture, 2020, vol. 2020-May. doi: 10.1109/ISCA45697.2020.00047.

[8]      O. Villa, M. Stephenson, D. Nellans, and S. W. Keckler, “NVBit: A dynamic binary instrumentation framework for NVIDIA GPUs,” in Proceedings of the Annual International Symposium on Microarchitecture, MICRO, Oct. 2019, pp. 372–383. doi: 10.1145/3352460.3358307.

[9]      AMD, “GCN ISA Manuals,” 2021. https://rocmdocs.amd.com/en/latest/GCN_ISA_Manuals/GCN-ISA-Manuals.html (accessed Jun. 10, 2021).

[10]    A. Gutierrez et al., “Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level,” in Proceedings - International Symposium on High-Performance Computer Architecture, Mar. 2018, vol. 2018-February, pp. 608–619. doi: 10.1109/HPCA.2018.00058.