Computation graph from original paper

Pruning is a popular technique for reducing the size of deep neural networks without sacrificing accuracy. However, traditional pruning methods can be computationally expensive and lack to hardware support. In this paper, Towards Fully Sparse Training: Information Restoration with Spatial Similarity, the authors propose a new approach to structured pruning that is specifically designed to work efficiently on NVIDIA Ampere sparse tensorcore.

The authors begin by noting that most existing pruning methods are based on unstructured pruning, which involves removing individual weights or activations from the network. While unstructured pruning can be effective, it can also be time-consuming due to the need to compute and update a pruning mask for each layer of the network. To address this issue, the authors propose a structured pruning approach.

Specifically, the authors propose using the 2:4 mask structure of Ampere GPUs to prune weights in a structured manner. This involves pruning two out of every four consecutive elements in a matrix, which reduces the size of the weight matrix by half. By pruning weights and activations in this way, the authors are able to achieve a high compression rate with minimal loss of accuracy.

To test their approach, the authors evaluate their method on two popular benchmark datasets: COCO17 for object detection and ImageNet for image classification. They find that their structured pruning method achieves a 0.2% decrease in accuracy for both tasks, but with a 2x speedup compared to unpruned models. This suggests that their method can achieve significant computational savings without sacrificing too much in terms of accuracy.

One interesting aspect of the paper is the authors' analysis of different pruning strategies. They evaluate four different strategies for backpropagation cases: pruning weights and gradients (WG), pruning weights and activations (WX), pruning gradients (GG), and pruning gradients and activations (GX). They find that pruning on gradients is generally a bad idea, as gradients contain useful information on model performance and can accumulate errors over multiple layers.

Pruning on activations can be time-consuming due to the size of the activation tensor, so the authors propose using a fixed pruning mask that prunes alternate columns of the tensor. They also note that this works well because neighboring columns tend to be similar in natural images, which is preserved throughout the network due to the use of the same filter for different patches of the input image. In addition, the authors also introduce an information restoration block to compensate for the error produced from fixed mask.

Overall, the authors' structured pruning method shows promise for reducing the computational complexity of deep neural networks on NVIDIA Ampere architecture. By using a structured pruning approach that is optimized for the hardware, they are able to achieve significant speedups with minimal loss of accuracy.

Generated by ChatGPT based on paper reading notes. Proofread by myself.