Files
Abstract
Convolutional Neural Network (CNN) models have become the mainstream method in Artificial Intelligence (AI) areas for computer vision tasks like image classification and image segmentation. Deep CNNs contain a large volume of convolution calculations. Thus, training a CNN requires powerful GPU resources. Training a large CNN may take days or even weeks, which is time-consuming and costly. When we need multiple runs to search for the optimal CNN hypermeter settings, it would take a couple of months with limited GPUs, which is not acceptable and hinders the development of CNNs. It is essential to train CNN faster. There are two kinds of methods to train CNN faster when no additional computing resources are available. The first method is to do the model compression, either by reducing parameters or using less storage to represent the models. This method reduces training time by reducing the architecture complexity. The second method is to reduce the input data feed into the network without affecting the network architecture. Architecture complexity reduction is a popular research area to train CNN faster. Nowadays, mobile devices like smartphones and smart cars rely on deep CNNs to accomplish complex tasks like human body recognition and face recognition. Due to the high real-time demands and the memory constraints for mobile device applications, conventional large CNN is not suitable. CNN model compression is a trend to train a deep CNN model with less computation cost. Currently, there are many successful networks designed to solve this problem, like ResNeXt, MobileNet, ShuffleNet, and GhostNet. They use 1×1 convolution, depthwise convolution, or group convolution to replace the standard convolution to reduce the computation. However, there are fewer studies on the following questions. First, does the variety of convolution layers (the output channel number is larger or smaller than the input channel number) affect different compression strategies’ performance? Second, does the expansion ratio (either the output channel number over the input channel number if the output channel number is larger, or the input channel number over the input channel number if the input channel is larger) of the convolution layers affect different compression strategies’ performance? Third, does the compression ratio (the reduced parameter number/FLOPs over the original parameter number/FLOPs) affect the performance of different compression strategies? Current networks tend to use the same convolution strategy inside a basic network block, ignoring the variety of network layers. We have proposed a novel Conditional Reduction (CR) module to compress a single 1×1 convolution layer. Then we have developed a novel three-layer Conditional block (C-block) to compress the CNN bottleneck or inverted bottlenecks. At last we have developed a novel Conditional Network (CRnet) based on the CR module and C-block. We have tested the CRnet on two image classification datasets: CIFAR-10 and CIFAR-100, with multiple network expansion ratios and compression ratios. The experiments verify our methods’ correctness with attention to the importance of the input-output pattern when selecting a compression strategy. The experiments show that our proposed CRnet better balances the model complexity and accuracy compared to the state-of-the-art group convolution and Ghost module compression. Data reduction reduces the training time in a direct and simple way through data dropping. There are works drop data by the sample importance ranking. The ranking process takes extra time when there is a large number of training samples. When we tune the different network settings to search for an optimal setting, we expect a way to reduce a large percentage of training time with tiny or no accuracy loss. There are fewer studies on the following questions. First, what are suitable sampling ratios? Second, should we use the same sampling ratio for each training epoch? Third, does the sampling ratio performs differently on small and large datasets? We have proposed a flat reduced random sampling training strategy and a bottleneck reduced random sampling strategy. We have proposed a three-stage training method based on the bottleneck reduced random sampling with consideration of the distinctiveness of the network early-stage training and end-stage training. Furthermore, we have proved the data visibility of a sample in the whole training process and the theoretical reduced time by four theorems and two corollaries. We have tested the two sampling strategies on three image classification datasets: CIFAR-10, CIFAR-100 and ImageNet. The experiments show that our proposed two sampling strategies effectively reduce a significant training time percentage at a very small accuracy loss.