Dynamic Pruning and Quantization to Reduce Computational Demand
Problem Definition
Modern deep-learning models require substantial computational power due to their size and complexity. This project aimed to reduce computational costs by optimizing the models' layers dynamically using two key techniques: layer pruning and quantization while applying Proximal Policy Optimization (PPO) to maintain performance stability.
Understanding Static Pruning and Quantization
Static pruning and quantization apply fixed rules during model training. These methods are limited as they cannot adapt to changes in the data distribution, often resulting in inefficiencies and potential overfitting.
Dynamic Pruning and Quantization Approach
This method combines both pruning and quantization in a dynamic, unified framework. Instead of applying fixed rules, the system dynamically prunes and quantizes layers during training.
Pruning: Pruning eliminates less important neurons based on the Taylor expansion criterion, calculating the effect of pruning on the loss function.
Quantization: Weights are reduced in precision using KL Divergence to ensure minimal accuracy loss.
Using PPO for Optimization
The Proximal Policy Optimization (PPO) algorithm was used to adjust pruning rates and quantization bit-width dynamically. PPO ensures that updates to pruning and quantization remain stable by preventing extreme updates, thus balancing model size reduction and accuracy.
Retraining with Lottery Ticket Hypothesis
After pruning and quantization, the model is retrained based on the Lottery Ticket Hypothesis, which states that within a large network, smaller subnetworks (winning tickets) can be retrained to perform as well as the original, full-size model.
Results
The method was tested on neural networks such as MobileNet and VGG-19. The dynamic approach led to:
6%-20% higher performance density (performance per unit of memory usage).
Reduction in model size while maintaining accuracy.
Future Applications
This dynamic pruning and quantization method can be applied to real-world use cases like healthcare, agriculture, and AI models deployed on resource-constrained devices.