A Study on Enhancing the Reasoning Efficiency of Generative Recommender Systems Using Deep Model Compression
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In order to improve the inference efficiency of generative recommendation systems in practical applications, a deep model compression framework integrating structured pruning, dynamic quantization, and knowledge distillation is constructed to study the comprehensive impact of multi strategy synergy on model volume, response delay, and recommendation accuracy. Using publicly available datasets such as MovieLens-1M and Alibaba Tianchi as benchmarks, analyze the performance and system bottlenecks under different compression combinations. By constructing a hybrid parallel scheduling mechanism and multi-level cache optimization system, enhance the execution capability of the compression model in the inference stage. The results show that the joint compression of the three strategies can reduce the inference delay to 15.3ms while compressing the model parameters to 22.8% of the original, HR@10 Only a 0.4% decrease has been achieved, significantly improving operational efficiency and deployment adaptability.