Deep Learning with Zero Initialization: Revisiting Symmetry Breaking and Gradient Flow
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
For decades, the artificial intelligence (AI) community has believed that zero initialization is ineffective for neural networks. Our study challenges this misconception by introducing a method that enables successful learning even when all weights and biases are initialized to zero. Beyond this method, we also examine mixed initialization schemes in which zero and random initialization coexist across different layers or parameters, showing that learning remains effective even under such partially randomized settings. Experiments on MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet using multilayer perceptrons (MLPs), convolutional neural networks (CNNs), residual networks (ResNets), vision transformers (ViTs), and multilayer perceptron mixers (MLP-Mixers) show that zero initialization can match or even surpass random initialization in certain scenarios, particularly with MLPs and CNNs. Notably, MLP-Mixers retained comparable performance despite having no randomly initialized parameters. These findings position random initialization as a special case of zero-centered symmetry breaking and refute the longstanding belief that zero initialization inherently degrades neural network performance, opening new possibilities for neural network training. To systematize these insights, we propose the "Seo Integrated Zero Initialization: Foundational Scheme (SIZIFS)" — a unified conceptual structure that categorizes artificial neural network initialization strategies into weight-level, node-level, and context-dependent types. Implementation code is publicly available at: https://github.com/sjw007s/Deep-Learning-with-Zero-Initialization-Revisiting-Symmetry-Breaking-and-Gradient-Flow.