Transferring Weight Distributions for Efficient Language Model Initialization

LLMs

Transfer Learning

Model Initialization

Author

Abhishek Upperwal

Published

2025-04-18

1 Introduction

Training large language models (LLMs) from scratch requires enormous amounts of compute and data. Typically, this involves initializing model weights randomly and then training them from the ground up. However, what if we could give our model a better starting point—one that reflects the patterns already learned by powerful existing models?

In this blog, we explore a method inspired by transfer learning: instead of transferring entire weights or architectures, we analyze the distribution of weights from an open-source pretrained model and use that distribution to initialize a new model. This new model could be smaller, differently structured, or even built for a new task—but it still benefits from the statistical “knowledge” embedded in the pretrained model’s weights.

The key motivation is simple: by initializing weights more intelligently, we can speed up training, improve performance, and reduce compute cost—all without needing to copy architectures or fine-tune a massive model directly.

2 Modeling Pretrained Weight Distributions in LLMs

Pretrained large language models (LLMs) exhibit characteristic weight distributions that can be modeled and transferred to new models. Studies have found that the distribution of trained weights often deviates from simple Gaussian noise – many layers show heavy-tailed or structured patterns. For example, in convolutional networks, learned filters display highly structured covariance matrices rather than independent Gaussian weights. In Transformers, the self-attention weight matrices (e.g. query/key projections) often develop distinct patterns (such as approximate identity structure on their diagonals) in well-trained models. These insights motivate distribution-based initialization: instead of random initialization, we fit statistical distributions (Gaussian, Laplace, etc., or even non-parametric estimates) to the weights of a pretrained model and sample new weights from these distributions for a new model. This approach treats the pretrained weights as a prior for initializing another network, even if the new network has a different architecture.

Researchers have explicitly modeled weight distributions of pretrained networks using both parametric and non-parametric methods. For instance, Trockman et al. (2023) analyzed the covariance of convolutional filter weights in vision models and showed that a multivariate Gaussian with the right covariance structure can capture much of the learned filter distribution. They observed that the covariance structure is model-independent to an extent – covariances measured on one model could be applied to initialize others of different depth or width. On the language model side, weight distributions also tend to be non-uniform. An IEEE study noted that over 80% of layers in various pretrained networks have weight distributions that significantly deviate from a pure Gaussian, exhibiting asymmetry or heavy tails (often better fit by α-stable distributions). Such findings suggest that simple random initializations (which assume independent Gaussian weights) might be suboptimal compared to distributions that incorporate the learned characteristics of pretrained weights.

In practice, modeling a pretrained model’s weight distribution can be done at different granularities:

Global or Layer-Wise Histograms: Fit a single distribution to all weights or use layer-wise distribution fitting.
Full-Covariance Models: Use multivariate Gaussian to capture joint distributions with correlation structure.
Non-Parametric Density Estimates: KDE or empirical histograms to preserve exact shape of pretrained weight distributions.

Open LLMs such as Google’s Gemma-3 (ranging from 1B to 27B parameters) provide publicly available weight sets that can be analyzed this way. One could fit, say, a Gaussian or mixture model to the weights of Gemma3-27B and use that to initialize a new model. The key idea is to be structure-agnostic: rather than requiring the new model to inherit the exact architecture (and weights) of the source, we only transfer statistical properties of the weights.

3 Techniques for Weight Distribution–Based Initialization

3.1 Parametric Distribution Fitting of Weights

A straightforward approach is to fit a parametric distribution (like Gaussian or Laplace) to pretrained weights and sample new weights from it. This retains basic statistics (mean, variance, possibly covariance) of the source model’s weights in the new model’s initialization.

Covariance-Based Gaussian Initialization: Trockman et al. (ICLR 2023) introduced “covariance-aware” initialization for conv nets using multivariate Gaussians with real covariances. Their method outperformed traditional initialization and required minimal training in some cases.
Mimetic Initialization for Transformers: Trockman & Kolter (ICML 2023) found that pretrained attention matrices often follow \(W_Q W_K^T \approx I\) and \(W_V W_O^T \approx -I\). They constructed closed-form weight matrices that preserve this structure for initialization. These led to faster and more accurate training.

Other parametric methods include fitting α-stable distributions or Laplace models where the source weights exhibit significant skew or kurtosis.

3.2 Non-Parametric and Empirical Weight Reuse

Weight Selection (Subset Transfer): Xu et al. (ICLR 2024) proposed initializing smaller models by selecting a subset of actual weights from larger pretrained models. Each student layer samples directly from the corresponding teacher layer’s weights.
Structured Pruning as Initialization (Sheared LLaMA): Xia et al. pruned LLaMA2-7B down to 2.7B and 1.3B, retaining original weights. The sheared models retained strong performance, showing that reusing actual pretrained weights can substitute full pretraining.
Cross-Architecture Mapping: Other works have attempted to transfer weights by layer alignment or transformation (Net2Net). These are less automated but demonstrate the same principle.
Weight Generative Models (e.g., Diffusion): Soro et al. train a latent diffusion model on weight vectors from a “model zoo” to generate new initialization weights conditioned on target metadata.

4 Effects on Convergence and Model Quality

Weight distribution–based initialization improves both training efficiency and final performance:

Faster Convergence: Models start in a better parameter space, reducing training time. E.g., Xu et al.’s ViT-Tiny converged in 1/3 the time.
Higher Final Accuracy: Initialization from real weight distributions yields better optima, e.g. +1.6% ImageNet gains in weight selection.
Gradient Stability: Matching scale and distribution from pretrained models avoids unstable regimes like gradient explosion/vanishing.

5 Parametric vs. Non-Parametric Distributions

Parametric: Simpler, smooth out noise. Gaussian with empirical μ, Σ often sufficient.
Non-Parametric: Precise but potentially overfitted. KDEs or direct weight copying preserve shape but must match layers correctly.
Hybrid: Use empirical stats with slight perturbations. Mix parametric priors with actual values for best of both.

6 Implementations and Open-Source Resources

Weight Selection (ICLR 2024): GitHub repo available with PyTorch examples.
Sheared LLaMA: Available on HuggingFace with pruned models.
ConvCov: GitHub repo for covariance-based Gaussian conv filter init.
Mimetic Init: Can be custom added to any Transformer init hook.
Weight Diffusion (Soro et al.): Likely to be open-sourced post-ICML.

7 Conclusion

Initializing new LLMs by sampling from distributions fitted to existing models’ weights is a promising form of structure-agnostic transfer learning. It’s more flexible than strict fine-tuning, less expensive than full training, and grounded in real model statistics. As more pretrained models become open, this approach will likely become a standard toolkit for efficient language model development.

8 References

Trockman, A., & Kolter, Z. (2023). Mimetic Initialization. ICML.
Trockman, A., et al. (2023). Covariance-Aware Initialization. ICLR.
Soro, A., et al. (2024). Diffusion-Based Weight Generators. ICML (to appear).
Xu, W., et al. (2024). Initializing Models with Larger Ones. ICLR.
Xia, X., et al. (2023). LLM Shearing. arXiv:2310.01257.
IEEE Transactions on Neural Networks. Heavy-Tailed Weight Distributions in Deep Nets.
Czyżewski, J., et al. (2022). Cross-Architecture Weight Transfer.
Chen, T., et al. (2016). Net2Net: Accelerating Learning via Knowledge Transfer.
Google Gemma Models: https://ai.google.dev/gemma
HuggingFace Sheared LLaMA: https://huggingface.co/collections/LLMs/sheared-llama