From Centralized to Composable: Advances in Distributed and Multimodal Language Modeling
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The advent of large language models (LLMs) has ushered in a new era of general-purpose artificial intelligence capable of language understanding, generation, and reasoning. Recent progress has further extended these capabilities to the multimodal domain, where models integrate textual, visual, auditory, and sensory information. As the scale and complexity of both unimodal and multimodal LLMs grow, centralized training and inference become increasingly infeasible due to computational, memory, energy, and privacy constraints. Distributed architectures—spanning data parallelism, model parallelism, pipeline sharding, federated learning, and expert specialization—have emerged as necessary frameworks for scalable deployment and collaboration across modalities and domains. This survey provides a comprehensive review of the advances, methodologies, and challenges in distributed LLMs and multimodal large language models (MLLMs). We begin with a mathematical characterization of distributed LLM architectures, followed by a taxonomy of communication-efficient fusion mechanisms, parallel training strategies, and alignment techniques across modalities. We discuss theoretical underpinnings of multimodal representation learning, including universal approximation, modality-aware attention, and conditional computation. We then examine training paradigms across decentralized and federated infrastructures, emphasizing optimization under partial supervision, low-bandwidth constraints, and heterogeneity in hardware and data distribution. The survey further explores real-world applications, deployment architectures, and constraints on inference latency, energy efficiency, and privacy. We highlight deployment patterns ranging from cloud-centric to fully edge-based systems and discuss model compression techniques including quantization, pruning, and mixture-of-experts routing. The final sections identify open research challenges in scalability, cross-modal generalization, robustness to missing or adversarial modalities, and the interpretability and safety of distributed MLLMs. We conclude with future directions that emphasize sustainability, democratization, and theoretical convergence in multimodal distributed intelligence. This survey aims to serve as a foundational resource for researchers and practitioners building the next generation of distributed, multimodal, and human-aligned AI systems.