Multimodal and Distributed LLMs: Bridging Scalability and Cross-Modal Reasoning

Rajesh Kumar
Isabelle Laurent
David Müller
Klaus Elli

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) have emerged as a cornerstone of modern artificial intelligence, achieving remarkable capabilities in natural language understanding and generation. As their scale and utility have increased, two critical and complementary trends have defined their evolution: (1) the distributed systems and algorithms enabling efficient training of ultra-large models across massive compute infrastructures, and (2) the integration of multiple modalities—such as vision, audio, and structured data—into unified multimodal large language models (MLLMs). This survey provides a comprehensive examination of the state-of-the-art in both of these dimensions.We begin by exploring the foundations and advances in distributed training, including model parallelism, pipeline parallelism, memory optimization strategies, and the design of sparse and expert models. We assess system-level techniques such as ZeRO, DeepSpeed, and tensor sharding that allow for scalable, memory-efficient training at trillion-parameter scale. Next, we turn to multimodality, surveying architectures and training objectives that extend LLMs to process and generate across diverse input types. We review contrastive learning, cross-attention fusion, and aligned token embeddings as key techniques that enable cross-modal reasoning, with illustrative examples from models like Flamingo, CLIP, and GPT-4V.Beyond current methodologies, we identify and formalize the core technical challenges facing distributed and multimodal LLMs, including memory bottlenecks, communication overhead, alignment in the absence of ground truth, robustness to modality shifts, and evaluation under open-ended tasks. To guide future research, we outline six key directions: unified memory-augmented architectures, modular and composable systems, self-aligning mechanisms, lifelong and continual learning agents, embodied multimodal cognition, and the emergence of general-purpose foundation agents.Our goal is to synthesize recent progress while articulating a vision for the next generation of foundation models—models that are not only scalable and multimodal but are also capable of reasoning, grounding, and adapting to complex, real-world environments. This survey serves both as a technical reference and a roadmap for researchers and practitioners navigating the future of large-scale, multimodal, and distributed AI systems.

Version published to 10.20944/preprints202505.1156.v1
May 15, 2025

From Centralized to Composable: Advances in Distributed and Multimodal Language Modeling

This article has 2 authors:
1. Irma Mirta
2. Klaus Elli
This article has no evaluationsLatest version May 14, 2025
Route, Select, Activate: The Mechanics of Mixture of Experts

This article has 1 author:
1. Fawzi Gamal
This article has no evaluationsLatest version Apr 7, 2025
Efficient Adaptation of Pre-trained Models: A Survey of PEFT for Language, Vision, and Multimodal Learning

This article has 2 authors:
1. Cheng Zhihao
2. Shufen Zhihao
This article has no evaluationsLatest version Apr 28, 2025

Listed in

Abstract

Article activity feed

Related articles

From Centralized to Composable: Advances in Distributed and Multimodal Language Modeling

Route, Select, Activate: The Mechanics of Mixture of Experts

Efficient Adaptation of Pre-trained Models: A Survey of PEFT for Language, Vision, and Multimodal Learning