At the World Artificial Intelligence Conference (WAIC) held in Shanghai from July 26th to 29th, Huawei unveiled the Ascend 384 Ultra-Node, also known as the Atlas 900 A3 SuperPoD, for the first time at booth H1-A301. This groundbreaking system, the largest ultra-node of its kind in the industry, has captured significant attention as a ‘gem of the exhibition’.
The Ascend 384 Ultra-Node represents a departure from the traditional CPU-centric Von Neumann architecture, introducing an innovative peer-to-peer computing model. This architecture extends the internal server bus to an entire rack, and even across multiple racks, fundamentally transforming data transmission and processing methods. Traditional AI training clusters, built by stacking servers, storage, and network devices, often suffer from low resource utilization and frequent failures, posing significant challenges to AI development.

The Ascend Ultra-Node, by connecting multiple NPUs (Neural Processing Units) via a high-speed bus, overcomes interconnection bottlenecks, enabling the ultra-node to function collaboratively as a single, powerful computing unit.
Key advancements include:
Communication Bandwidth Leap: Cross-node communication bandwidth has been increased by 15 times, leading to significantly faster data transfer speeds.
Communication Latency Reduction: Communication latency has been reduced tenfold, from 2μs to 0.2μs, minimizing data processing waiting times.
Superior Interconnection Capabilities: The system supports interconnection of up to 384 NPUs with extreme bandwidth in a point-to-point manner. Notably, it is the industry’s only product that can complete all expert parallelism (EP) schemes for MoE models within a single ultra-node domain. This makes it an optimal solution for training and inferencing MoE models, greatly enhancing efficiency.

The Ascend 384 Ultra-Node boasts three primary advantages:
Massive Bandwidth: The communication bandwidth between any two AI processors within the ultra-node is 15 times higher than in traditional architectures. Furthermore, single-hop communication latency within the ultra-node is reduced by 10 times, ensuring smoother data interaction.
Ultra-Low Latency: The Ascend Ultra-Node supports unified global memory addressing, enabling more efficient memory semantic communication. Its low-latency instruction-level memory semantic communication caters to the small packet communication needs in large model training and inference, improving the efficiency of small packet data transmission and discrete random access in expert networks. Critically, the Ascend 384 Ultra-Node is reportedly the industry’s first solution to break the 15ms decode latency barrier, meeting the demands of real-time, in-depth thinking user experiences.
Exceptional Performance: Actual tests indicate that on the Ascend Ultra-Node cluster, training performance for dense models with hundreds of billions of parameters, such as LLaMA3, can exceed 2.5 times that of traditional clusters. For multimodal and MoE models like Qwen and DeepSeek, which involve higher communication overheads, performance improvements can reach over 3 times.