Green Node Logo
 
AI products

NVIDIA DGX GH200: Decoding the Language of Massive Memory

Feb 06, 2024

nvidia-dgx-gh200-decoding-the-language-of-massive-memory
 

During COMPUTEX 2023, NVIDIA unveiled the NVIDIA DGX GH200, a groundbreaking advancement in GPU-accelerated computing designed to handle the most demanding AI workloads. This announcement not only delves into crucial aspects of the NVIDIA DGX GH200 architecture but also explores the capabilities of NVIDIA Base Command, streamlining rapid deployment, expediting user onboarding, and simplifying system management.

To empower scientists tackling these complex challenges, NVIDIA introduced the NVIDIA Grace Hopper Superchip, coupled with the NVLink Switch System, uniting up to 256 GPUs within an NVIDIA DGX GH200 system. This configuration grants the DGX GH200 system access to 144 terabytes of memory through the GPU-shared memory programming model at high speeds over NVLink.

In comparison to a single NVIDIA DGX A100 320 GB system, the NVIDIA DGX GH200 offers an astounding nearly 500x increase in memory available to the GPU shared memory programming model over NVLink, effectively forming a colossal GPU-equipped data center. Notably, the NVIDIA DGX GH200 achieves a historic milestone by becoming the first supercomputer to surpass the 100-terabyte barrier for memory accessible to GPUs over NVLink. 

nvidia-dgx-gh200-decoding-the-language-of-massive-memory
Advancements in NVLink technology lead to increased GPU memory capacity

The Architectural Framework of the NVIDIA DGX GH200 System

The foundational components of the NVIDIA DGX GH200 architecture consist of the NVIDIA Grace Hopper Superchip and the NVLink Switch System. The NVIDIA Grace Hopper Superchip integrates the Grace and Hopper architectures through NVIDIA NVLink-C2C, establishing a coherent memory model for both CPU and GPU. This innovative approach enhances connectivity and efficiency. The NVLink Switch System, leveraging the fourth generation of NVLink technology, extends NVLink connections across superchips, creating a seamless, high-bandwidth, multi-GPU system.

Within the NVIDIA DGX GH200, each NVIDIA Grace Hopper Superchip boasts 480 GB LPDDR5 CPU memory, offering a power efficiency of an eighth compared to DDR5, along with 96 GB of high-speed HBM3. The NVIDIA Grace CPU and Hopper GPU are interconnected using NVLink-C2C, delivering 7 times more bandwidth than PCIe Gen5 while consuming only one-fifth of the power.

The NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH200 system. This comprehensive interconnectivity ensures that every GPU in the DGX GH200 can access the memory of other GPUs, including the extended GPU memory of all NVIDIA Grace CPUs, operating at a remarkable speed of 900 GBps.

The compute baseboards hosting the Grace Hopper Superchips are linked to the NVLink Switch System through a custom cable harness, establishing the first layer of the NVLink fabric. LinkX cables then extend this connectivity in the second layer of the NVLink fabric, completing the intricate architecture of the NVIDIA DGX GH200. 

nvidia-dgx-gh200-decoding-the-language-of-massive-memory
Topology of a fully connected NVIDIA NVLink Switch System in NVIDIA DGX GH200 with 256 GPUs

In the DGX GH200 system, GPU threads can access peer HBM3 and LPDDR5X memory from other Grace Hopper Superchips in the NVLink network using an NVLink page table. NVIDIA Magnum IO acceleration libraries optimize GPU communications for efficiency, enhancing application scaling across all 256 GPUs.

Each Grace Hopper Superchip in DGX GH200 is coupled with one NVIDIA ConnectX-7 network adapter and one NVIDIA BlueField-3 NIC. The DGX GH200 boasts a bi-section bandwidth of 128 TBps and 230.4 TFLOPS of NVIDIA SHARP in-network computing, accelerating collective operations common in AI. It effectively doubles the NVLink Network System's bandwidth by minimizing communication overheads in collective operations.

For scalability beyond 256 GPUs, ConnectX-7 adapters can interconnect multiple DGX GH200 systems, creating an even larger solution. The power of BlueField-3 DPUs transforms any enterprise computing environment into a secure and accelerated virtual private cloud, enabling organizations to run application workloads securely in multi-tenant environments.

Target Applications and Performance Benefits

The significant advancement in GPU memory enhances the performance of AI and HPC applications that were previously constrained by GPU memory size. Many mainstream AI and HPC workloads can now fully reside in the collective GPU memory of a single NVIDIA DGX H100, making it the most performance-efficient training solution for such tasks.

However, for more demanding workloads, such as a deep learning recommendation model with terabytes of embedded tables, a terabyte-scale graph neural network training model, or large data analytics tasks, the DGX GH200 demonstrates notable speedups of 4x to 7x. This underscores the DGX GH200 as the preferred solution for advanced AI and HPC models that require extensive GPU-shared memory programming.

nvidia-dgx-gh200-decoding-the-language-of-massive-memory
Benchmarking Performance in Giant Memory AI Workloads

Tailoring for The Most Challenging Workloads

Each component in the DGX GH200 is meticulously chosen to minimize bottlenecks, optimizing network performance for crucial workloads and fully leveraging the scale-up hardware capabilities. This meticulous selection results in linear scalability and efficient utilization of the extensive shared memory space.

To maximize the potential of this advanced system, NVIDIA has also engineered an exceptionally high-speed storage fabric that operates at peak capacity. This fabric efficiently manages diverse data types - such as text, tabular data, audio, and video - simultaneously and consistently delivers high performance.

Comprehensive NVIDIA Solution

DGX GH200 is equipped with NVIDIA Base Command, encompassing an AI workload-optimized operating system, a cluster manager, and libraries that enhance compute, storage, and network infrastructure, all tailored for the DGX GH200 system architecture.

Additionally, DGX GH200 incorporates NVIDIA AI Enterprise, offering a comprehensive set of software and frameworks meticulously optimized to simplify AI development and deployment. This end-to-end solution empowers customers to concentrate on innovation, alleviating concerns about the intricacies of managing their IT infrastructure.

nvidia-dgx-gh200-decoding-the-language-of-massive-memory
The NVIDIA DGX GH200 AI supercomputer comprehensive includes NVIDIA Base Command and NVIDIA AI Enterprise

Final Thoughts

At the forefront of delivering this exceptional supercomputer is GreenNode. Committed to making the DGX GH200 accessible, GreenNode is your gateway to harnessing the power of this first-of-its-kind system. With a dedication to advancing technology and overcoming the most complex challenges, GreenNode paves the way for a future where groundbreaking achievements in AI and HPC become more achievable than ever. Explore the possibilities with GreenNode and embark on a journey of unprecedented computational prowess. Learn more here.

Tags:

Read more