Designing and deploying GPU clusters requires understanding networking, storage, and orchestration at scale.
## Cluster Topologies
### Single-Node Multi-GPU - 4-8 GPUs connected via NVLink or PCIe - Shared CPU, memory, and storage - Best for: Model training up to 70B parameters - Limitations: Single point of failure, VRAM cap
### Multi-Node Clusters - Dozens to thousands of GPU nodes - Connected via high-speed networking (InfiniBand, RoCE) - Distributed storage (Lustre, GPFS, NFS) - Best for: Foundation models, large-scale inference
### Network Architecture ``` GPU Cluster Network Design: ┌─────────────────────────────────────┐ │ Management Network (1GbE) │ ├─────────────────────────────────────┤ │ Compute Network (200Gbps IB/RoCE) │ ├─────────────────────────────────────┤ │ Storage Network (100Gbps) │ └─────────────────────────────────────┘
Each GPU node: - 8x A100/H100 GPUs - 2x AMD EPYC CPUs (128 cores total) - 1TB+ RAM - 4TB+ NVMe local storage - Dual InfiniBand HCAs ```
## Resource Management
### Slurm for GPU Scheduling ```bash # Configure Slurm for GPU cluster # /etc/slurm/slurm.conf NodeName=gpu[001-128] Gres=gpu:a100:8 CPUs=128 RealMemory=1000000
# Submit GPU job sbatch --gres=gpu:8 --nodes=16 train_large_model.sh
# Check GPU usage across cluster squeue --Format=jobid,username,nodelist,gres ```
### Kubernetes with GPU Operator ```yaml # Deploy NVIDIA GPU Operator apiVersion: v1 kind: Pod metadata: name: gpu-training-pod spec: containers: - name: trainer image: nvcr.io/nvidia/pytorch:24.01-py3 resources: limits: nvidia.com/gpu: 8 env: - name: NVIDIA_VISIBLE_DEVICES value: "all" ```