Understanding GPU architectures and selecting the right hardware for your AI workloads is critical for performance and cost optimization.
## GPU Comparison for AI
### NVIDIA GPUs | Model | VRAM | FP32 TFLOPS | Best For | Price Range | |-------|------|-------------|----------|-------------| | RTX 4090 | 24GB | 82.6 | Development, small models | $1,600 | | A100 | 40/80GB | 19.5 | Training medium models | $10,000-15,000 | | H100 | 80GB | 60 | Large model training | $30,000-40,000 | | B200 (Blackwell) | 192GB | 90 | Foundation models | $40,000+ |
### AMD GPUs - MI300X: 192GB HBM3, competitive with H100 - MI250X: 128GB, strong for research workloads - ROCm software stack: Open-source alternative to CUDA
### Google TPUs - TPU v5e: Cost-optimized for inference - TPU v5p: High-performance training - Cloud-only availability: Integrated with Vertex AI
## Memory Architecture
### VRAM Considerations - LLM Rule of Thumb: Model size × 2 = Minimum VRAM needed - 70B parameter model: Requires ~140GB VRAM (FP16) - Quantization: 4-bit reduces VRAM by 75% - Multi-GPU: Split model across devices when needed
### Memory Hierarchy 1. Registers: 1 cycle latency, tiny capacity 2. L1 Cache: ~4 cycles, 128KB per SM 3. L2 Cache: ~200 cycles, 40-60MB 4. HBM/VRAM: ~450 cycles, 24-192GB 5. System RAM: ~600+ cycles, unlimited capacity 6. NVMe Storage: milliseconds, unlimited capacity
## PCIe and Interconnects
### PCIe Generations - PCIe 3.0: 16 GB/s (x16 slot) - Bottleneck for data transfer - PCIe 4.0: 32 GB/s - Minimum for modern GPUs - PCIe 5.0: 64 GB/s - Emerging standard - NVLink: 900 GB/s bidirectional - Essential for multi-GPU
### Bandwidth Impact ```python # Measure PCIe bandwidth impact import time import torch
tensor = torch.randn(1000, 1000, 1000) # ~4GB
# CPU to GPU transfer start = time.time() tensor_gpu = tensor.cuda() torch.cuda.synchronize() pcie_time = time.time() - start
print(f"PCIe transfer: {4 / pcie_time:.2f} GB/s") # PCIe 3.0: ~12-14 GB/s # PCIe 4.0: ~24-28 GB/s ```