How is the computing performance of GPU servers?
Performance specs (RTX 4090 example): FP32 compute 82.6 TFLOPS; Memory bandwidth 1008 GB/s; 16384 CUDA cores; 128 ray tracing cores; 512 Tensor cores. Real performance: BERT training 15x faster, Stable Diffusion 3 seconds per image.