Understanding GPU Server Performance:

The first step is not optimization!

First, you need to understand how badly your server requires optimization by benchmarking critical performance metrics and determining how far you can push the hardware.

A proper optimization can easily bring in up to 30-40% performance gains, making high-end hardware pieces more efficient than ever before.

The indicators that we’re speaking about are:

  • GPU Utilization
  • GPU Memory
  • Training Speed
  • Power Draw

Over the years, we’ve optimized more than a thousand servers, and what we’ve learned is that you can achieve about 30-40% more performance with proper configuration.

Here’s how to identify whether your GPU server needs optimizations:

Unoptimized:Optimized:Improvement:
GPU Utilization60-70%95-98%+38%
Memory Bandwidth450 GB/s615 GB/s+37%
Training Speed100 epochs/hr140 epochs/hr+40%
Power Efficiency250W avg220W avg-12%

NVIDIA Driver Optimization: Installation & Setup

To optimize GPU performance, the first step is verifying that you have the correct NVIDIA driver, which is typically the latest version for your GPU model. The driver is like the fundamental link between your hardware and operating system, and the CUDA toolkit is what enables parallel computing and keeps your GPU usage at a stable level. A mismatch in drivers could be a limiting factor in your GPU setup.

Here are the three simplified steps to get your GPU optimizations:

1. Choose Driver Version

The first step is to choose the correct driver version for your GPU. Each graphics card generation is designed to work best with a specific CUDA driver and CUDA toolkit.

Here’s what you need to do:

  1. Visit NVIDIA Driver Downloads: First, go to the NVIDIA Driver Download page and select your GPU model, operating system, and CUDA version.
  2. Discover the right CUDA Toolkit: Then, you need to align your CUDA toolkit with the version of your development or runtime environment.
  3. Choose Certified Drivers (AI/ML): If you’re involved with AI/ML workloads, you need to select a framework like TensorFlow or PyTorch.
  4. Use Game Ready or Studio Drivers: For gaming servers or rendering workloads, you’ll need to choose a “Game Ready” or “Studio” driver.

2. Install NVIDIA Drivers

Once you’ve determined the correct driver for your GPU, it’s time to perform the installation process by following the correct steps based on whether you use Linux or Windows.

For Linux (Ubuntu/Debian)

1. First, you need to remove any previously installed drivers:

    sudo apt-get purge nvidia*
    sudo apt-get autoremove

    2. Then, you need to add your repository:

    sudo add-apt-repository ppa:graphics-drivers/ppa
    sudo apt update

    3. Next, you must install the specific driver version:

    sudo apt install nvidia-driver-535

    4. Finally, check whether the installation has been successful:

    nvidia-smi

    For Windows

    Installing drivers on Windows is much more user-friendly:

    • First, navigate to the official NVIDIA Driver Download page.
    • Choose Custom installation, then select Clean Installation.
    • Once installed, verify via Command Prompt or PowerShell:
    nvidia-smi

    Tip: You can also adjust settings using the NVIDIA Control Panel → Manage 3D Settings → Power Management Mode → Prefer Maximum Performance.

    3. Driver’s Configuration

    When the GPU driver has been successfully installed on your server, it’s vital to deploy a few changes that diverge from the default settings to achieve maximum GPU performance. We’ve prepared some easy steps for Windows and Linux to keep things simple:

    For Linux

    Linux offers quite the in-depth control over your GPU parameters through the “nvidia-smi” command, and again, you’ll need to use the Terminal. Here you can completely manage the clock speed, power, and persistence mode through the GPU driver, and here’s how:

    1. First, enable the “persistence mode” to keep your driver always on:

      sudo nvidia-smi -pm 1

      2. Then, you can set the maximum power draw for your GPU (advanced):

      sudo nvidia-smi -pl 350

      3. Next, you can adjust the application clock (memory and graphics):

      sudo nvidia-smi -ac 1593,1400

      4. Finally, you can check the GPU usage, memory, and temperature:

      nvidia-smi

      IMPORTANT: If you encounter a blue screen due to incorrect power draw settings, use the following commands (in order) to revert all settings to default:

      sudo nvidia-smi -rac
      sudo nvidia-smi -pl DEFAULT
      sudo nvidia-smi -pm 0
      sudo reboot

      You can also set environment variables such as:

      export CUDA_VISIBLE_DEVICES=0,1

      This allows you to assign specific GPU instances for your parallel computing tasks, which is ideal for AI training, rendering, or parallel computation workloads.

      For Windows

      If you’re on Windows, you can either use the graphical interface of the GPU or just open the Command Prompt (CMD) or PowerShell to input commands quickly.

      Option #1 – Using PowerShell or CMD:

      nvidia-smi -pm 1
      nvidia-smi -pl 350
      nvidia-smi -ac 1593,1400

      Option #2 Using NVIDIA Control Panel:

      1. First, locate and open the NVIDIA Control Panel (Desktop).
      2. Head to the Manage 3D Settings → Power Management Mode.
      3. Then enter the menu called “Prefer Maximum Performance“.

      CUDA Toolkit Optimization for GPU Server Workloads

      The next critical step to boost GPU performance is optimizing the CUDA toolkit by deploying a few very effective ways, especially when dealing with compute-heavy projects. The CUDA configuration directly impacts your performance, memory bandwidth, and parallel efficiency.

      Hence, understanding how each CUDA version works with your NVIDIA driver, hardware, operating systems, and supported programming languages ensures every GPU instance performs at its peak.

      An infographic illustrating how the CUDA toolkit works with NVIDIA drivers.

      Choose CUDA Version

      Choosing the correct CUDA version is a critical step for GPU performance. Each CUDA release brings new floating point operations, GPU optimization improvements, and much wider hardware support.

      Hence, using an unsupported version for your graphics card or compiler can easily bring many errors and eventually end up in vastly reduced functionality.

      Here’s what you’ll need to do to ensure you get the version right:

      • #1 – First, you need to ensure the CUDA driver compatibility using the CUDA Compatibility Guide.
      • #2 – Then, you need to align the CUDA toolkit with your NVIDIA driver and GCC compiler versions.
      • #3 – Lastly, you must match the CUDA version with the software framework that you will be using.

      Installation for Windows:

      For Windows, the installation can’t really be simpler. Just download the CUDA Installer.exe, start it, and during the installation, select “Custom Installation“. This would allow you to ensure that you will get only the tools and compliant versions that you actually need.

      Installation for Linux:

      wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
      
      sudo sh cuda_12.2.0_535.54.03_linux.run
      
      echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
      echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
      source ~/.bashrc

      Then, you can verify the installation by using the following command:

      nvcc --version

      Note: To write efficient GPU code, don’t forget to change the CUDA version you actually need when copying the aforementioned “wget” command and configure custom paths you would like to use.

      Memory Management

      The memory management is dependent on what you’re going to do with the GPU, and optimizing the performance can either be successful or backfire. That’s why we’ve made sure to guide you on whether or not to perform the optimization techniques shown below, based on specific purposes (read carefully).

      1. Pinned Memory

      The pinned memory will keep the data in a static, or also known as “fixed,” physical location on the host, which works great when you’re aiming to reduce the transfer latency.

      ⚠️When to Use: Use page-locked memory for real-time workloads, frequent CPU → GPU transfers, or high-throughput parallel computation tasks.

      float *hostPtr;
      cudaHostAlloc(&hostPtr, size, cudaHostAllocDefault);
      cudaMemcpy(devicePtr, hostPtr, size, cudaMemcpyHostToDevice);

      2. Unified Memory

      The unified memory would enable the CPU and GPU to share a single memory space. This means that the system will be able to automatically move or “migrate” data as needed.

      ⚠️When to Use: When you need simpler code, shared access between CPU and GPU, or dynamic data migration across multiple GPU instances.

      float *data;
      cudaMallocManaged(&data, size);
      myKernel<<<blocks, threads>>>(data);
      cudaDeviceSynchronize();

      3. Pooling & Reuse

      The repeated memory allocation and deallocation are techniques that can severely slow down the GPU performance. Memory pooling resolves this by allocating a block once and reusing it for multiple tasks!

      ⚠️When to Use: Use memory pooling only in long-running applications or loops where the GPU memory is repeatedly allocated and released.

      float *poolData;
      cudaMallocManaged(&poolData, poolSize);
      
      myKernel<<<blocks, threads>>>(poolData);

      Multi-GPU CUDA Setup

      If your workload requires more than one GPU, then you’re most certainly running parallel computing projects, which require a few specific tweaks.

      1. GPU Assignment

      If you have multiple GPU instance types, you need to correctly assign each of them, so you can handle the traffic from your workloads without potential overlap. This would guarantee that there are not going to be any idling or, in other words, “inactive” resources.

      ⚠️When to Use: Only assign your GPUs when you want to manually control the provisioning for specific applications in the parallel computing platforms.

      • Linux:
      export CUDA_VISIBLE_DEVICES=0,1

      Windows (PowerShell):

      set CUDA_VISIBLE_DEVICES=0,1

      Note: For example, if you set this to 0,1, it will allow your code to use only the first and second GPUs.

      2. Data Parallelism

      Another optimization for multiple GPU setups is implementing “data parallelism“, allowing you to split your load into different parts and execute them simultaneously. The parallelism is automatically used by TensorFlow and PyTorch, so no manual interaction would be needed when you use these frameworks.

      However, the concept is simple:

      • Split your dataset into batches for each GPU.
      • Each GPU computes its batch independently.
      • Aggregate the results on the CPU or main GPU.

      Inter-GPU communication is what allows multiple GPUs to share data, synchronize results, and operate as a single high-performance system. Optimizing this layer is key to parallel computation scalability.

      When to Use: When your GPUs need to exchange data frequently during AI training, simulations, or rendering workloads.

      Read Also: What is High Performance Computing (HPC)?

      GPU Clusters: Network Optimizations

      A decorative image illustrating a GPU cluster in a data center.

      When going beyond a single node and scaling GPU servers, you’re moving into clusters where the speed of your network is crucial for your GPU performance. We are speaking about high-bandwidth, low-latency connections that ultimately speed your GPU severs data transfer and synchronization.

      Here’s a quick comparison to begin with:

      Network Performance Comparison:
      TechnologyBandwidthLatencyUse Case
      Ethernet 10G10 Gbps5–10 µsSmall clusters
      Ethernet 100G100 Gbps2–5 µsMedium clusters
      InfiniBand HDR200 Gbps0.6 µsHPC clusters
      NVLink 4.0900 GB/s< 1 µsIntra-node links

      Here are the 4 fundamental network configurations:

      1. RDMA Configuration:

      Remote Direct Memory Access (RDMA) allows one system’s GPU or CPU to directly read/write another system’s memory without involving the kernel or CPU. This will reduce latency and improve throughput.

      For Linux:

      sudo apt install rdma-core ibverbs-utils perftest
      sudo service rdma start
      ibv_devinfo

      For Windows:
      Install Mellanox WinOF-2 drivers, then verify using PowerShell:

      Get-NetAdapterRdma

      Tip: You need to verify that the NIC supports RoCE v2 (RDMA over Converged Ethernet), which is only compatible with Ethernet-based clusters.

      2. InfiniBand Setup

      The second optimization would be the InfiniBand setup, which is the most popular interconnect for any high-performance GPU cluster configurations, due to its extremely low latency and high bandwidth.

      It is commonly used in AI supercomputing and data centers!

      For Linux:

      sudo apt install infiniband-diags
      sudo ibstat
      sudo ibstatus

      For Windows:
      First, you need to install WinOF-2 for Windows and verify InfiniBand links in the Device Manager → Network Adapters tab.

      3. GPUDirect Configuration

      NVIDIA GPUDirect will enable a direct data exchange between GPUs and NICs, bypassing the CPU to minimize latency. This is absolutely essential for multi-node GPU clusters that rely on fast data sharing.

      For Linux:

      wget https://content.mellanox.com/ofed/MLNX_OFED-5.8-1.0.1.1/MLNX_OFED_LINUX-5.8-1.0.1.1-ubuntu22.04-x86_64.tgz
      tar -xvf MLNX_OFED_LINUX-5.8-1.0.1.1-ubuntu22.04-x86_64.tgz
      sudo ./mlnxofedinstall --add-kernel-support
      
      sudo modprobe nvidia-peermem

      For Windows:
      Currently, GPUDirect RDMA is officially supported only on Linux. However, the GPUDirect Storage is partially supported on Windows Server with NVIDIA drivers ≥ 550.xx.

      4. NCCL Optimization

      The NVIDIA Collective Communication Library, known as NCCL, is designed for high-performance communication between multiple GPUs, both within and across nodes. Optimizing NCCL ensures maximum scaling efficiency for distributed workloads.

      For Linux:

      # Set environment variables for NCCL
      export NCCL_DEBUG=INFO
      export NCCL_IB_HCA=mlx5_0
      export NCCL_SOCKET_IFNAME=eth0

      For Windows:
      Frameworks like PyTorch and TensorFlow automatically manage NCCL under Windows with minimal manual configuration. Hence, changes will be automatically deployed when using those frameworks.

      GPU Performance: Power Draw and Heat Management

      A informational image showing the main factors for GPU overheating.

      High-performance GPU servers generate massive amounts of heat and power draw during compute-intensive workloads. Hence, uncontrolled temperature or voltage draw can lead to thermal throttling, reducing GPU usage and overall system performance.

      THERMAL THORTTLING:

      Thermal throttling occurs when a GPU (or CPU) automatically reduces its clock speeds and voltage to prevent overheating once it reaches a predefined temperature limit.

      Optimal Operating Ranges:

      Here are the optimal and maximum temperature ranges, as well as the potential action needed:

      ComponentOptimalMaximumAction Required
      GPU Core65-75°C🔥83°CIncrease cooling
      Memory80-85°C🔥95°CCheck airflow
      Hotspot75-85°C🔥110°CRepaste/RMA
      Power80% TDP100% TDPAdjust limits

      Tip: You can monitor your GPU temperatures in real time using the NVIDIA Control Panel or third-party software to prevent overheating.

      How to Prevent Thermal Throttling?

      Even the most powerful graphics cards can slow down if they exceed safe operating temperatures and when thermal throttling occurs. So, proper power draw management, fan control, and cooling system configuration ensure stable performance under full load.

      Here are the steps you need to undertake:

      #1 – Set Power Draw Limits

      Every GPU has a safe power threshold defined by the manufacturer. You can manually configure this limit to balance performance, temperature, and power efficiency. Use only when the GPUs operate near or above 85°C or 90°C under sustained load, or your datacenter has restricted power availability.

      For Linux:

      sudo nvidia-smi -pl 300

      For Windows:

      nvidia-smi -pl 300

      #2 – Fan Speed Optimization

      You can manually increase the fan speed, which can prevent thermal throttling during long training or rendering sessions. While most server fans are auto-controlled, there are custom curves that often deliver more stable thermals.

      For Linux:

      sudo nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=85"

      For Windows:
      You can manually set the fan speed if you open the GPU driver and go to Performance → Device Settings → Cooling!

      #3 – Cooling Optimizations

      Beyond software controls, maintaining ideal thermals often comes down to the environment your GPU servers operate in. Even the best graphics cards can overheat in spaces without professional-grade airflow or climate management.

      For maximum stability and cooling efficiency, consider hosting the infrastructure through ServerMania’s colocation for GPU infrastructure.

      Our top-tier data centers are designed with high-density cooling systems, redundant power delivery, and optimized airflow to keep GPU usage high and temperatures low.

      See Also: GPU Temperature Range Guide

      Solving GPU Performance Problems

      Even with the best GPU programming practices and optimized CUDA support, occasional issues can still affect your GPU performance. Hence, understanding how to diagnose and correct them is very important to keep your system stable and efficient.

      In most cases, the root cause lies in driver mismatches, memory leaks, or cooling limitations, all of which can be fixed with the right instructions and configuration steps.

      So, we’ll walk through the most common GPU performance issues and how to resolve them effectively!

      Driver Conflict Resolution

      A driver mismatch between the installed NVIDIA driver and the CUDA toolkit is one of the most frequent performance killers. When these versions don’t align, you may encounter errors like:

      • CUDA driver version is insufficient for CUDA runtime.”

        1. Uninstall the existing driver completely.

        sudo apt-get purge nvidia*

        2. Then, download the correct version directly.

        3. Run the installation manually if using Linux:

          sudo sh NVIDIA-Linux-x86_64-550.run

          4. Finally, you must verify compatibility using:

          nvidia-smi

          Tip: After upgrading, always reboot the computer and check the logs for potential kernel module errors!

          CUDA Errors and Fixes

          CUDA errors often appear during GPU programming when there’s an issue with shared memory, kernel launch parameters, or compiler versions. So, these issues can cause failed floating-point operations, incorrect output, or runtime crashes.

          Here are some quick and easy ways to fix:

          • Confirm that your CUDA toolkit and NVIDIA driver are compatible.
          • You can also try to rebuild the code using the correct compiler flags.
          • Adjust shared memory limits if your kernel relies on large data blocks.
          • When doing any manual installations, always run file installers as root.

          Boost GPU Performance with ServerMania GPU Solutions

          A CTA image reading "Boost Your GPU Server Performance with ServerMania".

          If you’re looking for GPU performance optimizations, you’ll need more than the bare hardware. A good optimization requires an entire infrastructure built to handle the most demanding workloads and tasks.

          That’s where ServerMania stands out. We deliver fully tuned GPU server hosting solutions designed for workloads including AI and ML, 3D rendering, scientific simulations, big data analysis, and much more.

          Our team offers dedicated GPU servers, managed GPU optimization, custom cluster deployments, and even cloud GPU infrastructure that guarantees reliability.

          You can also colocate your machine in one of our GPU-ready data centers for maximum control and uptime, while focusing on the importance surrounding your business.

          If you have questions, get in touch with our 24/7 customer support or book a free consultation and deploy your GPU infrastructure today to extend your capabilities.