Deploying LLM APIs on GPU Server with Docker

LLM APIs with Docker and GPU Infrastructure

Deploying a Large Language Model (LLM) and getting it ready for production requires a few things, all of which, absolutely critical for its performance. They include optimized infrastructure, a good strategy for containerization, and several optimization methods that, when blended, provide the reproducibility and performance that most AI teams are aiming for.

Containerization & GPU Acceleration

Containerization with Docker is perhaps the easiest way to deploy a large language model anywhere, allowing you to put together all the sources you might need, such as software, libraries, files, and dependencies, and run it with a single command, making your workflow consistent and predictable.

When Docker is paired with a powerful GPU acceleration, you can enable LLM inference and reduce the response times to a minimum. This combination is what allows large models with billions of parameters to serve multiple requests and keep your infrastructure costs low.

LLM Deployment Methods Compared

Even though Docker is the easiest way to run LLMs, there are other methods that allow this efficiency that we shouldn’t overlook. Here’s how Docker compares with other very popular deployment methods:

	Ideal Use Case:	Difficulty:	Scalability:
Docker + GPU	Production APIs, inference services, and many self-hosted LLMs.	Medium	High
Kubernetes	Ideal for many large-scale or multi-model AI-driven systems.	High	Very High
Local/Dev	Perfect for testing, fine-tuning, and prototype development.	Low	Low
Serverless	Works best for applications with variable demand and unpredictable traffic.	Medium	Auto

As we can see, each of these deployment methods comes with its own advantages, but Docker keeps the lead for simplicity in operation, making it the most popular solution.

GPU Server Preparation for Large Language Models Deployment

It is exceedingly important to guarantee that your foundation is strong by checking the fundamental hardware and software requirements before proceeding with the deployment.

Hardware Requirements

All large models require strong computing power, and the hardware you choose must align with the weight of your workload, as this directly affects performance.

Here’s a quick GPU estimation (minimal and optimal), based on the LLM model size:

Model Size:	Minimum GPU:	Optimal GPU:	VRAM Required:
7B parameters	NVIDIA L4 or NVIDIA RTX	NVIDIA L4 24GB	16–24 GB
13B parameters	NVIDIA L4	NVIDIA A100 40GB	24–32 GB
70B parameters	NVIDIA A100 80GB	NVIDIA H100 80GB	80+ GB
405B parameters	8× A100 80GB	8× H100 80GB	640+ GB

You must keep in mind that as your workload grows and the model parameters increase, so does the requirement for memory storage. For instance, smaller configurations can utilize local LLMs on a single GPU; at some point, you will need powerful GPUs with lots of VRAM, especially if you are looking at production-grade inference or even multiple LLMs.

Software Requirements

Linux is by far the best option for a self-hosted large language model setup. Ubuntu 20.04 and Rocky Linux 8+ are among the most preferred operating systems, since both of these systems are well-known for their compatibility with machine learning projects.

Along with the OS, you would also need Docker 24.0+, NVIDIA Docker Runtime, NVIDIA Driver 535+, and CUDA 12.0+ installed and set up on your server to allow GPU access within containers.

To verify that your GPU is available, open the Linux Terminal and try:

nvidia-smi

For example, if you have NVIDIA L4 24GB, you should see:

GPU 0: NVIDIA L4 (24GB)

This step ensures that the hardware is available, effectively checking it off the list. If you can’t see your GPU, reinstall the driver from the official NVIDIA website and rerun the command.

Moving on to the next step, you must install the NVIDIA Container Toolkit to enable communication between your video card and Docker. Here are easy and simple steps:

1. Add the NVIDIA package repositories:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

2. Install the Nvidia Container Toolkit:

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

3. Configure Docker to use a runtime:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

4. Test the GPU access within Docker:

docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

To conclude, if the output displays the model of your GPU, then it is ready to deploy the LLM, and everything works as intended.

If an error appears instead, we recommend rerunning all of the commands and verifying the versions of the NVIDIA driver and Docker.

See Also: How To Install Docker on Ubuntu

Production LLM Deployment with vLLM

To successfully deploy a self-hosted LLM in production, you need to undertake several important steps that include creating a vLLM Dockerfile, configuring the GPU memory, and setting the OpenAI API.

So, let’s go through each of these steps with in-depth tutorials!

Creating a vLLM Dockerfile

The first step, before doing anything else, is to tell Docker exactly what image to use, which model to load, and then connect it to your Hugging Face cache. The code shown below, a Docker run command, will launch a containerized vLLM instance, provide access to all GPUs, and set the critical parameters.

Those parameters include:

Memory
Parallelism
Maximum Token Length

Here’s how to get a local or production LLM server running using Ollama Models:

export MODEL_NAME="meta-llama/Llama-2-7b-chat-hf"
export HF_TOKEN="your_huggingface_token"

docker run -d \
  --name llm-api \
  --gpus all \
  --shm-size 16g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model $MODEL_NAME \
  --dtype float16 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --tensor-parallel-size 1

Configuring the GPU Memory

If you’re aiming for a production LLM deployment, it’s absolutely necessary to optimize GPU usage and tweak the performance settings in your favor. The custom Dockerfile instantly allows you to set them up, so we’re going to check how to define these variables.

In addition, we’re going to enable health checks and prepare a dedicated cache!

FROM vllm/vllm-openai:latest

ENV MODEL_NAME="meta-llama/Llama-2-7b-chat-hf"
ENV GPU_MEMORY_UTILIZATION="0.85"
ENV MAX_MODEL_LEN="4096"
ENV TENSOR_PARALLEL_SIZE="1"

RUN mkdir -p /model-cache

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

WORKDIR /app
EXPOSE 8000

CMD python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME \
    --dtype float16 \
    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
    --max-model-len $MAX_MODEL_LEN \
    --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
    --host 0.0.0.0 \
    --port 8000

Setting Up the OpenAI API

Now that we have the container running and the GPU memory configured, it’s time to interact with your LLM by using some commands to test the API, execute a completion request to generate text, and chat-style requests for conversational AI.

# Health check
curl http://localhost:8000/health

# Text completion
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "prompt": "Explain quantum computing in simple terms:",
    "max_tokens": 100,
    "temperature": 0.7
  }'

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [{"role": "user", "content": "What is machine learning?"}],
    "max_tokens": 150
  }'

Production LLM Deployment with Hugging Face TGI

LLMs deployment with Hugging Face TGI is quite interesting and ideal for large models that require stability above everything else. TGI can handle everything from the token limits to GPU acceleration and requires batching, making it pretty much ideal for any LLM servers and production deployments.

TGI Deployment Configuration

The first step, of course, is to set up and launch a basic TGI container with the preferred LLM model. The example code below will create a running instance that can easily handle requests and store the model locally for quick startup.

It’s really the simplest way to get a self-hosted LLM working with Docker!

# Deploy Llama 2 with TGI
docker run -d \
  --name tgi-llm \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /data/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-2-7b-chat-hf \
  --num-shard 1 \
  --max-total-tokens 4096 \
  --max-input-length 3072

Model Quantization Options

Another very important step, especially when running significantly larger models, is to use quantization, which will help you drastically reduce GPU memory usage.

The example code below makes it simple to enable quantization and shows a production-ready TGI deployment. It includes batching, concurrency, and quantization:

# TGI with quantization and custom settings
docker run -d \
  --name tgi-optimized \
  --gpus all \
  --shm-size 2g \
  -p 8080:80 \
  -v /data/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id TheBloke/Llama-2-13B-chat-GPTQ \
  --quantize gptq \
  --max-batch-prefill-tokens 4096 \
  --max-total-tokens 8192 \
  --max-batch-total-tokens 16384 \
  --max-waiting-tokens 20 \
  --waiting-served-ratio 1.2 \
  --max-concurrent-requests 128

Production Settings

The next step, especially helpful for long-running deployments, is to use Docker Compose to set up and outline your LLM service, volumes, card allocation, and variables.

The sample code below shows you how to set this up in a production environment:

# docker-compose.yml for TGI
version: '3.8'

services:
  llm-inference:
    image: ghcr.io/huggingface/text-generation-inference:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: 2gb
    ports:
      - "8080:80"
    volumes:
      - model-cache:/data
      - ./config:/config
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
      - MODEL_ID=meta-llama/Llama-2-13b-chat-hf
      - MAX_TOTAL_TOKENS=8192
      - MAX_INPUT_LENGTH=4096
    command: >
      --model-id ${MODEL_ID}
      --num-shard 1
      --max-total-tokens ${MAX_TOTAL_TOKENS}
      --max-input-length ${MAX_INPUT_LENGTH}
    restart: unless-stopped
    
volumes:
  model-cache:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/llm-cache

Testing TGI Endpoints

When ready with the deployment process, it’s time to check whether the LLM API is responding in the correct way. You can test it using Python or directly from your browser to send prompts and retrieve generated text.

import requests
import json

TGI_URL = "http://localhost:8080"

def generate_text(prompt, max_new_tokens=100):
    headers = {"Content-Type": "application/json"}
    data = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_new_tokens,
            "temperature": 0.7,
            "top_p": 0.95,
            "do_sample": True
        }
    }
    
    response = requests.post(
        f"{TGI_URL}/generate",
        headers=headers,
        data=json.dumps(data)
    )
    
    return response.json()

# Test the API
result = generate_text("Explain Docker containers:")
print(result['generated_text'])

Production LLM Deployment with NVIDIA NIM

Another way to deploy LLM is via NVIDIA NIM (NVIDIA AI Model Deployment), which basically allows you to run any large language model easily, but only with NVIDIA GPUs.

This framework is fully optimized and offers low-latency inference and exceptionally high throughput. NIM is specifically for production environments, and will help you manage scalable deployments, GPU memory utilization, and even scalable deployments.

NIM Deployment Configuration

The first step is to set up NIM to be able to recognize your GPU resources, set model parameters, and be able to outline and recognize deployment targets.

The example code below launches a basic containerized LLM with NIM:

# Deploy Llama 2 with NVIDIA NIM
docker run -d \
  --name nim-llm \
  --gpus all \
  --shm-size 4g \
  -p 8000:8000 \
  -v /data/nim-models:/models \
  -e NIM_API_KEY=$NIM_KEY \
  nvcr.io/nvidia/nim/nim-server:latest \
  --model-path /models/Llama-2-7b-chat \
  --max-batch-size 32 \
  --max-seq-length 4096 \
  --precision fp16

The code above starts a NIM server that will load your model into the GPU memory, set the precision in order to lower the GPU usage, and configure the batch sizes for inference.

It’s straightforward, nothing complex here!

Optimizing GPU and Memory Usage

When dealing with very large models or ungodly request loads, you also need to tweak the batch size, precision, and memory allocation manually to boost your throughput and response time. The code here shows how to deploy a quantized, multi-GPU NIM setup for models that go beyond the 70B+ threshold.

# NIM multi-GPU deployment with quantization
docker run -d \
  --name nim-llm-large \
  --gpus '"device=0,1"' \
  --shm-size 16g \
  -p 8000:8000 \
  -v /data/nim-models:/models \
  nvcr.io/nvidia/nim/nim-server:latest \
  --model-path /models/Llama-2-70b-chat \
  --max-batch-size 64 \
  --max-seq-length 4096 \
  --precision fp16 \
  --quantize awq

Testing NVIDIA NIM Endpoints

When the NIM server is up and running, it’s highly recommended to check whether the API endpoints will respond and confirm that the LLM is working correctly.

Again, by using Python, we can easily send some prompts and check whether we’ll receive some of the expected generated texts. This will verify that the container setup and model configuration are working.

import requests
import json

NIM_URL = "http://localhost:8000"

def generate_text(prompt, max_tokens=100):
    payload = {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": 0.7
    }
    response = requests.post(f"{NIM_URL}/v1/generate", json=payload)
    return response.json()

# Test the NIM API
result = generate_text("Explain the benefits of GPU acceleration:")
print(result['text'])

Troubleshooting Local LLM Server Deployment Issues

Running large language models on a local LLM server can sometimes produce errors or slow response times, depending on the computing power needed, RAM, or motherboard configuration.

This section helps you plan for common issues, gives tips for private LLMs, and shows how to achieve greater control over running models with a user-friendly web interface or terminal. The solutions below are cost-effective and suitable for different setups, so you can pick what makes sense for your system.

Out of Memory Errors

Here are the primary symptoms of “Out of Memory” errors:

Your container fails with CUDA out of memory, or the system runs out of RAM. This happens when the computing power needed to run large models exceeds what your computer or GPU can provide.

Here are several ways to regain control:

# Solution 1: Reduce GPU memory utilization
docker run --gpus all \
  vllm/vllm-openai:latest \
  --gpu-memory-utilization 0.8  # Reduced from 0.9

# Solution 2: Enable quantization for smaller memory footprint
docker run --gpus all \
  vllm/vllm-openai:latest \
  --model TheBloke/Llama-2-7B-Chat-GPTQ \
  --quantization gptq

# Solution 3: Reduce max model length
docker run --gpus all \
  vllm/vllm-openai:latest \
  --max-model-len 2048  # Reduced from 4096

With the adjustment of memory settings or using quantized models, you can continue running models on a local LLM server without upgrading HW, giving you greater control over performance and cost.

Slow Inference

Here are the primary symptoms of slow interface:

The user-friendly Open WebUI feels sluggish, or response times are longer than expected. This usually happens when batch sizes, parallelism, or GPU utilization don’t match the computing power needed.

Solutions vary depending on your setup and the features you need.

# Check GPU utilization
nvidia-smi dmon -s u

# Increase batch size to better use RAM and GPU
docker run --gpus all \
  vllm/vllm-openai:latest \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 512

# Enable tensor parallelism for multi-GPU setups
docker run --gpus '"device=0,1"' \
  vllm/vllm-openai:latest \
  --tensor-parallel-size 2

Adjusting batch sizes or enabling tensor parallelism gives you greater control over running models, helping you achieve faster inference while keeping your system resources under control.

Container Crashes

Here is the primary symptom of container crashes:

First, the container unexpectedly stops or won’t start. The primary causes often include insufficient RAM, shared memory, or a missing NVIDIA runtime.

# Check logs to understand the crash
docker logs llm-api --tail 100

# Increase shared memory to stabilize container
docker run --gpus all \
  --shm-size 32g \
  vllm/vllm-openai:latest

# Verify NVIDIA runtime is working
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

Why This Makes Sense?

By monitoring logs, adjusting RAM allocation, and ensuring proper GPU drivers, you maintain greater control over private LLMs and local LLM servers.

This approach is cost-effective, suitable for both open source models and production deployments, and helps you plan for future scaling without overloading your computer.

ServerMania GPU Solutions for LLM Deployment

A ServerMania CTA image listing the reasons why ServerMania stands out.

Now that we’ve gone through the basics behind deploying large language models (LLMs), we know how important a reliable infrastructure and powerful GPU are.

With ServerMania, you can deploy and run your own LLMs in a matter of days by using our ready-to-go environment, whether it’s a dedicated server, a virtual server with AraCloud, or via colocation service.

Why Choose ServerMania for LLM Hosting

✓ Quick Deployment: We deploy servers quickly and efficiently with many pre-configured elements based on the preferred workflow.
✓ Flexible Hardware: With ServerMania, you can easily scale from a single GPU to a multi-node server cluster as your LLM workload grows.
✓ 24/7 Expert Support: Our expert support team is available at your disposal 24/7 to answer inquiries, help you troubleshoot, and optimize your AI/ML infrastructure.
✓ 99.99% Uptime SLA: We offer enterprise-grade reliability for critical production deployments in our top-tier data centers, delivering the lowest possible latency.

How to Get Started | 3 Easy Steps to Deploy LLM Today

Choose Your Configuration: Select an NVIDIA L4 GPU server configuration by choosing the CPU that matches the weight of your LLM project and customize your deployment as necessary.
Select Data Center Location: Choose a ServerMania data center that is closest to your clients to optimize latency and user experience anywhere in the world.
Deploy with Our Assistance: The ServerMania team of experts will guide you through the steps of deploying your first LLM infrastructure.

🗨️Ready to take your LLM deployment to the next level? Book a free consultation with an LLM expert to help you out with conclusion building and decision-making. We’re available right now!

Deploying Your First LLM API: From Docker Container to High-Performance GPU Hosting