DoggyDish.com earns commissions when you purchase through the links below at no additional cost to you.

What you’ll build

This guide walks a developer from “I’ve run local LLMs on my PC” to a real, deployable multi-agent service running on a DigitalOcean GPU Droplet (RTX 6000 Ada or L40S-class), using the LangChain ecosystem (LangGraph + LangChain tools) and a GPU-backed inference server (vLLM or Ollama).

The “hello world” you’ll deploy is intentionally small but production-shaped:

  • GPU inference backend (OpenAI-compatible HTTP API via vLLM, or Ollama’s API).
  • multi-agent workflow with a supervisor delegating to specialized worker agents (e.g., Research + Analyst + Writer).
  • FastAPI service that exposes a single endpoint your apps can call.

Reference architecture (single-machine version):

				
					Internet
  |
  |  HTTPS (optional)
  v
FastAPI "Agent API"  (LangGraph supervisor + tools)
  |
  |  internal network (Docker)
  v
vLLM OpenAI-compatible server  --> GPU (RTX 6000 Ada / L40S)
				
			

Why this shape? It mirrors how teams actually ship agentic systems: a web-facing orchestration/service layer, calling a model-serving layer optimized for GPU utilization. DigitalOcean specifically supports this path with AI/ML-ready images and an inference-optimized image designed for LLM serving.

Hardware, model, and cost planning

DigitalOcean’s single-GPU plans include both NVIDIA RTX 6000 Ada Generation and NVIDIA L40S, each listed at $1.57/GPU/hour on-demand with 48 GB GPU memory, 64 GiB system memory, 8 vCPUs, and a 500 GiB NVMe boot disk. 

A critical billing detail for cost control:

  • GPU Droplets are billed per second with a 5-minute minimum, and powering off does not stop billing (resources remain reserved) — billing ends only when you destroy the Droplet. 

So the “save money” workflow is not “shut down and come back later.” It’s:

Snapshot → Destroy Droplet → Recreate from Snapshot when needed 

Snapshot storage is billed separately at $0.06/GB-month (proportional to stored snapshot size).

Cost and time estimates you can actually use

As of March 7, 2026 (America/Los_Angeles) using the published on-demand price $1.57/hr

  • 1 hour: ~$1.57
  • 1 day (24h): ~$37.68
  • “Setup session” (2–3 hours): ~$3.14–$4.71
  • Weekend testing (8 hours total): ~$12.56

If you’ve seen “$41.28/day” in informal estimates, that would correspond to ~$1.72/hr; the current published on-demand price for RTX 6000 Ada/L40S on DigitalOcean’s pricing page is $1.57/hr

A realistic end-to-end setup time for a first run (assuming you’re comfortable with SSH/Docker):

  • Provision droplet + SSH: 10–20 min
  • Verify GPU + Docker runtime: 10–20 min
  • Pull model weights (varies heavily): 5–30+ min
  • Deploy agent API + test: 20–45 min

The single biggest variable is model download + cold start, which scales with model size (many GBs). This is why snapshots and caching matter.

GPU Droplet vs CPU-only Droplet (why you’re doing this)

LLM inference is dominated by large matrix operations; GPU acceleration via CUDA is specifically built to “dramatically speed up computing applications by harnessing the power of GPUs.” 

DigitalOcean’s GPU Droplets exist because CPU-only VMs are typically poor value for serious LLM inference/embedding workloads; GPU plans also include a dedicated boot disk (persistent) plus a scratch disk (non-persistent) for staging data.

RTX 6000 Ada vs L40S vs RTX 5090 (practical LLM implications)

This is the “why it feels different than your desktop GPU” portion.

VRAM capacity is the feature that most often gates which models you can run well.

  • RTX 6000 Ada: 48GB GDDR6 (ECC); memory bandwidth listed as 960 GB/s in NVIDIA’s datasheet. 
  • L40S: 48GB GDDR6 (ECC), plus datacenter positioning; one spec block lists 864 GB/s memory bandwidth and FP8 tensor throughput up to 1,466 TFLOPS (with sparsity). 
  • GeForce RTX 5090: NVIDIA’s page states 32GB GDDR7. Some board partner specs list 512-bit and 1792 GB/s bandwidth; but VRAM remains 32GB. 

Translation for an LLM developer:

  • 48GB class GPUs (RTX 6000 Ada / L40S) let you run a wider range of mid-size and larger LLMs, and/or run smaller LLMs with bigger context windows, higher concurrency, or less aggressive quantization. 
  • 32GB class GPUs (RTX 5090) can be extremely fast, but you’ll hit VRAM ceilings sooner for bigger models or heavier concurrency; you’ll rely more on quantization/offloading tradeoffs. Quantization is a core technique for reducing memory and enabling larger model loads. 

Provisioning the GPU Droplet

Your screenshots match the current DigitalOcean control panel flow:

  • “Create GPU Droplet” page.
  • GPU plan selection (tabs for multiple GPU types; your example shows L40S at $1.57/hr).
  • Advanced options like Improved Metrics and Monitoring (Free).
  • “1-click Models” list (including very large models like “DeepSeek/R1 671B” showing why GPU selection/memory matters).

DigitalOcean’s official “How to Create GPU Droplets” doc confirms the key options you saw:

  • Use an AI/ML-ready image (drivers and software preinstalled), or
  • Use 1-Click Models powered by Hugging Face, or
  • Use a stock image and manually install drivers. 

Step-by-step creation in the control panel

  1. In the DigitalOcean control panel, click Create → GPU Droplets
  2. Choose a datacenter region where GPUs are available (your screenshot shows Toronto TOR1 as an example).
  3. Choose an image:
    • AI/ML-ready image when you want a clean, flexible base (recommended for learning + building). DigitalOcean documents that the NVIDIA AI/ML-ready image is Ubuntu 22.04 and includes CUDA drivers/toolkit and the NVIDIA Container Toolkit. 
    • Inference-optimized image if you want the fastest path to serving LLMs (it includes Docker and a vLLM container plus a helper script). 
    • 1-Click Models if you want a preconfigured model endpoint quickly (useful for demos and “I just want something responding”). 
  4. Choose your GPU plan:
    • RTX 6000 Ada or L40S are both listed at $1.57/GPU/hr on-demand. 
  5. Enable Improved Metrics and Monitoring (free) if you want GPU observability surfaced in DigitalOcean Insights. 
  6. Add your SSH key, name the Droplet, and create it.

Verifying the NVIDIA stack and hardening the host

DigitalOcean strongly recommends the AI/ML-ready images, and their “Recommended GPU setup” doc lists exactly what comes installed for NVIDIA GPU Droplets on that image:

  • nvidia-container-toolkit
  • CUDA keyring
  • CUDA driver package (cuda-drivers-575)
  • CUDA Toolkit (cuda-toolkit-12-9

Step-by-step verification

  1. SSH in as root (or your configured user).
  2. Confirm the GPU is visible:
				
					nvidia-smi
				
			

If the AI/ML-ready image was used, drivers/toolkit should already be present as above. 

  1. Confirm Docker can see the GPU (quick smoke test):
				
					docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi
				
			

CUDA container images are designed specifically to provide CUDA runtimes/toolkit in containers for GPU workloads. 

Basic host security that prevents “oops I exposed my agent to the internet”

At minimum, do both:

  • DigitalOcean Cloud Firewall (network-level), and
  • ufw on the VM (host-level)

DigitalOcean shows how to configure firewall rules and notes that without inbound rules, traffic is blocked; the suggested inbound rule allows SSH access. 

If you use UFW, allow SSH before enabling to avoid locking yourself out. 

Example UFW setup (adjust ports to your plan):

				
					sudo ufw allow OpenSSH
sudo ufw allow 8001/tcp   # example: your agent API port
sudo ufw enable
sudo ufw status
				
			

Deploying the GPU inference backend

You have two mainstream ways to serve a local model on your GPU Droplet for agent workflows:

  • vLLM (high-throughput inference; OpenAI-compatible server; great for multi-user services) 
  • Ollama (very friendly local dev ergonomics; also supports NVIDIA GPUs with appropriate drivers) 

This guide uses vLLM as the default because LangChain can connect via an OpenAI-compatible base URL, and DigitalOcean even provides an inference-optimized image that includes vLLM components out of the box. 

Option A: Inference-optimized image (fastest path)

DigitalOcean documents an Inference-Optimized Image for NVIDIA GPU Droplets that includes:

  • CUDA 12.9
  • NVIDIA driver version 575.51.03
  • Docker
  • vLLM container (vllm-openai container v0.9.0)
  • A guided run_model.sh script that prompts for model selection and configuration 

If you chose this image:

  1. SSH into the Droplet.
  2. Run the included script (per the doc) and follow prompts. 

Option B: Run vLLM via Docker Compose (portable, explicit)

vLLM documents official Docker images and provides a canonical docker run example using vllm/vllm-openai

Create a working directory:

				
					mkdir -p ~/agentic-gpu && cd ~/agentic-gpu
				
			

Create docker-compose.yml:

				
					services:
  vllm:
    image: vllm/vllm-openai:latest
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --host 0.0.0.0
      --port 8000
    environment:
      - HF_TOKEN=${HF_TOKEN}
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    ports:
      # For safety, bind to localhost unless you explicitly want public access.
      - "127.0.0.1:8000:8000"
    ipc: host
				
			

Bring it up:

				
					export HF_TOKEN="your_huggingface_token_if_needed"
docker compose up -d
docker logs -f vllm
				
			

Test locally on the droplet:

				
					curl http://127.0.0.1:8000/v1/models
				
			

This server is OpenAI-compatible, so your LangChain agent service can treat it like an OpenAI Chat Completions endpoint. 

Choosing a model that fits your GPU (and why quantization matters)

A helpful mental model:

  • Smaller models (7B–14B) = faster iteration, lower VRAM pressure, higher concurrency.
  • Larger models (30B–70B) = often better reasoning/writing quality, but much heavier VRAM demand.

Quantization techniques exist specifically to reduce memory requirements and make larger models feasible on limited VRAM, at the cost of some accuracy and sometimes speed. Hugging Face documents quantization as a way to reduce memory/computation cost and enable loading larger models; Transformers supports several approaches including 8-bit/4-bit and algorithms like AWQ/GPTQ. 

DigitalOcean’s own 1-Click models catalog emphasizes popular models like Llama and Mistral targeted at GPU Droplets, and their 1-Click documentation describes them as preconfigured and optimized for GPU Droplets. 

Deploying the LangChain multi-agent service

The multi-agent “hello world” here uses a supervisor pattern, where a central supervisor delegates to specialized agents. LangChain’s docs describe this pattern and why it helps (tool partitioning, domain separation, clearer iteration). 

For the orchestration layer, you have two good “LangChain ecosystem” choices:

  • Manual “supervisor + subagents as tools” approach (recommended by LangChain docs). 
  • LangGraph Supervisor library (a lightweight helper for hierarchical multi-agent graphs; the LangChain team announced it, and the repo shows end-to-end examples). 

This guide shows a practical implementation using the LangGraph Supervisor library because it yields a compact, demonstrable multi-agent workflow and still remains within the LangChain ecosystem. 

Step-by-step: create the Agent API service

Inside ~/agentic-gpu, create an agent_api folder:

				
					mkdir -p agent_api
cd agent_api
				
			

Create requirements.txt:

				
					fastapi
uvicorn[standard]
langchain
langchain-openai
langgraph
langgraph-supervisor
duckduckgo-search
				
			

Create app.py:

				
					import os
from fastapi import FastAPI
from pydantic import BaseModel

from duckduckgo_search import DDGS

from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langgraph_supervisor import create_supervisor

# ---------
# Tools
# ---------
def web_search(query: str) -> str:
    """Lightweight web search tool using DuckDuckGo snippets."""
    with DDGS() as ddgs:
        results = list(ddgs.text(query, max_results=5))
    # Return condensed snippets for the agent to cite/ground itself.
    lines = []
    for r in results:
        title = r.get("title", "")
        href = r.get("href", "")
        body = r.get("body", "")
        lines.append(f"- {title}\n  {href}\n  {body}")
    return "\n".join(lines)

def safe_calc(expression: str) -> str:
    """Very small calculator tool (avoid arbitrary code execution)."""
    allowed = set("0123456789+-*/(). %")
    if any(ch not in allowed for ch in expression):
        return "Unsupported characters in expression."
    try:
        return str(eval(expression, {"__builtins__": {}}, {}))
    except Exception as e:
        return f"Error: {e}"

# ---------
# Model: points to local vLLM OpenAI-compatible server
# ---------
VLLM_BASE_URL = os.getenv("VLLM_BASE_URL", "http://127.0.0.1:8000/v1")
MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-7B-Instruct")

llm = ChatOpenAI(
    model=MODEL_NAME,
    api_key=os.getenv("OPENAI_API_KEY", "not-used"),
    base_url=VLLM_BASE_URL,
    temperature=0,
)

# ---------
# Worker agents
# ---------
research_agent = create_react_agent(
    model=llm,
    tools=[web_search],
    name="research_expert",
    prompt="You are a research expert. Use web_search to gather facts and relevant links."
)

math_agent = create_react_agent(
    model=llm,
    tools=[safe_calc],
    name="math_expert",
    prompt="You are a math expert. Use safe_calc for numeric calculations."
)

# ---------
# Supervisor
# ---------
workflow = create_supervisor(
    [research_agent, math_agent],
    model=llm,
    prompt=(
        "You are a supervisor managing a research expert and a math expert. "
        "Delegate research questions to research_expert. "
        "Delegate calculations to math_expert. "
        "Synthesize a final answer for the user."
    ),
)

graph_app = workflow.compile()

# ---------
# FastAPI wrapper
# ---------
app = FastAPI(title="Agentic GPU API", version="0.1.0")

class ChatIn(BaseModel):
    message: str

@app.post("/chat")
def chat(payload: ChatIn):
    result = graph_app.invoke(
        {"messages": [{"role": "user", "content": payload.message}]}
    )
    # Return the last model message
    messages = result.get("messages", [])
    final = messages[-1].content if messages else ""
    return {"answer": final}

				
			

Why this works:

  • vLLM can run as an OpenAI-compatible server. 
  • LangChain can connect to OpenAI-compatible endpoints by setting base_url. Their vLLM integration doc shows ChatOpenAI(... base_url="http://localhost:8000/v1" ...)
  • The supervisor pattern is an established LangChain multi-agent approach. 
  • LangGraph’s “workflows and agents” material covers why graphs help with persistence/streaming/debugging in agentic systems. 

Containerize the Agent API

Create Dockerfile:

				
					FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt

COPY app.py /app/app.py

ENV PORT=8001
EXPOSE 8001

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8001"]
				
			

Now go back to ~/agentic-gpu and add an agent_api service to your repo-level docker-compose.yml (or create a new compose file):

				
					services:
  vllm:
    image: vllm/vllm-openai:latest
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --host 0.0.0.0
      --port 8000
    environment:
      - HF_TOKEN=${HF_TOKEN}
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    ports:
      - "127.0.0.1:8000:8000"
    ipc: host

  agent_api:
    build: ./agent_api
    environment:
      - VLLM_BASE_URL=http://host.docker.internal:8000/v1
      - MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
      - OPENAI_API_KEY=not-used
    ports:
      - "8001:8001"
    depends_on:
      - vllm

				
			

Notes:

  • If Docker on your host doesn’t support host.docker.internal on Linux, set VLLM_BASE_URL to http://172.17.0.1:8000/v1 (common Docker bridge gateway), or put the Agent API in host networking. Keeping vLLM bound to localhost reduces accidental exposure.

Bring everything up:

				
					cd ~/agentic-gpu
docker compose up -d --build
				
			

Test:

				
					curl -X POST http://YOUR_DROPLET_IP:8001/chat \
  -H 'Content-Type: application/json' \
  -d '{"message":"Find recent notes on DigitalOcean GPU Droplet billing, then compute the cost of 6 hours at $1.57/hr."}'

				
			

Monitoring, reliability, and the money-saving shutdown workflow

Monitoring GPU usage

If you enabled “Improved Metrics and Monitoring” during creation, DigitalOcean notes that do-agent can detect the GPU and integrate with exporters (DCGM for NVIDIA) to surface GPU metrics in Insights. 

For more advanced monitoring outside Insights, DigitalOcean recommends using NVIDIA DCGM / DCGM Exporter for NVIDIA GPUs. 

The correct “save money” workflow on DigitalOcean GPU Droplets

Because powered-off GPU Droplets are still billed until destroyed, your end-of-session checklist should look like this: 

Step-by-step

  1. Power off the Droplet (for data consistency).

    • DigitalOcean’s doctl snapshot command reference explicitly recommends powering off before snapshotting. 
  2. Take a snapshot:

    • Via UI: Droplet → Snapshots → “Take snapshot”
    • Via doctl (example):
				
					doctl compute droplet-action snapshot <droplet-id> --snapshot-name "agentic-gpu-$(date +%Y-%m-%d)"

				
			

DigitalOcean’s docs describe snapshotting via doctl compute droplet-action snapshot and recommend powering off first. 

  1. Destroy the Droplet (this is what stops billing):
    • UI: Droplet → Destroy
    • doctl:
				
					doctl compute droplet delete <droplet-id>
				
			

Destroying a Droplet is irreversible, and the control panel flow allows optionally destroying associated resources (snapshots, volumes) — be careful to keep the snapshot you intend to restore from. 

  1. Later: recreate from snapshot:

    • UI: Create → Droplets → Choose Image → Snapshots
    • Automation: create a Droplet using the snapshot image ID (DigitalOcean documents this conceptually). 
  2. Keep an eye on snapshot storage charges

    • Snapshot storage is billed at $0.06/GB-month, so delete old snapshots you no longer need. 

A cost-optimized “real-world” pattern

If you want a more “production” cost posture than “one GPU box does everything”:

  • Run the Agent API + routing + auth on a small CPU Droplet (always-on, cheap).
  • Create the GPU Droplet only when inference is needed, attach to a predictable DNS name or service discovery, and destroy it afterward (with snapshots/cached model weights strategy). DigitalOcean emphasizes per-second billing and cost-control benefits of granular usage in their per-second billing announcement. 

That pattern is commonly how teams justify on-demand GPU spend for agentic workloads that are bursty rather than 24/7.

Where 1-Click Models fit in this picture

DigitalOcean’s 1-Click Models are designed to give you a model endpoint “immediately after provisioning,” with no extra setup, and they’re explicitly positioned as optimized for GPU Droplets. 

If your goal is “I want to confirm end-to-end agent orchestration first,” 1-Click Models can reduce the inference setup work. You still deploy the agent-service layer, but you skip the model-server wiring.

Leave a Reply

Your email address will not be published. Required fields are marked *