triton-robot-inference

TensorRT/ONNX runtime FastAPI microservice (Orin-first)

arm64
FastAPIONNXTensorRTJetsonREST
Overview
Primary hardware
NVIDIA Orin/Jetson (arm64)
What it does

Lightweight FastAPI microservice running ONNX Runtime/TensorRT for common vision models.

Why it saves time

One-command GPU inference server; avoids building TRT engines by hand or stitching CUDA libs.

Get access

Use StreamDeploy to manage OTA updates, versioned configs, and rollbacks across fleets.

Request access
Dockerfile
ARG BASE_IMAGE
# Jetson-friendly CUDA/TensorRT runtime image
FROM ${BASE_IMAGE:-"nvcr.io/nvidia/l4t-pytorch:r35.4.1-py3"}

ENV DEBIAN_FRONTEND=noninteractive \
    LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3-pip python3-dev build-essential curl ca-certificates && \
    rm -rf /var/lib/apt/lists/*

# ONNX Runtime (GPU for Jetson uses special builds; fallback to CPU if unavailable)
# For Jetson: use onnxruntime-gpu-jetson wheel; else default to onnxruntime-gpu
RUN pip3 install --no-cache-dir \
    fastapi==0.111.0 uvicorn[standard]==0.30.0 \
    numpy==1.26.4 pillow==10.3.0 onnx==1.16.0 onnxruntime-gpu==1.18.0

# Place model(s) (mount your own at runtime or bake here)
WORKDIR /app
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
# Example: default ONNX path (override with MODEL_PATH env)
ENV MODEL_PATH=/models/model.onnx \
    SERVICE_HOST=0.0.0.0 \
    SERVICE_PORT=8080

EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s \
  CMD curl -fsS http://localhost:8080/health || exit 1

ENTRYPOINT ["/app/entrypoint.sh"]
entrypoint.sh
#!/usr/bin/env bash
set -euo pipefail

cat > /app/server.py << 'PY'
import os, io
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
from PIL import Image
import numpy as np, onnxruntime as ort

app = FastAPI()
model_path = os.getenv("MODEL_PATH","/models/model.onnx")
providers = ["CUDAExecutionProvider","CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)

@app.get("/health")
def health():
    return {"ok": True, "model": os.path.basename(model_path), "providers": session.get_providers()}

@app.post("/infer")
async def infer(file: UploadFile = File(...)):
    img_bytes = await file.read()
    img = Image.open(io.BytesIO(img_bytes)).convert("RGB").resize((640,640))
    x = np.asarray(img).astype(np.float32)/255.0
    x = np.transpose(x, (2,0,1))[None, ...]  # NCHW
    outputs = session.run(None, {session.get_inputs()[0].name: x})
    return JSONResponse({"outputs": [o.tolist() for o in outputs]})
PY

exec python3 -m uvicorn server:app --host "${SERVICE_HOST}" --port "${SERVICE_PORT}"