triton-robot-inference
TensorRT/ONNX runtime FastAPI microservice (Orin-first)
arm64
FastAPIONNXTensorRTJetsonREST
Overview
Primary hardware
NVIDIA Orin/Jetson (arm64)
What it does
Lightweight FastAPI microservice running ONNX Runtime/TensorRT for common vision models.
Why it saves time
One-command GPU inference server; avoids building TRT engines by hand or stitching CUDA libs.
Get access
Use StreamDeploy to manage OTA updates, versioned configs, and rollbacks across fleets.
Request accessDockerfile
ARG BASE_IMAGE
# Jetson-friendly CUDA/TensorRT runtime image
FROM ${BASE_IMAGE:-"nvcr.io/nvidia/l4t-pytorch:r35.4.1-py3"}
ENV DEBIAN_FRONTEND=noninteractive \
LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
RUN apt-get update && apt-get install -y --no-install-recommends \
python3-pip python3-dev build-essential curl ca-certificates && \
rm -rf /var/lib/apt/lists/*
# ONNX Runtime (GPU for Jetson uses special builds; fallback to CPU if unavailable)
# For Jetson: use onnxruntime-gpu-jetson wheel; else default to onnxruntime-gpu
RUN pip3 install --no-cache-dir \
fastapi==0.111.0 uvicorn[standard]==0.30.0 \
numpy==1.26.4 pillow==10.3.0 onnx==1.16.0 onnxruntime-gpu==1.18.0
# Place model(s) (mount your own at runtime or bake here)
WORKDIR /app
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
# Example: default ONNX path (override with MODEL_PATH env)
ENV MODEL_PATH=/models/model.onnx \
SERVICE_HOST=0.0.0.0 \
SERVICE_PORT=8080
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s \
CMD curl -fsS http://localhost:8080/health || exit 1
ENTRYPOINT ["/app/entrypoint.sh"]
entrypoint.sh
#!/usr/bin/env bash
set -euo pipefail
cat > /app/server.py << 'PY'
import os, io
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
from PIL import Image
import numpy as np, onnxruntime as ort
app = FastAPI()
model_path = os.getenv("MODEL_PATH","/models/model.onnx")
providers = ["CUDAExecutionProvider","CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)
@app.get("/health")
def health():
return {"ok": True, "model": os.path.basename(model_path), "providers": session.get_providers()}
@app.post("/infer")
async def infer(file: UploadFile = File(...)):
img_bytes = await file.read()
img = Image.open(io.BytesIO(img_bytes)).convert("RGB").resize((640,640))
x = np.asarray(img).astype(np.float32)/255.0
x = np.transpose(x, (2,0,1))[None, ...] # NCHW
outputs = session.run(None, {session.get_inputs()[0].name: x})
return JSONResponse({"outputs": [o.tolist() for o in outputs]})
PY
exec python3 -m uvicorn server:app --host "${SERVICE_HOST}" --port "${SERVICE_PORT}"