Triton Inference Server¶

Use TritonExecutor to offload model inference to a remote Triton Inference Server while keeping pre/postprocessing local.

Install¶

pip install tritonclient[grpc]

Configure¶

# app/adapters/http/deps.py
from app.execution.triton_executor import TritonExecutor

@lru_cache
def get_triton_executor():
    return TritonExecutor(url="triton-host:8001", max_workers=8)

Register it in ExecutionPolicy:

@lru_cache
def get_execution_policy() -> ExecutionPolicy:
    return ExecutionPolicy(
        executors={
            "cpu": get_cpu_executor(),
            "triton": get_triton_executor(),
        },
        policy={"my_model:v1": "triton"},
        default="cpu",
    )

How it works¶

Pre/postprocessing runs locally in the inference engine. Only the model.predict() call is sent to Triton via gRPC. The pipeline structure is unchanged.

Requirements¶

Triton Inference Server running and accessible at the configured URL
Model loaded in Triton's model repository
tritonclient[grpc] installed in the inference engine environment