wendylabsinc/tensorrt-swift
Swift Package that provides Swift-first APIs for working with NVIDIA TensorRT on Linux, with a separate TensorRTLLM product for LLM-specific extensions.
System Requirements
Required Libraries
The package links against the following system libraries at build time and runtime:
| Library | Package | Purpose | |---------|---------|---------| | libnvinfer.so | TensorRT | Core inference engine | | libnvinfer_plugin.so | TensorRT | Built-in plugins | | libnvonnxparser.so | TensorRT | ONNX model import | | libcuda.so | CUDA Driver | GPU access |
Installation
Option 1: NVIDIA Container (Recommended)
Use the official TensorRT container which includes all dependencies:
docker run --gpus all -it nvcr.io/nvidia/tensorrt:24.08-py3Option 1b: Jetson Container (Orin Nano, AGX Thor)
Jetson uses aarch64 containers and must match the host JetPack/L4T release. See docs/jetson-container.md for a full recipe.
Option 2: System Installation (Ubuntu/Debian)
# 1. Install CUDA 12.6
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-6
# 2. Install TensorRT 10.x
sudo apt-get install -y libnvinfer10 libnvinfer-plugin10 libnvonnxparser10 libnvinfer-dev
# 3. Add CUDA to your path
export PATH=/usr/local/cuda-12.6/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATHOption 3: From NVIDIA Developer Downloads
- Download CUDA Toolkit 12.6
- Download TensorRT 10.x (requires NVIDIA Developer account)
- Follow NVIDIA's installation guides
Verifying Installation
# Check CUDA
nvcc --version
# Check TensorRT
dpkg -l | grep nvinfer
# or
ls /usr/lib/x86_64-linux-gnu/libnvinfer*Swift Installation
Install Swift 6.2+ via Swiftly:
curl -L https://swiftlang.github.io/swiftly/swiftly-install.sh | bash
swiftly install 6.2Development Workflow (macOS/Windows)
You can write code on macOS or Windows, but building and running must happen on Linux with TensorRT/CUDA libraries available. The recommended workflow is:
- Develop locally on macOS/Windows.
- Build/test inside a Linux container (Option 1 / 1b) or on a Linux host.
Cross-compiling from macOS/Windows to Linux is possible but fragile and not recommended.
What Works Today
Core APIs
| API | Description | |-----|-------------| | TensorRTRuntime.buildEngine(onnxURL:options:) | Build TensorRT engine from ONNX | | TensorRTRuntime.deserializeEngine(from:) | Load serialized engine plan | | Engine.save(to:) / Engine.load(from:) | Persist/load engines to disk | | ExecutionContext.enqueue(_:) | Execute inference (host buffers) | | ExecutionContext.enqueueDevice(...) | Execute with device pointers | | ExecutionContext.warmup(iterations:) | Warmup for stable latency |
GPU & Device APIs
| API | Description | |-----|-------------| | TensorRTSystem.cudaDeviceCount() | Number of available GPUs | | TensorRTSystem.deviceProperties(device:) | GPU name, compute capability, memory | | TensorRTSystem.memoryInfo(device:) | Free/total GPU memory | | TensorRTSystem.CUDAStream | RAII stream wrapper | | TensorRTSystem.CUDAEvent | RAII event wrapper |
Dynamic Shapes & Profiles
| API | Description | |-----|-------------| | ExecutionContext.reshape(bindings:) | Set input shapes at runtime | | ExecutionContext.setOptimizationProfile(named:) | Switch optimization profiles | | OptimizationProfile | Define min/opt/max shapes |
LLM Extensions (TensorRTLLM)
| API | Description | |-----|-------------| | ExecutionContext.stream(...) | Streaming inference (AsyncSequence) | | StreamingConfiguration | Configure token-by-token generation | | StreamingInferenceStep | Per-step metadata and outputs |
Swift-y Conveniences
// TensorShape with array literal
let shape: TensorShape = [1, 3, 224, 224]
print(shape) // "TensorShape[1, 3, 224, 224]"
print(shape[0]) // 1
// Engine persistence
try engine.save(to: URL(fileURLWithPath: "model.engine"))
let loaded = try Engine.load(from: URL(fileURLWithPath: "model.engine"))
// Query GPU before loading
let mem = try TensorRTSystem.memoryInfo()
print("Free GPU memory: \(mem.free / 1_000_000_000) GB")Quick Start
Add the package to your Package.swift
// swift-tools-version: 6.2
import PackageDescription
let package = Package(
name: "MyApp",
dependencies: [
.package(url: "https://github.com/wendylabsinc/tensorrt-swift", from: "0.0.1"),
],
targets: [
.executableTarget(
name: "MyApp",
dependencies: [
.product(name: "TensorRT", package: "tensorrt-swift"),
]
),
]
)To use the LLM extension module for streaming inference and other LLM utilities:
.product(name: "TensorRTLLM", package: "tensorrt-swift")Query GPU and TensorRT version
import TensorRT
// Check TensorRT version
let version = try TensorRTRuntimeProbe.inferRuntimeVersion()
print("TensorRT version: \(version)")
// Check GPU
let props = try TensorRTSystem.deviceProperties()
print("GPU: \(props.name)")
print("Compute Capability: \(props.computeCapability)")
print("Memory: \(props.totalMemory / 1_000_000_000) GB")
let mem = try TensorRTSystem.memoryInfo()
print("Free: \(mem.free / 1_000_000_000) GB / \(mem.total / 1_000_000_000) GB")Build an engine from ONNX and run inference
import TensorRT
let runtime = TensorRTRuntime()
let engine = try runtime.buildEngine(
onnxURL: URL(fileURLWithPath: "model.onnx"),
options: EngineBuildOptions(
precision: [.fp32],
workspaceSizeBytes: 1 << 28
)
)
// Save for later use (avoid rebuild)
try engine.save(to: URL(fileURLWithPath: "model.engine"))
let ctx = try engine.makeExecutionContext()
// Warmup for stable latency
let warmup = try await ctx.warmup(iterations: 10)
print("Warmup avg: \(warmup.average ?? .zero)")
// Run inference
let inputDesc = engine.description.inputs[0].descriptor
let input: [Float] = (0..<inputDesc.shape.elementCount).map(Float.init)
let inputBytes = input.withUnsafeBufferPointer { Data(buffer: $0) }
let batch = InferenceBatch(inputs: [
inputDesc.name: TensorValue(descriptor: inputDesc, storage: .host(inputBytes))
])
let result = try await ctx.enqueue(batch)Streaming inference (for LLMs)
import TensorRTLLM
let stream = context.stream(
initialBatch: promptBatch,
configuration: StreamingConfiguration(maxSteps: 100)
) { previousResult in
// Transform previous output into next input (e.g., append generated token)
return makeNextBatch(from: previousResult)
}
for try await step in stream {
print("Step \(step.stepIndex), final: \(step.isFinal)")
// Process each step as it arrives
if step.isFinal { break }
}Dynamic shapes with optimization profiles
import TensorRT
let profile = OptimizationProfile(
name: "batch_range",
axes: [:],
bindingRanges: [
"input": .init(
min: TensorShape([1, 512]),
optimal: TensorShape([8, 512]),
max: TensorShape([32, 512])
),
]
)
let engine = try TensorRTRuntime().buildEngine(
onnxURL: URL(fileURLWithPath: "dynamic.onnx"),
options: EngineBuildOptions(precision: [.fp32], profiles: [profile])
)
let ctx = try engine.makeExecutionContext()
try await ctx.reshape(bindings: ["input": TensorShape([16, 512])])
let result = try await ctx.enqueue(batch)Examples
The package includes 17 examples organized by difficulty level. Run any example with ./scripts/swiftw run <ExampleName>. The wrapper keeps build artifacts in /tmp by default; override with SWIFT_BUILD_PATH if needed.
Beginner Examples
| Example | Description | Command | |---------|-------------|---------| | HelloTensorRT | Minimal "hello world" - probe version, build identity engine, run inference | ./scripts/swiftw run HelloTensorRT | | ONNXInference | Load ONNX model, build engine, run inference with throughput measurement | ./scripts/swiftw run ONNXInference | | BatchProcessing | Process multiple batches, latency statistics (p50/p95/p99) | ./scripts/swiftw run BatchProcessing |
Intermediate Examples
| Example | Description | Command | |---------|-------------|---------| | DynamicBatching | Dynamic shapes for variable batch sizes at runtime | ./scripts/swiftw run DynamicBatching | | MultiProfile | Multiple optimization profiles for different workloads | ./scripts/swiftw run MultiProfile | | AsyncInference | Non-blocking inference with CUDA streams and events | ./scripts/swiftw run AsyncInference | | ImageClassifier | End-to-end pipeline: preprocess → inference → postprocess | ./scripts/swiftw run ImageClassifier | | DeviceMemoryPipeline | Keep tensors on GPU, avoid H2D/D2H transfers | ./scripts/swiftw run DeviceMemoryPipeline |
LLM Examples (TensorRTLLM)
LLM examples live under ExamplesLLM/.
| Example | Description | Command | |---------|-------------|---------| | StreamingLLM | Token-by-token generation with KV-cache pattern | ./scripts/swiftw run StreamingLLM |
Advanced Examples
| Example | Description | Command | |---------|-------------|---------| | MultiGPU | Distribute inference across multiple GPUs | ./scripts/swiftw run MultiGPU | | CUDAEventPipelining | Overlap compute with data transfer using events | ./scripts/swiftw run CUDAEventPipelining | | BenchmarkSuite | Comprehensive throughput/latency measurement | ./scripts/swiftw run BenchmarkSuite | | FP16Quantization | Compare FP32 vs FP16 precision and performance | ./scripts/swiftw run FP16Quantization |
Real-World Examples
| Example | Description | Command | |---------|-------------|---------| | TextEmbedding | Sentence transformer for semantic search | ./scripts/swiftw run TextEmbedding | | ObjectDetection | YOLO-style detection with NMS postprocessing | ./scripts/swiftw run ObjectDetection | | WhisperTranscription | Audio transcription pipeline (encoder pattern) | ./scripts/swiftw run WhisperTranscription | | VisionTransformer | ViT image classification with patch embeddings | ./scripts/swiftw run VisionTransformer |
Example Output: BenchmarkSuite
=== TensorRT Benchmark Suite ===
┌──────────┬────────────┬────────────┬────────────┬────────────┐
│ Elements │ Throughput │ p50 │ p95 │ p99 │
├──────────┼────────────┼────────────┼────────────┼────────────┤
│ 64 │ 91.0K │ 10.4 µs │ 12.5 µs │ 22.8 µs │
│ 1024 │ 75.5K │ 11.5 µs │ 22.1 µs │ 23.1 µs │
│ 16384 │ 31.3K │ 31.8 µs │ 33.2 µs │ 37.1 µs │
└──────────┴────────────┴────────────┴────────────┴────────────┘Tests
Run:
./scripts/swiftw testThis wrapper keeps build artifacts in /tmp by default to avoid .build permission issues. Override with SWIFT_BUILD_PATH=/your/path ./scripts/swiftw test if needed.
The test suite includes end-to-end GPU tests that build engines (TensorRT builder and nvonnxparser), deserialize them, and run inference (host buffers, device pointers, external streams, and CUDA events).
Troubleshooting
libnvinfer.so: cannot open shared object file
TensorRT libraries are not in your library path. Add them:
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
# or wherever TensorRT is installedCUDA driver version is insufficient
Your NVIDIA driver is too old for CUDA 12.6. Update your driver:
sudo apt-get install nvidia-driver-550 # or newerSwift can't find CUDA headers
Ensure CUDA is installed and the include path is correct:
ls /usr/local/cuda/include/cuda.h
# If not found, create symlink or adjust Package.swiftLicense
See LICENSE.txt.
Package Metadata
Repository: wendylabsinc/tensorrt-swift
Default branch: main
README: README.md