Deep Learning Model Deployment

Deep Learning Model Deployment: Production with ONNX, TensorRT and FastAPI

From Training to Production: Deploying Deep Learning Models at Scale

Complete model deployment pipeline from training to production serving
Figure 10. The end-to-end model deployment pipeline

Moving deep learning models from experimentation to production presents unique challenges in performance, scalability, and maintainability. In this comprehensive guide, we'll explore model optimization with ONNX and TensorRT, building scalable APIs with FastAPI, and deploying to edge devices with TensorFlow Lite.

1. The Deployment Challenge

Production requirements differ significantly from research environments:

Requirement Research Focus Production Needs
Latency Batch processing Real-time inference
Throughput Single GPU Horizontal scaling
Resource Use Max accuracy Efficiency constraints
Reliability Experimental High availability
Key Insight: Deployment isn't just about the model - it's about building a reliable system around the model that meets business requirements.

2. Model Optimization

ONNX (Open Neural Network Exchange)

ONNX provides a standardized model format for framework interoperability:

ONNX converting models between different deep learning frameworks
Figure 10.1 ONNX enables framework-agnostic model representation
# Exporting PyTorch model to ONNX
import torch
model = ... # Your trained model
dummy_input = torch.randn(1, 3, 224, 224) # Example input shape

# Export the model
torch.onnx.export(
  model,
  dummy_input,
  "model.onnx",
  input_names=["input"],
  output_names=["output"],
  dynamic_axes={
    "input": {0: "batch_size"},
    "output": {0: "batch_size"}
  }
)

# Running inference with ONNX Runtime
import onnxruntime as ort

sess = ort.InferenceSession("model.onnx")
inputs = {"input": dummy_input.numpy()}
outputs = sess.run(["output"], inputs)

TensorRT Optimization

NVIDIA's TensorRT provides high-performance inference optimization:

  • Layer fusion: Combine operations to reduce overhead
  • Precision calibration: FP16/INT8 quantization
  • Kernel auto-tuning: Select best implementations
# TensorRT optimization pipeline
# 1. Convert ONNX model to TensorRT
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16

# 2. Load and run TensorRT engine in Python
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

# Load engine
with open("model.engine", "rb") as f:
  engine_data = f.read()
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(engine_data)

# Create execution context
context = engine.create_execution_context()

# Allocate buffers
inputs, outputs, bindings = [], [], []
stream = cuda.Stream()
for binding in engine:
  size = trt.volume(engine.get_binding_shape(binding))
  dtype = trt.nptype(engine.get_binding_dtype(binding))
  # Allocate host and device buffers
  host_mem = cuda.pagelocked_empty(size, dtype)
  device_mem = cuda.mem_alloc(host_mem.nbytes)
  bindings.append(int(device_mem))
  if engine.binding_is_input(binding):
    inputs.append({"host": host_mem, "device": device_mem})
  else:
    outputs.append({"host": host_mem, "device": device_mem})

# Run inference
np.copyto(inputs[0]["host"], input_data.ravel())
cuda.memcpy_htod_async(inputs[0]["device"], inputs[0]["host"], stream)
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
cuda.memcpy_dtoh_async(outputs[0]["host"], outputs[0]["device"], stream)
stream.synchronize()
output = outputs[0]["host"]

3. Model Serving with FastAPI

FastAPI provides a modern Python framework for building model APIs:

FastAPI request handling flow with model inference
Figure 10.2 FastAPI provides efficient model serving infrastructure
# FastAPI model serving example
from fastapi import FastAPI, File, UploadFile
import numpy as np
from PIL import Image
import io

app = FastAPI()

# Load your model (could be ONNX, TensorRT, etc.)
model = load_model("model.engine")

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
  # Read image file
  contents = await file.read()
  image = Image.open(io.BytesIO(contents))
  # Preprocess
  image = image.resize((224, 224))
  image_array = np.array(image) / 255.0
  image_array = np.transpose(image_array, (2, 0, 1))
  image_array = np.expand_dims(image_array, 0)
  # Predict
  prediction = model.predict(image_array)
  return {"prediction": prediction.tolist()}

# Run with: uvicorn api:app --host 0.0.0.0 --port 8000

Production Considerations

  • Scaling: Use with Gunicorn and multiple workers
  • Monitoring: Prometheus metrics endpoint
  • Batching: Implement request batching
  • Load Testing: Locust or k6 for performance testing

4. Edge Deployment with TensorFlow Lite

For mobile and embedded devices, TensorFlow Lite provides:

TensorFlow Lite deployment pipeline from training to mobile device
Figure 10.3 TensorFlow Lite enables efficient edge deployment
# Converting to TensorFlow Lite
import tensorflow as tf

# Convert SavedModel to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16] # FP16 quantization
tflite_model = converter.convert()

# Save the model
with open("model.tflite", "wb") as f:
  f.write(tflite_model)

# Android inference example (Java)
/*
try (Interpreter interpreter = new Interpreter(loadModelFile(context))) {
  ByteBuffer input = convertInputImage(bitmap);
  float[][] output = new float[1][NUM_CLASSES];
  interpreter.run(input, output);
  // Process output...
}
*/

5. Deployment Architectures

Real-time Serving

  • FastAPI/Flask with model loaded in memory
  • Horizontal scaling with Kubernetes
  • GPU acceleration for low latency

Batch Processing

  • Airflow/Luigi pipelines
  • Spark or Dask for large datasets
  • Cost-effective spot instances

Edge Deployment

  • TFLite for mobile
  • ONNX Runtime for edge devices
  • Custom optimizations for specific hardware
Monitoring Tip: Track both system metrics (latency, throughput) and model metrics (input distribution, prediction confidence) to catch performance and data drift issues.

6. Continuous Delivery for ML

ML systems require specialized CI/CD pipelines:

Machine learning CI/CD pipeline with validation and deployment stages
Figure 10.4 End-to-end ML CI/CD pipeline

Key Components

  • Data Validation: Check for schema/skew issues
  • Model Testing: Accuracy, fairness, performance
  • Canary Deployment: Gradually roll out new models
  • Rollback Strategy: Quickly revert problematic models

Conclusion

Deploying deep learning models to production requires careful consideration of performance, scalability, and maintainability. By leveraging tools like ONNX and TensorRT for optimization, FastAPI for serving, and TensorFlow Lite for edge deployment, you can build robust systems that deliver model predictions efficiently and reliably.

This concludes our comprehensive deep learning series. We've covered everything from mathematical foundations to cutting-edge architectures and production deployment. To continue your learning journey, explore our recommended resources for staying current in this rapidly evolving field.

AI models deployed across diverse environments from cloud to edge
Figure 10.5 The future of AI deployment spans from data centers to edge devices

✅ SHARE

LinkedIn WhatsApp
🔍 Curious about Deep Learning? Read our next post on Mathematics for Deep Learning

Follow DrASR Deep Learning for more in-depth tutorials, fundamentals, and research-backed content in Deep Learning.

If you found this helpful, leave a comment or share it with your peers. Let’s grow together in AI learning!

Comments

Popular posts from this blog

Generative Adversarial Networks

Mathematics for Deep Learning