Deep Learning Model Deployment: Production with ONNX, TensorRT and FastAPI

From Training to Production: Deploying Deep Learning Models at Scale

Complete model deployment pipeline from training to production serving

Figure 10. The end-to-end model deployment pipeline

Moving deep learning models from experimentation to production presents unique challenges in performance, scalability, and maintainability. In this comprehensive guide, we'll explore model optimization with ONNX and TensorRT, building scalable APIs with FastAPI, and deploying to edge devices with TensorFlow Lite.

1. The Deployment Challenge

Production requirements differ significantly from research environments:

Requirement	Research Focus	Production Needs
Latency	Batch processing	Real-time inference
Throughput	Single GPU	Horizontal scaling
Resource Use	Max accuracy	Efficiency constraints
Reliability	Experimental	High availability

Key Insight: Deployment isn't just about the model - it's about building a reliable system around the model that meets business requirements.

2. Model Optimization

ONNX (Open Neural Network Exchange)

ONNX provides a standardized model format for framework interoperability:

ONNX converting models between different deep learning frameworks

Figure 10.1 ONNX enables framework-agnostic model representation

        # Exporting PyTorch model to ONNX

        import torch

        model = ...  # Your trained model

        dummy_input = torch.randn(1, 3, 224, 224)  # Example input shape

        # Export the model

        torch.onnx.export(

              model,

              dummy_input,

              "model.onnx",

              input_names=["input"],

              output_names=["output"],

              dynamic_axes={

                    "input": {0: "batch_size"},

                    "output": {0: "batch_size"}

              }

        )

        # Running inference with ONNX Runtime

        import onnxruntime as ort

        sess = ort.InferenceSession("model.onnx")

        inputs = {"input": dummy_input.numpy()}

        outputs = sess.run(["output"], inputs)

TensorRT Optimization

NVIDIA's TensorRT provides high-performance inference optimization:

Layer fusion: Combine operations to reduce overhead
Precision calibration: FP16/INT8 quantization
Kernel auto-tuning: Select best implementations

        # TensorRT optimization pipeline

        # 1. Convert ONNX model to TensorRT

        trtexec --onnx=model.onnx --saveEngine=model.engine --fp16

        # 2. Load and run TensorRT engine in Python

        import tensorrt as trt

        import pycuda.driver as cuda

        import pycuda.autoinit

        # Load engine

        with open("model.engine", "rb") as f:

              engine_data = f.read()

        runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))

        engine = runtime.deserialize_cuda_engine(engine_data)

        # Create execution context

        context = engine.create_execution_context()

        # Allocate buffers

        inputs, outputs, bindings = [], [], []

        stream = cuda.Stream()

        for binding in engine:

              size = trt.volume(engine.get_binding_shape(binding))

              dtype = trt.nptype(engine.get_binding_dtype(binding))

              # Allocate host and device buffers

              host_mem = cuda.pagelocked_empty(size, dtype)

              device_mem = cuda.mem_alloc(host_mem.nbytes)

              bindings.append(int(device_mem))

              if engine.binding_is_input(binding):

                    inputs.append({"host": host_mem, "device": device_mem})

              else:

                    outputs.append({"host": host_mem, "device": device_mem})

        # Run inference

        np.copyto(inputs[0]["host"], input_data.ravel())

        cuda.memcpy_htod_async(inputs[0]["device"], inputs[0]["host"], stream)

        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)

        cuda.memcpy_dtoh_async(outputs[0]["host"], outputs[0]["device"], stream)

        stream.synchronize()

        output = outputs[0]["host"]

3. Model Serving with FastAPI

FastAPI provides a modern Python framework for building model APIs:

FastAPI request handling flow with model inference

Figure 10.2 FastAPI provides efficient model serving infrastructure

        # FastAPI model serving example

        from fastapi import FastAPI, File, UploadFile

        import numpy as np

        from PIL import Image

        import io

        app = FastAPI()

        # Load your model (could be ONNX, TensorRT, etc.)

        model = load_model("model.engine")

        @app.post("/predict")

        async def predict(file: UploadFile = File(...)):

              # Read image file

              contents = await file.read()

              image = Image.open(io.BytesIO(contents))

              # Preprocess

              image = image.resize((224, 224))

              image_array = np.array(image) / 255.0

              image_array = np.transpose(image_array, (2, 0, 1))

              image_array = np.expand_dims(image_array, 0)

              # Predict

              prediction = model.predict(image_array)

              return {"prediction": prediction.tolist()}

        # Run with: uvicorn api:app --host 0.0.0.0 --port 8000

Production Considerations

Scaling: Use with Gunicorn and multiple workers
Monitoring: Prometheus metrics endpoint
Batching: Implement request batching
Load Testing: Locust or k6 for performance testing

4. Edge Deployment with TensorFlow Lite

For mobile and embedded devices, TensorFlow Lite provides:

TensorFlow Lite deployment pipeline from training to mobile device

Figure 10.3 TensorFlow Lite enables efficient edge deployment

        # Converting to TensorFlow Lite

        import tensorflow as tf

        # Convert SavedModel to TFLite

        converter = tf.lite.TFLiteConverter.from_saved_model("saved_model")

        converter.optimizations = [tf.lite.Optimize.DEFAULT]

        converter.target_spec.supported_types = [tf.float16]  # FP16 quantization

        tflite_model = converter.convert()

        # Save the model

        with open("model.tflite", "wb") as f:

              f.write(tflite_model)

        # Android inference example (Java)

        /*

        try (Interpreter interpreter = new Interpreter(loadModelFile(context))) {

              ByteBuffer input = convertInputImage(bitmap);

              float[][] output = new float[1][NUM_CLASSES];

              interpreter.run(input, output);

              // Process output...

        }

        */

5. Deployment Architectures

Real-time Serving

FastAPI/Flask with model loaded in memory
Horizontal scaling with Kubernetes
GPU acceleration for low latency

Batch Processing

Airflow/Luigi pipelines
Spark or Dask for large datasets
Cost-effective spot instances

Edge Deployment

TFLite for mobile
ONNX Runtime for edge devices
Custom optimizations for specific hardware

Monitoring Tip: Track both system metrics (latency, throughput) and model metrics (input distribution, prediction confidence) to catch performance and data drift issues.

6. Continuous Delivery for ML

ML systems require specialized CI/CD pipelines:

Machine learning CI/CD pipeline with validation and deployment stages

Figure 10.4 End-to-end ML CI/CD pipeline

Key Components

Data Validation: Check for schema/skew issues
Model Testing: Accuracy, fairness, performance
Canary Deployment: Gradually roll out new models
Rollback Strategy: Quickly revert problematic models

Conclusion

Deploying deep learning models to production requires careful consideration of performance, scalability, and maintainability. By leveraging tools like ONNX and TensorRT for optimization, FastAPI for serving, and TensorFlow Lite for edge deployment, you can build robust systems that deliver model predictions efficiently and reliably.

This concludes our comprehensive deep learning series. We've covered everything from mathematical foundations to cutting-edge architectures and production deployment. To continue your learning journey, explore our recommended resources for staying current in this rapidly evolving field.

AI models deployed across diverse environments from cloud to edge

Figure 10.5 The future of AI deployment spans from data centers to edge devices

✅ SHARE

🔍 Curious about Deep Learning? Read our next post on Mathematics for Deep Learning

Follow DrASR Deep Learning for more in-depth tutorials, fundamentals, and research-backed content in Deep Learning.

If you found this helpful, leave a comment or share it with your peers. Let’s grow together in AI learning!

Search This Blog

Translate

Deep Learning

Menu

Deep Learning Model Deployment