Faster inference: torch.compile vs TensorRT

Faster inference: torch.compile vs TensorRT

Vineet Suryan
December 19, 2024

Share this post:

Reading time:

As deep learning models grow in complexity, optimizing performance and reducing latency has become more critical than ever. Two standout tools -- torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s TensorRT, a platform for high-performance deep learning inference. Both aim to enhance performance and reduce latency, but they serve different purposes and operate in unique ways. Interestingly, in some of our tests on models like LLama-7b, LLama-3-8b, Mistral-v0.1, phi-3, and phi-2, torch.compile demonstrated similar performance as TensorRT. This blog post dives into a detailed comparison of torch.compile and TensorRT, highlighting their features, use cases, and benchmarking insights, helping you understand when and where to use each.

TL;DR

Both torch.compile and TensorRT offer powerful optimization features for deep learning workflows. torch.compile outperforms TensorRT in terms of ease of use and performance in our tests on models like LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. Unless you need TensorRT-specific features or work exclusively within NVIDIA's ecosystem, torch.compile is the better choice for optimizing PyTorch models.

Torch.compile

Introduced in PyTorch 2.0, torch.compile brings a dynamic and user-friendly approach to model optimization. It uses backend compilers like TorchInductor and other JIT compilation techniques to accelerate training and inference. Here are its key features:

Ease of use: Developers can optimize their models with a single line of code: model = torch.compile(model)
Dynamic graphs: True to PyTorch’s philosophy, torch.compile supports dynamic computation graphs, making it versatile for research and production
Backend agnostic: It integrates with various backend compilers, enabling optimizations tailored to different hardware setups
Training optimization: Unlike TensorRT, torch.compile focuses on optimizing both training and inference.

Use cases

Ideal for PyTorch users looking for a seamless integration into their training and inference workflows
Suitable for dynamic or complex models that require flexible graph execution

TensorRT

TensorRT is a highly specialized platform for deploying deep learning models on NVIDIA GPUs. It focuses on inference acceleration, leveraging hardware-specific optimizations to maximize performance. Key features include:

Hardware optimizations: TensorRT takes full advantage of NVIDIA GPU architectures, including support for Tensor Cores and FP16/INT8 precision
Inference-centric: Unlike torch.compile, TensorRT exclusively targets inference, with features like kernel fusion, layer merging, and reduced memory overhead
Model conversion: TensorRT requires models to be converted to its format (e.g., ONNX), introducing an additional preprocessing step

Use cases

Best suited for high-throughput, latency-sensitive production systems
Ideal for models running on NVIDIA GPUs in a controlled, static graph environment

Benchmarking results

To evaluate the performance of torch.compile and TensorRT, we benchmarked popular models, including LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2.

Tools
- torch.compile was benchmarked using pytorch/gpt-fast
- TensorRT was benchmarked using TensorRT-LLM with the script modified to time the inference run and log tokens per second
Hardware
- All benchmarks were performed on an NVIDIA 4090 GPU using CUDA 12.4.1
- We tested the models on a 3090, 4090 and H100, the difference in numbers on 3090 & H100 is about the same, and we only show the results for the 4090
Setup
- Default gpt-fast & tensorrt-llm generation settings
- Prompt is set to "Hello, my name is"
- Input token length: 6 tokens
- Maximum output tokens: 200 tokens

To evaluate the performance of torch.compile and TensorRT, we benchmarked popular models, including LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. The results, measured in tokens per second, are shown below:

Collabora Torch Results — Results for 4090 and H100.

Collabora Torch Results - Weight Only — Results for 4090.

As seen in the graph, torch.compile consistently outperformed TensorRT across all tested models. While the differences are marginal for larger models like LLama-7b and mistral-v0.1, the gap becomes more noticeable for smaller models such as phi-3 and phi-2. These results highlight that torch.compile is not only easier to integrate but also provides similar performance for both dynamic and static model graphs.

Performance comparison

Flexibility vs. specialization
- torch.compile excels in flexibility, supporting dynamic and static graphs seamlessly
- TensorRT specializes in static graphs, delivering unmatched performance for predefined workflows
Hardware support
- torch.compile supports various hardware platforms, including CPUs, GPUs, and TPUs, depending on the backend
- TensorRT is tailored for NVIDIA GPUs, offering deep integration with CUDA and Tensor Cores
Precision and quantization
- TensorRT provides robust support for mixed-precision (FP16, INT8), enabling significant speedups without compromising accuracy
- torch.compile supports quantization through PyTorch’s native tooling, but it’s less focused on hardware-specific precision optimizations
Ease of deployment
- torch.compile simplifies optimization directly within PyTorch workflows, requiring minimal code changes.
- TensorRT involves additional steps, such as exporting models to ONNX and configuring precision modes, which can be more complex
Compilation
- Importantly, TensorRT does not need to compile the model for every inference run. It reduces startup overhead by serializing and reusing the optimized model for inference
- In contrast, torch.compile requires recompilation for each inference session, as it does not support graph serialization, leading to higher startup latency

Conclusion

Based on our investigation, torch.compile not only simplifies the optimization process but also performs similarly to TensorRT in terms of speed for models like LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. Given these findings, there is little reason to use TensorRT unless your application is tightly coupled with NVIDIA’s ecosystem and requires features exclusive to TensorRT. Torch.compile emerges as the more efficient and versatile tool, particularly for PyTorch users who value performance, ease of integration, and flexibility. Embracing torch.compile can help streamline your deep learning workflows without sacrificing speed or efficiency.

WhisperFusion: Ultra-low latency conversations with an AI chatbot

WhisperSpeech: Exploring new horizons in text-to-speech tech

Triple Threat: The Power of Transcription, Summary, and Translation

WhisperFusion: Ultra-low latency conversations with an AI chatbot

WhisperSpeech: Exploring new horizons in text-to-speech tech

Triple Threat: The Power of Transcription, Summary, and Translation

Comments (0)

Add a Comment

Search the newsroom

Latest Blog Posts

Customizing WirePlumber's configuration for embedded systems

29/04/2025

Configuring WirePlumber on embedded Linux systems can be somewhat confusing. We take a moment to demystify this process for a particular…

Evolving hardware, evolving demo: Collabora's Embedded World Board Farm

24/04/2025

Collabora's Board Farm demo, showcasing our recent hardware enablement and continuous integration efforts, has undergone serious development…

Implementing Bluetooth on embedded Linux: Open source BlueZ vs proprietary stacks

27/02/2025

If you are considering deploying BlueZ on your embedded Linux device, the benefits in terms of flexibility, community support, and long-term…

The state of GFX virtualization using virglrenderer

15/01/2025

With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기