We're hiring!
*

Faster inference: torch.compile vs TensorRT

Vineet Suryan avatar

Vineet Suryan
December 19, 2024

Share this post:

Reading time:

As deep learning models grow in complexity, optimizing performance and reducing latency has become more critical than ever. Two standout tools -- torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s TensorRT, a platform for high-performance deep learning inference. Both aim to enhance performance and reduce latency, but they serve different purposes and operate in unique ways. Interestingly, in some of our tests on models like LLama-7b, LLama-3-8b, Mistral-v0.1, phi-3, and phi-2, torch.compile demonstrated similar performance as TensorRT. This blog post dives into a detailed comparison of torch.compile and TensorRT, highlighting their features, use cases, and benchmarking insights, helping you understand when and where to use each.

TL;DR

Both torch.compile and TensorRT offer powerful optimization features for deep learning workflows. torch.compile outperforms TensorRT in terms of ease of use and performance in our tests on models like LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. Unless you need TensorRT-specific features or work exclusively within NVIDIA's ecosystem, torch.compile is the better choice for optimizing PyTorch models.

Torch.compile

Introduced in PyTorch 2.0, torch.compile brings a dynamic and user-friendly approach to model optimization. It uses backend compilers like TorchInductor and other JIT compilation techniques to accelerate training and inference. Here are its key features:

  • Ease of use: Developers can optimize their models with a single line of code: model = torch.compile(model)
  • Dynamic graphs: True to PyTorch’s philosophy, torch.compile supports dynamic computation graphs, making it versatile for research and production
  • Backend agnostic: It integrates with various backend compilers, enabling optimizations tailored to different hardware setups
  • Training optimization: Unlike TensorRT, torch.compile focuses on optimizing both training and inference.

Use cases

  • Ideal for PyTorch users looking for a seamless integration into their training and inference workflows
  • Suitable for dynamic or complex models that require flexible graph execution

TensorRT

TensorRT is a highly specialized platform for deploying deep learning models on NVIDIA GPUs. It focuses on inference acceleration, leveraging hardware-specific optimizations to maximize performance. Key features include:

  • Hardware optimizations: TensorRT takes full advantage of NVIDIA GPU architectures, including support for Tensor Cores and FP16/INT8 precision
  • Inference-centric: Unlike torch.compile, TensorRT exclusively targets inference, with features like kernel fusion, layer merging, and reduced memory overhead
  • Model conversion: TensorRT requires models to be converted to its format (e.g., ONNX), introducing an additional preprocessing step

Use cases

  • Best suited for high-throughput, latency-sensitive production systems
  • Ideal for models running on NVIDIA GPUs in a controlled, static graph environment

Benchmarking results

To evaluate the performance of torch.compile and TensorRT, we benchmarked popular models, including LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2.

  1. Tools
    • torch.compile was benchmarked using pytorch/gpt-fast
    • TensorRT was benchmarked using TensorRT-LLM with the script modified to time the inference run and log tokens per second
  2. Hardware
    • All benchmarks were performed on an NVIDIA 4090 GPU using CUDA 12.4.1
    • We tested the models on a 3090, 4090 and H100, the difference in numbers on 3090 & H100 is about the same, and we only show the results for the 4090
  3. Setup
    • Default gpt-fast & tensorrt-llm generation settings
    • Prompt is set to "Hello, my name is"
    • Input token length: 6 tokens
    • Maximum output tokens: 200 tokens

To evaluate the performance of torch.compile and TensorRT, we benchmarked popular models, including LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. The results, measured in tokens per second, are shown below:

Collabora Torch Results
Results for 4090 and H100.

Collabora Torch Results - Weight Only
Results for 4090.

 

As seen in the graph, torch.compile consistently outperformed TensorRT across all tested models. While the differences are marginal for larger models like LLama-7b and mistral-v0.1, the gap becomes more noticeable for smaller models such as phi-3 and phi-2. These results highlight that torch.compile is not only easier to integrate but also provides similar performance for both dynamic and static model graphs.

Performance comparison

  1. Flexibility vs. specialization
    • torch.compile excels in flexibility, supporting dynamic and static graphs seamlessly
    • TensorRT specializes in static graphs, delivering unmatched performance for predefined workflows
  2. Hardware support
    • torch.compile supports various hardware platforms, including CPUs, GPUs, and TPUs, depending on the backend
    • TensorRT is tailored for NVIDIA GPUs, offering deep integration with CUDA and Tensor Cores
  3. Precision and quantization
    • TensorRT provides robust support for mixed-precision (FP16, INT8), enabling significant speedups without compromising accuracy
    • torch.compile supports quantization through PyTorch’s native tooling, but it’s less focused on hardware-specific precision optimizations
  4. Ease of deployment
    • torch.compile simplifies optimization directly within PyTorch workflows, requiring minimal code changes.
    • TensorRT involves additional steps, such as exporting models to ONNX and configuring precision modes, which can be more complex
  5. Compilation
    • Importantly, TensorRT does not need to compile the model for every inference run. It reduces startup overhead by serializing and reusing the optimized model for inference
    • In contrast, torch.compile requires recompilation for each inference session, as it does not support graph serialization, leading to higher startup latency

Conclusion

Based on our investigation, torch.compile not only simplifies the optimization process but also performs similarly to TensorRT in terms of speed for models like LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. Given these findings, there is little reason to use TensorRT unless your application is tightly coupled with NVIDIA’s ecosystem and requires features exclusive to TensorRT. Torch.compile emerges as the more efficient and versatile tool, particularly for PyTorch users who value performance, ease of integration, and flexibility. Embracing torch.compile can help streamline your deep learning workflows without sacrificing speed or efficiency.

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

The state of GFX virtualization using virglrenderer

15/01/2025

With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2025. All rights reserved. Privacy Notice. Sitemap.