We're hiring!
*

Faster inference: torch.compile vs TensorRT

Vineet Suryan avatar

Vineet Suryan
December 19, 2024

Share this post:

Reading time:

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s TensorRT, a platform for high-performance deep learning inference. Both aim to enhance performance and reduce latency, but they serve different purposes and operate in unique ways. Interestingly, in some of our tests on models like LLama-7b, LLama-3-8b, Mistral-v0.1, phi-3, and phi-2, torch.compile demonstrated similar performance as TensorRT. This blog post dives into a detailed comparison of torch.compile and TensorRT, helping you understand when and where to use each.

TL;DR

torch.compile outperforms TensorRT in terms of ease of use and performance in our tests on models like LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. Unless you need TensorRT-specific features or work exclusively within NVIDIA's ecosystem, torch.compile is the better choice for optimizing PyTorch models.

Torch.compile

Introduced in PyTorch 2.0, torch.compile brings a dynamic and user-friendly approach to model optimization. It uses backend compilers like TorchInductor and other JIT compilation techniques to accelerate training and inference. Here are its key features:

  • Ease of use: Developers can optimize their models with a single line of code: model = torch.compile(model)
  • Dynamic graphs: True to PyTorch’s philosophy, torch.compile supports dynamic computation graphs, making it versatile for research and production
  • Backend agnostic: It integrates with various backend compilers, enabling optimizations tailored to different hardware setups
  • Training optimization: Unlike TensorRT, torch.compile focuses on optimizing both training and inference.

Use cases

  • Ideal for PyTorch users looking for a seamless integration into their training and inference workflows
  • Suitable for dynamic or complex models that require flexible graph execution

TensorRT

TensorRT is a highly specialized platform for deploying deep learning models on NVIDIA GPUs. It focuses on inference acceleration, leveraging hardware-specific optimizations to maximize performance. Key features include:

  • Hardware optimizations: TensorRT takes full advantage of NVIDIA GPU architectures, including support for Tensor Cores and FP16/INT8 precision
  • Inference-centric: Unlike torch.compile, TensorRT exclusively targets inference, with features like kernel fusion, layer merging, and reduced memory overhead
  • Model conversion: TensorRT requires models to be converted to its format (e.g., ONNX), introducing an additional preprocessing step

Use cases

  • Best suited for high-throughput, latency-sensitive production systems
  • Ideal for models running on NVIDIA GPUs in a controlled, static graph environment

Benchmarking results

To evaluate the performance of torch.compile and TensorRT, we benchmarked popular models, including LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. The results, measured in tokens per second, are shown below:

Collabora Torch ResultsCollabora Torch Results - Weight Only

As seen in the graph, torch.compile consistently outperformed TensorRT across all tested models. While the differences are marginal for smaller models like LLama-7b and mistral-v0.1, the gap becomes more noticeable for larger models such as phi-3 and phi-2. These results highlight that torch.compile is not only easier to integrate but also provides superior performance for both dynamic and static model graphs.

Performance comparison

  1. Flexibility vs. specialization
    • torch.compile excels in flexibility, supporting dynamic and static graphs seamlessly.
    • TensorRT specializes in static graphs, delivering unmatched performance for predefined workflows.
  2. Hardware support
    • torch.compile supports various hardware platforms, including CPUs, GPUs, and TPUs, depending on the backend.
    • TensorRT is tailored for NVIDIA GPUs, offering deep integration with CUDA and Tensor Cores.
  3. Precision and quantization
    • TensorRT provides robust support for mixed-precision (FP16, INT8), enabling significant speedups without compromising accuracy.
    • torch.compile supports quantization through PyTorch’s native tooling, but it’s less focused on hardware-specific precision optimizations.
  4. Ease of deployment
    • torch.compile simplifies optimization directly within PyTorch workflows, requiring minimal code changes.
    • TensorRT involves additional steps, such as exporting models to ONNX and configuring precision modes, which can be more complex.

Conclusion

Based on our investigation, torch.compile not only simplifies the optimization process but also performs similarly to TensorRT in terms of speed for models like LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. Given these findings, there is little reason to use TensorRT unless your application is tightly coupled with NVIDIA’s ecosystem and requires features exclusive to TensorRT. Torch.compile emerges as the more efficient and versatile tool, particularly for PyTorch users who value performance, ease of integration, and flexibility. Embracing torch.compile can help streamline your deep learning workflows without sacrificing speed or efficiency.

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Smart audio filters with WirePlumber 0.5

26/06/2024

WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2024. All rights reserved. Privacy Notice. Sitemap.