pytorch/executorch
**ExecuTorch** is PyTorch's unified solution for deploying AI models on-device—from smartphones to microcontrollers—built for privacy, performance, and portability. It powers Meta's on-device AI across **Instagram, WhatsApp, Quest 3, Ray-Ban Meta Smart Glasses**, and [more](https
Why ExecuTorch?
- 🔒 Native PyTorch Export — Direct export from PyTorch. No .onnx, .tflite, or intermediate format conversions. Preserve model semantics.
- ⚡ Production-Proven — Powers billions of users at Meta with real-time on-device inference.
- 💾 Tiny Runtime — 50KB base footprint. Runs on microcontrollers to high-end smartphones.
- 🚀 12+ Hardware Backends — Open-source acceleration for Apple, Qualcomm, ARM, MediaTek, Vulkan, and more.
- 🎯 One Export, Multiple Backends — Switch hardware targets with a single line change. Deploy the same model everywhere.
How It Works
ExecuTorch uses ahead-of-time (AOT) compilation to prepare PyTorch models for edge deployment:
- 🧩 Export — Capture your PyTorch model graph with
torch.export() - ⚙️ Compile — Quantize, optimize, and partition to hardware backends →
.pte - 🚀 Execute — Load
.pteon-device via lightweight C++ runtime
Models use a standardized Core ATen operator set. Partitioners delegate subgraphs to specialized hardware (NPU/GPU) with CPU fallback.
Learn more: How ExecuTorch Works • Architecture Guide
Quick Start
Installation
pip install executorchFor platform-specific setup (Android, iOS, embedded systems), see the Quick Start documentation for additional info.
Export and Deploy in 3 Steps
import torch
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)
exported_program = torch.export.export(model, example_inputs)
# 2. Optimize for target hardware (switch backends with one line)
program = to_edge_transform_and_lower(
exported_program,
partitioner=[XnnpackPartitioner()] # CPU | CoreMLPartitioner() for iOS | QnnPartitioner() for Qualcomm
).to_executorch()
# 3. Save for deployment
with open("model.pte", "wb") as f:
f.write(program.buffer)
# Test locally via ExecuTorch runtime's pybind API (optional)
from executorch.runtime import Runtime
runtime = Runtime.get()
method = runtime.load_program("model.pte").load_method("forward")
outputs = method.execute([torch.randn(1, 3, 224, 224)])Run on Device
#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
Module module("model.pte");
auto tensor = make_tensor_ptr({2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
auto outputs = module.forward(tensor);import ExecuTorch
let module = Module(filePath: "model.pte")
let input = Tensor<Float>([1.0, 2.0, 3.0, 4.0], shape: [2, 2])
let outputs = try module.forward(input)val module = Module.load("model.pte")
val inputTensor = Tensor.fromBlob(floatArrayOf(1.0f, 2.0f, 3.0f, 4.0f), longArrayOf(2, 2))
val outputs = module.forward(EValue.from(inputTensor))LLM Example: Llama
Export Llama models using the export_llm script or Optimum-ExecuTorch:
# Using export_llm
python -m executorch.extension.llm.export.export_llm --model llama3_2 --output llama.pte
# Using Optimum-ExecuTorch
optimum-cli export executorch \
--model meta-llama/Llama-3.2-1B \
--task text-generation \
--recipe xnnpack \
--output_dir llama_modelRun on-device with the LLM runner API:
#include <executorch/extension/llm/runner/text_llm_runner.h>
auto runner = create_llama_runner("llama.pte", "tiktoken.bin");
executorch::extension::llm::GenerationConfig config{
.seq_len = 128, .temperature = 0.8f};
runner->generate("Hello, how are you?", config);import ExecuTorchLLM
let runner = TextRunner(modelPath: "llama.pte", tokenizerPath: "tiktoken.bin")
try runner.generate("Hello, how are you?", Config {
$0.sequenceLength = 128
}) { token in
print(token, terminator: "")
}Kotlin (Android) — API Docs • Demo App
val llmModule = LlmModule("llama.pte", "tiktoken.bin", 0.8f)
llmModule.load()
llmModule.generate("Hello, how are you?", 128, object : LlmCallback {
override fun onResult(result: String) { print(result) }
override fun onStats(stats: String) { }
})For multimodal models (vision, audio), use the MultiModal runner API which extends the LLM runner to handle image and audio inputs alongside text. See Llava and Voxtral examples.
See examples/models/llama for complete workflow including quantization, mobile deployment, and advanced options.
Next Steps:
- 📖 Step-by-step tutorial — Complete walkthrough for your first model
- ⚡ Colab notebook — Try ExecuTorch instantly in your browser
- 🤖 Deploy Llama models — LLM workflow with quantization and mobile demos
Platform & Hardware Support
| Platform | Supported Backends | |------------------|----------------------------------------------------------| | Android | XNNPACK, Vulkan, Qualcomm, MediaTek, Samsung Exynos | | iOS | XNNPACK, CoreML (Neural Engine), MPS (deprecated) | | Linux / Windows | XNNPACK, OpenVINO, CUDA (experimental) | | macOS | XNNPACK, Metal (experimental), MPS (deprecated) | | Embedded / MCU | XNNPACK, ARM Ethos-U, NXP, Cadence DSP |
See Backend Documentation for detailed hardware requirements and optimization guides. For desktop/laptop GPU inference with CUDA and Metal, see the Desktop Guide. For Zephyr RTOS integration, see the Zephyr Guide.
Production Deployments
ExecuTorch powers on-device AI at scale across Meta's family of apps, VR/AR devices, and partner deployments. View success stories →
Examples & Models
LLMs: Llama 3.2/3.1/3, Qwen 3, Phi-4-mini, LiquidAI LFM2
Multimodal: Llava (vision-language), Voxtral (audio-language), Gemma (vision-language)
Vision/Speech: MobileNetV2, DeepLabV3, YOLO26, Whisper <!-- @lint-ignore -->
Resources: examples/ directory • executorch-examples out-of-tree demos • Optimum-ExecuTorch for HuggingFace models • Unsloth for fine-tuned LLM deployment <!-- @lint-ignore -->
Key Features
ExecuTorch provides advanced capabilities for production deployment:
- Quantization — Built-in support via torchao for 8-bit, 4-bit, and dynamic quantization
- Memory Planning — Optimize memory usage with ahead-of-time allocation strategies
- Developer Tools — ETDump profiler, ETRecord inspector, and model debugger
- Selective Build — Strip unused operators to minimize binary size
- Custom Operators — Extend with domain-specific kernels
- Dynamic Shapes — Support variable input sizes with bounded ranges
See Advanced Topics for quantization techniques, custom backends, and compiler passes.
Documentation
- Documentation Home — Complete guides and tutorials
- API Reference — Python, C++, Java/Kotlin APIs
- Backend Integration — Build custom hardware backends
- Troubleshooting — Common issues and solutions
Community & Contributing
We welcome contributions from the community!
- 💬 GitHub Discussions — Ask questions and share ideas
- 🎮 Discord — Chat with the team and community
- 🐛 Issues — Report bugs or request features
- 🤝 Contributing Guide — Guidelines and codebase structure
License
ExecuTorch is BSD licensed, as found in the LICENSE file.
<br><br>
<div align="center"> <p><strong>Part of the PyTorch ecosystem</strong></p> <p> <a href="https://github.com/pytorch/executorch">GitHub</a> • <a href="https://docs.pytorch.org/executorch">Documentation</a> </p> </div>
Package Metadata
Repository: pytorch/executorch
Default branch: main
README: README.md