profclaw/swift-llama
Run LLMs on Apple devices with Swift. Wraps [llama.cpp](https://github.com/ggml-org/llama.cpp) with a native Swift API.
Quick start
import SwiftLlama
let model = try LlamaModel(path: "path/to/model.gguf", config: .init(gpuLayers: .all))
let llama = try LlamaActor(model: model)
for try await chunk in llama.chat(messages: [.user("Hello!")], template: .gemma) {
switch chunk {
case .text(let token): print(token, terminator: "")
case .toolCall(let call): print("Tool: \(call.name)(\(call.arguments))")
}
}Installation
Add to your Package.swift:
dependencies: [
.package(url: "https://github.com/profclaw/swift-llama", from: "0.1.0"),
]Then add "SwiftLlama" to your target dependencies.
What's in the box
- Actor-isolated inference with
AsyncThrowingStreamfor token streaming - Chat templates for Gemma, Llama 3, Mistral, and ChatML (or write your own)
- A streaming tool call parser that detects function calls as tokens arrive
- Model downloader that grabs GGUFs from HuggingFace with progress and SHA256 checks
- A catalog of popular models with recommended configs
- Configurable sampling (temperature, top-p, top-k, repeat penalty, or just
.greedy)
Usage
Loading a model
let model = try LlamaModel(
path: "/path/to/model.gguf",
config: ModelConfig(
contextSize: 8192,
gpuLayers: .all, // .all, .none, .count(20)
threads: 8
)
)Streaming chat
let llama = try LlamaActor(model: model, params: .balanced)
let stream = llama.chat(
messages: [
.system("You are a helpful assistant."),
.user("What is Swift concurrency?")
],
template: .gemma
)
for try await chunk in stream {
switch chunk {
case .text(let token): print(token, terminator: "")
case .toolCall(let call): await handleTool(call)
}
}Chat templates
// Pick a built-in format
llama.chat(messages: messages, template: .gemma)
llama.chat(messages: messages, template: .llama3)
llama.chat(messages: messages, template: .mistral)
llama.chat(messages: messages, template: .chatML)
// Or bring your own
struct MyTemplate: ChatTemplateProtocol {
let stopTokens = ["<|end|>"]
func format(_ messages: [ChatMessage]) -> String { /* ... */ }
}
llama.chat(messages: messages, template: .custom(MyTemplate()))Sampling
// Presets
let llama = try LlamaActor(model: model, params: .greedy)
let llama = try LlamaActor(model: model, params: .creative)
let llama = try LlamaActor(model: model, params: .balanced)
// Or tune it yourself
let params = SamplingParams(temperature: 0.8, topP: 0.95, topK: 50, maxTokens: 4096)Downloading models
let downloader = ModelDownloader()
for await event in downloader.download(model: ModelCatalog.recommended[0], to: modelsDir) {
switch event {
case .progress(let percent, _, _): print("\(Int(percent))%")
case .verifying: print("Verifying checksum...")
case .completed(let url): print("Done: \(url.path)")
case .failed(let error): print("Error: \(error)")
}
}
// See what's available
let gemmaModels = ModelCatalog.models(for: .gemma)Supported models
| Model | Family | Size | Template | |-------|--------|------|----------| | Gemma 4 E2B | Gemma | 2.9 GB | .gemma | | Gemma 4 E4B | Gemma | 5.4 GB | .gemma | | Llama 3.2 3B | Llama | 2.0 GB | .llama3 | | Mistral 7B v0.3 | Mistral | 4.4 GB | .mistral | | Phi-3.5 Mini | Phi | 2.4 GB | .chatML | | Qwen 2.5 3B | Qwen | 2.1 GB | .chatML |
Any GGUF model works. The catalog is there so you don't have to hunt for HuggingFace URLs.
Requirements
- macOS 14+ / iOS 17+ / visionOS 1+
- Swift 6.0+
- Apple Silicon recommended (Intel works, just slower)
License
MIT. See LICENSE.
Who made this
ProfClaw. We're building ProfClaw Studio, a native macOS AI assistant that uses this package for local inference.
Package Metadata
Repository: profclaw/swift-llama
Default branch: main
README: README.md