profclaw/swift-llama

Run LLMs on Apple devices with Swift. Wraps [llama.cpp](https://github.com/ggml-org/llama.cpp) with a native Swift API.

Quick start

import SwiftLlama

let model = try LlamaModel(path: "path/to/model.gguf", config: .init(gpuLayers: .all))
let llama = try LlamaActor(model: model)

for try await chunk in llama.chat(messages: [.user("Hello!")], template: .gemma) {
    switch chunk {
    case .text(let token): print(token, terminator: "")
    case .toolCall(let call): print("Tool: \(call.name)(\(call.arguments))")
    }
}

Installation

Add to your Package.swift:

dependencies: [
    .package(url: "https://github.com/profclaw/swift-llama", from: "0.1.0"),
]

Then add "SwiftLlama" to your target dependencies.

What's in the box

Actor-isolated inference with AsyncThrowingStream for token streaming
Chat templates for Gemma, Llama 3, Mistral, and ChatML (or write your own)
A streaming tool call parser that detects function calls as tokens arrive
Model downloader that grabs GGUFs from HuggingFace with progress and SHA256 checks
A catalog of popular models with recommended configs
Configurable sampling (temperature, top-p, top-k, repeat penalty, or just .greedy)

Usage

Loading a model

let model = try LlamaModel(
    path: "/path/to/model.gguf",
    config: ModelConfig(
        contextSize: 8192,
        gpuLayers: .all,    // .all, .none, .count(20)
        threads: 8
    )
)

Streaming chat

let llama = try LlamaActor(model: model, params: .balanced)

let stream = llama.chat(
    messages: [
        .system("You are a helpful assistant."),
        .user("What is Swift concurrency?")
    ],
    template: .gemma
)

for try await chunk in stream {
    switch chunk {
    case .text(let token): print(token, terminator: "")
    case .toolCall(let call): await handleTool(call)
    }
}

Chat templates

// Pick a built-in format
llama.chat(messages: messages, template: .gemma)
llama.chat(messages: messages, template: .llama3)
llama.chat(messages: messages, template: .mistral)
llama.chat(messages: messages, template: .chatML)

// Or bring your own
struct MyTemplate: ChatTemplateProtocol {
    let stopTokens = ["<|end|>"]
    func format(_ messages: [ChatMessage]) -> String { /* ... */ }
}
llama.chat(messages: messages, template: .custom(MyTemplate()))

Sampling

// Presets
let llama = try LlamaActor(model: model, params: .greedy)
let llama = try LlamaActor(model: model, params: .creative)
let llama = try LlamaActor(model: model, params: .balanced)

// Or tune it yourself
let params = SamplingParams(temperature: 0.8, topP: 0.95, topK: 50, maxTokens: 4096)

Downloading models

let downloader = ModelDownloader()

for await event in downloader.download(model: ModelCatalog.recommended[0], to: modelsDir) {
    switch event {
    case .progress(let percent, _, _): print("\(Int(percent))%")
    case .verifying: print("Verifying checksum...")
    case .completed(let url): print("Done: \(url.path)")
    case .failed(let error): print("Error: \(error)")
    }
}

// See what's available
let gemmaModels = ModelCatalog.models(for: .gemma)

Supported models

| Model | Family | Size | Template | |-------|--------|------|----------| | Gemma 4 E2B | Gemma | 2.9 GB | .gemma | | Gemma 4 E4B | Gemma | 5.4 GB | .gemma | | Llama 3.2 3B | Llama | 2.0 GB | .llama3 | | Mistral 7B v0.3 | Mistral | 4.4 GB | .mistral | | Phi-3.5 Mini | Phi | 2.4 GB | .chatML | | Qwen 2.5 3B | Qwen | 2.1 GB | .chatML |

Any GGUF model works. The catalog is there so you don't have to hunt for HuggingFace URLs.

Requirements

macOS 14+ / iOS 17+ / visionOS 1+
Swift 6.0+
Apple Silicon recommended (Intel works, just slower)

License

MIT. See LICENSE.

Who made this

ProfClaw. We're building ProfClaw Studio, a native macOS AI assistant that uses this package for local inference.

Package Metadata

Repository: profclaw/swift-llama

Default branch: main

README: README.md