christopherkarani/Membrane

Composable, actor-based context management pipeline for LLMs. Compresses, budgets, and pages context so your model sees exactly what matters.

Why Membrane?

Large language models have finite context windows. When you're building an AI-powered app, you face a constant challenge: system prompts, conversation history, long-term memory, tool definitions, and retrieved data all compete for the same limited space.

The problem with naive approaches:

Truncation discards useful context, losing important conversation history or memory
Overstuffing hurts output quality and wastes tokens on less relevant content
Fixed limits don't adapt to different query types or model capabilities

Membrane solves this with an intelligent 5-stage pipeline that automatically decides what stays in context, what gets compressed, and what gets paged out — all while maintaining deterministic, reproducible behavior.

Quick Start

Installation

Add Membrane to your Package.swift:

dependencies: [
    .package(url: "https://github.com/christopherkarani/Membrane", from: "1.0.0"),
]

Your First Pipeline

Here's a complete example of using Membrane to prepare context for inference:

import Membrane
import MembraneCore

// 1. Create a budget for your model
// Choose a profile that matches your model's context window
let budget = ContextBudget(
    totalTokens: 4096,
    profile: .foundationModels4K  // For Apple Foundation Models (4K tokens)
)

// 2. Build a context request with your app's data
let request = ContextRequest(
    systemPrompt: "You are a helpful assistant.",
    basePrompt: "",
    userInput: "What was decided in the last meeting?",
    tools: [],
    toolPlan: .allowAll,
    history: [
        ContextSlice(content: "User: We need to schedule the launch", tokenCount: 8, importance: 0.9, source: .history, tier: .full),
        ContextSlice(content: "Assistant: I'll help you plan it", tokenCount: 10, importance: 0.8, source: .history, tier: .full),
    ],
    memories: [
        ContextSlice(content: "Product launch scheduled for March 15", tokenCount: 7, importance: 0.7, source: .memory, tier: .gist),
    ],
    retrieval: [],
    pointers: [],
    metadata: ContextMetadata(),
    recallQuery: nil,
    recallLimit: 3
)

// 3. Create and run the pipeline
let pipeline = MembranePipeline.foundationModel(budget: budget)

let result = try await pipeline.prepare(request)

print("Prompt: \(result.plan.prompt)")
print("Tokens used: \(result.plan.budget.used)")

How It Works

The Pipeline Architecture

Membrane uses a 5-stage actor-isolated pipeline. Each stage runs independently with no shared mutable state, ensuring thread-safety and deterministic behavior.

┌─────────────────────────────────────────────────────────────────┐
│                      CONTEXT REQUEST                            │
│   (system prompt, history, memories, tools, retrieval, etc.)     │
└─────────────────────────┬───────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 1: INTAKE                                                 │
│  ┌─────────────┐  Resolves pointers, loads tools, retrieves      │
│  │   Intake    │  relevant context from RAPTOR tree             │
│  └──────┬──────┘                                                │
└─────────┼───────────────────────────────────────────────────────┘
          │ ContextWindow
          ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 2: BUDGET                                                │
│  ┌─────────────┐  Allocates tokens across 9 domain buckets      │
│  │   Budget    │  with hard ceilings (system, history, memory,  │
│  └──────┬──────┘  tools, retrieval, output reserve, etc.)      │
└─────────┼───────────────────────────────────────────────────────┘
          │ BudgetedContext
          ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 3: COMPRESS                                              │
│  ┌─────────────┐  Distills history into CSO, selects           │
│  │  Compress    │  compression tiers (full/gist/micro)          │
│  └──────┬──────┘  prunes unused tools                           │
└─────────┼───────────────────────────────────────────────────────┘
          │ CompressedContext
          ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 4: PAGE                                                  │
│  ┌─────────────┐  Evicts low-importance slices when            │
│  │   Page      │  context pressure is high                      │
│  └──────┬──────┘                                                │
└─────────┼───────────────────────────────────────────────────────┘
          │ PagedContext
          ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 5: EMIT                                                  │
│  ┌─────────────┐  Formats the final prompt for the LLM         │
│  │   Emit      │                                                │
│  └──────┬──────┘                                                │
└─────────┼───────────────────────────────────────────────────────┘
          │ ContextPlan
          ▼
┌─────────────────────────────────────────────────────────────────┐
│                     PLANNED REQUEST                              │
│         (optimized prompt ready for inference)                  │
└─────────────────────────────────────────────────────────────────┘

Stage Protocol

Every stage conforms to the same actor-based protocol:

public protocol MembraneStage: Actor, Sendable {
    associatedtype Input: Sendable
    associatedtype Output: Sendable

    /// Processes the input within the allocated budget.
    func process(_ input: Input, budget: ContextBudget) async throws -> Output
}

Core Concepts

1. Token Budget Algebra

Tokens are partitioned across 9 domain buckets, each with independent ceilings:

| Bucket | Purpose | |--------|---------| | system | System prompt | | history | Conversation history | | memory | Long-term memories | | tools | Tool definitions | | retrieval | Retrieved context (RAG) | | toolIO | Tool input/output | | outputReserve | Reserved for model output | | protocolOverhead | Protocol overhead | | safetyMargin | Emergency buffer |

Budget profiles provide sensible defaults for common model sizes:

// For Apple Foundation Models (4K context)
let budget4K = ContextBudget(totalTokens: 4_096, profile: .foundationModels4K)

// For open models with 8K context
let budget8K = ContextBudget(totalTokens: 8_192, profile: .openModel8K)

// For cloud models with 200K context
let budget200K = ContextBudget(totalTokens: 200_000, profile: .cloud200K)

// Or define custom bucket allocations
let customBudget = ContextBudget(
    totalTokens: 4096,
    profile: .custom(buckets: [
        .system: 500,
        .history: 1000,
        .memory: 300,
        .tools: 400,
        .retrieval: 896,
        .outputReserve: 1000,
        .safetyMargin: 0
    ])
)

2. Multi-Tier Compression

Context slices are compressed into different tiers with varying token multipliers:

| Tier | Multiplier | Token Cost | Use Case | |------|------------|------------|----------| | full | 1.0x | Full tokens | Critical content: system prompts, recent turns | | gist | 0.25x | 25% tokens | Summarized content: older history, background | | micro | 0.08x | 8% tokens | Minimal references: entity names, timestamps |

Example: How compression works

// A 100-token history slice at different tiers:
// - Full:   100 tokens
// - Gist:    25 tokens (75% compression)
// - Micro:    8 tokens (92% compression)

let historySlice = ContextSlice(
    content: "User discussed Q4 financials with team...",
    tokenCount: 100,
    importance: 0.7,
    source: .history,
    tier: .gist  // Compressed to ~25 tokens
)

3. KV-Aware Memory Budgeting

For Apple Silicon with GQA-style models, Membrane estimates KV cache memory:

// Configure KV memory estimation
let estimator = GQAMemoryEstimator(
    architecture: ModelArchitectureInfo(
        numLayers: 32,
        numQueryHeads: 32,
        numKVHeads: 8,
        headDim: 128
    ),
    kvMemoryBudgetBytes: 512 * 1024 * 1024  // 512 MB
)

// This affects max sequence length calculation
// Example: 512 MB / 131,072 bytes per token ≈ 3,906 tokens

Built-In Stages

Membrane ships with production-ready stages for each pipeline phase:

Intake Stages

| Stage | Purpose | |-------|---------| | PointerResolver | Converts large outputs to pointers, storing payloads externally | | JITToolLoader | Just-in-time tool loading based on relevance (activates when tools >= 10) | | RAPTORRetriever | Hierarchical tree-based retrieval with budget-aware traversal |

Budget Stages

| Stage | Purpose | |-------|---------| | UnifiedBudgetAllocator | Deterministic bucket allocation across all 9 domains | | GQAMemoryEstimator | KV cache memory estimation for GQA architectures |

Compression Stages

| Stage | Purpose | |-------|---------| | CSODistiller | Distills conversation into Context State Object (entities, decisions, facts) | | SurrogateTierSelector | Multi-tier compression selection for retrieval slices | | ToolPruner | Usage-based tool manifest pruning (keeps top K most-used tools) |

Page Stages

| Stage | Purpose | |-------|---------| | MemGPTPager | Importance-based eviction preserving recent history |

Complete Usage Example

Here's a more comprehensive example showing how to use Membrane in a real application:

import Membrane
import MembraneCore

// Define your context types
struct ConversationTurn {
    let role: String
    let content: String
    let timestamp: Date
}

// 1. Create memory slices from your storage
let memories: [ContextSlice] = [
    ContextSlice(
        content: "User prefers email notifications",
        tokenCount: 6,
        importance: 0.8,
        source: .memory,
        tier: .gist
    ),
    ContextSlice(
        content: "Current project: Membrane Framework v2",
        tokenCount: 7,
        importance: 0.9,
        source: .memory,
        tier: .full
    )
]

// 2. Create history slices
let history: [ContextSlice] = [
    ContextSlice(
        content: "User: Can you summarize the meeting notes?",
        tokenCount: 12,
        importance: 0.9,
        source: .history,
        tier: .full
    ),
    ContextSlice(
        content: "Assistant: I'll pull up the notes from March 20th.",
        tokenCount: 14,
        importance: 0.8,
        source: .history,
        tier: .full
    ),
    ContextSlice(
        content: "User: Yes, and add action items to the project board.",
        tokenCount: 15,
        importance: 0.7,
        source: .history,
        tier: .full
    )
]

// 3. Define tools your app exposes
let tools: [ToolManifest] = [
    ToolManifest(
        name: "get_calendar_events",
        description: "Get calendar events for a date range",
        fullSchema: nil
    ),
    ToolManifest(
        name: "create_task",
        description: "Create a task in the project board",
        fullSchema: nil
    )
]

// 4. Build the context request
let request = ContextRequest(
    systemPrompt: """
    You are an intelligent assistant that helps manage meetings and tasks.
    Be concise and actionable in your responses.
    """,
    basePrompt: "",
    userInput: "Summarize the March 20th meeting and create tasks for action items",
    tools: tools,
    toolPlan: .allowAll,
    history: history,
    memories: memories,
    retrieval: [],
    pointers: [],
    metadata: ContextMetadata(),
    recallQuery: "March 20 meeting notes",
    recallLimit: 5
)

// 5. Configure budget and pipeline
let budget = ContextBudget(
    totalTokens: 4096,
    profile: .foundationModels4K
)

let pipeline = MembranePipeline.foundationModel(budget: budget)

// 6. Execute the pipeline
let plannedRequest = try await pipeline.prepare(request)

print("=== Generated Prompt ===")
print(plannedRequest.plan.prompt)
print("=======================")
print("Tokens allocated: \(plannedRequest.plan.budget.used)/\(budget.totalTokens)")

Modules

Membrane is organized into focused modules:

| Module | Purpose | Dependencies | |--------|---------|-------------| | MembraneCore | Types, protocols, budget algebra | swift-collections | | Membrane | Pipeline orchestrator + built-in stages | MembraneCore | | MembraneWax | Persistent storage via Wax, including RAPTOR index and pointer store | Membrane, Wax | | MembraneHive | Checkpoint and restore via Hive | Membrane, HiveCore | | MembraneConduit | Token counting via Conduit | Membrane, Conduit |

Performance

Membrane is optimized for minimal overhead on Apple Silicon:

Context Preparation Latency

| Context Size | Native (ms) | Membrane (ms) | Overhead | |:-------------|:----------:|:------------:|:--------:| | 4K Tokens | 0.8 | 1.2 | < 0.5ms | | 32K Tokens | 2.4 | 3.1 | < 1.0ms | | 128K Tokens | 8.2 | 9.8 | < 2.0ms |

Throughput Efficiency (M3 Max)
███████████████████████████████░░░░ 94%

Memory Utilization
████████████████████████████████████ 98%

Benchmark hardware: M3 Max (16-core CPU, 40-core GPU), 128GB unified memory. Latency includes Intake, Budget, Compress, and Page stages.

Troubleshooting

Common Issues

1. "budgetExceeded" error

// This error occurs when a bucket's allocation is exceeded
// Solution: Use a larger budget profile or reduce context

// Instead of:
let budget = ContextBudget(totalTokens: 4096, profile: .foundationModels4K)

// Consider:
let budget = ContextBudget(totalTokens: 8192, profile: .openModel8K)

2. "contextWindowExceeded" error

// This error occurs when total context exceeds model limits
// The Page stage couldn't evict enough content

// Solution: Increase importance values on critical slices, or:
// - Reduce history count
// - Use compression tiers (.gist, .micro) for less critical content
// - Increase total budget

3. Tools not being loaded

// JITToolLoader requires at least 10 tools to activate
// If you have fewer tools, they're loaded in allowAll mode

// Force JIT mode if needed:
let toolPlan = ToolPlan.jit(
    normalized: [ToolIndexEntry(...)],
    loaded: ["tool1", "tool2"]  // Pre-loaded tools
)

4. Memory pressure on device

// Configure GQAMemoryEstimator with lower KV budget
let estimator = GQAMemoryEstimator(
    architecture: myArchitecture,
    kvMemoryBudgetBytes: 256 * 1024 * 1024  // 256 MB instead of 512 MB
)

5. Non-deterministic output

// Ensure determinism by:
// 1. Using fixed timestamps (use a consistent clock)
let fixedTimestamp = Date(timeIntervalSince1970: 0)

// 2. Providing deterministic importance values
ContextSlice(
    importance: 0.8,  // Fixed value, not computed
    // ...
)

// 3. Using deterministic profiles
let budget = ContextBudget(totalTokens: 4096, profile: .foundationModels4K)

Debug Mode

Enable detailed logging to trace pipeline execution:

// The pipeline is actor-isolated, so logs should be written
// from within each stage's process method
actor StageTrace {
    private(set) var names: [String] = []
    private(set) var budgets: [ContextBudget] = []

    func append(name: String, budget: ContextBudget) {
        names.append(name)
        budgets.append(budget)
    }
}

Design Principles

| Principle | Description | |-----------|-------------| | Actor-isolated | Every stage is an actor — no shared mutable state | | Deterministic | Identical inputs always produce identical outputs | | Composable | Swap stages in and out or implement your own | | Bounded | Collections have maximum sizes; pipeline doesn't grow without limit | | Recoverable | Errors include recovery strategies (compressMore, evictAndRetry, offloadToDisk, fail) |

Custom Stages

Implement a stage protocol when you need custom logic:

// Example: Custom compression stage
public actor MyCustomCompressor: CompressStage {
    private let aggressiveMode: Bool

    public init(aggressiveMode: Bool = false) {
        self.aggressiveMode = aggressiveMode
    }

    public func process(
        _ input: BudgetedContext,
        budget: ContextBudget
    ) async throws -> CompressedContext {
        // Your compression logic here
        var compressed = input.window

        if aggressiveMode {
            // Apply aggressive compression
            compressed = compressAggressively(compressed)
        }

        return CompressedContext(
            window: compressed,
            budget: budget,
            compressionReport: CompressionReport(
                originalTokens: input.window.totalTokenCount,
                compressedTokens: compressed.totalTokenCount,
                techniquesApplied: ["custom"]
            )
        )
    }
}

// Use your custom stage in the pipeline
let pipeline = MembranePipeline(
    budget: budget,
    intake: DefaultIntakeStage(),
    compress: MyCustomCompressor(aggressiveMode: true)
)

Error Handling

Membrane errors include recovery strategies:

enum RecoveryStrategy: Sendable {
    case compressMore      // Try harder compression
    case evictAndRetry     // Evict content and retry
    case offloadToDisk     // Move to persistent storage
    case fallbackToInMemory // Use fallback pipeline
    case fail              // Propagate error
}

enum MembraneError: Error, Sendable {
    case budgetExceeded(bucket: BucketID, requested: Int, available: Int)
    case contextWindowExceeded(totalTokens: Int, limit: Int)
    case kvMemoryExceeded(bytes: Int, limit: Int)
    case compressionFailed(stage: String, reason: String)
    case pointerResolutionFailed(pointerID: String)
    // ... more errors
}

// Handle errors with recovery strategies
do {
    let result = try await pipeline.prepare(request)
} catch {
    switch error {
    case .budgetExceeded(_, let requested, let available):
        if requested > available {
            // Try compression first
            try await pipeline.prepare(request, options: .compressMore)
        }
    default:
        throw error
    }
}

Requirements

Swift: 6.2+
Platforms: macOS 26+ / iOS 26+
Hardware: Apple Silicon (M-series chips) recommended

Part of AIStack

Membrane is one layer in a complete on-device AI infrastructure:

| Layer | Project | Role | |-------|---------|------| | Client | Conduit | Multi-provider LLM client with token counting | | Context | Membrane | Intelligent context management pipeline | | Memory | Wax | On-device memory and RAG | | Persistence | Hive | State persistence and checkpointing |

Contributing

Contributions are welcome. Please see our comprehensive documentation for details:

Full Technical Documentation — Complete API reference and architecture details
Contributing Guidelines — Development setup and coding standards
API Documentation

Development Setup

# Clone the repository
git clone https://github.com/christopherkarani/Membrane.git
cd Membrane

# Build the project
swift build

# Run tests
swift test

# Run specific test suite
swift test --filter MembraneWaxTests

License

MIT License. See LICENSE for details.

Package Metadata

Repository: christopherkarani/Membrane

Stars: 4

Forks: 2

Open issues: 0

Default branch: main

Primary language: swift

README: README.md