Contents

jaredhowland/warc-swift

A production-ready Swift 6.2 library for creating [WARC 1.1](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) web archives. Powered by [wget-at](https://github.com/ArchiveTeam/wget-lua) (ArchiveTeam's wget-lua) for crawling.

Features

  • Full WARC 1.1 compliance — all 8 record types, all named fields, correct CRLF framing
  • Single-page capturearchive(url:options:) fetches a page and all its inline assets
  • Recursive crawlingcrawl(url:options:) with configurable depth, domain filters, rate limits
  • wget-at integration — robots.txt, deduplication, Lua scripting, battle-tested crawling
  • Per-record GZIP.warc.gz output following WARC 1.1 Annex D best practice
  • Swift WARC I/O — read and write WARC files independently with WARCReader / WARCWriter
  • Async/await — Swift 6.2 strict concurrency, WARCArchiver is a safe actor

Requirements

  • Swift 6.2+
  • macOS 13+ or Linux
  • wget or wget-at: see Installation
  • zlib: brew install zlib (macOS) / apt-get install zlib1g-dev (Linux)

Installation

Swift Package Manager

.package(url: "https://github.com/yourorg/warc-swift", from: "1.0.0")

Add "warc-swift" to your target dependencies.

wget Installation

warc-swift delegates all HTTP crawling to wget (or the enhanced wget-at from ArchiveTeam).

macOS (standard wget via Homebrew):

brew install wget

Ubuntu/Debian:

apt-get install wget

Build ArchiveTeam's wget-lua (adds Lua scripting, advanced WARC options):

chmod +x Scripts/build-wget-at.sh
./Scripts/build-wget-at.sh

This builds GNU Wget 1.21.3-at with +ssl/openssl +lua/luajit +psl and installs it into Sources/warc-swift/Resources/Binaries/wget-at-darwin-arm64 (macOS arm64).

Note: The bundled wget-at binary links against Homebrew dylibs (openssl@3, luajit, libpsl). Install them with: brew install openssl@3 luajit libpsl

Bundle into the package (fallback — copies the system wget):**

chmod +x Scripts/fetch-binaries.sh
./Scripts/fetch-binaries.sh

Quick Start

import warc_swift

let archiver = WARCArchiver()

// Archive a single page (fetches page + all inline assets: images, CSS, JS)
let warcURL = try await archiver.archive(
    url: URL(string: "https://example.com")!,
    options: ArchiveOptions(outputPath: URL(fileURLWithPath: "output/"))
)
print("Saved WARC to: \(warcURL.path)")

// Recursive site crawl
var opts = ArchiveOptions(outputPath: URL(fileURLWithPath: "output/"))
opts.maxDepth = 2
opts.allowedDomains = ["example.com"]
opts.rateLimit = "500k"           // 500 KB/s max
opts.userAgent = "MyArchive/1.0"
opts.retries = 3
let warcURL2 = try await archiver.crawl(
    url: URL(string: "https://example.com")!,
    options: opts
)

Configuration Reference

| Option | Type | Default | Description | |--------|------|---------|-------------| | outputPath | URL | required | Output directory (created if absent) | | compress | Bool | true | Per-record GZIP → .warc.gz | | maxDepth | Int? | nil | Max crawl recursion depth | | allowedDomains | [String] | ` | Restrict crawl to these domains | | rateLimit | String? | nil | e.g. "100k", "1m" | | timeout | TimeInterval | 30 | Per-connection timeout (seconds) | | retries | Int | 3 | Retry count on failure | | userAgent | String? | nil | Custom User-Agent header | | additionalHeaders | [String: String] | [:] | Extra HTTP headers | | urlFilter | String? | nil | Reject pattern, e.g. ".mp4,.zip" | | luaScripts | [URL] | | wget-lua Lua script paths | | username / password | String? | nil | HTTP Basic Auth | | wgetAtPath | URL? | nil` | Override wget-at binary path |

Reading WARC Files

let reader = try WARCReader(path: URL(fileURLWithPath: "archive.warc.gz"))
for try await record in reader {
    print("\(record.type.rawValue): \(record.targetURI?.absoluteString ?? "-")")
    print("  Size: \(record.contentLength) bytes, Date: \(WARCDate.string(from: record.date))")
}

Writing WARC Files

let writer = try WARCWriter(path: URL(fileURLWithPath: "output.warc.gz"), compress: true)

// Always start with a warcinfo record (recommended by WARC 1.1)
try writer.write(.warcinfo(block: Data("software: my-app/1.0\r\noperator: me\r\n".utf8)))

// Write a response record
let httpResponse = Data("HTTP/1.1 200 OK\r\nContent-Type: text/html\r\n\r\n<html>…</html>".utf8)
var r = WARCRecord.response(targetURI: url, block: httpResponse)
r.blockDigest = WARCDigest.sha256(httpResponse)
try writer.write(r)

try writer.close()

Error Handling

do {
    let warcURL = try await archiver.archive(url: url, options: opts)
} catch WARCArchiverError.binaryNotFound {
    print("wget not found — install via 'brew install wget'")
} catch WARCArchiverError.crawlFailed(let code, let stderr) {
    print("wget exited \(code): \(stderr)")
} catch WARCArchiverError.invalidURL(let url) {
    print("Invalid URL: \(url)")
}

Platform Notes

  • macOS arm64 / x86_64: Fully supported. Install wget via Homebrew.
  • Linux x86_64: Fully supported. Install wget via apt/yum.
  • GZIP: Per-record compression uses zlib (available on all platforms); link zlib1g-dev on Linux.
  • Swift concurrency: WARCArchiver is an actor; WARCWriter and WARCReader are safe for single-task use.

WARC Implementation Notes

  • Record IDs use <urn:uuid:UUID4> scheme (WARC spec section 5.1).
  • Dates are ISO 8601 UTC (YYYY-MM-DDThh:mm:ssZ), fractional seconds supported.
  • Digests use sha256:BASE32 (RFC 4648, no padding) via CryptoKit.
  • WARC-Concurrent-To is modelled as [String] — the only field that may repeat.
  • Per-record GZIP means each GZIP member is independently decompressable (Annex D).
  • Recommended WARC file size limit is 1 GB (Annex C).

Examples

The Examples/ directory contains seven runnable examples, from minimal to advanced:

| # | Target | What it demonstrates | |---|--------|----------------------| | 01 | SinglePageArchive | Archive one URL with default options — the minimal use case | | 02 | RecursiveCrawl | Recursive site crawl with depth, domain allow-list, rate limit, timeout, retries, user-agent, and URL filter | | 03 | AuthAndHeaders | HTTP Basic Auth + arbitrary custom request headers + browser-spoof user-agent | | 04 | CustomBinaryAndLua | Explicit wget-at binary path and Lua script injection (ArchiveTeam's wget-lua features) | | 05 | ReadWARC | Read any .warc / .warc.gz, print per-record summaries, extract fields | | 06 | WriteWARCManually | Construct a WARC from scratch using WARCWriter — covers all 8 record types, digests, concurrent-to linking, revisit dedup, and segmentation | | 07 | ErrorHandling | Exhaustive handling of every WARCArchiverError case plus I/O edge cases |

Run any example with:

swift run <TargetName> [arguments]

# Examples:
swift run SinglePageArchive https://example.com ./output
swift run RecursiveCrawl   https://example.com ./output
swift run AuthAndHeaders   https://private.example.com ./output myuser mypass
swift run CustomBinaryAndLua https://example.com ./output /usr/local/bin/wget-at
swift run ReadWARC         ./output/archive.warc.gz
swift run WriteWARCManually ./output.warc.gz
swift run ErrorHandling

License

MIT

Package Metadata

Repository: jaredhowland/warc-swift

Default branch: main

README: README.md