jaredhowland/warc-swift
A production-ready Swift 6.2 library for creating [WARC 1.1](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) web archives. Powered by [wget-at](https://github.com/ArchiveTeam/wget-lua) (ArchiveTeam's wget-lua) for crawling.
Features
- Full WARC 1.1 compliance — all 8 record types, all named fields, correct CRLF framing
- Single-page capture —
archive(url:options:)fetches a page and all its inline assets - Recursive crawling —
crawl(url:options:)with configurable depth, domain filters, rate limits - wget-at integration — robots.txt, deduplication, Lua scripting, battle-tested crawling
- Per-record GZIP —
.warc.gzoutput following WARC 1.1 Annex D best practice - Swift WARC I/O — read and write WARC files independently with
WARCReader/WARCWriter - Async/await — Swift 6.2 strict concurrency,
WARCArchiveris a safeactor
Requirements
- Swift 6.2+
- macOS 13+ or Linux
- wget or wget-at: see Installation
- zlib:
brew install zlib(macOS) /apt-get install zlib1g-dev(Linux)
Installation
Swift Package Manager
.package(url: "https://github.com/yourorg/warc-swift", from: "1.0.0")Add "warc-swift" to your target dependencies.
wget Installation
warc-swift delegates all HTTP crawling to wget (or the enhanced wget-at from ArchiveTeam).
macOS (standard wget via Homebrew):
brew install wgetUbuntu/Debian:
apt-get install wgetBuild ArchiveTeam's wget-lua (adds Lua scripting, advanced WARC options):
chmod +x Scripts/build-wget-at.sh
./Scripts/build-wget-at.shThis builds GNU Wget 1.21.3-at with +ssl/openssl +lua/luajit +psl and installs it into Sources/warc-swift/Resources/Binaries/wget-at-darwin-arm64 (macOS arm64).
Note: The bundled wget-at binary links against Homebrew dylibs (
openssl@3,luajit,libpsl). Install them with:brew install openssl@3 luajit libpsl
Bundle into the package (fallback — copies the system wget):**
chmod +x Scripts/fetch-binaries.sh
./Scripts/fetch-binaries.shQuick Start
import warc_swift
let archiver = WARCArchiver()
// Archive a single page (fetches page + all inline assets: images, CSS, JS)
let warcURL = try await archiver.archive(
url: URL(string: "https://example.com")!,
options: ArchiveOptions(outputPath: URL(fileURLWithPath: "output/"))
)
print("Saved WARC to: \(warcURL.path)")
// Recursive site crawl
var opts = ArchiveOptions(outputPath: URL(fileURLWithPath: "output/"))
opts.maxDepth = 2
opts.allowedDomains = ["example.com"]
opts.rateLimit = "500k" // 500 KB/s max
opts.userAgent = "MyArchive/1.0"
opts.retries = 3
let warcURL2 = try await archiver.crawl(
url: URL(string: "https://example.com")!,
options: opts
)Configuration Reference
| Option | Type | Default | Description | |--------|------|---------|-------------| | outputPath | URL | required | Output directory (created if absent) | | compress | Bool | true | Per-record GZIP → .warc.gz | | maxDepth | Int? | nil | Max crawl recursion depth | | allowedDomains | [String] | ` | Restrict crawl to these domains | | rateLimit | String? | nil | e.g. "100k", "1m" | | timeout | TimeInterval | 30 | Per-connection timeout (seconds) | | retries | Int | 3 | Retry count on failure | | userAgent | String? | nil | Custom User-Agent header | | additionalHeaders | [String: String] | [:] | Extra HTTP headers | | urlFilter | String? | nil | Reject pattern, e.g. ".mp4,.zip" | | luaScripts | [URL] | | wget-lua Lua script paths | | username / password | String? | nil | HTTP Basic Auth | | wgetAtPath | URL? | nil` | Override wget-at binary path |
Reading WARC Files
let reader = try WARCReader(path: URL(fileURLWithPath: "archive.warc.gz"))
for try await record in reader {
print("\(record.type.rawValue): \(record.targetURI?.absoluteString ?? "-")")
print(" Size: \(record.contentLength) bytes, Date: \(WARCDate.string(from: record.date))")
}Writing WARC Files
let writer = try WARCWriter(path: URL(fileURLWithPath: "output.warc.gz"), compress: true)
// Always start with a warcinfo record (recommended by WARC 1.1)
try writer.write(.warcinfo(block: Data("software: my-app/1.0\r\noperator: me\r\n".utf8)))
// Write a response record
let httpResponse = Data("HTTP/1.1 200 OK\r\nContent-Type: text/html\r\n\r\n<html>…</html>".utf8)
var r = WARCRecord.response(targetURI: url, block: httpResponse)
r.blockDigest = WARCDigest.sha256(httpResponse)
try writer.write(r)
try writer.close()Error Handling
do {
let warcURL = try await archiver.archive(url: url, options: opts)
} catch WARCArchiverError.binaryNotFound {
print("wget not found — install via 'brew install wget'")
} catch WARCArchiverError.crawlFailed(let code, let stderr) {
print("wget exited \(code): \(stderr)")
} catch WARCArchiverError.invalidURL(let url) {
print("Invalid URL: \(url)")
}Platform Notes
- macOS arm64 / x86_64: Fully supported. Install wget via Homebrew.
- Linux x86_64: Fully supported. Install wget via apt/yum.
- GZIP: Per-record compression uses zlib (available on all platforms); link
zlib1g-devon Linux. - Swift concurrency:
WARCArchiveris anactor;WARCWriterandWARCReaderare safe for single-task use.
WARC Implementation Notes
- Record IDs use
<urn:uuid:UUID4>scheme (WARC spec section 5.1). - Dates are ISO 8601 UTC (
YYYY-MM-DDThh:mm:ssZ), fractional seconds supported. - Digests use
sha256:BASE32(RFC 4648, no padding) via CryptoKit. WARC-Concurrent-Tois modelled as[String]— the only field that may repeat.- Per-record GZIP means each GZIP member is independently decompressable (Annex D).
- Recommended WARC file size limit is 1 GB (Annex C).
Examples
The Examples/ directory contains seven runnable examples, from minimal to advanced:
| # | Target | What it demonstrates | |---|--------|----------------------| | 01 | SinglePageArchive | Archive one URL with default options — the minimal use case | | 02 | RecursiveCrawl | Recursive site crawl with depth, domain allow-list, rate limit, timeout, retries, user-agent, and URL filter | | 03 | AuthAndHeaders | HTTP Basic Auth + arbitrary custom request headers + browser-spoof user-agent | | 04 | CustomBinaryAndLua | Explicit wget-at binary path and Lua script injection (ArchiveTeam's wget-lua features) | | 05 | ReadWARC | Read any .warc / .warc.gz, print per-record summaries, extract fields | | 06 | WriteWARCManually | Construct a WARC from scratch using WARCWriter — covers all 8 record types, digests, concurrent-to linking, revisit dedup, and segmentation | | 07 | ErrorHandling | Exhaustive handling of every WARCArchiverError case plus I/O edge cases |
Run any example with:
swift run <TargetName> [arguments]
# Examples:
swift run SinglePageArchive https://example.com ./output
swift run RecursiveCrawl https://example.com ./output
swift run AuthAndHeaders https://private.example.com ./output myuser mypass
swift run CustomBinaryAndLua https://example.com ./output /usr/local/bin/wget-at
swift run ReadWARC ./output/archive.warc.gz
swift run WriteWARCManually ./output.warc.gz
swift run ErrorHandlingLicense
Package Metadata
Repository: jaredhowland/warc-swift
Default branch: main
README: README.md