kylehowells/swift-justhtml
A dependency-free HTML5 parser for Swift, following the WHATWG HTML parsing specification.
Features
- Full HTML5 Compliance - Passes all 1,831 html5lib-tests tree construction tests
- Zero Dependencies - Pure Swift using only standard library and Foundation
- Cross-Platform - macOS, iOS, tvOS, watchOS, visionOS, and Linux
- CSS Selectors - Query documents using standard CSS selector syntax
- Multiple Output Formats - Serialize to HTML, plain text, or Markdown
- Streaming API - Memory-efficient event-based parsing
- Fragment Parsing - Parse HTML fragments in specific contexts
Installation
Swift Package Manager
Add swift-justhtml to your Package.swift:
dependencies: [
.package(url: "https://github.com/kylehowells/swift-justhtml.git", from: "0.4.0")
]Then add it to your target's dependencies:
targets: [
.target(
name: "YourTarget",
dependencies: ["justhtml"]
)
]Xcode
- File > Add Package Dependencies...
- Enter:
https://github.com/kylehowells/swift-justhtml.git - Select version: 0.4.0 or later
Usage
Basic Parsing
import justhtml
// Parse an HTML document
let doc = try JustHTML("<html><body><p>Hello, World!</p></body></html>")
// Access the document tree
print(doc.root.children) // [<html>]CSS Selectors
// Query with CSS selectors
let paragraphs = try doc.query("p")
let byClass = try doc.query(".intro")
let byId = try doc.query("#header")
let complex = try doc.query("nav > ul > li > a[href]")
// Check if a node matches a selector
let matches = try node.matches(".highlight")Extracting Content
// Get plain text
let text = doc.toText()
// Serialize to HTML
let html = doc.toHTML()
let prettyHtml = doc.toHTML(pretty: true, indentSize: 4)
// Convert to Markdown
let markdown = doc.toMarkdown()Fragment Parsing
// Parse HTML as if inside a specific element
let ctx = FragmentContext("tbody")
let fragment = try JustHTML("<tr><td>Cell</td></tr>", fragmentContext: ctx)Streaming API
// Memory-efficient event-based parsing
for event in HTMLStream("<p>Hello</p>") {
switch event {
case .start(let tag, let attrs):
print("Start: \(tag)")
case .end(let tag):
print("End: \(tag)")
case .text(let content):
print("Text: \(content)")
case .comment(let text):
print("Comment: \(text)")
case .doctype(let name, let publicId, let systemId):
print("Doctype: \(name)")
}
}Error Handling
// Strict mode - throws on first parse error
do {
let doc = try JustHTML("<p>Unclosed", strict: true)
} catch let error as StrictModeError {
print("Error: \(error.parseError.code)")
}
// Collect errors without throwing
let doc = try JustHTML("<p>Unclosed", collectErrors: true)
for error in doc.errors {
print("\(error.line):\(error.column): \(error.code)")
}DoS Protection
swift-justhtml includes configurable limits to protect against denial-of-service attacks from malicious HTML input:
// Default limits are applied automatically (recommended)
let doc = try JustHTML(untrustedHTML)
// Custom limits for servers with more resources
var limits = ParserLimits()
limits.maxNestingDepth = 2048
let doc = try JustHTML(html, limits: limits)
// Stricter limits for resource-constrained devices
let doc = try JustHTML(html, limits: .strict)
// Disable limits for trusted content only
let doc = try JustHTML(trustedHTML, limits: .unlimited)Default limits:
maxEntityNameLength: 255 characters (prevents memory attacks from&aaaa...)maxNestingDepth: 512 levels (prevents stack overflow from deep nesting)
See the DoS Protection Guide for details.
Spec Compliance
swift-justhtml implements the WHATWG HTML parsing specification exactly and passes all tests from the official html5lib-tests suite (used by browser vendors), the same as justhtml.
Test Results
| Test Suite | Passed | Failed | |------------|--------|--------| | Tree Construction | 1,831 | 0 | | Tokenizer | 6,810 | 0 | | Serializer | 230 | 0 | | Encoding | 82 | 0 | | Total | 8,953 | 0 |
Fuzz Testing
The parser has been fuzz tested with millions of randomized and malformed HTML documents to ensure it never crashes or hangs on any input:
- Random data fuzzing with varying document sizes
- Fragment context fuzzing
- Deep nesting stress tests
- Malformed tag and entity sequences
Run the fuzzer: swift test --filter fuzzTest
Performance
swift-justhtml is optimized for performance, matching or exceeding JavaScript implementations:
Parse Time
| Implementation | Parse Time | Comparison | |----------------|-----------|------------| | Swift | 97ms | - | | JavaScript | 99ms | 1.02x slower | | Python | 398ms | 4.1x slower |
Benchmark: Parsing 2.5MB of HTML across 5 Wikipedia articles
See Benchmarks/BENCHMARK_RESULTS.md for detailed performance comparison.
Memory Usage
| Implementation | Peak RSS | Comparison | |----------------|----------|------------| | Swift | 103 MB | - | | Python | 106 MB | 1.03x more | | JavaScript | 226 MB | 2.2x more |
Benchmark: Average peak memory across 6 test files including 20MB synthetic HTML
See Benchmarks/MEMORY_RESULTS.md for detailed memory comparison.
Swift Library Comparison
| Library | html5lib Pass Rate | Crashes/Hangs | Dependencies | |---------|-------------------|---------------|--------------| | swift-justhtml | 100% (1831/1831) | None | None | | Kanna | 94.4% (1542/1633) | None | libxml2 | | SwiftSoup | 87.9% (1436/1633) | Infinite loop on 197 tests | swift-atomics | | LilHTML | 47.4% (775/1634) | Crashes on 855 tests | libxml2 |
See notes/comparison.md for detailed library comparison.
Platform Support
| Platform | Minimum Version | |----------|-----------------| | macOS | 13.0+ | | iOS | 16.0+ | | tvOS | 16.0+ | | watchOS | 9.0+ | | visionOS | 1.0+ | | Linux | Swift 6.0+ |
Documentation
License
MIT License - see LICENSE for details.
Credits
- Original Python implementation: justhtml by Emil Stenström
- JavaScript port: justjshtml by Simon Willison
- Test suite: html5lib-tests
Package Metadata
Repository: kylehowells/swift-justhtml
Default branch: master
README: README.md