scinfu/swiftsoup

---

Swift

Swift 5 ``>=2.0.0``

Swift 4.2 ``1.7.4``

Installation

Cocoapods

SwiftSoup is available through CocoaPods. To install it, simply add the following line to your Podfile:

pod 'SwiftSoup'

Carthage

SwiftSoup is also available through Carthage. To install it, simply add the following line to your Cartfile:

github "scinfu/SwiftSoup"

Swift Package Manager

SwiftSoup is also available through Swift Package Manager. To install it, simply add the dependency to your Package.Swift file:

...
dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0"),
],
targets: [
    .target( name: "YourTarget", dependencies: ["SwiftSoup"]),
]
...

Usage Examples

Parse an HTML Document

import SwiftSoup

let html = """
<html><head><title>Example</title></head>
<body><p>Hello, SwiftSoup!</p></body></html>
"""

let document: Document = try SwiftSoup.parse(html)
print(try document.title()) // Output: Example

Automatic Format Detection

SwiftSoup.parse(...) automatically detects XML input by looking for an <?xml declaration at the start of the content. When detected, the XML parser is used; otherwise the HTML parser is applied. This means feeds, OPML, and other XML documents with a standard XML declaration "just work":

import SwiftSoup

let xml = """

<opml version="1.0">
  <body>
    <link>I'm link</link>
    <img>I'm img</img>
  </body>
</opml>
"""

let document = try SwiftSoup.parse(xml) // auto-detects XML
print(try document.select("link").first()?.text()) // Output: I'm link
print(try document.select("body > img").first()?.text()) // Output: I'm img

Explicit Parse Modes

Use parseXML(...) or parseHTML(...) when you want to force a specific parser regardless of the content:

// Force XML parsing (no HTML5 tag normalization)
let xmlDoc = try SwiftSoup.parseXML(xmlString)

// Force HTML parsing (always applies HTML5 rules, even if input has <?xml>)
let htmlDoc = try SwiftSoup.parseHTML(htmlString)

// Explicit parser argument (unchanged from before)
let doc = try SwiftSoup.parse(input, baseUri, Parser.xmlParser())

Parse HTML from a URL

If Foundation cannot determine a page's text encoding, avoid String(contentsOf:) and parse the raw response bytes instead:

import SwiftSoup

let url = URL(string: "https://example.com")!
let document = try SwiftSoup.parse(url)
print(try document.title())

Profiling

SwiftSoup includes a lightweight profiler (gated by a compile-time flag) and a small CLI harness for parsing benchmarks.

### CLI parse benchmark
This uses the `SwiftSoupProfile` executable target to parse a fixture corpus and report wall time:

```bash
swift run -c release SwiftSoupProfile --fixtures /path/to/fixtures
```

Add `--text` to include `Document.text()` in the workload.

### In-code profiler
The `Profiler` type is only compiled when the `PROFILE` flag is set. Build with:

```bash
swift run -c release -Xswiftc -DPROFILE SwiftSoupProfile --fixtures /path/to/fixtures
```

Then the CLI will print the profiler summary at the end of the run.

---

### Select Elements with CSS Query

```swift
let html = """
<html><body>
<p class='message'>SwiftSoup is powerful!</p>
<p class='message'>Parsing HTML in Swift</p>
</body></html>
"""

let document = try SwiftSoup.parse(html)
let messages = try document.select("p.message")

for message in messages {
    print(try message.text())
}
// Output:
// SwiftSoup is powerful!
// Parsing HTML in Swift
```

---

### Extract Text and Attributes

```swift
let html = "<a href='https://example.com'>Visit the site</a>"
let document = try SwiftSoup.parse(html)
let link = try document.select("a").first()

if let link = link {
    print(try link.text()) // Output: Visit the site
    print(try link.attr("href")) // Output: https://example.com
}
```

---

### Modify the DOM

```swift
var document = try SwiftSoup.parse("<div id='content'></div>")
let div = try document.select("#content").first()
try div?.append("<p>New content added!</p>")
print(try document.html())
// Output:
// <html><head></head><body><div id="content"><p>New content added!</p></div></body></html>
```

---

### Clean HTML for Security (Whitelist)

```swift
let dirtyHtml = "<script>alert('Hacked!')</script><b>Important text</b>"
let cleanHtml = try SwiftSoup.clean(dirtyHtml, Whitelist.basic())
print(cleanHtml) // Output: <b>Important text</b>
```

```swift
let dirtyHtml = #"<p style="color:red; position:absolute">Styled text</p>"#
let whitelist = try Whitelist()
    .addTags("p")
    .addAttributes("p", "style")
    .addCSSProperties("p", "color")
let cleanHtml = try SwiftSoup.clean(dirtyHtml, whitelist)
print(cleanHtml) // Output: <p style="color:red">Styled text</p>
```

---
### Use CSS selectors to find elements
(from [jsoup](https://jsoup.org/cookbook/extracting-data/selector-syntax))

#### Selector overview

- `tagname`: find elements by tag, e.g. `div`
- `#id`: find elements by ID, e.g. `#logo`
- `.class`: find elements by class name, e.g. `.masthead`
- `[attribute]`: elements with attribute, e.g. `[href]`
- `[^attrPrefix]`: elements with an attribute name prefix, e.g. `[^data-]` finds elements with HTML5 dataset attributes
- `[attr=value]`: elements with attribute value, e.g. `[width=500]` (also quotable, like `[data-name='launch sequence']`)
- `[attr^=value]`, `[attr$=value]`, `[attr*=value]`: elements with attributes that start with, end with, or contain the value, e.g. `[href*=/path/]`
- `[attr~=regex]`: elements with attribute values that match the regular expression; e.g. `img[src~=(?i)\.(png|jpe?g)]`
- `*`: all elements, e.g. `*`
- `[*]` selects elements that have any attribute. e.g. `p[*]` finds paragraphs with at least one attribute, and `p:not([*])` finds those with no attributes.
- `ns|tag`: find elements by tag in a namespace prefix, e.g. `dc|name` finds `<dc:name>` elements
- `*|tag`: find elements by tag in any namespace prefix, e.g. `*|name` finds `<dc:name>` and `<name>` elements
- `:empty`: selects elements that have no children (ignoring blank text nodes, comments, etc.); e.g. `li:empty`

#### Selector combinations

- `el#id`: elements with ID, e.g. `div#logo`
- `el.class`: elements with class, e.g. `div.masthead`
- `el[attr]`: elements with attribute, e.g. `a[href]`
- Any combination, e.g. `a[href].highlight`
- `ancestor child`: child elements that descend from ancestor, e.g. `.body p` finds `p` elements anywhere under a block with class "body"
- `parent > child`: child elements that descend directly from parent, e.g. `div.content > p` finds `p` elements; and `body > *` finds the direct children of the body tag
- `siblingA + siblingB`: finds sibling B element immediately preceded by sibling A, e.g. `div.head + div`
- `siblingA ~ siblingX`: finds sibling X element preceded by sibling A, e.g. `h1 ~ p`
- `el, el, el`: group multiple selectors, find unique elements that match any of the selectors; e.g. `div.masthead, div.logo`

#### Pseudo selectors

- `:has(selector)`: find elements that contain elements matching the selector; e.g. `div:has(p)`
- `:is(selector)`: find elements that match any of the selectors in the selector list; e.g. `:is(h1, h2, h3, h4, h5, h6)` finds any heading element
- `:not(selector)`: find elements that do not match the selector; e.g. `div:not(.logo)`
- `:lt(n)`: find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than `n`; e.g. `td:lt(3)`
- `:gt(n)`: find elements whose sibling index is greater than `n`; e.g. `div p:gt(2)`
- `:eq(n)`: find elements whose sibling index is equal to `n`; e.g. `form input:eq(1)`
- Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

#### Text content pseudo selectors

- `:contains(text)`: find elements that contain (directly or via children) the given normalized text. The search is case-insensitive; e.g. `div:contains(jsoup)`
- `:containsOwn(text)`: find elements whose own text directly contains the given text. e.g. `p:containsOwn(jsoup)`
- `:containsData(text)`: selects elements that contain the specified data (e.g. within `<script>`, `<style>`, or comments); e.g. `script:containsData(jsoup)`
- `:containsWholeText(text)`: selects elements that contain the exact, non-normalized whole text (case sensitive, preserving whitespace/newlines); e.g. `p:containsWholeText(jsoup The Java HTML Parser)`
- `:containsWholeOwnText(text)`: selects elements whose own text exactly matches the given non-normalized text (case sensitive); e.g. `p:containsWholeOwnText(jsoup The Java HTML Parser)`
- `:matches(regex)`: find elements whose text matches the specified regular expression; e.g. `div:matches((?i)login)`
- `:matchesOwn(regex)`: find elements whose own text matches the specified regular expression
- `:matchesWholeText(regex)`: selects elements whose entire, non-normalized text matches the specified regex; e.g. `div:matchesWholeText(\d{3}-\d{2}-\d{4})`
- `:matchesWholeOwnText(regex)`: selects elements whose own non-normalized text matches the regex; e.g. `span:matchesWholeOwnText(\w+)`

#### Structural pseudo selectors

- `:root`: selects the root element of the document (in HTML, the `<html>` element); e.g. `:root`
- `:nth-child(an+b)`: selects elements with an+b–1 preceding siblings; supports expressions like `2n+1` for odd elements; e.g. `tr:nth-child(2n+1)`
- `:nth-last-child(an+b)`: selects elements with an+b–1 following siblings; e.g. `tr:nth-last-child(-n+2)`
- `:nth-of-type(an+b)`: selects elements based on their position among siblings of the same type; e.g. `img:nth-of-type(2n+1)`
- `:nth-last-of-type(an+b)`: selects elements based on their position among siblings of the same type, counting from the end; e.g. `img:nth-last-of-type(2n+1)`
- `:first-child`: selects elements that are the first child of their parent; e.g. `div > p:first-child`
- `:last-child`: selects elements that are the last child of their parent; e.g. `ol > li:last-child`
- `:first-of-type`: selects the first element of its type among its siblings; e.g. `dl dt:first-of-type`
- `:last-of-type`: selects the last element of its type among its siblings; e.g. `tr > td:last-of-type`
- `:only-child`: selects elements that are the only child of their parent; e.g. `div:only-child`
- `:only-of-type`: selects elements that are the only element of their type among their siblings; e.g. `span:only-of-type`

#### Optimize repeated queries

SwiftSoup provides automatic caching of parsed CSS queries to speed up repeated queries, and also to speed up parsing related queries.

The cache is controlled through the static property `QueryParser.cache`. By default, it is initialized with a reasonable size limit.
You may replace the cache at any time; however, assigning a new cache instance will discard all previously cached values.

```swift
// Remove any cache limits.
QueryParser.cache = QueryParser.DefaultCache(limit: .unlimited)
// Limit to 1000 items. See also documentation for ``QueryParserCache/set(_:_:)``.
QueryParser.cache = QueryParser.DefaultCache(limit: .count(1000))
```

An alternative is to parse the query upfront and passing an `Evaluator` instead of query string.
Since `Evaluator` instances are immutable they are safe to store in (static) properties or pass across isolation boundaries. 

```swift
let elements: Elements = …
let eval = try QueryParser.parse("div > p")
for element in elements {
    print(try element.select(eval).text())
}
```

---

Author

Nabil Chatbi, scinfu@gmail.com

Current maintainer: Alex Ehlke, available for hire for SwiftSoup related work or other iOS projects: alex dot ehlke at gmail

Note

SwiftSoup was ported to Swift from Java Jsoup library.

License

SwiftSoup is available under the MIT license. See the LICENSE file for more info.

Package Metadata

Repository: scinfu/swiftsoup

Default branch: master

README: README.md