Contents

Unicode.UTF8.ValidationError

The kind and location of a UTF-8 encoding error.

Declaration

@frozen struct ValidationError

Overview

Valid UTF-8 is represented by this table:

Scalar value

Byte 0

Byte 1

Byte 2

Byte 3

U+0000..U+007F

00..7F

U+0080..U+07FF

C2..DF

80..BF

U+0800..U+0FFF

E0

A0..BF

80..BF

U+1000..U+CFFF

E1..EC

80..BF

80..BF

U+D000..U+D7FF

ED

80..9F

80..BF

U+E000..U+FFFF

EE..EF

80..BF

80..BF

U+10000..U+3FFFF

F0

90..BF

80..BF

80..BF

U+40000..U+FFFFF

F1..F3

80..BF

80..BF

80..BF

U+100000..U+10FFFF

F4

80..8F

80..BF

80..BF

Classifying errors

An unexpected continuation is when a continuation byte (10xxxxxx) occurs in a position that should be the start of a new scalar value. Unexpected continuations can often occur when the input contains arbitrary data instead of textual content. An unexpected continuation at the start of input might mean that the input was not correctly sliced along scalar boundaries or that it does not contain UTF-8.

A truncated scalar is a multi-byte sequence that is the start of a valid multi-byte scalar but is cut off before ending correctly. A truncated scalar at the end of the input might mean that only part of the entire input was received.

A surrogate code point (U+D800..U+DFFF) is invalid UTF-8. Surrogate code points are used by UTF-16 to encode scalars in the supplementary planes. Their presence may mean the input was encoded in a different 8-bit encoding, such as CESU-8, WTF-8, or Java’s Modified UTF-8.

An invalid non-surrogate code point is any code point higher than U+10FFFF. This can often occur when the input is arbitrary data instead of textual content.

An overlong encoding occurs when a scalar value that could have been encoded using fewer bytes is encoded in a longer byte sequence. Overlong encodings are invalid UTF-8 and can lead to security issues if not correctly detected:

An overlong encoding of NUL, 0xC0 0x80, is used in Java’s Modified UTF-8 but is invalid UTF-8. Overlong encoding errors often catch attempts to bypass security measures.

Reporting the range of the error

The range of the error reported follows the Maximal subpart of an ill-formed subsequence algorithm in which each error is either one byte long or ends before the first byte that is disallowed. See “U+FFFD Substitution of Maximal Subparts” in the Unicode Standard. Unicode started recommending this algorithm in version 6 and is adopted by the W3C.

The maximal subpart algorithm will produce a single multi-byte range for a truncated scalar (a multi-byte sequence that is the start of a valid multi-byte scalar but is cut off before ending correctly). For all other errors (including overlong encodings, surrogates, and invalid code points), it will produce an error per byte.

Since overlong encodings, surrogates, and invalid code points are erroneous by the second byte (at the latest), the above definition produces the same ranges as defining such a sequence as a truncated scalar error followed by unexpected continuation byte errors. The more semantically-rich classification is reported.

For example, a surrogate code point sequence ED A0 80 will be reported as three .surrogateCodePointByte errors rather than a .truncatedScalar followed by two .unexpectedContinuationByte errors.

Other commonly reported error ranges can be constructed from this result. For example, PEP 383’s error-per-byte can be constructed by mapping over the reported range. Similarly, constructing a single error for the longest invalid byte range can be constructed by joining adjacent error ranges.

Algorithm

61

F1

80

80

E1

80

C2

62

Longest range

U+61

err

U+62

Maximal subpart

U+61

err

err

err

U+62

Error per byte

U+61

err

err

err

err

err

err

U+62

Topics

Structures

Operators

Initializers

Instance Properties

Instance Methods

Default Implementations