Enum regex_syntax::hir::Class
source · pub enum Class {
Unicode(ClassUnicode),
Bytes(ClassBytes),
}
Expand description
The high-level intermediate representation of a character class.
A character class corresponds to a set of characters. A character is either
defined by a Unicode scalar value or a byte. Unicode characters are used
by default, while bytes are used when Unicode mode (via the u
flag) is
disabled.
A character class, regardless of its character type, is represented by a sequence of non-overlapping non-adjacent ranges of characters.
Note that unlike Literal
, a Bytes
variant may
be produced even when it exclusively matches valid UTF-8. This is because
a Bytes
variant represents an intention by the author of the regular
expression to disable Unicode mode, which in turn impacts the semantics of
case insensitive matching. For example, (?i)k
and (?i-u)k
will not
match the same set of strings.
Variants§
Unicode(ClassUnicode)
A set of characters represented by Unicode scalar values.
Bytes(ClassBytes)
A set of characters represented by arbitrary bytes (one byte per character).
Implementations§
source§impl Class
impl Class
sourcepub fn case_fold_simple(&mut self)
pub fn case_fold_simple(&mut self)
Apply Unicode simple case folding to this character class, in place. The character class will be expanded to include all simple case folded character variants.
If this is a byte oriented character class, then this will be limited
to the ASCII ranges A-Z
and a-z
.
sourcepub fn negate(&mut self)
pub fn negate(&mut self)
Negate this character class in place.
After completion, this character class will contain precisely the characters that weren’t previously in the class.
sourcepub fn is_always_utf8(&self) -> bool
pub fn is_always_utf8(&self) -> bool
Returns true if and only if this character class will only ever match valid UTF-8.
A character class can match invalid UTF-8 only when the following conditions are met:
- The translator was configured to permit generating an expression that can match invalid UTF-8. (By default, this is disabled.)
- Unicode mode (via the
u
flag) was disabled either in the concrete syntax or in the parser builder. By default, Unicode mode is enabled.