Expand description
Match regular expressions on arbitrary bytes.
This module provides a nearly identical API to the one found in the top-level of this crate. There are two important differences:
- Matching is done on
&[u8]
instead of&str
. Additionally,Vec<u8>
is used whereString
would have been used. - Unicode support can be disabled even when disabling it would result in matching invalid UTF-8 bytes.
Example: match null terminated string
This shows how to find all null-terminated strings in a slice of bytes:
let re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap();
let text = b"foo\x00bar\x00baz\x00";
// Extract all of the strings without the null terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
re.captures_iter(text)
.map(|c| c.name("cstr").unwrap().as_bytes())
.collect();
assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
Example: selectively enable Unicode support
This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded string (e.g., to extract a title from a Matroska file):
let re = Regex::new(
r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
).unwrap();
let text = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";
let caps = re.captures(text).unwrap();
// Notice that despite the `.*` at the end, it will only match valid UTF-8
// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
// the `.*` would match the rest of the bytes.
let mat = caps.get(1).unwrap();
assert_eq!((7, 10), (mat.start(), mat.end()));
// If there was a match, Unicode mode guarantees that `title` is valid UTF-8.
let title = str::from_utf8(&caps[1]).unwrap();
assert_eq!("☃", title);
In general, if the Unicode flag is enabled in a capture group and that capture is part of the overall match, then the capture is guaranteed to be valid UTF-8.
Syntax
The supported syntax is pretty much the same as the syntax for Unicode regular expressions with a few changes that make sense for matching arbitrary bytes:
- The
u
flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When theu
flag is disabled, the regex is said to be in “ASCII compatible” mode. - In ASCII compatible mode, neither Unicode scalar values nor Unicode character classes are allowed.
- In ASCII compatible mode, Perl character classes (
\w
,\d
and\s
) revert to their typical ASCII definition.\w
maps to[[:word:]]
,\d
maps to[[:digit:]]
and\s
maps to[[:space:]]
. - In ASCII compatible mode, word boundaries use the ASCII compatible
\w
to determine whether a byte is a word byte or not. - Hexadecimal notation can be used to specify arbitrary bytes instead of
Unicode codepoints. For example, in ASCII compatible mode,
\xFF
matches the literal byte\xFF
, while in Unicode mode,\xFF
is a Unicode codepoint that matches its UTF-8 encoding of\xC3\xBF
. Similarly for octal notation when enabled. - In ASCII compatible mode,
.
matches any byte except for\n
. When thes
flag is additionally enabled,.
matches any byte.
Performance
In general, one should expect performance on &[u8]
to be roughly similar to
performance on &str
.
Structs
- CaptureLocations is a low level representation of the raw offsets of each submatch.
- An iterator that yields all non-overlapping capture groups matching a particular regular expression.
- An iterator over the names of all possible captures.
- Captures represents a group of captured byte strings for a single match.
- Match represents a single match of a regex in a haystack.
- An iterator over all non-overlapping matches for a particular string.
NoExpand
indicates literal byte string replacement.- A compiled regular expression for matching arbitrary bytes.
- A configurable builder for a regular expression.
- Match multiple (possibly overlapping) regular expressions in a single scan.
- A configurable builder for a set of regular expressions.
- By-reference adaptor for a
Replacer
- A set of matches returned by a regex set.
- An owned iterator over the set of matches from a regex set.
- A borrowed iterator over the set of matches from a regex set.
- Yields all substrings delimited by a regular expression match.
- Yields at most
N
substrings delimited by a regular expression match. - An iterator that yields all capturing matches in the order in which they appear in the regex.
Traits
- Replacer describes types that can be used to replace matches in a byte string.