Regular expressions
Regular expressions (or regex or regexp) are a tool used by computer programmers, developers and high-end users for quickly searching for search terms and text strings. A regular expression is a sequence of characters that specifies a match pattern in a text, a file, or a group of files. Such patterns are often used by string-search functions, programs and algorithms for "find" or "find and replace" operations, e.g., in text files. Some commonly used regular expressions that may be familiar include the use of asterisks as wildcard symbols for multiple characters, e.g., searching for "g*e" can return strings with any characters in between, such as "glute" or "Google". Regular expressions are common in Linux search functions, such as with the find and grep commands, and in text manipulation programs like sed and awk.
Regular expression syntax consists of the following types of characters.
- Regular text characters
- Special characters for specific search functions
- Escape sequences (which indicate, e.g., that a certain character should be interpreted literally, or is meant as a regular character rather than as a special character)
- Grouping symbols
- Quantifiers (to indicate how many of a certain character is desired)
- Position anchors (to indicate a specific position in a line or word)
- Character classes, or metacharacters to match multiple characters of one type at once
- Operators, e.g., symbols for OR and AND functions
Contents
1 Regular characters
Regular characters are regular text characters that do not have special functions, that is, characters that do not normally function as anchors, special characters, metacharacters, escape characters, grouping symbols, metacharacters, or operators. This inlcudes most Latin letters, Arabic numerals, and non-alphanumeric symbols that have no assigned special regex functions.
2 Character classes
A character class matches any one or more of a set of characters.
Character class | Description | Example regex | Example result |
---|---|---|---|
[char_group] | Match any character in group of characters within the character group in square brackets; mathing is case-sensitive. | [ae] [ae] [OE] |
"a" in "graze" "a", "e" in "Dane" "OE" in "COELOCANTH" |
[^char_group] | Negation: Match any character that is not in the specified group in brackets | [^aei] | "f", "g", "n" in "feign" |
[first - last] | Character range: Matches any character in a specified range from first to last. | [A-Z] [1-0] |
"ABCD" in "ABCD123456" "123456" in "ABCD123456" |
. | Wildcard: Match any single alphanumeric or text character except \n (end of line). To match a literal period character (.), an escape character must be used (\.). | a.e a.e e\. |
"ave" in "crave" "ate" in "pollinate" "e." in "alienate." |
\p{name} | Match any single character in a Unicode category or a named Unicode character block (see below) | \p{IsCyrillic} | "Калининград"in Калининград City" |
\P{name} | Match any single character that is not in a specified Unicode general category or character block | \P{IsCyrillic} | "City" in "Калининград City" |
\w | Match any word character, i.e., any alphanumeric character | \w | "I", "D", "B", "3", "1" in "4 ID $B.&3.14" |
\W | Match any non-word or non-alphanumeric character | \W | " ", "." in "IC B2.6" |
\s | Match any white-space character. | \w\s | "D " in "IC B2.6" |
\S | Match any non-white-space character | \S | " " in "intl. __int" |
\d | Match any numerical digit | \d | "6" in "ab.6$ = IC" |
\D | Match any character other than a numerical digit. | \D | "a", "b", ".", "$", " ", "=", " ", "I", "C" in "ab.6$ = IC" |
3 Special characters
These symbols have special functions in searches.
Special character | Description |
---|---|
\ | Escape character: Indicates that the following character is a literal or non-special character; e.g., \\ indicates a real or literal slash |
\n | New line |
\r | Carriage return (not the same as \n) |
\s | White space |
\t | Tab spacing |
\v | Vertical tab |
\f | Form feed |
\yy | Octal character 'yy', e.g.: \x44 = octal character 44 |
\xhh | Hexadecimal character 'hh', e.g., \xcd = hex character cd |
4 Quantifiers
A quantifier specifies how many instances of a previous element (such as a single character or a group) must match the input string for a valid match
Quantifier | Description | Example regex | Example result |
---|---|---|---|
* | Wildcard: Match the previous element zero or more times. | a.*c | "ab" in "ababcdefg" |
+ | Wildcard: Match the previous element excalty one or more times. | "ab+" | "ab" in "abab", "ab" in "abjad" |
? | Wildcard: Match the previous element zero or one time. | "rex?" | "rex" in "rex romanus" |
{n} | Match the previous element exactly n times. | ",\d{3}" | ",086" in "1,086.4", ",514", ",175", and ",278" in "7,514,175,278" |
{n,} | Match the previous element at least n times. | "\d{2,}" | "42", "1040" in "ab1.42.1040xz" |
{n, m} | Match the previous element between n and m times | "\d{3,5}" | "42", "31415"
in "ab1.42.31415.xy00000000" |
*? | Match the previous element zero or more times, but as few times as possible. | a.*?c | "abc" in "abcbc" |
+? | Match the previous element one or more times, but as few times as possible | "be+?" | "se" in "seen", "se" in "sent"
?? || Match the previous element zero or one time, but as few times as possible. || "rei??" || "re" in "rein" |
{n}? | Match the preceding element exactly n times. | ",\d{3}?" | ",023" in "1,023.6", and ",514", ",271", and ",314" in "9,514,271,314" |
{n,}? | Matches the previous element at least n times, but as few times as possible. | "\d{2,}?" | "514", "27", "514" in "9,514,271,314" |
{n,m}? | Match the previous element between n and m times, but as few times as possible. | "\d{3,5}?" | "213", "045" in "213045" |
5 Anchors
Anchor | Description | Example regex | Example result |
---|---|---|---|
^ | Search from beginning of line or string | ^\d{3} | "806" in "806-867-5309" |
$ | End of line or string, or before \n | -\d{4}$ | "5309" in "806-867-5309" |
\A | Search from start of string. | \A\d{3} | "806" in "806-867-5309" |
\Z | End of string or before \n at end of string | -\d{4}\Z | "5309" in "806-867-5309" |
\z | Search at very end of string. | -\d{4}\z | "5309" in "806-867-5309" |
\G | Search from the point where a previous match ended (or if there was no previous match, then from the position where matching started) | \G\(\d\) | "(1)", "(3)", "(5)" in "(1)(3)(5)[7](9)" |
\b | Search starting from boundary between a \w (alphanumeric) and a \W (nonalphanumeric) character | \b\w+\s\w+\b | "this that", "these those" in "this that these those" |
\B | Search for a match not occuring on a \b boundary. | \Bend\w*\b | "ends", "ender" in "end rends obdurate lender" |
6 Grouping symbols
Grouping symbols indicate sets of substrings in an input string.