Overview
Regular expressions are indispensable when looking for a specific information, for example while peering thru log files. When used in programming languages, it's highly versatile to filter out information or validating inputs without needing to write lots of complex code.
The following websites contain in depth information and examples:
The following websites contain interactive tutorials:
The following websites provide test harness to test regex expressions and even debug.
The following websites facilitate compiling and running code samples in C++ and other platforms
Notepad++ supports regular expression search and replace.
Validation
Regular expressions can be used to validate an input. Some examples are below.
Regular Expression | Description | Inputs | Demo |
\d{3}-\d{2}-\d{4} | US social security number | 123-45-6789 | Example |
\((\d{3})\)\d{3}-\d{4} | US phone number | (408)333-4444 | Example 2 |
9505[0-6] | santa clara zip codes | 95056 | Example 3 |
(?:0?\d|1[0-2]|jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[ \/-](?:0?\d|1\d|2\d|3[0-1])[ \/-]\d{4} | date in different formats with limited validation | 1/1/1987 09-21-2018 mar 8 1969 7 17 1975
| Example 4 |
Regular expressions can be used to extract information from the input. To extract information, capture groups can be used where the information inside () will be extracted. Some examples are below.
Regular Expression | Description | Inputs | Demo |
---|
(\(\d{3}\))-\d{3}-\d{4} | extract area code from a US phone number | (408)-333-4444 | Example 5 |
\s+(\w+)\s+\1 | find all the duplicate words | state of of the art | Example 6 |
Basics
A regular expression basically searches for a pattern. A pattern can be simple text comprising of a few letters or numbers. Example: Bangalore 560082
A pattern can be complex comprising of character classes, quantifiers, grouping etc. as shown in the examples above. The structure of such pattern is governed by a set of regular expression grammar rules. The grammar uses these metacharacters <([{\^-=$!|]})?*+.>
The following topics describe each feature of the grammar in detail
Dot character
Patterns can use . to map any character except some control characters. Note that it has no effect in a character class construct and maps to decimal point.
Some examples are
cat,
Example etc.
Example 7
Character class
The basic ingredient of the regular expression grammar is a character class. The structure of a character class is defined as below. Note that the yellow background characters indicates matches in the input text.
Construct | Description | Matches | Demo |
[ae] | a or e (simple class) | gray grey | Example 8 |
[^aeiou] | Any character except aeiou (negation) | marcial | Example 9 |
[a-zA-Z] | a through z, or A through Z inclusive (range) | Khri$ha | Example 10 |
[^7-9] | any number other than 7-9 | 95056 | Example 11 |
Predefined character classes or Shorthands
To reduce clutter, shorthands to character classes are provided. These can be freely used in another character class or even in pattern. The shorthands and their expansions are below. Note that the yellow background characters indicates matches in the input text.
Construct | Description | Matches | Demo |
\d | A digit: [0-9] | $10.99 | Example 12 |
\D | A non-digit: [^0-9] | $10.99 | Example 13 |
\s | A whitespace character: [ \t\n\x0B\f\r] | try it! | Example 14 |
\S | A non-whitespace character: [^\s] | try it! | Example 15 |
\w | A word character: [a-zA-Z_0-9] | try it! | Example 16 |
\W | A non-word character: [^\w] | try it! | Example 17 |
Escaping
Some times a meta character needs to be escaped in a pattern or a character class. Escaping is done by placing \ in front of the meta character.
Examples: \[ or \] escapes[ and ] meta characters and matches [] in the pattern as in
[1,2,3
]. \. escapes decimal point as in 10
.99
Example 18
Anchors and boundary markers
Anchors and boundary markers marks special locations such as beginning or ending of the lines, word boundaries etc. These can be used only in the pattern and not in character classes. Note that the yellow background indicates markers in the input text: "Hello, World!"
Construct | Description | Matches | Demo |
^ | The beginning of a line | Hello, World! | Example 19 |
$ | The end of a line | Hello, World! | Example 20 |
\b | A word boundary | Hello , World ! | Example 21 |
\B | A non-word boundary | H e l l o, W o r l d! | Example 22 |
\A | The beginning of the input | Hello, World! | Example 23 |
\G | The end of the previous match | Hello, World! | Example 24 |
\Z | The end of the input but for the final terminator, if any | Hello, World! | Example 25 |
\z | The end of the input | Hello, World! | Example 26 |
Quantifiers
Quantifiers determine repetitiveness of a token in the pattern. Quantifiers applies to any token in the pattern only. The table below lists in detail. Note that <empty> means blank or no matches were found. Matches are highlighted in yellow.
Greedy, Lazy and Possessive Quantifiers
The results of the same pattern for the same input but different quantifiers can be surprisingly different.
This is more pronounced when the pattern has a dot (.) followed by ? or * or +.
This has to do with the how much the regular expression engine grabs the input text for matching and then backtracks when a match is not found. During backtracking, the regular expression engine looses one token from the grabbed text and tries again. This repeats till the grabbed text is empty or a match is found.
In case of greedy, the entire input is grabbed and when no match is found, backtracking happens.
In case of lazy, a few tokens are grabbed and when no match is found, backtracking happens.
In case of Possessive, the entire input is grabbed and when no match is found, no backtracking happens.
By default, quantifiers are greedy. They can be made lazy by adding ? or Possessive by adding + as shown below.
Greedy | Lazy | Possessive | Meaning |
X* | X*? | X*+ | X, zero or more times |
X+ | X+? | X++ | X, one or more times |
X{n} | X{n}? | X{n}+ | X, exactly n times |
X{n,} | X{n,}? | X{n,}+ | X, at least n times |
X{n,m} | X{n,m}? | X{n,m}+ | X, at least n but not more than m times |
For example, consider text "This is a <B> bold </B> text" . Using the pattern <.*>, the expectation is to match, <B> and </B>. However Greedy matches more and Possessive matches none. Only lazy matches correctly.
Pattern | Remark | Match | Demo |
---|
<.*> | Greedy | This is a <B> bold </B> example | Example 33 |
<.*?> | Lazy | This is a <B> bold </B> example | Example 34 |
<.*+> | Possessive | This is a <B> bold </B> example | Example 35 |
Capture groups
The capture groups are one of the key aspects of the regular expression. It enables capturing a specific information such as area code as seen in the example below. The groups are defined and enclosed in ().
For example, the pattern \((\d{3})\)\d{3}-\d{4} matches
(408)333-4444. Example 36 Following topics discuss different features of the same in depth,
OR (|) Operator
Suppose the phone numbers of city of Los Angeles needs to be filtered, the OR operator(|) can be used in the capture groups. For example, the pattern below has a group setup as (213|323).
For example, the pattern \((213|323)\)\d{3}-\d{4} matches
(213)123-4567 or
(323)456-1234 Example 37
Non capturing groups
Non capturing groups are used for efficiency and optimization. As the name indicates, contents of non capturing groups are discarded. The Non capturing groups are defined and enclosed in (?:)
For example, the pattern
(\w\w) (\d{5})-(?:\d{4}) discards the last 4 numbers of the zip code
CA 95131-3059
Example 38
References
Captured groups in a pattern are internally labeled as \1, \2 , \3 etc. as shown below.
([0-9])([-/ ])[a-z][-/ ]([0-9])
|--1--||--2--| |--3--|
Nested references are labeled differently as shown below.
(([0-9])([-/ ]))([a-z])
|--2--||--3--|
|-------1------||--4--|
A reference refers to a previously captured group in the pattern. These are useful when looking for duplicate words in a text. There can be 3 different kinds of references.
Back reference
Back references are located after the captured group. For example, the pattern
\s(\w*)\s\1 looks for adjacent duplicate words in a text. Here
\1 refers to the first global captured group. For the input "This
is is a test" a match is made as highlighted in blue.
Example 39
Forward reference
Forward references are located before the captured group. For example, the pattern
(\2two|(one)) looks for the text in the second global captured group "one" in a text. For the input "
oneonetwo" a match is made as highlighted in blue.
Example 40
Nested reference
Nested reference are defined with in a captured group and refers to sub captured group defined with in it. For example,
(\1two|(one)),
\1 refers to the first relative captured group with in the outer captured group. For the input "
oneonetwo" a match is made as highlighted in blue.
Example 41
In addition to the above, references can also be named and used with \k switch. Relative referencing is also supported with negative numbers using \k switch.
Named references
Captured groups can have names and they can be used instead of numbers. For example, the back reference pattern discussed earlier can be rewritten as
\s(?'dup'\w*\s)\k'dup'.
Example 43 Note that the captured group name is defined as dup. The syntax is
?'name'. It can be referenced as
\k'name'. Alternatively, \g can also be used.
\g'name' for referencing.
Example 42 Duplicate names are allowed however the last captured group with the same name is used for matching.
Relative references
References to relatively placed capture groups are allowed with \kn. Here n needs to be negative number starting with -1. For example for the pattern (a)(b)(c)\k-3 matches abca. Similarly
(a)(b)(c)\k-1 matches
abcc. Another pattern
(a)(b)(c\k-2) matches
abcb.
Example 44
Advanced Features
Flags
The behavior of the regular expression engine can be changed by setting flags in the pattern. For example, to make the search case insensitive. The syntax is
(?gi) turns on global flag and case insensitive flag.
Example 45 Similarly,
(?gi-mx) turns on global flag and case insensitive flag and turns off multiline and skipping whitespace.
The following is a partial list supported by most engines.
g: matches the pattern multiple times
i: makes the regex case insensitive
m: enables multi-line mode. Where ^ and $ match the start and end of the entire string. Without this, multi-line strings match the beginning and end of each line.
u: enables support for unicode
s: short for single line, it causes the . to also match new line characters
x: ignore whitespace
U:make quantifiers lazy
Unicode support
First some basic information. Unicode standard describes how to represent characters of all the languages in the world - living or dead. In Unicode, a codepoint describe an unique artifact of the script. It can be a character or it can be combining mark. For example a (U+61) and combining grave accent ò. (U+300) An unit of readable representation of the script is called grapheme cluster. It can consists of one or more codepoints. For example a (U+61) or à (U+61 and U+300). Note that à can also be represented as single code point (U+E0).
The dot (.) equivalent of unicode is \X except it also matches line breaks. A single codepoint can also be represented \x{FFFF} where FFFF is the codepoint.
Unicode categories
Unicode also defines categories that are represented as \p{xxx} where xxx can be languages(\p{L}), mark(\p{M}), numbers(\pN), currencies(\p{Sc}) etc. \P{xxx} matches anything that does not belong to that category.
Examples:
\p{M}*\p{L}* matches kannada script "ಖ್ರಿಷಾ" as six different code points
ಖ ್ ರ ಿ ಷ ಾ Example 47 Here \p{L} maps to ಖ ರ ಷ and \p{M} maps to ್ ಿ ಾ
combining code points ಖ and ್ yields ಖ್
combining code points ರ and ಿ yields ರಿ
combining ಖ್ and ರಿ yields ಖ್ರಿ
combining code points ಷ and ಾ yields ಷಾ
combining ಖ್ರಿ and ಷಾ yield ಖ್ರಿಷಾ
Branch Reset Groups
Consider a pattern
(1a)|(2a)|(1b)\1. This defines three capture groups. For the input 1a1a, it is expected to match, however it does not. The solution is to use a branch reset group. The pattern
(?|(1a)|(2a)|(1b))\1 defines one capture group that matches inputs
1a1a or
2a2a or
1b1b. Example 48
LookAround
There are 4 types look around, positive/negative look ahead/behind. Collectively they are called lookaround, are zero-length assertions just like the start and end of line, or start and end of word anchors. The difference is that lookaround actually matches characters, but discards it, returning only the result: match or no match.
Negative lookahead can be used if you want to match something not followed by something else. For example q(?!u) matches words like qack but not quit.
Positive lookahead works just the opposite. For example q(?=u) matches words like quit but not qack.
Lookbehind works backwards. Negative lookbehind can be used if you want to match something not preceded by something else. For example, (?<!a)b matches a “b” that is not preceded by an “a”. It doesn’t match cab, but matches the b (and only the b) in bed or debt. whereas positive lookbehind
(?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
Detailed example
Let's say there is an inventory of different writing items in different colors as below:
black pen
black pencil
red pen
red crayon
purple crayon
The Look around functions can be used to filter out unique items based on certain criteria as discussed below.
Lookaround | Pattern | Description | Demo |
---|
Positive Look ahead | (\w+) (?=pen\s) | extract all the colors of all the pen in the inventory. (black, red) | Example 49 |
Negative Look ahead | (\w+) (?=pen\s) | extract all the colors of all the items in the inventory that are not pen. (black, red, purple) | Example 50 |
Positive Look behind | (\w+) (?=pen\s) | extract all the black color items in the inventory. (pen, pencil) | Example 51 |
Negative Look behind | (\w+) (?=pen\s) | extract all the items in the inventory that are not black color (pen, crayon) | Example 52 |
LookBehind with \K
Due to certain restriction in matching expression of positive lookbehind i.e., <=expression, as an alternative to positive lookbehind, \K switch can be used. For example, the pattern
Atomic Grouping
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing. The syntax is (?>group). Lookaround groups are also atomic.
Example: The pattern
a(?>bc|b)c matches
abcc but not
abc.
Example 54 When applied to abc, a matches to a, bc to bc, and then c will fail to match at the end of the string. In otherwords, backtracking will not happen as in case capture group and failure is reported.
If-Then-Else Conditionals
If-Then-Else is a special construct allows creation of conditional regular expressions. If the if condition evaluates to true, then the regex engine will attempt to match the then part. Otherwise, the else part is attempted instead. The syntax is as below:
(?(condition)then|else)
The else part is optional. The condition can be the number of the group set or a lookaround etc.
Example:
Consider (?:(a)|(b)|(c))(?(n)x|y) where n can be 1 or 2 or 3.
Recursion
Suppose the task is to find out if random number of open and close braces such as () or {} match, regular expression recursion comes to the rescue.
The syntax for recursion is (?R)
For example,
\{(?R)?\} matches
input {{{}}} but fails {{{}}}} Example 58Here { are matched with equal number of }. First, a matches the first { in the string. Then the regex engine reaches (?R). This tells the engine to attempt the whole regex again at the present position in the string. Now, { matches the second { in the string. The engine reaches (?R) again. On the second recursion, { matches the third {. On the third recursion, a fails to match the first } in the string with {. This causes (?R) to fail. But the regex uses a quantifier to make (?R) optional. So the engine continues with } which matches the first } in the string.
Now, the regex engine has reached the end of the regex. But since it’s two levels deep in recursion, it hasn’t found an overall match yet. It only has found a match for (?R). Exiting the recursion after a successful match, the engine also reaches }. It now matches the second } in the string. The engine is still one level deep in recursion, from which it exits with a successful match. Finally, } matches the third } in the string. The engine is again at the end of the regex. This time, it’s not inside any recursion. Thus, it returns {{{}}} as the overall regex match.
The main purpose of recursion is to match balanced constructs or nested constructs as shown below.
Example: The pattern \((?>[^()]|(?R))*\) matches the input
Subroutines
Subroutines are applied to the capture groups. These are very similar to regular expression recursion. Instead of matching the entire regular expression again, a subroutine call only matches the regular expression inside a capturing group. A subroutine call can be made to any capturing group from anywhere in the pattern. A call made to same capturing group leads to recursion.
Recursion can be called in different ways. For example, (?1) calls a numbered group, (?+1) to call the next group, (?-1) to call the preceding group, (?&name) to call a named group.
For example,
(?+1)(?'name'[abc])(?1)(?-1)(?&name) matches a string that is five letters long and consists only of the first three letters of the alphabet such as
abcab,
cabab etc.
Example 60This regex is exactly the same as [abc](?'name'[abc])[abc][abc][abc]
Another example would be ([abc])(?1){4} matches cabab Example 61 Recursion into a capturing group is a more flexible way of matching balanced constructs than recursion of the whole regex. We can wrap the regex in a capturing group, recurse into the capturing group instead of the whole regex, and add anchors outside the capturing group.
The above example of matching equation can be written as
(\((?>[^()]|(?1))*\))
This matches inputs such as ( 10 + 9 ) * ( 13 *7 ) + ( 6 * ( 9 ) * 7 ) Example 62 Another example is to match palindromes
(?'word'(?'letter'[a-z])(?&word)\k'letter'|[a-z]?)
This matches inputs such as radar , dad ,
abba etc
Example 63
Using Regular expressions in C++
It's possible to replace captured groups or entire match. It's discussed below.
Regular expressions is a part of in C++ 11 standard library, however it does not support many features discussed here. An alternate would be to use boost libraries which seems compatible with feature rich perl.
As noted earlier,
wandbox can be used to try out compile and run the example below. The examples discussed here use latest C++ lang compiler along latest boost library.
Majorly, regex programming involves three operations.
Match
To match if the input text is an valid input. Examples: date, email address, phone numbers, SSN etc
Search or extract
Extract certain information from the text. Examples : date, email address, phone numbers, SSN etc
Replace
Replacing certain information from the text. Examples : date, email address, phone numbers, SSN etc
The following
example 64 demonstrates all the three above operations in detail as seen in its output.
Summary of Examples