Modern C++ 11,14: Quick tour with examples: String and Literals

Showing posts with label String and Literals. Show all posts

Monday, November 13, 2023

basic_string

Overview

strings are essential part of any programming language. In C/C++ strings are represented as set of characters, terminated by a null character.

A string can be created as a const char* or as char[]. However for backward compatibility, it is referred as const char *, when passed by value as shown below. Note that the compiler adds null character implicitly at the end. These strings are also known as cstring.

template <typename T>
void printtype(T) 
{
    cout << typeid(T).name() << endl;
}

//Example
const char *str = "hello,";
const char str2[9] = "world!";
printtype(str);   //prints char const * 
printtype( str2); //prints char const *

The example 6 depicts the usage.

Details

CRT provides plethora of functions to handle strings. There are different functions for getting length, append, copy etc.

std::basic_string class attempts to objectify strings so that it's easier to use. string is a template based class as defined below.

syntax

//charT - one of the character types char_t, wchar_t, char16_t or char32_t
//traits - defines null character, comparison and other properties
template <typename charT,
typename traits = char_traits<charT>,
typename Allocator = allocator<charT> >
class basic_string;

//Predefined string classes 
typedef basic_string<char_t> string;
typedef basic_string<wchar_t> wstring;
typedef basic_string<char16_t> u16string;
typedef basic_string<char32_t> u32string;

char_traits➹

char_traits classes define common behavior such as comparison, assignment, copy etc. and also other aspects such as eof type, offset type position type etc.

Buffering

Internally raw strings are stored in heap as shown below. It's possible to change storage. This example 15 allocates 100 mb string using a shared memory based custom allocator class.

Constants

npos is used in constructor, string modification functions. Based on the context, it means end of string.

Constructors

The following constructors are available. Note [khrisha rao] denotes a string object. "khrisha rao" denotes a string literal.

Name	Description
string()	default constructor
string(const string&)	copy constructor
string(string&&)	move constructor
string(initializer_list<char>)	initializer list constructor.
string(const string& s, size_t p, size_t n=npos)	from s, copy n chars, starting from p
string(const char* s)	copy all chars from c string s
string(const char* s, size_t n)	copy n chars from c string s
string(size_t n, char c)	fill n chars with char c
string(InputIterator b, InputIterator e)	copy chars from iterator b to iterator e

The example 7 depicts the usage.

Access using Iterators

Following functions return one of the iterators - random access iterators, const access iterators, reverse iterators and const access reverse iterators.

Name	Description
iterator begin()	Return iterator to beginning
iterator end()	Return iterator to end
reverse_iterator rbegin()	Return reverse iterator to reverse beginning
reverse_iterator rend()	Return reverse iterator to reverse end
const_iterator cbegin()	Return const_iterator to beginning
const_iterator cend()	Return const_iterator to end
const_reverse_iterator crbegin()	Return const_reverse_iterator to reverse beginning
const_reverse_iterator crend()	Return const_reverse_iterator to reverse end

The example 8 depicts the usage.

Storage and length

The following functions returns length, capacity of the underlying string buffer.

Name	Description
const char* c_str()	Get C string equivalent
const char* data()	Same as c_str() except null character in the end.
size_t size()	Return length of string
size_t length()	Return length of string
size_t max_size()	Return maximum size of string
void resize(size_t n) void resize(size_t n, char c)	Resize string to size n and copy contents. The string should be initialized. Resize string to size n and fill with char c. The string should be empty.
size_t capacity()	Return size of allocated storage
void reserve(size_t n=0)	Change capacity to size n or minimum
void clear()	Clear string
bool empty()	Test if string is empty
void shrink_to_fit()	Shrink to fit

The example 9 depicts the usage.

Access using Index

Access individual elements in the string

Name	Description
char& at(size_t p) const char& at(size_t p) const	Get character in string at position p
char& back() const char& back() const	Access last character
char& front() const char& front() const	Access first character
char& operator[] (size_t p) const char& operator[] (size_t p) const	indexed access to characters of the string

String Modifications

Perform operations such as insert, append, replace, erase, assign and swap

Name	Description
string& append(const string& s) string& append(const string& s, size_t p, size_t n) string& append(const char* s) string& append(const char* s, size_t n) string& append(size_t n, char c) string& append(InputIterator b, InputIterator e) string& append(initializer_list<char> il)	Append string s Append substring from s Append c string s Append n chars from c string s Append char c, n times Append chars from iterators Append chars from initializer list
string& assign(const string& s) string& assign(const string& s, size_t p, size_t n) string& assign(const char* s) string& assign(const char* s, size_t n) string& assign(size_t n, char c) string& assign(InputIterator b, InputIterator e) string& assign(initializer_list<char> il) string& assign (string&& s)	Assign string s Assign substring from s Assign c string s Assign n chars from c string s Assign char c, n times Assign chars from iterators Assign chars from initializer list Assign string s
string& insert (size_t p, const string& s) string& insert (size_t p, const string& s, size_t sp, size_t n) string& insert (size_t p, const char* s) string& insert (size_t p, const char* s, size_t n) string& insert (size_t p, size_t n, char c) iterator insert (const_iterator p, size_t n, char c) iterator insert (const_iterator p, char c) iterator insert (iterator p, InputIterator b, InputIterator e) iterator insert (const_iterator p, initializer_list<char> )	insert string s at position p insert substring from s at position p insert c string s at position p insert n chars c string s at position p insert char c, n times at position p insert char c, n times at iterator p insert char c, at iterator p insert chars from b to e, at iterator p insert chars from initializer list at iterator p
string& erase (size_t p = 0, size_t n = npos) iterator erase (const_iterator p) iterator erase (const_iterator b, const_iterator e)	Erase n chars at p Erase char at p Erase chars from b to e
string& replace (size_t p, size_t n, const string& s) string& replace (const_iterator i1, const_iterator i2, const string& s) string& replace (size_t p, size_t n, const string& s, size_t sp, size_t sn) string& replace (size_t p, size_t n, const char* s) string& replace (const_iterator i1, const_iterator i2, const char* s) string& replace (size_t p, size_t n, const char* s, size_t sn) string& replace (const_iterator i1, const_iterator i2, const char* s, size_t n) string& replace (size_t pos, size_t len, size_t n, char c) string& replace (const_iterator i1, const_iterator i2, size_t n, char c) string& replace (const_iterator i1, const_iterator i2, InputIterator b, InputIterator e) string& replace (const_iterator i1, const_iterator i2, initializer_list<char>)	replace n chars from string s at position p replace chars from i1 to i2 from string s replace n chars at position p from substring of string s replace n chars at position p from cstring s replace chars from i1 to i2 from cstring s replace n chars or less at position p from cstring s replace n chars from i1 to i2 from cstring s replace n chars with char c at position p replace n chars from i1 to i2 with char c replace chars from i1 to i2 with chars from iterators b to e replace chars from i1 to i2 with chars from initializer list
void push_back(char c)	Append character to string
void pop_back()	Delete last character
void swap (string& x, string& y)	Exchanges the values of two strings
void swap(string& s)	Swap string values with s

The example 10 depicts the usage.

String Operations

Perform operations such as extract string buffer, copy, find, compare and substring. The find functions return the position of the match, otherwise, npos.

compare functions, compare full or substring with a string or its substring or a cstring or its substring.

The return values are

Result	Description
0	compare equal.
< 0	either the value of the first character that does not match is lower in the compared string, or all compared characters match but the compared string is shorter.
> 0	either the value of the first character that does not match is greater in the compared string, or all compared characters match but the compared string is longer.

Name	Description
size_t copy (char* s, size_t n, size_t p = 0)	Copy n chars from position p to buffer s
size_t find (const string& s, size_t p = 0) size_t find (const char* s, size_t p = 0) size_t find (const char* s, size_t p, size_t n) size_t find (char c, size_t p = 0)	Find string s from position p Find cstring s from position p Find n chars in cstring s from position p Find char c from position p
size_t rfind (const string& s, size_t p = npos) size_t rfind (const char* s, size_t p =npos) size_t rfind (const char* s, size_t p, size_t n) size_t rfind (char c, size_t p=npos)	Reverse find string s from position p Reverse find cstring s from position p Reverse find n chars in cstring s from position p Reverse find char c from position p
size_t find_first_of(const string& s, size_t p = 0) size_t find_first_of(const char* s, size_t p = 0) size_t find_first_of(const char* s, size_t p, size_t n) size_t find_first_of(char c, size_t p = 0)	Find any char in string s from position p Find any char in cstring s from position p Find any of n chars in cstring s from position p Find char c from position p
size_t find_last_of(const string& s, size_t p = npos) size_t find_last_of(const char* s, size_t p = npos) size_t find_last_of(const char* s, size_t p, size_t n) size_t find_last_of(char c, size_t p = npos)	Reverse Find any char in string s from position p Reverse Find any char in cstring s from position p Reverse Find any of n chars in cstring s from position p Reverse Find char c from position p
size_t find_first_not_of(const string& s, size_t p = 0) size_t find_first_not_of(const char* s, size_t p = 0) size_t find_first_not_of(const char* s, size_t p, size_t n) size_t find_first_not_of(char c, size_t p = 0)	Find any char not in string s from position p Find any char not in cstring s from position p Find any of n chars not in cstring s from position p Find any char but not char c from position p
size_t find_last_not_of(const string& s, size_t p = npos) size_t find_last_not_of(const char* s, size_t p = npos) size_t find_last_not_of(const char* s, size_t p, size_t n) size_t find_last_not_of(char c, size_t p = npos)	Reverse Find any char not in string s from position p Reverse Find any char not in cstring s from position p Reverse Find any of n chars not in cstring s from position p Reverse Find any char but not char c from position p
int compare (const string& s) int compare (size_t p, size_t n, const string& s) int compare (size_t p, size_t n, const string& s, size_t sp, size_t sn) int compare (const char* s) int compare (size_t p, size_t n, const char* s) int compare (size_t p, size_t n, const char* s, size_t n)	Compare with string s Compare substring with string s Compare substring with substring of string s Compare with cstring s Compare substring with cstring s Compare substring with substring of cstring s
string substr (size_t p = 0, size_t n = npos)	Generate substring

The example 11 depicts the usage.

Overloaded Operators

Following overloaded operators are implemented externally and internally.

Name	Description
string operator+ (const string& lhs, const string& rhs) string operator+ (const string& lhs, const char* rhs) string operator+ (const char * lhs, const string& rhs) string operator+ (const string& lhs, char rhs) string operator+ (char lhs, const string& rhs)	add two strings. return the result. add string and cstring. return the result. add cstring and string. return the result. add string and char. return the result. add char and string. return the result.
string& operator=(const string& str) string& operator=(const char* s) string& operator=(char c) string& operator=(initializer_list<char>) string& operator=(string&& s)	assign string s assign cstring s assign char c assign from initializer list assign and move string s
string& operator+= (const string& str) string& operator+= (const char * s) string& operator+= (char c)	Append a string Append a cstring Append a char

Relational Operators

These operators are externally implemented for string comparison.

Name	Description
bool operator== (const string& lhs, const string& rhs) bool operator== (const char * lhs, const string& rhs) bool operator== (const string& lhs, const char * rhs)	Equality operator for string
bool operator!= (const string& lhs, const string& rhs) bool operator!= (const char * lhs, const string& rhs) bool operator!= (const string& lhs, const char * rhs)	Non Equality operator for string
bool operator< (const string& lhs, const string& rhs) bool operator< (const char * lhs, const string& rhs) bool operator< (const string& lhs, const char * rhs)	Less than operator for string
bool operator<= (const string& lhs, const string& rhs) bool operator<= (const char * lhs, const string& rhs) bool operator<= (const string& lhs, const char * rhs)	Less than equal operator for string
bool operator> (const string& lhs, const string& rhs) bool operator> (const char * lhs, const string& rhs), bool operator> (const string& lhs, const char* rhs)	Greater than operator for string
bool operator>= (const string& lhs, const string& rhs) bool operator>=(const char * lhs, const string& rhs), bool operator>= (const string& lhs, const char * rhs)	Greater than equal operator for string

Stream operators and functions

The following overloaded function can be used with stream read and write operations.

Name	Description
istream& operator>> (istream& is, string& str)	Extract string from stream
ostream& operator<< (ostream& os, const string& str)	Insert string into stream
istream& getline (istream& is, string& str, char delim) istream& getline (istream& is, string& str)	Externally overloaded function, reads a line from stream into string

The example 12 depicts the usage.

Convert strings to arithmetic types

The following functions convert strings to arithmetic type. The supported char types are char and wchar.

The value of the parameter const string& str can be preceded with or whitespace but followed by a numerical value. which can be trailed by any character. Examples:"99", " 99namaskara" etc

The value of the parameter size_t* idx , is set by the function to the position of the next character in str after the numerical value. This parameter can also be a null pointer, in which case it is not used.

The value of the parameter int base represents numerical base (radix) that determines the valid characters and their interpretation.

If this is 0, the base used is determined by the format in the sequence. Notice that by default this argument is 10, not 0.

Name	Description
int stoi (const string& str, size_t* idx = 0, int base = 10)	Convert string to integer
long stol (const string& str, size_t* idx = 0, int base = 10)	Convert string to long int
unsigned long stoul (const string& str, size_t* idx = 0, int base = 10)	Convert string to unsigned integer
long long stoll (const string& str, size_t* idx = 0, int base = 10)	Convert string to long long
unsigned long long stoull (const str, string& str, size_t* idx = 0, int base = 10)	Convert string to unsigned long long
float stof (const string& str, size_t* idx = 0)	Convert string to float
double stod (const string& str, size_t* idx = 0)	Convert string to double
long double stold (const string& str, size_t* idx = 0)	Convert string to long double

The example 13 depicts the usage.

Convert arithmetic types to string

The following functions convert strings to arithmetic type. The supported char types are char and wchar.

Name	Description
string to_string (int val) wstring to_wstring (int val)	Convert integer to string / wstring
string to_string (long val) wstring to_wstring (long val)	Convert long int to string / wstring
string to_string (long long val) wstring to_wstring (long long val)	Convert long long to string / wstring
string to_string (unsigned val) wstring to_wstring (unsigned val)	Convert unsigned int to string / wstring
string to_string (unsigned long val) wstring to_wstring (unsigned long val)	Convert unsigned long to string / wstring
string to_string (unsigned long long val) wstring to_wstring (unsigned long long val)	Convert unsigned long long to string / wstring
string to_string (float val) wstring to_wstring (float val)	Convert float to string / wstring
string to_string (double val) wstring to_wstring (double val)	Convert double to string / wstring
string to_string (long double val) wstring to_wstring (long double val)	Convert long double to string / wstring

The example 14 depicts the usage.

Monday, November 6, 2023

Regular Expression

Overview

Regular expressions are indispensable when looking for a specific information, for example while peering thru log files. When used in programming languages, it's highly versatile to filter out information or validating inputs without needing to write lots of complex code.

The following websites contain in depth information and examples:

Regular expressions info
rexegg

The following websites contain interactive tutorials:

regexone

The following websites provide test harness to test regex expressions and even debug.

regex101

The following websites facilitate compiling and running code samples in C++ and other platforms

wandbox

Notepad++ supports regular expression search and replace.

Validation

Regular expressions can be used to validate an input. Some examples are below.

Regular Expression	Description	Inputs	Demo
\d{3}-\d{2}-\d{4}	US social security number	123-45-6789	Example
$(\d{3})$\d{3}-\d{4}	US phone number	(408)333-4444	Example 2
9505[0-6]	santa clara zip codes	95056	Example 3
(?:0?\d\|1[0-2]\|jan\|feb\|mar\|apr\|may\|jun\|jul\|aug\|sep\|oct\|nov\|dec)[ \/-](?:0?\d\|1\d\|2\d\|3[0-1])[ \/-]\d{4}	date in different formats with limited validation	1/1/1987 09-21-2018 mar 8 1969 7 17 1975	Example 4

Extraction

Regular expressions can be used to extract information from the input. To extract information, capture groups can be used where the information inside () will be extracted. Some examples are below.

Regular Expression	Description	Inputs	Demo
($\d{3}$)-\d{3}-\d{4}	extract area code from a US phone number	(408)-333-4444	Example 5
\s+(\w+)\s+\1	find all the duplicate words	state of of the art	Example 6

Basics

A regular expression basically searches for a pattern. A pattern can be simple text comprising of a few letters or numbers. Example: Bangalore 560082

A pattern can be complex comprising of character classes, quantifiers, grouping etc. as shown in the examples above. The structure of such pattern is governed by a set of regular expression grammar rules. The grammar uses these metacharacters <([{\^-=$!|]})?*+.>

The following topics describe each feature of the grammar in detail

Dot character

Patterns can use . to map any character except some control characters. Note that it has no effect in a character class construct and maps to decimal point.

Some examples are cat, Example etc. Example 7

Character class

The basic ingredient of the regular expression grammar is a character class. The structure of a character class is defined as below. Note that the yellow background characters indicates matches in the input text.

Construct	Description	Matches	Demo
[ae]	a or e (simple class)	gray grey	Example 8
[^aeiou]	Any character except aeiou (negation)	marcial	Example 9
[a-zA-Z]	a through z, or A through Z inclusive (range)	Khri$ha	Example 10
[^7-9]	any number other than 7-9	95056	Example 11

Predefined character classes or Shorthands

To reduce clutter, shorthands to character classes are provided. These can be freely used in another character class or even in pattern. The shorthands and their expansions are below. Note that the yellow background characters indicates matches in the input text.

Construct	Description	Matches	Demo
\d	A digit: [0-9]	$10.99	Example 12
\D	A non-digit: [^0-9]	$10.99	Example 13
\s	A whitespace character: [ \t\n\x0B\f\r]	try it!	Example 14
\S	A non-whitespace character: [^\s]	try it!	Example 15
\w	A word character: [a-zA-Z_0-9]	try it!	Example 16
\W	A non-word character: [^\w]	try it!	Example 17

Escaping

Some times a meta character needs to be escaped in a pattern or a character class. Escaping is done by placing \ in front of the meta character.

Examples: \[ or \] escapes[ and ] meta characters and matches [] in the pattern as in [1,2,3]. \. escapes decimal point as in 10.99 Example 18

Anchors and boundary markers

Anchors and boundary markers marks special locations such as beginning or ending of the lines, word boundaries etc. These can be used only in the pattern and not in character classes. Note that the yellow background indicates markers in the input text: "Hello, World!"

Construct	Description	Matches	Demo
^	The beginning of a line	Hello, World!	Example 19
$	The end of a line	Hello, World!	Example 20
\b	A word boundary	Hello , World !	Example 21
\B	A non-word boundary	H e l l o, W o r l d!	Example 22
\A	The beginning of the input	Hello, World!	Example 23
\G	The end of the previous match	Hello, World!	Example 24
\Z	The end of the input but for the final terminator, if any	Hello, World!	Example 25
\z	The end of the input	Hello, World!	Example 26

Quantifiers

Quantifiers determine repetitiveness of a token in the pattern. Quantifiers applies to any token in the pattern only. The table below lists in detail. Note that <empty> means blank or no matches were found. Matches are highlighted in yellow.

Construct	Description	Pattern	Matches	Demo
?	Matches 0 or once	S?	s	Example 27
+	Matches once or more	S+	s sss	Example 28
*	Matches 0 or more	S*	sss	Example 29
{n}	Exactly n times	S{3}	sss	Example 30
{m,}	Minimum m times or more	S{2,}	s ss sss	Example 31
{m.n}	Minimum m times and Maximum n times	S{2,3}	s ss sss	Example 32

Greedy, Lazy and Possessive Quantifiers

The results of the same pattern for the same input but different quantifiers can be surprisingly different.

This is more pronounced when the pattern has a dot (.) followed by ? or * or +.

This has to do with the how much the regular expression engine grabs the input text for matching and then backtracks when a match is not found. During backtracking, the regular expression engine looses one token from the grabbed text and tries again. This repeats till the grabbed text is empty or a match is found.

In case of greedy, the entire input is grabbed and when no match is found, backtracking happens.

In case of lazy, a few tokens are grabbed and when no match is found, backtracking happens.

In case of Possessive, the entire input is grabbed and when no match is found, no backtracking happens.

By default, quantifiers are greedy. They can be made lazy by adding ? or Possessive by adding + as shown below.

Greedy	Lazy	Possessive	Meaning
X*	X*?	X*+	X, zero or more times
X+	X+?	X++	X, one or more times
X{n}	X{n}?	X{n}+	X, exactly n times
X{n,}	X{n,}?	X{n,}+	X, at least n times
X{n,m}	X{n,m}?	X{n,m}+	X, at least n but not more than m times

For example, consider text "This is a <B> bold </B> text" . Using the pattern <.*>, the expectation is to match, <B> and </B>. However Greedy matches more and Possessive matches none. Only lazy matches correctly.

Pattern	Remark	Match	Demo
<.*>	Greedy	This is a <B> bold </B> example	Example 33
<.*?>	Lazy	This is a <B> bold </B> example	Example 34
<.*+>	Possessive	This is a <B> bold </B> example	Example 35

Capture groups

The capture groups are one of the key aspects of the regular expression. It enables capturing a specific information such as area code as seen in the example below. The groups are defined and enclosed in ().

For example, the pattern $(\d{3})$\d{3}-\d{4} matches (408)333-4444. Example 36

Following topics discuss different features of the same in depth,

OR (|) Operator

Suppose the phone numbers of city of Los Angeles needs to be filtered, the OR operator(|) can be used in the capture groups. For example, the pattern below has a group setup as (213|323).

For example, the pattern $(213|323)$\d{3}-\d{4} matches (213)123-4567 or (323)456-1234 Example 37

Non capturing groups

Non capturing groups are used for efficiency and optimization. As the name indicates, contents of non capturing groups are discarded. The Non capturing groups are defined and enclosed in (?:)

For example, the pattern (\w\w) (\d{5})-(?:\d{4}) discards the last 4 numbers of the zip code CA 95131-3059 Example 38

References

Captured groups in a pattern are internally labeled as \1, \2 , \3 etc. as shown below.

([0-9])([-/ ])[a-z][-/ ]([0-9])

|--1--||--2--| |--3--|

Nested references are labeled differently as shown below.

(([0-9])([-/ ]))([a-z])

|--2--||--3--|

|-------1------||--4--|

A reference refers to a previously captured group in the pattern. These are useful when looking for duplicate words in a text. There can be 3 different kinds of references.

Back reference

Back references are located after the captured group. For example, the pattern \s(\w*)\s\1 looks for adjacent duplicate words in a text. Here \1 refers to the first global captured group. For the input "This is is a test" a match is made as highlighted in blue. Example 39

Forward reference

Forward references are located before the captured group. For example, the pattern (\2two|(one)) looks for the text in the second global captured group "one" in a text. For the input "oneonetwo" a match is made as highlighted in blue. Example 40

Nested reference

Nested reference are defined with in a captured group and refers to sub captured group defined with in it. For example, (\1two|(one)), \1 refers to the first relative captured group with in the outer captured group. For the input "oneonetwo" a match is made as highlighted in blue. Example 41

In addition to the above, references can also be named and used with \k switch. Relative referencing is also supported with negative numbers using \k switch.

Named references

Captured groups can have names and they can be used instead of numbers. For example, the back reference pattern discussed earlier can be rewritten as \s(?'dup'\w*\s)\k'dup'. Example 43 Note that the captured group name is defined as dup. The syntax is ?'name'. It can be referenced as \k'name'. Alternatively, \g can also be used. \g'name' for referencing. Example 42

Duplicate names are allowed however the last captured group with the same name is used for matching.

Relative references

References to relatively placed capture groups are allowed with \kn. Here n needs to be negative number starting with -1. For example for the pattern (a)(b)(c)\k-3 matches abca. Similarly

(a)(b)(c)\k-1 matches abcc. Another pattern (a)(b)(c\k-2) matches abcb. Example 44

Advanced Features

Flags

The behavior of the regular expression engine can be changed by setting flags in the pattern. For example, to make the search case insensitive. The syntax is (?gi) turns on global flag and case insensitive flag. Example 45 Similarly, (?gi-mx) turns on global flag and case insensitive flag and turns off multiline and skipping whitespace.

The following is a partial list supported by most engines.

g: matches the pattern multiple times

i: makes the regex case insensitive

m: enables multi-line mode. Where ^ and $ match the start and end of the entire string. Without this, multi-line strings match the beginning and end of each line.

u: enables support for unicode

s: short for single line, it causes the . to also match new line characters

x: ignore whitespace

U:make quantifiers lazy

Unicode support

First some basic information. Unicode standard describes how to represent characters of all the languages in the world - living or dead. In Unicode, a codepoint describe an unique artifact of the script. It can be a character or it can be combining mark. For example a (U+61) and combining grave accent ò. (U+300) An unit of readable representation of the script is called grapheme cluster. It can consists of one or more codepoints. For example a (U+61) or à(U+61 and U+300). Note that à can also be represented as single code point (U+E0).

The dot (.) equivalent of unicode is \X except it also matches line breaks. A single codepoint can also be represented \x{FFFF} where FFFF is the codepoint.

Unicode categories

Unicode also defines categories that are represented as \p{xxx} where xxx can be languages(\p{L}), mark(\p{M}), numbers(\pN), currencies(\p{Sc}) etc. \P{xxx} matches anything that does not belong to that category.

Examples:

\p{Sc} matches "Prices: $2, €1, ¥9" Example 46

\p{M}*\p{L}* matches kannada script "ಖ್ರಿಷಾ" as six different code points ಖ ್ ರ ಿ ಷ ಾ Example 47

Here \p{L} maps to ಖ ರ ಷ and \p{M} maps to ್ ಿ ಾ

combining code points ಖ and ್ yields ಖ್

combining code points ರ and ಿ yields ರಿ

combining ಖ್ and ರಿ yields ಖ್ರಿ

combining code points ಷ and ಾ yields ಷಾ

combining ಖ್ರಿ and ಷಾ yield ಖ್ರಿಷಾ

Branch Reset Groups

Consider a pattern (1a)|(2a)|(1b)\1. This defines three capture groups. For the input 1a1a, it is expected to match, however it does not. The solution is to use a branch reset group. The pattern (?|(1a)|(2a)|(1b))\1 defines one capture group that matches inputs 1a1a or 2a2a or 1b1b. Example 48

LookAround

There are 4 types look around, positive/negative look ahead/behind. Collectively they are called lookaround, are zero-length assertions just like the start and end of line, or start and end of word anchors. The difference is that lookaround actually matches characters, but discards it, returning only the result: match or no match.

Negative lookahead can be used if you want to match something not followed by something else. For example q(?!u) matches words like qack but not quit.

Positive lookahead works just the opposite. For example q(?=u) matches words like quit but not qack.

Lookbehind works backwards. Negative lookbehind can be used if you want to match something not preceded by something else. For example, (?<!a)b matches a “b” that is not preceded by an “a”. It doesn’t match cab, but matches the b (and only the b) in bed or debt. whereas positive lookbehind

(?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.

Detailed example

Let's say there is an inventory of different writing items in different colors as below:

black pen

black pencil

red pen

red crayon

purple crayon

The Look around functions can be used to filter out unique items based on certain criteria as discussed below.

Lookaround	Pattern	Description	Demo
Positive Look ahead	(\w+) (?=pen\s)	extract all the colors of all the pen in the inventory. (black, red)	Example 49
Negative Look ahead	(\w+) (?=pen\s)	extract all the colors of all the items in the inventory that are not pen. (black, red, purple)	Example 50
Positive Look behind	(\w+) (?=pen\s)	extract all the black color items in the inventory. (pen, pencil)	Example 51
Negative Look behind	(\w+) (?=pen\s)	extract all the items in the inventory that are not black color (pen, crayon)	Example 52

LookBehind with \K

Due to certain restriction in matching expression of positive lookbehind i.e., <=expression, as an alternative to positive lookbehind, \K switch can be used. For example, the pattern

(ab\Kc|d\Ke)f matches abcf and def Example 53

Atomic Grouping

An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing. The syntax is (?>group). Lookaround groups are also atomic.

Example: The pattern a(?>bc|b)c matches abcc but not abc. Example 54 When applied to abc, a matches to a, bc to bc, and then c will fail to match at the end of the string. In otherwords, backtracking will not happen as in case capture group and failure is reported.

If-Then-Else Conditionals

If-Then-Else is a special construct allows creation of conditional regular expressions. If the if condition evaluates to true, then the regex engine will attempt to match the then part. Otherwise, the else part is attempted instead. The syntax is as below:

(?(condition)then|else)

The else part is optional. The condition can be the number of the group set or a lookaround etc.

Example:

Consider (?:(a)|(b)|(c))(?(n)x|y) where n can be 1 or 2 or 3.

Pattern	if	else	Demo
(?:(a)\|(b)\|(c))(?(1)x\|y)	ax	by cy	Example 55
(?:(a)\|(b)\|(c))(?(2)x\|y)	bx	ay cy	Example 56
(?:(a)\|(b)\|(c))(?(3)x\|y)	cx	ay by	Example 57

Recursion

Suppose the task is to find out if random number of open and close braces such as () or {} match, regular expression recursion comes to the rescue.

The syntax for recursion is (?R)

For example, \{(?R)?\} matches input {{{}}} but fails {{{}}}} Example 58

Here { are matched with equal number of }. First, a matches the first { in the string. Then the regex engine reaches (?R). This tells the engine to attempt the whole regex again at the present position in the string. Now, { matches the second { in the string. The engine reaches (?R) again. On the second recursion, { matches the third {. On the third recursion, a fails to match the first } in the string with {. This causes (?R) to fail. But the regex uses a quantifier to make (?R) optional. So the engine continues with } which matches the first } in the string.

Now, the regex engine has reached the end of the regex. But since it’s two levels deep in recursion, it hasn’t found an overall match yet. It only has found a match for (?R). Exiting the recursion after a successful match, the engine also reaches }. It now matches the second } in the string. The engine is still one level deep in recursion, from which it exits with a successful match. Finally, } matches the third } in the string. The engine is again at the end of the regex. This time, it’s not inside any recursion. Thus, it returns {{{}}} as the overall regex match.

The main purpose of recursion is to match balanced constructs or nested constructs as shown below.

Example: The pattern $(?>[^()]|(?R))*$ matches the input

( 1000 - ( 22 / ( 7 + 4 ) * 8 ) * 9 ) Example 59

Subroutines

Subroutines are applied to the capture groups. These are very similar to regular expression recursion. Instead of matching the entire regular expression again, a subroutine call only matches the regular expression inside a capturing group. A subroutine call can be made to any capturing group from anywhere in the pattern. A call made to same capturing group leads to recursion.

Recursion can be called in different ways. For example, (?1) calls a numbered group, (?+1) to call the next group, (?-1) to call the preceding group, (?&name) to call a named group.

For example, (?+1)(?'name'[abc])(?1)(?-1)(?&name) matches a string that is five letters long and consists only of the first three letters of the alphabet such as abcab, cabab etc. Example 60

This regex is exactly the same as [abc](?'name'[abc])[abc][abc][abc]

Another example would be ([abc])(?1){4} matches cabab Example 61

Recursion into a capturing group is a more flexible way of matching balanced constructs than recursion of the whole regex. We can wrap the regex in a capturing group, recurse into the capturing group instead of the whole regex, and add anchors outside the capturing group.

The above example of matching equation can be written as

($(?>[^()]|(?1))*$)

This matches inputs such as ( 10 + 9 ) * ( 13 *7 ) + ( 6 * ( 9 ) * 7 ) Example 62

Another example is to match palindromes

(?'word'(?'letter'[a-z])(?&word)\k'letter'|[a-z]?)

This matches inputs such as radar , dad , abba etc Example 63

Using Regular expressions in C++

It's possible to replace captured groups or entire match. It's discussed below.

Regular expressions is a part of in C++ 11 standard library, however it does not support many features discussed here. An alternate would be to use boost libraries which seems compatible with feature rich perl.

As noted earlier, wandbox can be used to try out compile and run the example below. The examples discussed here use latest C++ lang compiler along latest boost library.

Majorly, regex programming involves three operations.

Match

To match if the input text is an valid input. Examples: date, email address, phone numbers, SSN etc

Search or extract

Extract certain information from the text. Examples : date, email address, phone numbers, SSN etc

Replace

Replacing certain information from the text. Examples : date, email address, phone numbers, SSN etc

The following example 64 demonstrates all the three above operations in detail as seen in its output.

Summary of Examples

Name	Category	Description	Github	Regex101/WandBox
Example	Validations	SSN	output	source + output
Example 2	Validations	Phone #	output	source + output
Example 3	Validations	Zip code	output	source + output
Example 4	Validations	Date Formats	output	source + output
Example 5	Extractions	Area code	output	source + output
Example 6	Extractions	Duplicate words	output	source + output
Example 7	Dot Character	Usage	output	source + output
Example 8	Character class	Simple	output	source + output
Example 9	Character class	Negation	output	source + output
Example 10	Character class	Range	output	source + output
Example 11	Character class	Range Negation	output	source + output
Example 12	Shorthands	Digits	output	source + output
Example 13	Shorthands	Non Digits	output	source + output
Example 14	Shorthands	White Space	output	source + output
Example 15	Shorthands	Non White Space Word	output	source + output
Example 16	Shorthands	Non Word	output	source + output
Example 17	Shorthands	Non Word	output	source + output
Example 18	Escaping	Usage	output	source + output
Example 19	Anchors and boundary markers	Line Begin	output	source + output
Example 20	Anchors and boundary markers	Line End	output	source + output
Example 21	Anchors and boundary markers	Word Boundary	output	source + output
Example 22	Anchors and boundary markers	Non Word Boundary	output	source + output
Example 23	Anchors and boundary markers	Begin of Input	output	source + output
Example 24	Anchors and boundary markers	End of Previous Match	output	source + output
Example 25	Anchors and boundary markers	End of Input	output	source + output
Example 26	Anchors and boundary markers	End of Input2	output	source + output
Example 27	Quantifiers	?	output	source + output
Example 28	Quantifiers	+	output	source + output
Example 29	Quantifiers	*	output	source + output
Example 30	Quantifiers	{n}	output	source + output
Example 31	Quantifiers	{m,}	output	source + output
Example 32	Quantifiers	{m,n}	output	source + output
Example 33	Greedy, Lazy and Possessive Quantifiers	Greedy	output	source + output
Example 34	Greedy, Lazy and Possessive Quantifiers	Lazy	output	source + output
Example 35	Greedy, Lazy and Possessive Quantifiers	Capture with OR(\|) Operator	output	source + output
Example 36	Capture groups	Capture	output	source + output
Example 37	Capture groups	Usage	output	source + output
Example 38	Non capturing groups	Source	output	source + output
Example 39	References	Back reference	output	source + output
Example 40	References	Forward reference	output	source + output
Example 41	References	Nested reference	output	source + output
Example 42	References	Named reference	output	source + output
Example 43	References	Named reference 2	output	source + output
Example 44	References	Relative reference	source output	source + output
Example 45	Flags	Source	output	source + output
Example 46	Unicode	Currency	output	source + output
Example 47	Unicode	International	output	source + output
Example 48	Branch Reset Groups	Source	output	source + output
Example 49	Look Around	Positive Look Ahead	output	source + output
Example 50	Look Around	Negative Look Ahead	output	source + output
Example 51	Look Around	Positive Look Behind	output	source + output
Example 52	Look Around	Negative Look Behind	output	source + output
Example 53	Look Around	LookBehind with \K	output	source + output
Example 54	Atomic Grouping	Source	output	source + output
Example 55	If-Then-Else Conditionals	Capture Group 1	output	source + output
Example 56	If-Then-Else Conditionals	Capture Group 2	output	source + output
Example 57	If-Then-Else Conditionals	Capture Group 3	output	source + output
Example 58	Recursion	Braces	output	source + output
Example 59	Recursion	Equation	output	source + output
Example 60	Subroutines	Three Letter	output	source + output
Example 61	Subroutines	Three Letter 2	output	source + output
Example 62	Subroutines	Equation	output	source + output
Example 63	Subroutines	Palindromes	output	source + output
Example 64	Search and Replacement	Source	source output	source + output

Regular Expression	Description	Inputs	Demo
(\(\d{3}\))-\d{3}-\d{4}	extract area code from a US phone number	(408)-333-4444	Example 5
\s+(\w+)\s+\1	find all the duplicate words	state of of the art	Example 6

Pages

Monday, November 13, 2023

basic_string

Monday, November 6, 2023

Regular Expression