Overview
strings and literals are essential part of any programming language. The following describes various facilities and classes available to handle strings and literals.
Details
The following describes new character types, string literal types and classes.
char16_t and char 32_t
char and wchar_t are widely supported data types for character representation. However on linux platforms wchar_t is 32 bits and 16 bits on Windows platform.
char16_t and char32_t were introduced to provide uniformity among other things. The size of char16_t data type is fixed at 16 bits and the size of char32_t data type is fixed at 32 bits across all the platforms.
The prefix for char16_t literals is u, and char32_t literals is U.
char_16_t is used encoding UTF-16 strings. Similarly, char_32_t is used for encoding UTF-32 strings.
char16_t and char32_t are not supported in IOStreams. However u16string and u32string classes are defined to support string operations.
u32string fn {U"khrisha"};
u16string ln {u"rao"};
mbrtoc16() and c16tombr() APIs are introduced to to convert narrow multibyte character to UTF-16 encoding and back. Similarly, mbrtoc32 and c32tombr APIs are introduced to to convert narrow multibyte character to UTF-32 encoding and back. The codecvt based facets also provides similar facility.
The example 3 and its console output demonstrates conversion from narrow byte UTF-16/ UTF-32 and back using above APIs.
Unicode String Literals
string literals prefixed with u8 are encoded as UTF-8. The difference between a C string and u8 string is that way they are encoded which can vary based on platforms as discussed below.
Consider the following code.
char a[] = "Khrisha Rao👸";
char b[] = u8"Khrisha Rao👸";
In Windows environment the strlen call on a[] and b[] returns 13 and 15. This is because based on the active code page(e.g., 1252), the compiler emits 0x3f,0x3f or ?? for character 👸 in the a[]. Whereas it emits 0xf0, 0x9f, 0x91,0xb8 for the same in b[] irrespective of active code page.
In POSIX environment, the strlen call on a[] and b[] returns 15. the compiler emits 0xf0, 0x9f, 0x91,0xb8 for character 👸 in the a[] and b[]. This enables writing portable applications where strings can be serialized and deserialized the same way across platforms.
Prefix for UTF-16 encoded strings is u. char_16_t is used for encoding UTF-16 strings.
Similarly, prefix for UTF-32 encoded strings is U. char_32_t is used for encoding UTF-32 strings.
Again encoding is consistent across all platforms.
The table below summarizes the same.
data type | storage | encoding | prefix | Example | Length |
---|---|---|---|---|---|
char | string | UTF-8 | u8 | u8"ಖ್ರಿಷಾ Rao👸" | 15 |
char16_t | u16string | UTF-16 | u | u"ಖ್ರಿಷಾ Rao👸" | 12 |
char32_t | u32string | UTF-32 | U | U"ಖ್ರಿಷಾ Rao👸" | 11 |
raw string
Assigning a free flowing text to a string variable has been a challenge especially if it contains " character embedded in it. Escaping special characters and adding newline characters can be tedious and reduces readability of the text.
For example,
//<div class = "separator" style="clear: both; text-align: center;"> string html = "<div class = \"separator\" style=\"clear: both; text-align: center;\">"; /* #version 330 core layout (location = 0) in vec3 vVertex; layout (location = 1) in vec2 vTexCrd; */ string shader = "\t\t#version 330 core\n" "\t\tlayout (location = 0) in vec3 vVertex;\n" "\t\tlayout (location = 1) in vec2 vTexCrd;\n";
output
The raw string feature enables declaring strings containing special characters such as newline, tab etc as is without escaping. It is is defined is as below:
Syntax
//The delim is a character sequence of at most 16 basic characters
//except the backslash, whitespaces, and parentheses.
//The delim is optional but if the string contains "), delim should be used.
R"delim(...)delim"
Example
// no need to escape "
string html = R"(<div class = "separator" style="clear: both; text-align: center;">)";
// preserves tab, newlines std::string shader = R"( #version 330 core layout (location = 0) in vec3 vVertex; layout (location = 1) in vec2 vTexCrd; )";
//"(Orange)", "(Apple)"
// delimiter should be used only if the string ends with )"
// delimiter is ! below
std::string fruits = R"!("(Orange)", "(Apple)")!";
output
The example 4 depicts the usage.
Raw strings can be used for other character types as below.
string type | prefix | example |
---|---|---|
char | R" | R"(test)" |
wchar | LR" | LR"(test)" |
char16 | uR" | uR"(test)" |
char32 | UR" | UR"(test)" |
UTF-8 | u8R" | u8R"(test)" |
user defined literals
In C++, arithmetic types such as long, float can be represented with suffix as 1L,1.2f etc. C++11 extends this to represent standard library and user defined types using user defined literals. For example, 100_ms(duration 100 milliseconds), 100_b(binary form of 8) etc.
It's implemented as an user defined function.
Syntax
return_type operator ""_<name>(parameter_list)
return_type
can be any predefined or user defined type.
name
The name of the user defined literal should be prefixed with ""_ followed by user defined string.
The standard library reserves the use of literals without _ in the prefix.
parameter_list
Only the following parameter types are allowed.
- const char*
- unsigned long long int
- long double
- char
- wchar_t
- char16_t
- char32_t
- const char*, std::size_t
- const wchar_t*, std::size_t
- const char16_t*, std::size_t
- const char32_t*, std::size_t
The following examples demo using defined literals to represent memory sizes like KB, MB, GB, or TB.
unsigned long long operator ""_kb(unsigned long long int n) {return n*1024;}; unsigned long long operator ""_mb(unsigned long long int n) {return n*1024_kb;}; unsigned long long operator ""_gb(unsigned long long int n) {return n*1024_mb;}; unsigned long long operator ""_tb(unsigned long long int n) {return n*1024_gb;}; //prints 5120 cout << 5_kb << endl; //prints 6291456 cout << 6_mb << endl; //prints 8589934592 cout << 8_gb << endl; //prints 9895604649984 cout << 9_tb << endl;
The rules of overloading for functions are also applied. In case of character strings, length of the string is also passed to the function. However the function can ignore it.
The following examples describes the details. Here user defined literal _w is overloaded to take different inputs. Notice that length of the string is used in 2nd and ignored in 3rd.
long double operator ""_w(long double d) {return d;}; u16string operator ""_w(const char16_t* s, size_t n) {return u16string(s,n);}; unsigned operator ""_w(const char* str) {return atoi(str);}; // calls operator ""_w(1.2L) //d=1.2 auto d = 1.2_w; // calls operator ""_w(u"one", 3) //s = u"one" auto s = u"one"_w; // calls operator ""_w("12") //n=12 auto n = 12_w;
char_traits classes define common behavior such as comparison, assignment, copy etc. and also other aspects such as eof type, offset type position type etc.
CRT provides plethora of functions to handle strings. There are different functions for getting length, append, copy, find etc. std::basic_string class attempts to objectify strings so that it's easier to use.
Summary of Examples
Name | Category | Description | Github | Wandbox |
---|---|---|---|---|
Example | char_traits | Copy and Find | source output | source + output |
Example 2 | char_traits | Custom Character Traits | source output | source + output |
Example 3 | char16_t,char32_t | Usage | source output | source + output |
Example 4 | raw strings literals | Usage | source output | source + output |
Example 5 | user defined literals | Celsius to Fahrenheit | source output | source + output |
Example 6 | basic_string | cstring | source output | source + output |
Example 7 | basic_string | Constructor | source output | source + output |
Example 8 | basic_string | Iterators | source output | source + output |
Example 9 | basic_string | Storage and Length | source output | source + output |
Example 10 | basic_string | String Modifications | source output | source + output |
Example 11 | basic_string | String Operations | source output | source + output |
Example 12 | basic_string | Stream Operators and Functions | source output | source + output |
Example 13 | basic_string | Strings to Arithmetic type conversion | source output | source + output |
Example 14 | basic_string | Arithmetic type to Strings conversion | source output | source + output |
Example 15 | basic_string | Shared memory custom allocator (windows) | source output | source + output |
No comments:
Post a Comment