Literals and Strings

Overview
strings and literals are essential part of any programming language. The following describes various facilities and classes available to handle strings and literals.

Details
The following describes new character types, string literal types and classes.

char16_t and char 32_t
char and wchar_t are widely supported data types for character representation. However on linux platforms wchar_t is 32 bits and 16 bits on Windows platform. 
char16_t  and char32_t  were introduced to provide uniformity among other things. The size of char16_t data type is fixed at 16 bits and the size of char32_t data type is fixed at 32 bits across all the platforms. 
The prefix for char16_t literals is u, and char32_t literals is U.
char_16_t is used encoding UTF-16 strings. Similarly, char_32_t is used for encoding UTF-32 strings.
char16_t and char32_t are not supported in IOStreams. However u16string and u32string classes are defined to support string operations.
u32string fn {U"khrisha"};
u16string ln {u"rao"};

mbrtoc16() and c16tombr() APIs are introduced to to convert narrow multibyte character to UTF-16 encoding and back. Similarly, mbrtoc32 and c32tombr APIs are introduced to to convert narrow multibyte character to UTF-32 encoding and back. The codecvt based facets also provides similar facility.
The   example 3  and its console output demonstrates conversion from narrow byte UTF-16/ UTF-32 and back using above APIs.

Unicode String Literals
string literals prefixed with u8 are encoded as UTF-8. The difference between a C string and u8 string is that way they are encoded which can vary based on platforms as discussed below.  
Consider the following code.
char a[] = "Khrisha Rao👸";
char b[] = u8"Khrisha Rao👸";

In Windows environment the strlen call on a[] and b[] returns 13 and 15. This is because based on the active code page(e.g., 1252), the compiler emits 0x3f,0x3f or ?? for character 👸 in the a[]. Whereas it emits 0xf0, 0x9f, 0x91,0xb8 for the same in b[] irrespective of active code page.

In POSIX environment, the strlen call on a[] and b[] returns 15. the compiler emits 0xf0, 0x9f, 0x91,0xb8 for character 👸 in the a[]  and b[]. This enables writing portable applications where strings can be serialized and deserialized the same way across platforms.

Prefix for UTF-16 encoded strings is u. char_16_t is used for encoding UTF-16 strings. 
Similarly, prefix for UTF-32 encoded strings is U. char_32_t is used for encoding UTF-32 strings. 
Again encoding is consistent across all platforms.

The table below summarizes the same.
data typestorageencodingprefixExampleLength
charstringUTF-8u8u8"ಖ್ರಿಷಾ Rao👸"  15
char16_tu16stringUTF-16uu"ಖ್ರಿಷಾ Rao👸"  12
char32_tu32stringUTF-32UU"ಖ್ರಿಷಾ Rao👸"  11

raw string
Assigning a free flowing text to a string variable has been a challenge especially if it contains " character embedded in it. Escaping special characters  and adding newline characters can be tedious and reduces readability of the text.  
For example,
//<div class = "separator" style="clear: both; text-align: center;">
string html = "<div class = \"separator\" style=\"clear: both; text-align: center;\">";
/*
		#version 330 core
		layout (location = 0) in vec3 vVertex;
		layout (location = 1) in vec2 vTexCrd;
*/
string shader = 
	"\t\t#version 330 core\n"
	"\t\tlayout (location = 0) in vec3 vVertex;\n"
	"\t\tlayout (location = 1) in vec2 vTexCrd;\n";


output
<div class = "separator" style="clear: both; text-align: center;">

                #version 330 core
                layout (location = 0) in vec3 vVertex;
                layout (location = 1) in vec2 vTexCrd;
The raw string feature enables declaring strings containing special characters such as newline, tab etc as is without escaping. It is is defined is as below:
Syntax
//The delim is a character sequence of at most 16 basic characters 
//except the backslash, whitespaces, and parentheses.  
//The delim is optional but if the string contains "), delim should be used. 
R"delim(...)delim" 

Example
    // no need to escape "
    string html = R"(<div class = "separator" style="clear: both; text-align: center;">)";
    
    // preserves tab, newlines
    std::string shader = R"(
		#version 330 core
		layout (location = 0) in vec3 vVertex;
		layout (location = 1) in vec2 vTexCrd;
        )";

    //"(Orange)", "(Apple)"
    // delimiter should be used only if the string ends with )"
    // delimiter is ! below
std::string fruits = R"!("(Orange)", "(Apple)")!";

output
<div class = "separator" style="clear: both; text-align: center;">

                #version 330 core
                layout (location = 0) in vec3 vVertex;
                layout (location = 1) in vec2 vTexCrd;

"(Orange)", "(Apple)"
The example 4 depicts the usage.

Raw strings  can be used for other character types as below.
string typeprefixexample
charR"
R"(test)"
wcharLR"
LR"(test)"
char16uR"
uR"(test)"
char32UR"
UR"(test)"
UTF-8u8R"
u8R"(test)"

user defined literals
In  C++, arithmetic types such as long, float can be represented with suffix as 1L,1.2f etc. C++11 extends this to represent standard library and user defined types using user defined literals. For example, 100_ms(duration 100 milliseconds), 100_b(binary form of 8) etc. 
It's implemented as an user defined function.

Syntax
return_type operator ""_<name>(parameter_list)

return_type
can be any predefined or user defined type.

name
The  name of the user defined literal should be prefixed with ""_ followed by user defined string. 
The standard library reserves the use of literals without _ in the prefix.

parameter_list
Only the following parameter types are allowed.
  • const char*
  • unsigned long long int
  • long double
  • char
  • wchar_t
  • char16_t
  • char32_t
  • const char*, std::size_t
  • const wchar_t*, std::size_t
  • const char16_t*, std::size_t
  • const char32_t*, std::size_t
The following examples demo using defined literals to represent memory sizes like KB, MB, GB, or TB.
unsigned long long  operator ""_kb(unsigned long long int n) {return n*1024;};
unsigned long long  operator ""_mb(unsigned long long int n) {return n*1024_kb;};
unsigned long long  operator ""_gb(unsigned long long int n) {return n*1024_mb;};
unsigned long long  operator ""_tb(unsigned long long int n) {return n*1024_gb;};

    //prints 5120
    cout << 5_kb << endl;
    
    //prints 6291456
    cout << 6_mb << endl;
    
    //prints 8589934592
    cout << 8_gb << endl;

    //prints 9895604649984
    cout << 9_tb << endl;

The rules of overloading for functions are also applied. In case of character strings, length of the string is also passed to the function. However the function can ignore it.

The following examples describes the details. Here user defined literal _w is overloaded to take different inputs. Notice that length of the string is used in 2nd  and ignored in 3rd.
long double operator ""_w(long double d) {return d;};
u16string operator ""_w(const char16_t* s, size_t n) {return u16string(s,n);};
unsigned  operator ""_w(const char* str) {return atoi(str);};
 

    // calls operator ""_w(1.2L)
    //d=1.2
    auto d = 1.2_w;
    
    // calls operator ""_w(u"one", 3)
    //s = u"one"
    auto s = u"one"_w;
    
    // calls operator ""_w("12")
    //n=12
    auto n = 12_w;  
This example 5 demonstrates  temperature conversion from centigrade to Fahrenheit.

char_traits classes define common behavior such as comparison, assignment, copy etc. and also other aspects such as eof type, offset type position type etc.

CRT provides plethora of functions to handle strings. There are different functions for getting length, append, copy, find etc. std::basic_string class attempts to objectify strings so that it's easier to use.
Summary of Examples
NameCategoryDescriptionGithubWandbox
Examplechar_traitsCopy and Findsource    outputsource + output
Example 2char_traitsCustom Character Traitssource    outputsource + output
Example 3char16_t,char32_tUsagesource    outputsource + output
Example 4raw strings literalsUsagesource    outputsource + output
Example 5user defined literalsCelsius to Fahrenheitsource   outputsource + output
Example 6basic_stringcstringsource    outputsource + output
Example 7basic_stringConstructorsource    outputsource + output
Example 8basic_stringIteratorssource    outputsource + output
Example 9basic_stringStorage and Lengthsource    outputsource + output
Example 10basic_stringString Modificationssource    outputsource + output
Example 11basic_stringString Operationssource    outputsource + output
Example 12basic_stringStream Operators and Functionssource    outputsource + output
Example 13basic_stringStrings to Arithmetic type conversionsource    outputsource + output
Example 14basic_stringArithmetic type to Strings conversionsource    outputsource + output
Example 15basic_stringShared memory custom allocator (windows)source    outputsource + output

No comments:

Post a Comment