Overview
The following are Codecvt based facets to convert Unicode based formats with each other. For example, conversion from UTF-8 to UCS-2, UTF-16 and UTF-32. Also reverse.
Details
codecvt_mode
This is a bitmask based enum defined as below:
enum codecvt_mode { consume_header = 4, generate_header = 2, little_endian = 1}.
Note that 0 is also a valid value that represents the absence of all flags.
The facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16 accept an optional value of type codecvt_mode as a template argument, which specifies optional features of the unicode string conversion.
Name | Description |
---|---|
little_endian | Assume the input is in little-endian byte order (applies to UTF-16 input only, the default is big-endian). |
generate_header | Output the BOM at the start of the output sequence. |
consume_header | Consume the BOM, if present at the start of input sequence, and (in case of UTF-16), rely on the byte order it specifies for decoding the rest of the input. |
BOM types
BOM is basically a set of characters present in a UTF stream that determines its nature of origin.
Name | Values |
---|---|
UTF-16 big-endian | 0xfe 0xff |
UTF-16 little-endian | 0xff 0xfe |
UTF-8 | 0xef 0xbb 0xbf |
Example
ofstream{"text.txt"} << "\xef\xbb\xbfಖ್ರಿಷಾ Rao👸"; // read the UTF-8 file, skipping the BOM wifstream fin{"text.txt"}; fin.imbue(locale(fin.getloc(), new codecvt_utf8<wchar_t, 0x10ffff, consume_header>)); istream_iterator<wchar_t, wchar_t> it(fin), end; cout << showbase << hex; //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x52 0x61 0x6f 0x1f478 0 for_each(it,end, [](wchar_t c){cout << (unsigned long long)c << ' '; }
codecvt_utf8
codecvt_utf8 is a codecvt facet which encapsulates conversion between a UTF-8 encoded byte string and UCS-2 or UCS-4(UTF-32) character string (depending on the type of Elem). This codecvt facet can be used to read and write UTF-8 files, both text and binary.
Note that UCS-2 has limited code points (0xFFFF), unsuitable in some cases.
Syntax
template< class Elem, unsigned long Maxcode = 0x10ffff, codecvt_mode Mode = (codecvt_mode)0 > class codecvt_utf8 : public std::codecvt<Elem, char, std::mbstate_t>;
template parameters
Name | Description |
---|---|
Elem | Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t. |
MaxCode | The largest code point that will be translated without reporting a conversion error. |
Mode | A constant of type codecvt_mode. |
member types
Name | Description |
---|---|
intern_type | Elem |
extern_type | char |
state_type | codecvt::state_type |
result | codecvt_base::result |
Example
char str8[] = u8"ಖ್ರಿಷಾ Rao👸"; wstring_convert<codecvt_utf8<char32_t>, char32_t> u8to32; u32string str32 = u8to32.from_bytes(str8); cout << std::showbase << std::hex; //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x20 0x52 0x61 0x6f 0x1f478 for (char32_t c : str32) cout << static_cast<unsigned long long>(c) << ' ';
codecvt_utf16
codecvt_utf16 is a codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS-2 or UTF-32 character string (depending on the type of Elem). This codecvt facet can be used to read and write UTF-16 files in binary mode.
Syntax
template< class Elem, unsigned long Maxcode = 0x10ffff, std::codecvt_mode Mode = (std::codecvt_mode)0 > class codecvt_utf16 : public std::codecvt<Elem, char, std::mbstate_t>;
template parameters
Name | Description |
---|---|
Elem | Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t. |
MaxCode | The largest code point that will be translated without reporting a conversion error. |
Mode | A constant of type codecvt_mode. |
member types
Name | Description |
---|---|
intern_type | Elem |
extern_type | char |
state_type | codecvt::state_type |
result | codecvt_base::result |
Example
char16_t utf16le[] = u"ಖ್ರಿಷಾ Rao👸"; ofstream oss("test.txt"); oss.write(reinterpret_cast<char*>(utf16le), sizeof utf16le); oss.close(); wifstream wiss("test.txt",ios::binary|ios_base::in); wiss.imbue(locale(wiss.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>)); istream_iterator<wchar_t, wchar_t> oit(wiss), end; cout << showbase << hex; //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x52 0x61 0x6f 0x1f478 0 for_each(oit,end, [](wchar_t c){cout << (unsigned long long)c << ' '; });
codecvt_utf8_utf16
codecvt_utf8_utf16 is a codecvt facet which encapsulates conversion between a UTF-8 encoded byte string and UTF-16 encoded character string. If Elem is a 32-bit type, one UTF-16 code unit will be stored in each 32-bit character of the output sequence.
The UTF-16 string is variable width encoded so it can represent all the code points,
This is an N:M conversion facet, and cannot be used with std::basic_filebuf (which only permits 1:N conversions, such as UTF-32/UTF-8, between the internal and the external encodings). This facet can be used with std::wstring_convert.
Syntax
template< class Elem, unsigned long Maxcode = 0x10ffff, std::codecvt_mode Mode = (std::codecvt_mode)0 > class codecvt_utf8_utf16 : public std::codecvt<Elem, char, std::mbstate_t>
template parameters
Name | Description |
---|---|
Elem | Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t. |
MaxCode | The largest code point that will be translated without reporting a conversion error. |
Mode | A constant of type codecvt_mode. |
member types
Name | Description |
---|---|
intern_type | Elem |
extern_type | char |
state_type | codecvt::state_type |
result | codecvt_base::result |
Example
char str8[] = u8"ಖ್ರಿಷಾ Rao👸"; wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t> u8to16; u16string str16 = u8to16.from_bytes(str8); cout << std::showbase << std::hex; //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x20 0x52 0x61 0x6f 0xd83d 0xdc78 for (char16_t c : str16) cout << static_cast<unsigned> (c) << ' ';