Modern C++ 11,14: Quick tour with examples: Unicode conversion facets

Overview

The following are Codecvt based facets to convert Unicode based formats with each other. For example, conversion from UTF-8 to UCS-2, UTF-16 and UTF-32. Also reverse.

Details

Topics
codecvt_mode
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16

Topics
codecvt_mode
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16

codecvt_mode

This is a bitmask based enum defined as below:

enum codecvt_mode { consume_header = 4, generate_header = 2, little_endian = 1}.

Note that 0 is also a valid value that represents the absence of all flags.

The facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16 accept an optional value of type codecvt_mode as a template argument, which specifies optional features of the unicode string conversion.

Name	Description
little_endian	Assume the input is in little-endian byte order (applies to UTF-16 input only, the default is big-endian).
generate_header	Output the BOM at the start of the output sequence.
consume_header	Consume the BOM, if present at the start of input sequence, and (in case of UTF-16), rely on the byte order it specifies for decoding the rest of the input.

BOM types

BOM is basically a set of characters present in a UTF stream that determines its nature of origin.

Name	Values
UTF-16 big-endian	0xfe 0xff
UTF-16 little-endian	0xff 0xfe
UTF-8	0xef 0xbb 0xbf

Example

    ofstream{"text.txt"} << "\xef\xbb\xbfಖ್ರಿಷಾ Rao👸";
 
    // read the UTF-8 file, skipping the BOM
    wifstream fin{"text.txt"};
    fin.imbue(locale(fin.getloc(),
        new codecvt_utf8<wchar_t, 0x10ffff, consume_header>));
 
    istream_iterator<wchar_t, wchar_t> it(fin), end;
    cout << showbase << hex; 
    //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x52 0x61 0x6f 0x1f478 0 
    for_each(it,end, [](wchar_t c){cout << (unsigned long long)c << ' '; }

codecvt_utf8

codecvt_utf8 is a codecvt facet which encapsulates conversion between a UTF-8 encoded byte string and UCS-2 or UCS-4(UTF-32) character string (depending on the type of Elem). This codecvt facet can be used to read and write UTF-8 files, both text and binary.

Note that UCS-2 has limited code points (0xFFFF), unsuitable in some cases.

Syntax

template<
    class Elem,
    unsigned long Maxcode = 0x10ffff,
    codecvt_mode Mode = (codecvt_mode)0 >
class codecvt_utf8
    : public std::codecvt<Elem, char, std::mbstate_t>;

template parameters

Name	Description
Elem	Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t.
MaxCode	The largest code point that will be translated without reporting a conversion error.
Mode	A constant of type codecvt_mode.

member types

Name	Description
intern_type	Elem
extern_type	char
state_type	codecvt::state_type
result	codecvt_base::result

Example

    char str8[] = u8"ಖ್ರಿಷಾ Rao👸";
    wstring_convert<codecvt_utf8<char32_t>, char32_t> u8to32;
    u32string str32 = u8to32.from_bytes(str8);
    cout << std::showbase << std::hex;
    //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x20 0x52 0x61 0x6f 0x1f478 
    for (char32_t c : str32)
        cout << static_cast<unsigned long long>(c) << ' ';

codecvt_utf16

codecvt_utf16 is a codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS-2 or UTF-32 character string (depending on the type of Elem). This codecvt facet can be used to read and write UTF-16 files in binary mode.

Syntax

template<
    class Elem,
    unsigned long Maxcode = 0x10ffff,
    std::codecvt_mode Mode = (std::codecvt_mode)0 >
class codecvt_utf16
    : public std::codecvt<Elem, char, std::mbstate_t>;

template parameters

Name	Description
Elem	Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t.
MaxCode	The largest code point that will be translated without reporting a conversion error.
Mode	A constant of type codecvt_mode.

member types

Name	Description
intern_type	Elem
extern_type	char
state_type	codecvt::state_type
result	codecvt_base::result

Example

    char16_t utf16le[] = u"ಖ್ರಿಷಾ Rao👸";
    ofstream oss("test.txt");
    oss.write(reinterpret_cast<char*>(utf16le), sizeof utf16le);
    oss.close();

    wifstream wiss("test.txt",ios::binary|ios_base::in);
    wiss.imbue(locale(wiss.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>));
    istream_iterator<wchar_t, wchar_t> oit(wiss), end;
    cout << showbase << hex; 
    //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x52 0x61 0x6f 0x1f478 0 
    for_each(oit,end, [](wchar_t c){cout << (unsigned long long)c << ' '; });

codecvt_utf8_utf16

codecvt_utf8_utf16 is a codecvt facet which encapsulates conversion between a UTF-8 encoded byte string and UTF-16 encoded character string. If Elem is a 32-bit type, one UTF-16 code unit will be stored in each 32-bit character of the output sequence.

The UTF-16 string is variable width encoded so it can represent all the code points,

This is an N:M conversion facet, and cannot be used with std::basic_filebuf (which only permits 1:N conversions, such as UTF-32/UTF-8, between the internal and the external encodings). This facet can be used with std::wstring_convert.

Syntax

template<
    class Elem,
    unsigned long Maxcode = 0x10ffff,
    std::codecvt_mode Mode = (std::codecvt_mode)0 >
class codecvt_utf8_utf16
    : public std::codecvt<Elem, char, std::mbstate_t>

template parameters

Name	Description
Elem	Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t.
MaxCode	The largest code point that will be translated without reporting a conversion error.
Mode	A constant of type codecvt_mode.

member types

Name	Description
intern_type	Elem
extern_type	char
state_type	codecvt::state_type
result	codecvt_base::result

Example

    char str8[] = u8"ಖ್ರಿಷಾ Rao👸";
    wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t> u8to16;
    u16string str16 = u8to16.from_bytes(str8);
    cout << std::showbase << std::hex;
    //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x20 0x52 0x61 0x6f 0xd83d 0xdc78
    for (char16_t c : str16)
        cout << static_cast<unsigned> (c) << ' ';

Modern C++ 11,14: Quick tour with examples

Pages

Wednesday, February 12, 2025

Unicode conversion facets

No comments:

Post a Comment