Wednesday, February 12, 2025

Unicode conversion facets

Overview
The following are Codecvt based facets to convert Unicode based formats with each other. For example, conversion from  UTF-8 to UCS-2, UTF-16 and UTF-32. Also reverse.

Details


codecvt_mode
This is a bitmask based enum defined as below:
enum codecvt_mode {  consume_header = 4,  generate_header = 2,  little_endian = 1}.
Note that 0 is also a valid value that represents the absence of all flags.

The facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16 accept an optional value of type codecvt_mode as a template argument, which specifies optional features of the unicode string conversion.
NameDescription
little_endian Assume the input is in little-endian byte order (applies to UTF-16 input only, the default is big-endian).
generate_header Output the BOM at the start of the output sequence.
consume_header Consume the BOM, if present at the start of input sequence, and (in case of UTF-16), rely on the byte order it specifies for decoding the rest of the input.

BOM types
BOM is basically a set of characters present in a UTF stream that determines its nature of origin.
NameValues
UTF-16 big-endian0xfe 0xff
UTF-16 little-endian0xff 0xfe
UTF-80xef 0xbb 0xbf

Example
    ofstream{"text.txt"} << "\xef\xbb\xbfಖ್ರಿಷಾ Rao👸";
 
    // read the UTF-8 file, skipping the BOM
    wifstream fin{"text.txt"};
    fin.imbue(locale(fin.getloc(),
        new codecvt_utf8<wchar_t, 0x10ffff, consume_header>));
 
    istream_iterator<wchar_t, wchar_t> it(fin), end;
    cout << showbase << hex; 
    //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x52 0x61 0x6f 0x1f478 0 
    for_each(it,end, [](wchar_t c){cout << (unsigned long long)c << ' '; }

codecvt_utf8
codecvt_utf8 is a codecvt facet which encapsulates conversion between a UTF-8 encoded byte string and UCS-2 or UCS-4(UTF-32) character string (depending on the type of Elem). This codecvt facet can be used to read and write UTF-8 files, both text and binary.
Note that UCS-2 has limited code points (0xFFFF), unsuitable in some cases.

Syntax
template<
    class Elem,
    unsigned long Maxcode = 0x10ffff,
    codecvt_mode Mode = (codecvt_mode)0 >
class codecvt_utf8
    : public std::codecvt<Elem, char, std::mbstate_t>;

template parameters
NameDescription
Elem  Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t.
MaxCode The largest code point that will be translated without reporting a conversion error.
Mode A constant of type codecvt_mode.

member types
NameDescription
intern_typeElem
extern_typechar
state_typecodecvt::state_type
resultcodecvt_base::result

Example
    char str8[] = u8"ಖ್ರಿಷಾ Rao👸";
    wstring_convert<codecvt_utf8<char32_t>, char32_t> u8to32;
    u32string str32 = u8to32.from_bytes(str8);
    cout << std::showbase << std::hex;
    //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x20 0x52 0x61 0x6f 0x1f478 
    for (char32_t c : str32)
        cout << static_cast<unsigned long long>(c) << ' ';

codecvt_utf16
codecvt_utf16 is a codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS-2 or UTF-32 character string (depending on the type of Elem). This codecvt facet can be used to read and write UTF-16 files in binary mode.

Syntax
template<
    class Elem,
    unsigned long Maxcode = 0x10ffff,
    std::codecvt_mode Mode = (std::codecvt_mode)0 >
class codecvt_utf16
    : public std::codecvt<Elem, char, std::mbstate_t>;

template parameters
NameDescription
Elem  Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t.
MaxCode The largest code point that will be translated without reporting a conversion error.
Mode A constant of type codecvt_mode.

member types
NameDescription
intern_typeElem
extern_typechar
state_typecodecvt::state_type
resultcodecvt_base::result

Example
    char16_t utf16le[] = u"ಖ್ರಿಷಾ Rao👸";
    ofstream oss("test.txt");
    oss.write(reinterpret_cast<char*>(utf16le), sizeof utf16le);
    oss.close();

    wifstream wiss("test.txt",ios::binary|ios_base::in);
    wiss.imbue(locale(wiss.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>));
    istream_iterator<wchar_t, wchar_t> oit(wiss), end;
    cout << showbase << hex; 
    //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x52 0x61 0x6f 0x1f478 0 
    for_each(oit,end, [](wchar_t c){cout << (unsigned long long)c << ' '; });
    

codecvt_utf8_utf16
codecvt_utf8_utf16 is a codecvt facet which encapsulates conversion between a UTF-8 encoded byte string and UTF-16 encoded character string. If Elem is a 32-bit type, one UTF-16 code unit will be stored in each 32-bit character of the output sequence.
The UTF-16 string is variable width encoded so it can represent all the code points,
This is an N:M conversion facet, and cannot be used with std::basic_filebuf (which only permits 1:N conversions, such as UTF-32/UTF-8, between the internal and the external encodings). This facet can be used with std::wstring_convert.

Syntax
template<
    class Elem,
    unsigned long Maxcode = 0x10ffff,
    std::codecvt_mode Mode = (std::codecvt_mode)0 >
class codecvt_utf8_utf16
    : public std::codecvt<Elem, char, std::mbstate_t>

template parameters
NameDescription
Elem  Aliased as member intern_type. This can be: wchar_t, char16_t or char32_t.
MaxCode The largest code point that will be translated without reporting a conversion error.
Mode A constant of type codecvt_mode.

member types
NameDescription
intern_typeElem
extern_typechar
state_typecodecvt::state_type
resultcodecvt_base::result

Example
    char str8[] = u8"ಖ್ರಿಷಾ Rao👸";
    wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t> u8to16;
    u16string str16 = u8to16.from_bytes(str8);
    cout << std::showbase << std::hex;
    //prints:0xc96 0xccd 0xcb0 0xcbf 0xcb7 0xcbe 0x20 0x52 0x61 0x6f 0xd83d 0xdc78
    for (char16_t c : str16)
        cout << static_cast<unsigned> (c) << ' ';     

No comments:

Post a Comment