Tuesday, January 14, 2025

codecvt facet

Overview
The facet codecvt belongs to ctype category. The codecvt or codeset conversion transforms characters from multibyte to wide char and vice versa.  

Details
codecvt_base 
Declares result enum type defined as below. These are internally used by codecvt methods.
NameDescription
result::okConversion successful
result::partialPartial conversion
result::errorConversion error
result::noconvNo conversion

codecvt
The template class codecvt encapsulates conversion of character strings, including wide and multibyte, from one encoding to another. All file I/O operations performed through file stream objects use the codecvt facet of the locale imbued in the stream.

Syntax
template <class internT, class externT, class stateT> 
class codecvt: public locale:facet, public codecvt_base

types
NameDescription
internT First template parameter. Internal character type which can be char, wchar_t, char16_t and char32_t.
Typically, this is the wide character type.
externT Second template parameter. External character type which can be char.
Typically, this is the multi byte character type.
state_type

Third template parameter. A state type: Typically, this is an object able to keep track of the state of the conversion, such as mbstate_t (or, more generically, char_traits<externT>::state_type).

Specializations
//performs no conversion
codecvt<char,char,mbstate_t>

//converts between native wide and narrow character sets
codecvt<wchar_t,char,mbstate_t>	

//converts between UTF16 and UTF8 encodings
codecvt<char16_t,char,mbstate_t>

//converts between UTF32 and UTF8 encodings
codecvt<char32_t,char,mbstate_t>

codecvt is derived from locale::facet and codecvt_base class.

Fields
NameDescription
locale::id id the identifier of the facet. Represents the ctype category of the facet.

Constructor
NameDescription
codecvt(size_t refs = 0)Creates a codecvt facet and forwards the starting reference count refs to the base class constructor, locale::facet::facet()

Methods
conversion functions
NameDescription
result in 
(state_type& state, 
const extern_type* from, const extern_type* from_end, 
const extern_type*& from_next,
intern_type* to, intern_type* to_limit, 
intern_type*& to_next)


Translates the external characters from the source range [from, from_end] to internal characters, placing the results in the subsequent locations starting at to. Converts no more than from_end - from external characters and writes no more than to_end - to internal characters. Leaves from_next and to_next pointing one beyond the last element successfully converted.
Returns result enumeration type indicating the result of operation.
If this codecvt facet does not define a conversion, no characters are converted. to_next is set to be equal to to, state is unchanged, and codecvt_base::noconv is returned.
Example
    string from = reinterpret_cast<const char*>(u8"ಖ್ರಿಷಾ Rao👸");
    locale::global(locale("en_US.utf8"));
    auto& f = use_facet<codecvt<wchar_t, char, mbstate_t>>(locale());
    
    mbstate_t mb = mbstate_t();
    auto len = f.length(mb, &(*begin(from)), &(*end(from)), from.length());

    const char *from_next = nullptr;
    wchar_t* to_next = nullptr;
    wstring to(len+1,'\0');
    
    mb = mbstate_t();
    auto r = f.in (mb, &(*begin(from)), &(*end(from)), from_next, &(*begin(to)), &(*end(to)), to_next);
    if ((codecvt<wchar_t, char, mbstate_t>::result::ok == r))
    {
        to.resize((to_next - to.data()));
        //prints ಖ್ರಿಷಾ Rao👸
        wcout << to;
    }
result out 
(state_type& state, 
const intern_type* from, const intern_type* from_end, 
const intern_type*& from_next,
extern_type* to,  extern_type* to_limit, 
extern_type*& to_next)
If this codecvt facet defines a conversion, translates the internal characters from the source range [from, from_end] to external characters, placing the results in the subsequent locations starting at to. Converts no more than from_end - from internal characters and writes no more than to_end - to external characters. Leaves from_next and to_next pointing one beyond the last element successfully converted.
Returns result enumeration type indicating the result of operation.
If this codecvt facet does not define a conversion, no characters are converted. to_next is set to be equal to to, state is unchanged, and codecvt_base::noconv is returned.
Example
    wstring from = L"ಖ್ರಿಷಾ Rao👸";

    locale::global(locale("en_US.utf8"));
    auto& f = use_facet<codecvt<wchar_t, char, mbstate_t>>(locale());

    auto len = f.max_length()*from.length();
    
    const wchar_t *from_next = nullptr;
    char* to_next = nullptr;
    string to(len+1,'\0');
    
    auto mb = mbstate_t();
    auto r = f.out (mb, &(*begin(from)), &(*end(from)), from_next, &(*begin(to)), &(*end(to)), to_next);
    if ((codecvt<wchar_t, char, mbstate_t>::ok == r))
    {
        to.resize((to_next - to.data()));
        cout << len << " " << to.length() <<endl;
        //prints ಖ್ರಿಷಾ Rao👸
        cout << to;
    }
result unshift 
(state_type& state, 
extern_type* to, extern_type* to_limit, 
extern_type*& to_next)
Writes into to the sequence of character needed to unshift the state of state.

During a character encoding translation (such as the one initiated by codecvt::out), the state may have been shifted to some state other than the state by default, notably when the destination range could not absorb all the characters produced by the translation. By calling this function with additional storage, the remainder of the sequence needed to return the shift state to its default state is written to to (taking up to to_limit).

When the function returns, to_next points to one beyond the last element successfully written.

Character encoding properties
NameDescription
bool always_noconv()
Returns true if conversions between the internal and external types, in either direction, always yield a copy without any real conversion. Otherwise false.

Example
    //b:1
    bool b = use_facet<codecvt<char, char, mbstate_t>>(locale()).always_noconv();
    //b:0
    b = use_facet<codecvt<wchar_t, char, mbstate_t>>(locale()).always_noconv();
int encoding()
Returns number of internal characters required to represent the external characters.
  • A fixed value if each external character can be represented by fixed number of internal characters.
  • 0 if this is a variable value. e.g., 1-5.
  • -1 if the encoding of an external sequence is state-dependent
Example
    //i:1
    int i = use_facet<codecvt<wchar_t, char, mbstate_t>>(locale::classic()).encoding();
    //i:0
    i = use_facet<codecvt<wchar_t, char, mbstate_t>>(locale("en_US.UTF-8")).encoding();

int length
(state_type& state, 
const extern_type* from, const  extern_type* from_end,
size_t max)
Returns the length of the sequence of characters, in terms of translated internal characters.
which is  number of external characters in the range [from,from_end] that could be translated into at maximum of max internal characters, as if applying codecvt::in.
state is also updated as if codecvt::in was called for a buffer of max internal characters.

Example
    string s = reinterpret_cast<const char*>(u8"ಖ್ರಿಷಾ Rao👸");

    auto l = locale("en_US.UTF-8");
    auto& f = use_facet<codecvt<wchar_t, char, std::mbstate_t>>(l);
    mbstate_t mb = std::mbstate_t();
    //len:26
    auto len = f.length(mb, &(*begin(s)), &(*end(s)), s.length());
int max_length() 
Returns the maximum number of internal characters needed for an external character.

Example
    locale loc;

    //n:1
    auto n = use_facet<codecvt<char,char,mbstate_t> >(loc).max_length();
    //n:5
    n = use_facet<codecvt<wchar_t,char,mbstate_t> >(loc).max_length();
This example 6 demonstrates the usage of the codecvt facet as seen in its console output.

No comments:

Post a Comment