Modern C++ 11,14: Quick tour with examples: Localization

Overview

In the beginning, digital computers used 7 bits to store characters, which was limited to English and other western European languages. Today, up to 21 bits can be used to store characters from any language, dead or alive.

The world is a diverse place of many countries. Each country having one or more native speaking languages and writing systems, numbering symbols, number grouping, currency symbols, date and time formats etc.

On computers, operating systems provide the support for all these aspects so that it can be integrated into the software used in the daily lives of the people, government and businesses etc.

Details

Character Set

A Character Set is a set of characters identifying a particular writing system of a particular region. A character set consists of letters, numbers, special characters, and other elements used to represent information as well as control characters such as escape, line feed, tab, end of file etc. Examples for character sets include ASCII, ISO 8859-1, JIS, Unicode etc.

Each of the character in a character set needs to be uniquely represented as a hexadecimal value. These character sets are called coded character set.

Encoding➹

Character encoding is the process of assigning numbers to the characters of the character set, allowing them to be stored, transmitted, and transformed using digital computers so that they can be reinterpreted back correctly.

This is discussed in details.

codepages

A code page is basically a coded character set. The term code page initially referred to the page number of IBM standard character set manual containing character mapping for plethora of character sets. The term code page is also adopted by other vendors such as Microsoft, Oracle etc. It now refers to a name and also a number. For example UTF-8 is identified as page numbers 1208 at IBM, 65001 at Microsoft.

The following lists some of the common Microsoft code pages used in Windows operating system.

ANSI code pages

1252 - Windows Western European

This is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German (though missing uppercase ẞ). This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa.

The encoding of these single byte code pages is called Single byte character set or SBCS encoding.

Multi Byte code pages

These code pages represent character encodings for Chinese, Japanese and Korean , collectively called as CJK languages. In addition, these can also contain ASCII characters.

The following code pages are called double byte character set or DBCS encoding as each character can take 1 to 2 bytes.

932 – Japanese Shift-JIS

936 – Simplified Chinese GBK

949 – Korean Unified Hangul Code

950 – Traditional Chinese Big5

The following code pages are called EUC encoding as each character can take 1 to 3 bytes.

20932 - EUC-JP

Unicode code pages

The following are some of the code pages that are used for encoding Unicode character sets.

1200 – UTF-16LE Unicode (little-endian)

1201 – UTF-16BE Unicode (big-endian)

12000 – UTF-32LE Unicode (little-endian)

12001 – UTF-32BE Unicode (big-endian)

65001 – UTF-8 Unicode

Localization

Internationalization is the process of designing and developing software or applications to be accessible and adaptable to different languages, cultures, and regions. Localization is the process of adapting software or applications to meet the language, cultural, and other requirements of a specific target market or locale.

While internationalization creates a framework that supports multiple languages and regions, localization involves customizing the software or application for a specific target market or region.

For example, while designing a MFC application for different markets, the application might catalogue all the messages that the GUI displays to the user in the local language. As precursor to this work, designers might assign a unique code to each message enabling GUI to load the message from a resource file for that local language. This part of the work is called Internationalization.

Localization involves creating resource files that contain messages in the local language for that unique code. It might also require to save resource files in special encoding,

Locales

A locale is a collection of information pertaining to a culture, region, and language. Locales are physically stored in external files in POSIX environments. In Windows, they are located in the registry. The locale includes information about

Formatting numbers, currency, dates, and time
Classifying characters (letter, digit, punctuation, etc.)
Converting characters from uppercase to lowercase and vice versa
Sorting text (e.g., is 'A' less than, equal to, or greater than 'Å'?)
Message catalogs (for translations of strings that an application uses)

locale identifier

On POSIX systems, a locale identifier is a string composed of 2 or 3 elements specifying a language, the region in which that language is employed, encoding format and an optional variant. It's represented as below:

[language[_territory][.codeset][@modifier]].

The component parts are as follows:

language is a two-letter language code from ISO 639
_territory is a two-letter country/region/subdivision code from ISO 3166
.codeset is any pre-defined name for a character set or encoding identifier (for example, iso885915 or UTF-8)
@modifier specifies an adjustment to the default locale behavior such as date/time/currency formatting or sorting method

Some examples are

Locale Name	Language	Region/Country
de_AT.UTF-8	German	Austria
sv_SE.UTF-8	Swedish	Sweden
en_US.UTF-8	English	USA
ja_JP.UTF-8	Japanese	Japan
kn_IN.UTF-8	Kannada	Karnataka

In Windows, locales are identified by LCID or locale Ids. Latest versions of windows support BCP47 based named locale.

Some examples are:

Locale Name	LCID	POSIX compliant	Description
en-US	409	en_US	English for USA
es-ES-u-co-trad	1034	es_ES@traditional	Spanish for Spain, specifying the traditional sort order
kn-IN	44b	kn_IN.UTF-8	Kannada for India

The following special locale names are available

locale name	description
"C"	Minimal "C" locale (the same as locale::classic) Available to POSIX and WINDOWS
""	The environment's default locale Available to POSIX

Locales in C

category

Internally various categories of information contained in a locale are classified as below.

Category	Content
LC_NUMERIC	Rules and symbols for numbers.
LC_TIME	Values for date and time information.
LC_MONETARY	Rules and symbols for monetary information.
LC_CTYPE	Character classification and case conversion.
LC_COLLATE	Collation sequence.
LC_MESSAGES	Formats and values of messages.
LC_ALL	All the above.

setlocale()

An application can load a locale by calling setlocale() api. setlocale() accepts two parameters: category and locale id.

The following lists various examples

//POSIX and WINDOWS

//sets C locale

setlocale(LC_ALL,"C");

//POSIX 
//sets default locale
setlocale(LC_ALL,"");

//WINDOWS

//sets a specific locale
setlocale(LC_ALL,"kn-IN");

//POSIX

//sets a specific locale

setlocale(LC_ALL,"en_US.ISO8859-1");

This is depicted in this example,

locale class➹

In C++, a locales are represented by a locale object, which is a collection of indexed facets. It also contains a basic_regex object.

Facets

The following facets are defined for each category. It's possible to override default behavior of a fact as demonstrated in this example 3.

Category	Facets
collate	collate➹ ollate facet type can be used for collate comparison of strings
ctype	ctype_base➹ This is a base class for ctype facet. It lists the character classification categories which are inherited by the ctype facet. ctype➹ ctype facet type encapsulates character classification features. codecvt_base➹ Declares result enum type, which are internally used by codecvt methods. codecvt➹ codecvt facet type encapsulates conversion of character strings, including wide and multibyte, from one encoding to another.
monetary	money_base➹ money_base serves as a base class and defines enum part and money format pattern internally used by money_get and money_put facets. moneypunct➹ moneypunct facet encapsulates monetary value format preferences. money_get➹ money_get facet encapsulates monetary parsing rules. money_put➹ money_put facet encapsulates monetary formatting rules.
numeric	numpunct➹ numpunct facet encapsulates number value format preferences. num_get➹ num_get facet encapsulates number parsing rules. num_put➹ num_put facet encapsulates number formatting rules.
time	time_base➹ time_base class serves as base class and defines result enum type dateorder, internally used by time_get and time_put facets. time_get➹ time_get facet encapsulates Date and Time parsing rules. time_put➹ time_put facet encapsulates Date and Time formatting rules.
messages	messages_base➹ This is a base classes messages facet. It defines the member catalog inherited by all the derived classes. int catalog messages➹ The messages standard facet is used to read individual strings from a message catalog.

Character classification and conversion➹

CRT provides set of functions to classify characters and changes their cases to upper or lower. Standard library also provides template based functions to provide the same facility.

Unicode Conversion Facets

The following are Codecvt based facets to convert Unicode based formats with each other. For example, conversion from UTF-8 to UCS-2, UTF-16 and UTF-32. Also reverse.

codecvt_mode➹

This is a bitmask based enum used by the unicode conversion facets.

codecvt_utf8➹

Converts between a UTF-8 encoded byte string and UCS-2 or UCS-4(UTF-32)

codecvt_utf16➹

Converts between a UTF-16 encoded byte string and UCS-2 or UCS-4(UTF-32)

Summary of Examples

Name	Category	Description	Github	WandBox/Coliru
Example	Locales	setlocale	source
Example 2	Locales	imbue	source
Example 3	Facet	Override standard behavior	source output	source + output
Example 4	collate	collate	source output	source + output
Example 5	ctype	ctype	source output
Example 6	ctype	codecvt	source output	source + output
Example 7	money	money_get	source output	source + output
Example 8	money	money_put	source output	source + output
Example 9	money	moneypunct	source output	source + output
Example 10	numeric	num_get	source output	source + output
Example 11	numeric	num_put	source output	source + output
Example 12	numeric	numpunct	source output	source + output
Example 13	time	time_get	source output	source + output
Example 14	time	time_put	source output	source + output

Modern C++ 11,14: Quick tour with examples

Pages

Localization

No comments:

Post a Comment