Localization

 
Overview
In the beginning, digital computers used 7 bits to store characters, which was limited to English and other western European languages. Today, up to 21 bits can  be used to store characters from  any language, dead or alive. 
The world is a diverse place of many countries. Each country having one or more native speaking languages and writing systems, numbering symbols,  number grouping, currency symbols, date and time formats etc. 
On computers, operating systems provide the support for all these aspects so that it can be integrated into the software used in the daily lives of the people, government and businesses etc.

Details
Character Set
A Character Set is a set of characters identifying a particular writing system of a particular region. A character set consists of letters, numbers, special characters, and other elements used to represent information as well as control characters such as escape, line feed, tab, end of file etc. Examples for character sets include ASCII, ISO 8859-1, JIS, Unicode etc.
Each of the character in a character set needs to be uniquely represented as a hexadecimal value. These character sets are called coded character set.

Encoding➹
Character encoding is the process of assigning numbers to the characters of the character set, allowing them to be stored, transmitted, and transformed using digital computers so that they can be reinterpreted back correctly. 
This is discussed in details.

codepages
A code page is basically a coded character set. The term code page initially referred to the page number of IBM standard character set manual containing character mapping for plethora of character sets. The term code page is also adopted by other vendors such as Microsoft, Oracle etc. It now refers to a name and also a number. For example UTF-8  is identified as page numbers 1208 at IBM, 65001 at Microsoft.

The following lists some of the common Microsoft code pages used in Windows operating system.

ANSI code pages
1252 - Windows Western European
This is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German (though missing uppercase ẞ). This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa.
The encoding of these single byte code pages is called Single byte character set or SBCS encoding.

Multi Byte code pages
These code pages represent character encodings for Chinese, Japanese and Korean , collectively called as CJK languages. In addition, these can also contain ASCII characters.

The following code pages are called double byte character set or DBCS encoding as each character can take 1 to 2 bytes.
932 –  Japanese Shift-JIS
936 – Simplified Chinese GBK
949 – Korean Unified Hangul Code
950 – Traditional Chinese Big5

The following code pages are called EUC encoding  as each character can take 1 to 3 bytes.
20932 - EUC-JP

Unicode code pages
The following are some of the code pages that are used for encoding Unicode character sets.
1200    –  UTF-16LE Unicode (little-endian)
1201    –  UTF-16BE Unicode (big-endian)
12000  –  UTF-32LE Unicode (little-endian)
12001  –  UTF-32BE Unicode (big-endian)
65001  –  UTF-8 Unicode

Localization
Internationalization is the process of designing and developing software or applications to be accessible and adaptable to different languages, cultures, and regions. Localization is the process of adapting software or applications to meet the language, cultural, and other requirements of a specific target market or locale. 
While internationalization creates a framework that supports multiple languages and regions, localization involves customizing the software or application for a specific target market or region.
For example, while designing a MFC application for different markets, the application might catalogue all the messages that the GUI displays to the user in the local language. As precursor to this work, designers might assign a unique code to each message enabling GUI to load the message from a resource file for that local language.   This part of the work is called Internationalization.
Localization involves creating resource files  that contain messages in the local language for that unique code.  It might also require to save resource files in special encoding, 

Locales
A locale is a collection of information pertaining to a culture, region, and language. Locales are physically stored in external files in POSIX environments. In Windows, they are located in the registry. The locale includes information about
  • Formatting numbers, currency, dates, and time
  • Classifying characters (letter, digit, punctuation, etc.)
  • Converting characters from uppercase to lowercase and vice versa
  • Sorting text (e.g., is 'A' less than, equal to, or greater than 'Å'?)
  • Message catalogs (for translations of strings that an application uses)
locale identifier
On POSIX systems, a locale identifier is a string composed of 2 or 3 elements specifying a language, the region in which that language is employed, encoding format and an optional variant. It's represented as below:
 [language[_territory][.codeset][@modifier]].
The component parts are as follows:
  • language is a two-letter language code from ISO 639
  • _territory is a two-letter country/region/subdivision code from ISO 3166
  • .codeset is any pre-defined name for a character set or encoding identifier (for example, iso885915 or UTF-8)
  • @modifier specifies an adjustment to the default locale behavior such as date/time/currency formatting or sorting method
Some examples are
Locale NameLanguageRegion/Country
de_AT.UTF-8GermanAustria
sv_SE.UTF-8SwedishSweden
en_US.UTF-8EnglishUSA
ja_JP.UTF-8JapaneseJapan
kn_IN.UTF-8KannadaKarnataka

In Windows, locales are identified by LCID or locale Ids. Latest versions of windows support BCP47 based named  locale. 
Some examples are:
Locale NameLCIDPOSIX compliantDescription
en-US409en_USEnglish for USA
es-ES-u-co-trad1034es_ES@traditionalSpanish for Spain, specifying the traditional sort order
kn-IN44bkn_IN.UTF-8Kannada for India

The following special locale names are available 
locale namedescription
"C"Minimal "C" locale (the same as locale::classic)
Available to POSIX and WINDOWS
 ""
The environment's default locale
Available to POSIX

Locales in C
category
Internally various categories of information contained in a locale are classified as below.
CategoryContent
LC_NUMERICRules and symbols for numbers.
LC_TIMEValues for date and time information.
LC_MONETARYRules and symbols for monetary information.
LC_CTYPECharacter classification and case conversion.
LC_COLLATECollation sequence.
LC_MESSAGESFormats and values of messages.
LC_ALLAll the above.

setlocale()
An application can load a locale by calling setlocale() api. setlocale() accepts two parameters: category and locale id.

The following lists various examples
//POSIX and WINDOWS
//sets C locale
setlocale(LC_ALL,"C");

//POSIX 
//sets default locale setlocale(LC_ALL,""); //WINDOWS
//sets a specific locale
setlocale(LC_ALL,"kn-IN");

//POSIX
//sets a specific locale
setlocale(LC_ALL,"en_US.ISO8859-1"); 
This is depicted in this  example,

In C++, a locales are represented by a locale object, which  is a collection of indexed facets. It also contains a basic_regex object. 

Facets
The following facets are defined for each category.  It's possible to override default behavior of a fact as demonstrated in this example 3.
CategoryFacets
collatecollate
ollate facet type can be used for collate comparison of strings
ctype ctype_base
This is a base class for ctype facet. It lists the character classification categories which are inherited by the ctype facet. 
ctype
ctype facet type encapsulates character classification features.
codecvt_base
Declares result enum type, which are internally used by codecvt methods.
codecvt
codecvt facet type encapsulates conversion of character strings, including wide and multibyte, from one encoding to another.
monetary
money_base
money_base serves as a base class and defines enum part and money format pattern internally used by money_get and money_put facets.
moneypunct facet encapsulates monetary value format preferences. 
money_get
money_get facet encapsulates monetary parsing rules.
money_put
money_put facet encapsulates monetary formatting rules.
numericnumpunct➹
numpunct  facet encapsulates number value format preferences. 
num_get➹
num_get facet encapsulates number parsing rules.
num_put facet encapsulates number formatting rules.
time
time_base class serves as base class and defines result enum type dateorder, internally used by  time_get and time_put facets.
time_get facet encapsulates Date and Time parsing rules.
time_put facet encapsulates Date and Time formatting  rules.
messages
This is a base classes messages facet. It defines the member catalog inherited by all the derived classes.
int catalog
The messages standard facet is used to read individual strings from a message catalog.

CRT provides set of functions to classify characters and changes their cases to upper or lower. Standard library also provides template based functions to provide the same facility.

Unicode Conversion Facets
The following are Codecvt based facets to convert Unicode based formats with each other. For example, conversion from  UTF-8 to UCS-2, UTF-16 and UTF-32. Also reverse.

This is a bitmask based enum used by the unicode conversion facets.

Converts between a UTF-8 encoded byte string and UCS-2 or UCS-4(UTF-32)

Converts between a UTF-16 encoded byte string and UCS-2 or UCS-4(UTF-32)

Converts between a UTF-8 encoded byte string and UTF-16.

String and stream conversions
Performs conversions between wide strings and byte strings (on either direction) using a conversion object of type 

Performs conversions between a wide string and a byte string

Performs conversion between a byte stream buffer and a wide stream buffer


Summary of Examples
NameCategoryDescriptionGithubWandBox/Coliru
ExampleLocalessetlocalesource 
Example 2Localesimbuesource 
Example 3FacetOverride standard behaviorsource outputsource + output
Example 4collatecollatesource outputsource + output
Example 5ctypectypesource output 
Example 6ctypecodecvtsource outputsource + output
Example 7moneymoney_getsource outputsource + output
Example 8moneymoney_putsource outputsource + output
Example 9moneymoneypunctsource outputsource + output
Example 10numericnum_getsource outputsource + output
Example 11numericnum_putsource outputsource + output
Example 12numericnumpunctsource outputsource + output
Example 13timetime_getsource outputsource + output
Example 14timetime_putsource outputsource + output

No comments:

Post a Comment