In the beginning, digital computers used 7 bits to store characters, which was limited to English and other western European languages. Today, up to 21 bits can be used to store characters from any language, dead or alive.
The world is a diverse place of many countries. Each country having one or more native speaking languages and writing systems, numbering symbols, number grouping, currency symbols, date and time formats etc.
On computers, operating systems provide the support for all these aspects so that it can be integrated into the software used in the daily lives of the people, government and businesses etc.
Details
Character Set
A Character Set is a set of characters identifying a particular writing system of a particular region. A character set consists of letters, numbers, special characters, and other elements used to represent information as well as control characters such as escape, line feed, tab, end of file etc. Examples for character sets include ASCII, ISO 8859-1, JIS, Unicode etc.
Each of the character in a character set needs to be uniquely represented as a hexadecimal value. These character sets are called coded character set.
Encoding➹
Character encoding is the process of assigning numbers to the characters of the character set, allowing them to be stored, transmitted, and transformed using digital computers so that they can be reinterpreted back correctly.
This is discussed in details.
codepages
A code page is basically a coded character set. The term code page initially referred to the page number of IBM standard character set manual containing character mapping for plethora of character sets. The term code page is also adopted by other vendors such as Microsoft, Oracle etc. It now refers to a name and also a number. For example UTF-8 is identified as page numbers 1208 at IBM, 65001 at Microsoft.
The following lists some of the common Microsoft code pages used in Windows operating system.
ANSI code pages
1252 - Windows Western European
This is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German (though missing uppercase ẞ). This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa.
The encoding of these single byte code pages is called Single byte character set or SBCS encoding.
Multi Byte code pages
These code pages represent character encodings for Chinese, Japanese and Korean , collectively called as CJK languages. In addition, these can also contain ASCII characters.
The following code pages are called double byte character set or DBCS encoding as each character can take 1 to 2 bytes.
932 – Japanese Shift-JIS
936 – Simplified Chinese GBK
949 – Korean Unified Hangul Code
950 – Traditional Chinese Big5
The following code pages are called EUC encoding as each character can take 1 to 3 bytes.
20932 - EUC-JP
Unicode code pages
The following are some of the code pages that are used for encoding Unicode character sets.
1200 – UTF-16LE Unicode (little-endian)
1201 – UTF-16BE Unicode (big-endian)
12000 – UTF-32LE Unicode (little-endian)
12001 – UTF-32BE Unicode (big-endian)
65001 – UTF-8 Unicode
Localization
Internationalization is the process of designing and developing software or applications to be accessible and adaptable to different languages, cultures, and regions. Localization is the process of adapting software or applications to meet the language, cultural, and other requirements of a specific target market or locale.
While internationalization creates a framework that supports multiple languages and regions, localization involves customizing the software or application for a specific target market or region.
For example, while designing a MFC application for different markets, the application might catalogue all the messages that the GUI displays to the user in the local language. As precursor to this work, designers might assign a unique code to each message enabling GUI to load the message from a resource file for that local language. This part of the work is called Internationalization.
Localization involves creating resource files that contain messages in the local language for that unique code. It might also require to save resource files in special encoding,
Locales
A locale is a collection of information pertaining to a culture, region, and language. Locales are physically stored in external files in POSIX environments. In Windows, they are located in the registry. The locale includes information about
- Formatting numbers, currency, dates, and time
- Classifying characters (letter, digit, punctuation, etc.)
- Converting characters from uppercase to lowercase and vice versa
- Sorting text (e.g., is 'A' less than, equal to, or greater than 'Å'?)
- Message catalogs (for translations of strings that an application uses)
locale identifier
On POSIX systems, a locale identifier is a string composed of 2 or 3 elements specifying a language, the region in which that language is employed, encoding format and an optional variant. It's represented as below:
[language[_territory][.codeset][@modifier]].
The component parts are as follows:
- language is a two-letter language code from ISO 639
- _territory is a two-letter country/region/subdivision code from ISO 3166
- .codeset is any pre-defined name for a character set or encoding identifier (for example, iso885915 or UTF-8)
- @modifier specifies an adjustment to the default locale behavior such as date/time/currency formatting or sorting method
Some examples are
Locale Name | Language | Region/Country |
---|---|---|
de_AT.UTF-8 | German | Austria |
sv_SE.UTF-8 | Swedish | Sweden |
en_US.UTF-8 | English | USA |
ja_JP.UTF-8 | Japanese | Japan |
kn_IN.UTF-8 | Kannada | Karnataka |
In Windows, locales are identified by LCID or locale Ids. Latest versions of windows support BCP47 based named locale.
Some examples are:
Locale Name | LCID | POSIX compliant | Description |
---|---|---|---|
en-US | 409 | en_US | English for USA |
es-ES-u-co-trad | 1034 | es_ES@traditional | Spanish for Spain, specifying the traditional sort order |
kn-IN | 44b | kn_IN.UTF-8 | Kannada for India |
The following special locale names are available
locale name | description |
---|---|
"C" | Minimal "C" locale (the same as locale::classic) Available to POSIX and WINDOWS |
"" | The environment's default locale Available to POSIX |
Locales in C
category
Internally various categories of information contained in a locale are classified as below.
Category | Content |
---|---|
LC_NUMERIC | Rules and symbols for numbers. |
LC_TIME | Values for date and time information. |
LC_MONETARY | Rules and symbols for monetary information. |
LC_CTYPE | Character classification and case conversion. |
LC_COLLATE | Collation sequence. |
LC_MESSAGES | Formats and values of messages. |
LC_ALL | All the above. |
setlocale()
An application can load a locale by calling setlocale() api. setlocale() accepts two parameters: category and locale id.
The following lists various examples
//POSIX and WINDOWS
//sets C locale
setlocale(LC_ALL,"C");
//POSIX
//sets default locale
setlocale(LC_ALL,"");
//WINDOWS
//sets a specific locale setlocale(LC_ALL,"kn-IN"); //POSIX
//sets a specific locale
setlocale(LC_ALL,"en_US.ISO8859-1");
This is depicted in this example,
In C++, a locales are represented by a locale object, which is a collection of indexed facets. It also contains a basic_regex object.
Facets
The following facets are defined for each category. It's possible to override default behavior of a fact as demonstrated in this example 3.
Category | Facets |
---|---|
collate | collate➹ ollate facet type can be used for collate comparison of strings |
ctype | ctype_base➹ This is a base class for ctype facet. It lists the character classification categories which are inherited by the ctype facet. ctype➹ ctype facet type encapsulates character classification features. codecvt_base➹Declares result enum type, which are internally used by codecvt methods. codecvt➹codecvt facet type encapsulates conversion of character strings, including wide and multibyte, from one encoding to another. |
monetary | money_base➹ money_base serves as a base class and defines enum part and money format pattern internally used by money_get and money_put facets. moneypunct facet encapsulates monetary value format preferences. money_get➹money_get facet encapsulates monetary parsing rules. money_put➹money_put facet encapsulates monetary formatting rules. |
numeric | numpunct➹ numpunct facet encapsulates number value format preferences. num_get➹ num_get facet encapsulates number parsing rules. num_put facet encapsulates number formatting rules. |
time | time_base class serves as base class and defines result enum type dateorder, internally used by time_get and time_put facets. time_get facet encapsulates Date and Time parsing rules. time_put facet encapsulates Date and Time formatting rules. |
messages | This is a base classes messages facet. It defines the member catalog inherited by all the derived classes. int catalog The messages standard facet is used to read individual strings from a message catalog. |
CRT provides set of functions to classify characters and changes their cases to upper or lower. Standard library also provides template based functions to provide the same facility.
Unicode Conversion Facets
The following are Codecvt based facets to convert Unicode based formats with each other. For example, conversion from UTF-8 to UCS-2, UTF-16 and UTF-32. Also reverse.
Converts between a UTF-8 encoded byte string and UCS-2 or UCS-4(UTF-32)
Converts between a UTF-8 encoded byte string and UTF-16.
String and stream conversions
Performs conversions between wide strings and byte strings (on either direction) using a conversion object of type
Performs conversion between a byte stream buffer and a wide stream buffer
Summary of Examples
Name | Category | Description | Github | WandBox/Coliru |
---|---|---|---|---|
Example | Locales | setlocale | source | |
Example 2 | Locales | imbue | source | |
Example 3 | Facet | Override standard behavior | source output | source + output |
Example 4 | collate | collate | source output | source + output |
Example 5 | ctype | ctype | source output | |
Example 6 | ctype | codecvt | source output | source + output |
Example 7 | money | money_get | source output | source + output |
Example 8 | money | money_put | source output | source + output |
Example 9 | money | moneypunct | source output | source + output |
Example 10 | numeric | num_get | source output | source + output |
Example 11 | numeric | num_put | source output | source + output |
Example 12 | numeric | numpunct | source output | source + output |
Example 13 | time | time_get | source output | source + output |
Example 14 | time | time_put | source output | source + output |
No comments:
Post a Comment