Modern C++ 11,14: Quick tour with examples: Encoding

Overview

Character encoding is the process of assigning numbers to the characters of the character set, allowing them to be stored, transmitted, and transformed using digital computers so that they can be reinterpreted back correctly.

Details

Two important things to consider during encoding are

compactness - less storage is used
performance - reading and writing does not use lot of CPU cycles.

Sometimes the terms character set and encoding are used interchangeably, since the two aspects are so closely related.

The following encoding schemes are popular:

SBCS (Single Byte Character Set)

This describes an encoding in which each hexadecimal value has a simple relationship with a character. Up to 256 characters can be defined. Not all hexadecimal values necessarily have a meaning. Some of the character sets are ASCII and EBCDIC.

DBCS (Double Byte Character Set)

This describes an encoding in which some hexadecimal values are recognized as being the first byte of a two-byte sequence that collectively identifies a particular character. This kind of encoding allows far more characters to be defined; in theory up to 65536. Some of the character sets are JIS X 0208-1990.

One example is Shift-JIS. In this encoding scheme, each byte is inspected to see if it is a one-byte character or the first byte of a two-byte character. This is determined by reserving a set of byte values for certain purposes.

The encoding scheme used for Japanese characters is as follows:

Any byte having a value in the range 0x21-7E is assumed to be a one-byte ASCII/JIS Roman character.
Any byte having a value in the range 0xA1-DF is assumed to be a one-byte half-width katakana character.
Any byte having a value in the range 0x81-9F or 0xE0-EF is assumed to be the first byte of a two-byte character from the set JIS X 0208-1990. The second byte must have a value in the range 0x40-7E or 0x80-FC.

The example shows EUC encoding where each character is represented in 2 bytes.

MBCS (Multi Byte Character Set)

This encoding describes a character set in which hexadecimal sequences of arbitrary length are associated with particular characters. Multiple character sets are supported.

JIS

The JIS, or Japanese Industrial Standard, supports a number of standard Japanese character sets, some requiring one byte, others two. Escape sequences are required to shift between one- and two-byte modes.

Escape sequences, also referred to as shift sequences, are sequences of control characters. Control characters do not belong to any of the alphabets. They are artificial characters that do not have a visual representation. However, they are part of the encoding scheme, where they serve as separators between different character sets, and indicate a switch in the way a character sequence is interpreted. The use of the shift sequence is demonstrated below.

For encoding schemes containing shift sequences, like JIS, it is necessary to maintain a shift state while parsing a character sequence. In the example above, we are in some initial shift state at the start of the sequence. Here it is ASCII. Therefore, characters are assumed to be one-byte ASCII codes until the shift sequence <ESC>$B is seen. This switches us to two-byte mode, as defined by JIS X 0208-1983. The shift sequence <ESC>(B then switches us back to ASCII mode.

EUC

Extended UNIX Code (EUC) is not peculiar to Japanese encoding. It was developed as a method for handling multiple character sets, Japanese or otherwise, within a single text stream. It's more extensible than DBCS as each character can be represented in 1-3 bytes.

The EUC encoding is much more extensible than Shift-JIS since it allows for characters containing more than two bytes.

The encoding scheme used for Japanese characters is as follows:

Any byte having a value in the range 0x21-7E is assumed to be a one-byte ASCII/JIS Roman character.
Any byte having a value in the range 0xA1-FE is assumed to be the first byte of a two-byte character from the set JIS X0208-1990. The second byte must also have a value in that range.
Any byte having the value 0x8E is assumed to be followed by a second byte with a value in the range 0xA1-DF, which represents a half-width katakana character.
Any byte having the value 0x8F is assumed to be followed by two more bytes with values in the range 0xA1-FE, which together represent a character from the set JIS X0212-1990.

The example shows EUC encoding where each character is represented in 2-3 bytes.

Unicode

Unicode is a text encoding standard designed to support the use of text written in all of the world's major writing systems. A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given a representation in the Unicode standard.

A code unit is the unit of storage of a part of an encoded code point. In UTF-8, this means 8 bits, in UTF-16 this means 16 bits and in UTF-32 this means 32 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (☃) is a single code point but it can be represented in three UTF-8 or one UTF-16 or one UTF-32 code units.

In C++11, new character types and their character traits are introduced to support encoding UTF-16 (char16_t) and UTF-32(char32_t).

Storing code points in a stream is called encoding and there are two methods:

Fixed width encoding - Every code point will take fixed number of code units. Examples are UCS-2, UTF-32 encoding, a code point always takes 1 code unit. UCS-2 is an archaic encoding based on UTF-16, which encodes scalar values in the range U+0000-U+FFFF.
Variable width encoding - Every code point will take variable number of code units. For example in UTF-8 encoding, a code point can take 1-4 code units. In UTF-16 encoding, it can take 1 or 2 code units. A code point in UTF-8 is encoded as 1-4 octets. For example, the legacy umlaut noted earlier has a hex value of 0xE4 which can be represented in a single code unit. However it's encoded as two octets 0xCE and 0xA4.

The following displays unicode encoding of string: "ಖ್ರಿಷಾ Rao👸"

ಖ್ರಿ				ಷಾ			r	a	o	👸	Encoding
ಖ	್	ರ	ಿ	ಷ	ಾ		r	a	o	👸	Encoding
e0 b2 96	e0 b3 8d	e0 b2 b0	e0 b2 bf	e0 b2 b7	e0 b2 be	20	52	61	6f	f0 9f 91 b8	UTF-8
c96	ccd	cb0	cbf	cb7	cbe	20	52	61	6f	d83d dc78	UTF-16
c96	ccd	cb0	cbf	cb7	cbe	20	52	61	6f	1f478	UTF-32

The streams of Unicode encoded text may contain Byte Order Marking or BOM in the beginning of the text to notify whether it's big endian (BE) or little endian (LE).

Modern C++ 11,14: Quick tour with examples

Pages

Saturday, January 11, 2025

Encoding

No comments:

Post a Comment