Wednesday 13 March 2013

Character set encoding

Character set:-When we write a program, we express source files as text lines containing characters from the source character set.When a program executes in the target environment ,it uses characters from the target character set.These character sets are related ,but need not have the same encoding or all the same members .
               Every character set contains  a distinct code value for each character in the basic C character set.A character set can also contain additional characters with other code values .For example:

  •    The character constant 'x' becomes the value of the code for the character corresponding to x in the target character set.
  • The string literal "xyz" becomes a sequence of character constants stored in successive bytes of memory,followed by a byte containing  the value zero{'x','y','z','\0'}. 
Character encoding:-A character encoding system consists of a code that pairs each character from a given repertoire with something else such as bit pattern,sequence of natural numbers,octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or for data storage .Common example of character encoding systems include Morse code,the Baudot code ,Ascii and Unicode.
Here are some important terms used in character set encoding :-

ASCII:-Ascii stands for American standard code for information interchange .It is a character encoding scheme originally based on English alphabet.Ascii codes represent text in computers,communication equipment and other devices that uses text.Most modern character encoding scheme is based on Ascii though they support many additional characters.Ascii developed from telegraphic codes.Its first commercial use was a seven-bit teleprinter code.Ascii includes definitions for  128 characters: 33 are non printing control character that affects how text and space are processed and 95 printable chars including space which is considered an invisible graphics.Extended ASCII comprises 256 code point in the range 0(hex) to ff(hex)

EBCDIC:-EBCDIC stands for extended binary coded decimal interchange code is an 8-bit chatracter encoding use mainly on IBM mainframe  and IBM midrange computer operating system.It is developed before ASCII. EBCDIC has no technical advantage compared to ASCII based webpages .

Unicode:-Unicode is a computer industry standard for the consistent encoding,representation and handling of text expressed .It is a worldwide character encoding standard that provides a unique number to represent each character used in modern computing ,including technical symbols and special characters used in publishing.



UNICODE
Universal character name                                   ISO/IEC 10646 short name
where N is a hexadecimal digit
\UNNNNNNNN                                                   NNNNNNNN
\uNNNN 0000NNNN

. UTF literals
Syntax                                                            Explanation
u'character'                                                   Denotes a UTF-16 character.
u"character-sequence"                                Denotes an array of UTF-16 characters.
U'character' Denotes a UTF-32 character.
U"character-sequence"                               Denotes an array of UTF-32 characters.


examples:

u'\u1234'

u"\u1234\u8189"

U'\U12345678'

U"\U12345678\U43332233"




                 Unicode comprises 1,114,112 code points in the range 0(hex) to 10FFFF(hex) code points in the range.The Unicode code space is divided into seventeen planes(the basic multilingual plane and 16 supplementary planes),each with 65,536=(2^26) code points.Thus the total size of the Unicode code spaces 17*65,536=1,114,112.  Unicode is required by modern standard such as XML and java scripts.It is supported by many operating systems all modern browsers.It has two mapping metods:-

1)UTF:-It stands for unicode transformation format.For UTF encoding the number in the names of encoding indicate the no of bits in one code value.
  a)UTF-8:-UTF-8 is a variable width encoding that can represent every character in  the Unicode character set.It was designed for backward compatibility with ASCII and to avoid the complications of endiannes and byte order marks in UTF-16 and  UTF-32.UTF-8 has become dominant character encoding for the world wide web accounting for more than half of all webpages.UTF-8 is also increasingly  being used as the the default  character encoding in operating systems, programming language,APIs and software application.
 b)UTF-16:-   UTF-16 (16-bit unicode transformation format)is a character encoding for encoding 1,112,064 numbers in the Unicode color space 0 to 0x10FFFF.It produces a variable-length result either one or two 16-bit code units per code-unit.
c)UTF-32:-UTf-32 is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point .All other Unicode information formats use variable -length encodings.The main advantage of UTF-32 versus variable-length encodings ,is that unicode code points are directly indexable .
d)UTF-EBCDIC:-UTF-EBCDIC  is a character encoding used to represent Unicode characters .It is meant to be EBCDIC -friendly .

2)UCS:-The Universal character set (UCS) is standard set of characters upon which many character encoding are based.The UCS contains  one hundred thousand abstract characters,each identified by an unambiguous name and an integer number called its endpionts.Characters (letters,numbers ,symbol, ideograms ,logograms etc.) from the many languages ,scripts and traditions of the world are represented in the UCS with unique cods points .It has two forms:-

a)UCS-2:-UCS-2 is similar to the UTF-16.

b)UCS-4:-It is similar to the UTF-32.

Multibyte Character Encoding:-Multibyte character encoding uses varying number of bytes to encode different characters.Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without  backward compatibilty with an existing constraint .

Wide-character:- A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character .The increase datatype size allows for the use of larger codec character sets.A wide character refers to the size of the datatype in memory .It does not state how each value in a character
set is defined .Those values are instead defined using characters sets with UCS and Unicode simply being two common character sets that contain more characters than an 8-bit value would allow.

We need to know answers of some questions which are necessary to know us:-

1) What is the default value of character encoding in visual studio.?
Ans:-character set-Use Multi-Byte Character Set
2)What are the possible values of character encoding in VS ?
Ans:-1)Use Multi-Byte Character Set
        2)Not Set
        3)Use Unicode Character Set
3)How can you change this value in VS?
Ans:-We can change this value with following procedure:-
        a)Go to the project properties in visual studio.
        b)There is a option for character set in project defaults where "Use Multi-Byte Character Set"
           is already selected .We can change by clicking the drop down button at left side.
        c) From there we can choose any option among all possible values .
        d)Then Click Ok or Apply to set.

(4)What is the code unit for each set?
Ans:-Here is the code unit for each character set:-
1)US-ASCII       -  7bits
2)UTF-8             - 8bits
3)EBCDIC         - 8bits
4)UTF-16           -16 bits
5)UTF-32           -32 bits



No comments:

Post a Comment