Traditionally, we used ASCII (American Standard Code for Information Interchange) characters to represent alphanumeric characters.
The ASCII was later extended to include another 128 characters. This allows some German characters such as ä, ö, ü, and the British currency symbol £ to be included.
However, not every language uses Latin letters. Chinese, Korean, and Thai are examples of languages that use different character sets.
Unicode attempts to include all characters in all languages in the world into one single character set.
Unicode Consortium endorses three character sets:
- UTF-8 (Unicode Transformation Format, 8-bit encoding form). This is popular for HTML and for protocols whereby Unicode characters are transformed into a variable length encoding of bytes or 8 bits.
- UTF-16 (Unicode Transformation Format, 16-bit encoding form). In this character encoding, all the more commonly used characters fit into a single 16-bit code unit. The .NET Framework uses this character encoding.
- UTF-32 (Unicode Transformation Format, 32-bit encoding form). This character encoding uses 32 bits for every single character.
In HTML or web applications, we are using UTF-8 encoding.
If we create a HTML template, it will generally look like below showing that we are using UTF-8 character encoding:
<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> </head> <body> </body> </html>