Character Encoding Guide - ASCII, Unicode, and UTF-8 Explained

Published 2025-03-15 · ToolNest

Garbled text and encoding errors are common frustrations in development. Understanding character encoding eliminates most of these problems.

Why Encoding Exists

Computers only process numbers. To represent characters, we need a mapping between numbers and characters — that's encoding.

History of Encoding

1. ASCII (1963)

  • 7-bit, 128 characters
  • English letters, digits, punctuation, control characters
  • Only enough for English

2. Extended ASCII / Latin-1

  • 8-bit, 256 characters
  • Added European language characters
  • Still no Chinese support

3. GB2312 / GBK / GB18030 (Chinese)

  • Chinese characters use 2 bytes each
  • Chinese national standards
  • Not compatible with other languages

4. Unicode

  • Goal: cover every character in the world
  • 140,000+ characters and counting
  • Each character has a unique code point (e.g., "中" = U+4E2D)

5. UTF-8 (Unicode Implementation)

  • Variable-length encoding: 1-4 bytes
  • ASCII characters: 1 byte (backward compatible)
  • Chinese characters: typically 3 bytes
  • The de facto standard for the web

UTF-8 Encoding Rules

Code Point Range Bytes Format
U+0000-007F 1 0xxxxxxx
U+0080-07FF 2 110xxxxx 10xxxxxx
U+0800-FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
U+10000+ 4 11110xxx ...

Example: "中" (U+4E2D) → UTF-8 bytes: E4 B8 AD

This is why "中" in URL encoding becomes %E4%B8%AD.

Common Causes of Garbled Text

1. Encoding Mismatch

File is UTF-8 but decoded as GBK → garbled characters.

2. BOM Issues

UTF-8 BOM (EF BB BF) at file start. Some programs don't recognize it, causing  at the beginning.

3. MySQL Character Set

  • MySQL's utf8 only supports 3-byte characters (some emoji fail)
  • utf8mb4 is true UTF-8 — always use this

Why Chinese Base64 Fails

JavaScript's btoa() only accepts Latin1 characters (code points 0-255). Chinese exceeds this range, causing an error.

Fix: Use TextEncoder to convert to UTF-8 bytes first, then encode to Base64. See our Base64 guide for details.

Best Practices

  1. Always use UTF-8 everywhere: files, databases, HTTP headers
  2. HTML declaration: <meta charset="UTF-8"> as first element in <head>
  3. HTTP header: Content-Type: text/html; charset=utf-8
  4. Database: Use utf8mb4, not utf8
  5. Editor: Default to UTF-8 without BOM

Use ToolNest Base64 tool for UTF-8-safe encoding and decoding.

← Back to Articles