Character Encoding Guide - ASCII, Unicode, and UTF-8 Explained
Garbled text and encoding errors are common frustrations in development. Understanding character encoding eliminates most of these problems.
Why Encoding Exists
Computers only process numbers. To represent characters, we need a mapping between numbers and characters — that's encoding.
History of Encoding
1. ASCII (1963)
- 7-bit, 128 characters
- English letters, digits, punctuation, control characters
- Only enough for English
2. Extended ASCII / Latin-1
- 8-bit, 256 characters
- Added European language characters
- Still no Chinese support
3. GB2312 / GBK / GB18030 (Chinese)
- Chinese characters use 2 bytes each
- Chinese national standards
- Not compatible with other languages
4. Unicode
- Goal: cover every character in the world
- 140,000+ characters and counting
- Each character has a unique code point (e.g., "中" = U+4E2D)
5. UTF-8 (Unicode Implementation)
- Variable-length encoding: 1-4 bytes
- ASCII characters: 1 byte (backward compatible)
- Chinese characters: typically 3 bytes
- The de facto standard for the web
UTF-8 Encoding Rules
| Code Point Range | Bytes | Format |
|---|---|---|
| U+0000-007F | 1 | 0xxxxxxx |
| U+0080-07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800-FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000+ | 4 | 11110xxx ... |
Example: "中" (U+4E2D) → UTF-8 bytes: E4 B8 AD
This is why "中" in URL encoding becomes %E4%B8%AD.
Common Causes of Garbled Text
1. Encoding Mismatch
File is UTF-8 but decoded as GBK → garbled characters.
2. BOM Issues
UTF-8 BOM (EF BB BF) at file start. Some programs don't recognize it, causing  at the beginning.
3. MySQL Character Set
- MySQL's
utf8only supports 3-byte characters (some emoji fail) utf8mb4is true UTF-8 — always use this
Why Chinese Base64 Fails
JavaScript's btoa() only accepts Latin1 characters (code points 0-255). Chinese exceeds this range, causing an error.
Fix: Use TextEncoder to convert to UTF-8 bytes first, then encode to Base64. See our Base64 guide for details.
Best Practices
- Always use UTF-8 everywhere: files, databases, HTTP headers
- HTML declaration:
<meta charset="UTF-8">as first element in<head> - HTTP header:
Content-Type: text/html; charset=utf-8 - Database: Use
utf8mb4, notutf8 - Editor: Default to UTF-8 without BOM
Use ToolNest Base64 tool for UTF-8-safe encoding and decoding.