Character Encoding Guide - ASCII, Unicode, and UTF-8 Explained

Published 2025-03-15 · ToolNest

Garbled text and encoding errors are common frustrations in development. Understanding character encoding eliminates most of these problems.

Why Encoding Exists

Computers only process numbers. To represent characters, we need a mapping between numbers and characters — that's encoding.

History of Encoding

1. ASCII (1963)

7-bit, 128 characters
English letters, digits, punctuation, control characters
Only enough for English

2. Extended ASCII / Latin-1

8-bit, 256 characters
Added European language characters
Still no Chinese support

3. GB2312 / GBK / GB18030 (Chinese)

Chinese characters use 2 bytes each
Chinese national standards
Not compatible with other languages

4. Unicode

Goal: cover every character in the world
140,000+ characters and counting
Each character has a unique code point (e.g., "中" = U+4E2D)

5. UTF-8 (Unicode Implementation)

Variable-length encoding: 1-4 bytes
ASCII characters: 1 byte (backward compatible)
Chinese characters: typically 3 bytes
The de facto standard for the web

UTF-8 Encoding Rules

Code Point Range	Bytes	Format
U+0000-007F	1	0xxxxxxx
U+0080-07FF	2	110xxxxx 10xxxxxx
U+0800-FFFF	3	1110xxxx 10xxxxxx 10xxxxxx
U+10000+	4	11110xxx ...

Example: "中" (U+4E2D) → UTF-8 bytes: E4 B8 AD

This is why "中" in URL encoding becomes %E4%B8%AD.

Common Causes of Garbled Text

1. Encoding Mismatch

File is UTF-8 but decoded as GBK → garbled characters.

2. BOM Issues

UTF-8 BOM (EF BB BF) at file start. Some programs don't recognize it, causing ï»¿ at the beginning.

3. MySQL Character Set

MySQL's utf8 only supports 3-byte characters (some emoji fail)
utf8mb4 is true UTF-8 — always use this

Why Chinese Base64 Fails

JavaScript's btoa() only accepts Latin1 characters (code points 0-255). Chinese exceeds this range, causing an error.

Fix: Use TextEncoder to convert to UTF-8 bytes first, then encode to Base64. See our Base64 guide for details.

Best Practices

Always use UTF-8 everywhere: files, databases, HTTP headers
HTML declaration: <meta charset="UTF-8"> as first element in <head>
HTTP header: Content-Type: text/html; charset=utf-8
Database: Use utf8mb4, not utf8
Editor: Default to UTF-8 without BOM

Use ToolNest Base64 tool for UTF-8-safe encoding and decoding.

← Back to Articles