Few things are more frustrating than seeing garbled characters where your carefully crafted content should be. Those mysterious question marks, strange symbols like "’", or empty boxes are almost always encoding issues. This guide will help you understand and prevent them.
The Encoding Fundamentals
At its core, text encoding is the mapping between human-readable characters and the bytes computers store. Different encodings use different mappings, which is why mixing them causes problems.
The Evolution
- ASCII (1963): 128 characters, English only. Still foundational—consult an ASCII table when working with legacy systems.
- Extended ASCII / ISO-8859: 256 characters, added European languages
- Unicode (1991+): 149,000+ characters, all languages
- UTF-8 (1993): Variable-width Unicode encoding, the web standard
UTF-8: The Modern Standard
UTF-8 should be your default choice for web development. It's:
- Backward compatible with ASCII
- Space-efficient for English text
- Capable of encoding any Unicode character
- The encoding of 98%+ of web pages
Always declare it in your HTML:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<!-- This must be in the first 1024 bytes -->
Common Encoding Scenarios
URL Encoding (Percent Encoding)
URLs can only contain ASCII characters. Special characters must be encoded:
Original: Hello World! £100
Encoded: Hello%20World%21%20%C2%A3100
Space → %20
! → %21
£ → %C2%A3 (UTF-8 bytes)
In JavaScript:
const encoded = encodeURIComponent("Hello World! £100");
const decoded = decodeURIComponent(encoded);
Base64 Encoding
Base64 converts binary data to ASCII text, useful for:
- Embedding images in CSS/HTML
- Sending binary data in JSON
- Email attachments (MIME)
// JavaScript
const encoded = btoa("Hello World"); // "SGVsbG8gV29ybGQ="
const decoded = atob(encoded); // "Hello World"
# Python
import base64
encoded = base64.b64encode(b"Hello World")
decoded = base64.b64decode(encoded)
HTML Entity Encoding
HTML entities represent characters that have special meaning in HTML:
< → <
> → >
& → &
" → "
' → '
Use entities when inserting user content to prevent XSS attacks.
Debugging Encoding Issues
The "Mojibake" Problem
When you see "café" instead of "café", that's mojibake—UTF-8 bytes interpreted as Latin-1. Common causes:
- Database connection not set to UTF-8
- File saved in wrong encoding
- Missing charset declaration
- Copy-pasting from Word (curly quotes)
Diagnostic Steps
# Check file encoding (Linux/Mac)
$ file -i document.txt
document.txt: text/plain; charset=utf-8
# Convert encoding
$ iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
# Hex dump to see actual bytes
$ xxd document.txt | head
Database Checklist
-- MySQL: Check current settings
SHOW VARIABLES LIKE 'character_set%';
-- Set connection encoding
SET NAMES 'utf8mb4';
-- Create table with proper encoding
CREATE TABLE posts (
content TEXT
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Language-Specific Tips
Python
# Python 3 strings are Unicode by default
text = "café" # Works fine
# Reading files
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()
# HTTP responses
response.encoding = 'utf-8'
JavaScript
// Fetch with encoding
const response = await fetch(url);
const text = await response.text(); // Assumes UTF-8
// Explicit TextDecoder
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(arrayBuffer);
PHP
// Set default encoding
mb_internal_encoding('UTF-8');
// String functions (use mb_ versions)
$length = mb_strlen($string);
$upper = mb_strtoupper($string);
Best Practices Summary
- Use UTF-8 everywhere: Files, databases, HTTP headers, HTML
- Declare encoding early: First element in
<head> - Be explicit: Never assume encoding, always specify
- Test with special characters: Include é, £, 中, 🎉 in test data
- Keep references handy: Bookmark tools like ascii.co.uk for quick character lookups
Understanding encoding transforms mysterious bugs into straightforward fixes. When in doubt, the answer is almost always: use UTF-8 and be explicit about it everywhere in your stack.