The Complete Guide to Text Encoding for Web Developers

Few things are more frustrating than seeing garbled characters where your carefully crafted content should be. Those mysterious question marks, strange symbols like "â€™", or empty boxes are almost always encoding issues. This guide will help you understand and prevent them.

The Encoding Fundamentals

At its core, text encoding is the mapping between human-readable characters and the bytes computers store. Different encodings use different mappings, which is why mixing them causes problems.

The Evolution

ASCII (1963): 128 characters, English only. Still foundational—consult an ASCII table when working with legacy systems.
Extended ASCII / ISO-8859: 256 characters, added European languages
Unicode (1991+): 149,000+ characters, all languages
UTF-8 (1993): Variable-width Unicode encoding, the web standard

UTF-8: The Modern Standard

UTF-8 should be your default choice for web development. It's:

Backward compatible with ASCII
Space-efficient for English text
Capable of encoding any Unicode character
The encoding of 98%+ of web pages

Always declare it in your HTML:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <!-- This must be in the first 1024 bytes -->

Common Encoding Scenarios

URL Encoding (Percent Encoding)

URLs can only contain ASCII characters. Special characters must be encoded:

Original: Hello World! £100
Encoded:  Hello%20World%21%20%C2%A3100

Space  → %20
!      → %21
£      → %C2%A3 (UTF-8 bytes)

In JavaScript:

const encoded = encodeURIComponent("Hello World! £100");
const decoded = decodeURIComponent(encoded);

Quick tip: For URL encoding reference, the URL encoding tool at ascii.co.uk can help debug encoding issues.

Base64 Encoding

Base64 converts binary data to ASCII text, useful for:

Embedding images in CSS/HTML
Sending binary data in JSON
Email attachments (MIME)

// JavaScript
const encoded = btoa("Hello World");  // "SGVsbG8gV29ybGQ="
const decoded = atob(encoded);         // "Hello World"

# Python
import base64
encoded = base64.b64encode(b"Hello World")
decoded = base64.b64decode(encoded)

Warning: Base64 increases data size by ~33%. Don't use it for large files where you could serve binary directly.

HTML Entity Encoding

HTML entities represent characters that have special meaning in HTML:

<    →  &lt;
>    →  &gt;
&    →  &amp;
"     →  &quot;
'     →  &#39;

Use entities when inserting user content to prevent XSS attacks.

Debugging Encoding Issues

The "Mojibake" Problem

When you see "café" instead of "café", that's mojibake—UTF-8 bytes interpreted as Latin-1. Common causes:

Database connection not set to UTF-8
File saved in wrong encoding
Missing charset declaration
Copy-pasting from Word (curly quotes)

Diagnostic Steps

# Check file encoding (Linux/Mac)
$ file -i document.txt
document.txt: text/plain; charset=utf-8

# Convert encoding
$ iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt

# Hex dump to see actual bytes
$ xxd document.txt | head

Database Checklist

-- MySQL: Check current settings
SHOW VARIABLES LIKE 'character_set%';

-- Set connection encoding
SET NAMES 'utf8mb4';

-- Create table with proper encoding
CREATE TABLE posts (
    content TEXT
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Language-Specific Tips

Python

# Python 3 strings are Unicode by default
text = "café"  # Works fine

# Reading files
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

# HTTP responses
response.encoding = 'utf-8'

JavaScript

// Fetch with encoding
const response = await fetch(url);
const text = await response.text();  // Assumes UTF-8

// Explicit TextDecoder
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(arrayBuffer);

PHP

// Set default encoding
mb_internal_encoding('UTF-8');

// String functions (use mb_ versions)
$length = mb_strlen($string);
$upper = mb_strtoupper($string);

Best Practices Summary

Use UTF-8 everywhere: Files, databases, HTTP headers, HTML
Declare encoding early: First element in <head>
Be explicit: Never assume encoding, always specify
Test with special characters: Include é, £, 中, 🎉 in test data
Keep references handy: Bookmark tools like ascii.co.uk for quick character lookups

Understanding encoding transforms mysterious bugs into straightforward fixes. When in doubt, the answer is almost always: use UTF-8 and be explicit about it everywhere in your stack.