Regular expressions (regex) provide a declarative language for pattern matching, text searching, validation, and transformation. Supported natively in most modern programming languages and text editors, regex engines parse concise syntax strings into optimized matching algorithms.
1. Introduction
A regular expression is a sequence of characters that defines a search pattern. While traditionally rooted in formal language theory (regular languages), modern implementations extend far beyond theoretical limits with features like backreferences, lookarounds, and recursive matching.
Most contemporary engines follow either the PCRE (Perl-Compatible Regular Expressions) or ECMAScript standard. This article documents the shared core syntax, noting engine-specific variations where applicable.
2. Basic Metacharacters
Metacharacters carry special meaning within a regex pattern. To match them literally, they must be escaped with a backslash (\).
| Symbol | Meaning | Example |
|---|---|---|
. | Any character except newline | c.t โ cat, cot, cut |
^ | Start of string/line | ^Hello โ Hello at beginning |
$ | End of string/line | end$ โ end at conclusion |
\ | Escape character | \. โ literal dot |
| | Alternation (OR) | cat|dog โ matches either |
3. Character Classes & Ranges
Character classes match a single character from a specified set. Square brackets [] define custom classes, while shorthand sequences provide convenience.
[a-zA-Z0-9] โ alphanumeric [0-9] โ digits only [^abc] โ negation: any except a, b, c \w \d \s โ word, digit, whitespace \W \D \S โ negated counterparts
\w behavior varies: in ASCII mode it matches [a-zA-Z0-9_], while Unicode-aware engines include accented characters and non-Latin scripts.
4. Quantifiers & Repetition
Quantifiers specify how many times the preceding token should repeat. By default, they are greedy (match as much as possible). Append ? to make them lazy (match as little as possible).
| Quantifier | Matches | Lazy Variant |
|---|---|---|
* | 0 or more | *? |
+ | 1 or more | + |
? | 0 or 1 | ?? |
{n} | exactly n | n/a |
{n,} | n or more | {n,}? |
{n,m} | between n and m | {n,m}? |
5. Anchors & Word Boundaries
Anchors match positions rather than characters. They are zero-width assertions that constrain where a pattern can match.
\b โ word boundary (between \w and \W) \B โ non-word boundary \A โ absolute start of string \Z โ absolute end of string (?=X) โ positive lookahead (X follows) (?!X) โ negative lookahead (X does not follow)
6. Groups & Capturing
Parentheses () group subpatterns and capture matched text for later reference. Non-capturing groups (?:) improve performance when backreference isn't needed.
const text = "2025-11-12";
const match = text.match(/(\d{4})-(\d{2})-(\d{2})/);
// match[1] โ "2025", match[2] โ "11", match[3] โ "12"
// Non-capturing:
/(?:http|https):\/\//i
// Named groups (PCRE/JS ES2018+):
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
7. Lookarounds & Assertions
Lookarounds allow conditional matching without consuming characters. They are essential for context-aware extraction.
| Assertion | Description | Example |
|---|---|---|
(?=pattern) | Positive lookahead | \d+(?=px) โ number before "px" |
(?!pattern) | Negative lookahead | \b(?!foo)\w+\b โ words not starting with "foo" |
(?<=pattern) | Positive lookbehind | (?<=\$)\d+\.\d{2} โ dollar amounts |
(?<!pattern) | Negative lookbehind | (?<!\w)error โ "error" at word start |
re/regex modules) offer more flexible variable-length support.
8. Practical Examples
Email Validation (Basic)
/^[\w.-]+@[\w.-]+\.[a-zA-Z]{2,}$/
Note: RFC 5322 allows far more complex email formats. This pattern balances accuracy and readability for most production use cases.
Extracting Hex Colors
import re
html = "<div style='color: #ff5733; background: #1a1a2e;'>"
colors = re.findall(r'#[0-9a-fA-F]{6}', html)
# ['#ff5733', '#1a1a2e']
Log Timestamp Parsing
/\[(?<ts>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (?<level>\w+): (?<msg>.*)/
9. Best Practices & Performance
- Avoid catastrophic backtracking by anchoring patterns or using possessive quantifiers
++/*+(PCRE/Java) or atomic groups(?>). - Escape user input before injecting into regex to prevent ReDoS (Regular Expression Denial of Service) attacks.
- Precompile patterns in loops or high-frequency operations to leverage engine caching.
- Use non-capturing groups
(?:)when you don't need extracted substrings. - Test with edge cases: empty strings, overlapping matches, and locale-specific characters.
10. References & Further Reading
- PCRE2 Manual โ PHP Foundation Documentation
- ECMAScriptยฎ 2023 Language Specification โ Section 21.2
- Mastering Regular Expressions (3rd ed.) โ Jeffrey E.F. Friedl
- ReDoS Detection & Mitigation โ OWASP Testing Guide