Regular Expression Syntax

Regular expressions (regex) provide a declarative language for pattern matching, text searching, validation, and transformation. Supported natively in most modern programming languages and text editors, regex engines parse concise syntax strings into optimized matching algorithms.

1. Introduction

A regular expression is a sequence of characters that defines a search pattern. While traditionally rooted in formal language theory (regular languages), modern implementations extend far beyond theoretical limits with features like backreferences, lookarounds, and recursive matching.

Most contemporary engines follow either the PCRE (Perl-Compatible Regular Expressions) or ECMAScript standard. This article documents the shared core syntax, noting engine-specific variations where applicable.

2. Basic Metacharacters

Metacharacters carry special meaning within a regex pattern. To match them literally, they must be escaped with a backslash (\).

Symbol	Meaning	Example
`.`	Any character except newline	`c.t` → cat, cot, cut
`^`	Start of string/line	`^Hello` → Hello at beginning
`$`	End of string/line	`end$` → end at conclusion
`\`	Escape character	`\.` → literal dot
`\|`	Alternation (OR)	`cat\|dog` → matches either

3. Character Classes & Ranges

Character classes match a single character from a specified set. Square brackets [] define custom classes, while shorthand sequences provide convenience.

Regex

[a-zA-Z0-9]     → alphanumeric
[0-9]          → digits only
[^abc]         → negation: any except a, b, c
\w \d \s       → word, digit, whitespace
\W \D \S       → negated counterparts

💡 Engine Note

\w behavior varies: in ASCII mode it matches [a-zA-Z0-9_], while Unicode-aware engines include accented characters and non-Latin scripts.

4. Quantifiers & Repetition

Quantifiers specify how many times the preceding token should repeat. By default, they are greedy (match as much as possible). Append ? to make them lazy (match as little as possible).

Quantifier	Matches	Lazy Variant
`*`	0 or more	`*?`
`+`	1 or more	`+`
`?`	0 or 1	`??`
`{n}`	exactly n	n/a
`{n,}`	n or more	`{n,}?`
`{n,m}`	between n and m	`{n,m}?`

5. Anchors & Word Boundaries

Anchors match positions rather than characters. They are zero-width assertions that constrain where a pattern can match.

Regex

\b       → word boundary (between \w and \W)
\B       → non-word boundary
\A       → absolute start of string
\Z       → absolute end of string
(?=X)    → positive lookahead (X follows)
(?!X)    → negative lookahead (X does not follow)

6. Groups & Capturing

Parentheses () group subpatterns and capture matched text for later reference. Non-capturing groups (?:) improve performance when backreference isn't needed.

JavaScript

const text = "2025-11-12";
const match = text.match(/(\d{4})-(\d{2})-(\d{2})/);
// match[1] → "2025", match[2] → "11", match[3] → "12"

// Non-capturing:
/(?:http|https):\/\//i

// Named groups (PCRE/JS ES2018+):
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/

7. Lookarounds & Assertions

Lookarounds allow conditional matching without consuming characters. They are essential for context-aware extraction.

Assertion	Description	Example
`(?=pattern)`	Positive lookahead	`\d+(?=px)` → number before "px"
`(?!pattern)`	Negative lookahead	`\b(?!foo)\w+\b` → words not starting with "foo"
`(?<=pattern)`	Positive lookbehind	`(?<=\$)\d+\.\d{2}` → dollar amounts
`(?<!pattern)`	Negative lookbehind	`(?<!\w)error` → "error" at word start

⚠️ Compatibility

Lookbehind assertions were not supported in JavaScript until ES2018, and fixed-length lookbehinds are required in many engines. PCRE and Python (re/regex modules) offer more flexible variable-length support.

8. Practical Examples

Email Validation (Basic)

Regex

/^[\w.-]+@[\w.-]+\.[a-zA-Z]{2,}$/

Note: RFC 5322 allows far more complex email formats. This pattern balances accuracy and readability for most production use cases.

Extracting Hex Colors

Python

import re
html = "<div style='color: #ff5733; background: #1a1a2e;'>"
colors = re.findall(r'#[0-9a-fA-F]{6}', html)
# ['#ff5733', '#1a1a2e']

Log Timestamp Parsing

Regex

/\[(?<ts>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (?<level>\w+): (?<msg>.*)/

9. Best Practices & Performance

Avoid catastrophic backtracking by anchoring patterns or using possessive quantifiers ++/*+ (PCRE/Java) or atomic groups (?>).
Escape user input before injecting into regex to prevent ReDoS (Regular Expression Denial of Service) attacks.
Precompile patterns in loops or high-frequency operations to leverage engine caching.
Use non-capturing groups (?:) when you don't need extracted substrings.
Test with edge cases: empty strings, overlapping matches, and locale-specific characters.

10. References & Further Reading

PCRE2 Manual – PHP Foundation Documentation
ECMAScript® 2023 Language Specification – Section 21.2
Mastering Regular Expressions (3rd ed.) – Jeffrey E.F. Friedl
ReDoS Detection & Mitigation – OWASP Testing Guide

1. Introduction

2. Basic Metacharacters

3. Character Classes & Ranges

4. Quantifiers & Repetition

5. Anchors & Word Boundaries

6. Groups & Capturing

7. Lookarounds & Assertions

8. Practical Examples

Email Validation (Basic)

Extracting Hex Colors

Log Timestamp Parsing

9. Best Practices & Performance

10. References & Further Reading

Finite State Automata & Regex Engines

String Manipulation in Modern Languages

Security Implications of ReDoS