Pratt Parsing
The definitive guide
Table of Contents
Hundreds of Pratt parsing posts exist. I hope this one is relatively clear, concise, and comprehensive.
Learn how Pratt parsers work by writing one. I assume familiarity with grammars, parsers and TypeScript.
The code is public domain. Send feedback by email.
Motivation
This is an intuitive expression grammar.
expr = expr ("+" | "-" | "*" | "/" | "^") expr
| "-" expr
| number
The problem is it's ambiguous. 1 + 2 * 3
can be parsed as (1 + 2) * 3
or
1 + (2 * 3)
.
The textbook way to resolve ambiguity is to convolute the grammar with associativity and precedence rules.
expr = expr ("+" | "-") term
| term
term = term ("*" | "/") factor
| factor
factor = factor "^" number
| "-" factor
| number
This grammar is not intuitive. It gets worse if your parser cannot handle left recursion. And worse as you add more operators.
The resulting syntax trees are cumbersome.
Overview
Pratt parsing uses grammars of this form.
expr = head tail*
head = number
| "-" expr
tail = ("+" | "-" | "*" | "/" | "^") expr
And produces syntax trees of this form.
expr = expr ("+" | "-" | "*" | "/" | "^") expr
| "-" expr
| number
The parsing routine applies precedence and associativity rules to resolve grammar ambiguity.
Pratt parsers are easy to understand, implement, and integrate.
We'll build a series of gradually better parsers to explain how it works.
Precedence and Associativity
Left-Associative Parser
Start simple. Assume all operators are left-associative and without precedence.
Parse 1 + 2 + 3 * 4 * 5
as ((((1 + 2) + 3) * 4) * 5)
.
The expression grammar supports this with a repeated tail that immediately applies the operator to expand the left expression.
expr = number (tail_op number)*
| | | |
└head┘ └──────tail─────┘
The implementation reflects the grammar.
function expr(ctx) {
let left_expr = number(next_token(ctx));
while (has_token(ctx)) {
const op = tail_op(next_token(ctx));
const right_expr = number(next_token(ctx));
left_expr = infix(op, left_expr, right_expr);
}
return left_expr;
}
Right-Associative Parser
Now assume all operators are right-associative and without precedence
Parse 1 + 2 + 3 * 4 * 5
as (1 + (2 + (3 * (4 * 5))))
.
The expression grammar supports this with a right-recursive tail that expands the expression to the right before applying the operator.
expr = number (tail_op expr)?
The implementation reflects the grammar.
function expr(ctx) {
let left_expr = number(next_token(ctx));
if (has_token(ctx)) {
const op = tail_op(next_token(ctx));
const right_expr = expr(ctx);
left_expr = infix(op, left_expr, right_expr);
}
return left_expr;
}
Mixed-Associative Grammar
Sometimes the parser should apply an operator immediately as in the first parser. Other times it should wait until later operators are applied as in the second parser.
Merge the grammars to create an ambiguous grammar that enables this choice:
expr = number (tail_op number)* // Parser 1: left-associative
expr = number (tail_op expr )? // Parser 2: right-associative
expr = number (tail_op expr )* // Merged: mixed-associative
Take the second parser and rewrite if
to while
.
function expr(ctx) {
let left_expr = number(next_token(ctx));
while (has_token(ctx)) {
const op = tail_op(next_token(ctx));
const right_expr = expr(ctx);
left_expr = infix(op, left_expr, right_expr);
}
return left_expr;
}
This parser resolves the grammar's ambiguity by always choosing to recurse, which makes it indistinguishable from the second parser.
Respecting Associativity
From experience we know right-associative operators should recurse.
Left-associative operators should instead return the current expression to become part of the parent expression.
function expr(ctx, parent_op) {
let left_expr = number(next_token(ctx));
while (has_token(ctx)) {
const op = tail_op(peek_token(ctx));
if (assoc(op) === "left") break;
next_token(ctx);
const right_expr = expr(ctx, op);
left_expr = infix(op, left_expr, right_expr);
}
return left_expr;
}
Now the parser respects associativity but not precedence.
Respecting Precedence
Pratt parsing uses nested levels of recursion to represent nested expressions of equal or increasing precedence.
By definition, a level exits when an operator with lower precedence appears.
By definition, associativity applies when operators have equal precedence.
The parent operator determines a level's precedence. The root expression has no parent, which means minimum precedence.
// Compares the precedence of two operators
function cmp_precedence(op1, op2): "<" | "=" | ">";
function expr(ctx, parent_op) {
let left_expr = number(next_token(ctx));
while (has_token(ctx)) {
const op = tail_op(peek_token(ctx));
const order = cmp_precedence(op, parent_op);
if (order === "<") break;
if (order === "=" && assoc(op) === "left") break;
next_token(ctx);
const right_expr = expr(ctx, op);
left_expr = infix(op, left_expr, right_expr);
}
return left_expr;
}
expr(ctx, Op.Root); // Expected syntax for parsing an expression.
This parser respects precedence and associativity.
Adding Operators
Head and Tail Parsing
To simplify future extensions, split the grammar into head and tail rules.
expr = expr_head expr_tail*
expr_head = number
expr_tail = tail_op expr
Refactor the parser accordingly.
function expr(ctx, parent_op) {
const left_expr = expr_head(ctx);
return expr_tail(ctx, parent_op, left_expr);
}
function expr_head(ctx) {
return number(next_token(ctx));
}
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
const op = tail_op(peek_token(ctx));
const order = cmp_precedence(op, parent_op);
if (order === "<") break;
if (order === "=" && assoc(op) === "left") break;
next_token(ctx);
const right_expr = expr(ctx, op);
left_expr = infix(op, left_expr, right_expr);
}
return left_expr;
}
Prefix Operators
Extend expr_head
to support prefix operators.
expr_head = number
| prefix_op expr
Operators like -
can have overloaded infix and prefix definitions because
prefix_op
and tail_op
don't conflict in the grammar.
function expr_head(ctx) {
const token = next_token(ctx);
if (is_number_token(token)) {
return number(token);
} else {
const op = prefix_op(token);
const right_expr = expr(ctx, op);
return prefix(op, right_expr);
}
}
Postfix Operators
Update expr_tail
to support postfix operators.
expr_tail = postfix_op
| infix_op expr
If an operator is both postfix and infix, the parser must look ahead to differentiate the two cases. Disallow this to simplify parsing.
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
const op = tail_op(peek_token(ctx));
const order = cmp_precedence(op, parent_op);
if (order === "<") break;
if (order === "=" && assoc(op) === "left") break;
next_token(ctx);
if (is_postfix_op(op)) {
left_expr = postfix(op, left_expr);
} else {
const right_expr = expr(ctx, op);
left_expr = infix(op, left_expr, right_expr);
}
}
return left_expr;
}
Parentheses
Extend expr_head
to support parenthesized expressions.
expr_head = "(" expr ")"
| number
| prefix_op expr
Parentheses prevent other operators from interacting with the enclosed expression, so precedence is reset.
function expr_head(ctx) {
const token = next_token(ctx);
if (token === "(") {
const right_expr = expr(ctx, Op.Root);
next_token(ctx); // consume ")"
return right_expr;
} else if (is_number_token(token)) {
return number(token);
} else {
const op = prefix_op(token);
const right_expr = expr(ctx, op);
return prefix(op, right_expr);
}
}
)
tokens appear where the parser otherwise expects tail operators. They mark
the end of an expression.
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
const token = peek_token(ctx);
if (token === ")") break;
const order = cmp_precedence(op, parent_op);
if (order === "<") break;
if (order === "=" && assoc(op) === "left") break;
next_token(ctx);
if (is_postfix_op(op)) {
left_expr = postfix(op, left_expr);
} else {
const right_expr = expr(ctx, op);
left_expr = infix(op, left_expr, right_expr);
}
}
return left_expr;
}
Subscript Operator
The (array) subscript operator [
resembles a postfix operator because there is
no right argument.
Brackets prevent other operators from interacting with the enclosed expression.
expr_tail = infix_op expr
| postfix_op
| subscript_op
subscript_op = "[" expr "]"
Like )
tokens, ]
tokens mark the end of an expression.
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
const token = peek_token(ctx);
if (token === ")" || token === "]") break;
const op = tail_op(token);
const order = cmp_precedence(op, parent_op);
if (order === "<") break;
if (order === "=" && assoc(op) === "left") break;
next_token(ctx);
if (op === Op.Subscript) {
const middle_expr = expr(ctx, Op.Root);
next_token(ctx); // consume "]"
left_expr = subscript(left_expr, middle_expr);
} else if (is_postfix_op(op)) {
left_expr = postfix(op, left_expr);
} else {
const right_expr = expr(ctx, op);
left_expr = infix(op, left_expr, right_expr);
}
}
return left_expr;
}
Ternary Operator
The ternary operator left ? middle : right
resembles an infix operator because
it has an unenclosed right argument.
The middle expression is enclosed so other operators cannot affect how it is parsed.
expr_tail = infix_op expr
| postfix_op
| subscript_op
| ternary_op
ternary_op = "?" expr ":" expr
:
tokens mark the end of an expression.
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
const token = peek_token(ctx);
if (token === ")" || token === "]" || token === ":") break;
const op = tail_op(token);
const order = cmp_precedence(op, parent_op);
if (order === "<") break;
if (order === "=" && assoc(op) === "left") break;
next_token(ctx);
if (op === Op.Ternary) {
const middle_expr = expr(ctx, Op.Root);
next_token(ctx); // consume ":"
const right_expr = expr(ctx, op);
left_expr = ternary(left_expr, middle_expr, right_expr);
} else if (op === Op.Subscript) {
const middle_expr = expr(ctx, Op.Root);
next_token(ctx); // consume "]"
left_expr = subscript(left_expr, middle_expr);
} else if (is_postfix_op(op)) {
left_expr = postfix(op, left_expr);
} else {
const right_expr = expr(ctx, op);
left_expr = infix(op, left_expr, right_expr);
}
}
return left_expr;
}
Ternary Operator with Optional Else
Modify the ternary operator so : right
is optional.
Pair :
to the closest unpaired ?
so that a ? b ? c : d
parses as
a ? (b ? c : d)
.
ternary_op = "?" expr (":" expr)?
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
// ...
if (op === Op.Ternary) {
const middle_expr = expr(ctx, Op.Root);
let right_expr = null;
if (peek_token(ctx) === ":") {
next_token(ctx); // consume ":"
right_expr = expr(ctx, op);
}
left_expr = ternary(left_expr, middle_expr, right_expr);
}
// ...
}
return left_expr;
}
Non-Associative Operators and Relative Precedence
Non-Associative Operators
Is <<
left-associative or right-associative?
Since it's not obvious, forgo associativity and instead require disambiguating parentheses.
a << b
passes because it has one interpretation but a << b << c
fails as
ambiguous and requires parentheses to fix.
function error_assoc(op) {
return new Error(
`Operator "${op}" is not associative and requires parentheses`,
);
}
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
// ...
const order = cmp_precedence(op, parent_op);
if (order === "<") break;
if (order === "=") {
if (assoc(op) === "left") break;
if (assoc(op) === "none") throw error_assoc(op);
}
// ...
}
return left_expr;
}
Relative Precedence
Is <<
or **
higher precedence?
Since global precedence is confusing, let's instead define relative precedence between operator pairs and require parentheses for unrelated operators.
For example, if Precedence(**) > Precedence(/)
is the only known precedence
relationship, then expect the following parse results:
Expression | Parse Result | Parse Trees |
---|---|---|
a << b |
Ok | (a << b) |
a ** b / c |
Ok | ((a ** b) / c) |
a ** b << c |
Ambiguous | (a ** (b << c)) and ((a ** b) << c) |
a ** (b << c) |
Ok | (a + (b << c)) |
The parser's cmp_precedence()
now returns "!"
for unrelated operators.
function error_unrelated_ops(op1, op2) {
return new Error(`"${op1}" is unrelated to "${op2}"`);
}
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
// ...
const order = cmp_precedence(op, parent_op);
if (order === "!") throw error_unrelated_ops(op, parent_op);
if (order === "<") break;
if (order === "=") {
if (assoc(op) === "left") break;
if (assoc(op) === "none") throw error_assoc(op);
}
// ...
}
return left_expr;
}
Precedence Groups
Defining the relative precedence between every pair of operators is unscalable in languages with many operators.
One solution is to organize operators into groups that share the same associativity and relative precedence.
Let's define a variety of operators, group them, and define each group's relative precedence.
Precedence Group | Operators | Associativity | Greater Precedence Than |
---|---|---|---|
Postfix | ++ -- [ | none | Prefix |
Prefix | ++ -- + - ! | none | BitwiseShift, Exponentiation |
BitwiseShift | << >> | none | Comparison |
Exponentiation | ** | right | Multiplication |
Multiplication | * / & % | left | Addition |
Addition | + - | | left | Comparison |
Comparison | == != < <= > >= | none | Conjunction |
Conjunction | && | left | Disjunction |
Disjunction | || | left | Ternary |
Ternary | ? | right | Root |
Root | none |
Given precedence P
, P(Multiplication) > P(Comparison)
because precedence is
transitive: P(A) > P(B)
and P(B) > P(C)
imply P(A) > P(C)
.
Precedence groups form a digraph. Note how BitwiseShift is unrelated to Exponentiation, Multiplication, and Addition.
Implementation
Encode the precedence table in TypeScript.
const enum Op {
PostfixPlusPlus,
PostfixMinusMinus,
Subscript,
// ... remaining operators
}
const enum Group {
Postfix,
Prefix,
BitwiseShift,
// ... remaining groups
}
const groups = [
{
id: Group.Postfix,
ops: [Op.PostfixPlusPlus, Op.PostfixMinusMinus /* , ... */],
assoc: "none",
gt: [Group.Prefix],
},
{
id: Group.Prefix,
ops: [Op.PrefixPlusPlus, Op.PrefixMinusMinus /* , ... */],
assoc: "none",
gt: [Group.BitwiseShift, Group.Exponentiation],
},
{
id: Group.BitwiseShift,
ops: [Op.BitwiseShiftLeft, Op.BitwiseShiftRight],
assoc: "none",
gt: [Group.Comparison],
},
// ... remaining groups
];
Precompute mappings from operators to groups and associativities.
const map_op_to_group = new Map<Op, Group>();
const map_op_to_assoc = new Map<Op, "left" | "right" | "none">();
for (const group of groups) {
for (const op of group.ops) {
map_op_to_group.set(op, group.id);
map_op_to_assoc.set(op, group.assoc);
}
}
assoc()
is a map lookup.
function assoc(op) {
return map_op_to_assoc.get(op)!;
}
For each precedence group, precompute lower precedence groups using depth-first search.
const precedence_gt = new Map<Group, Set<Group>>();
for (const src of groups) precedence_gt.set(src.id, new Set(src.gt));
for (const [_, dsts] of precedence_gt) {
const explore = [...dsts];
while (explore.length) {
const dst = explore.pop()!;
dsts.add(dst);
const next_dsts = precedence_gt.get(dst)!;
for (const new_dst of next_dsts) {
if (!dsts.has(new_dst)) {
explore.push(new_dst);
}
}
}
}
cmp_precedence()
refers to the map.
function cmp_precedence(op1, op2) {
const group1 = map_op_to_group.get(op1)!;
const group2 = map_op_to_group.get(op2)!;
if (group1 === group2) return "=";
if (precedence_gt.get(group1)!.has(group2)) return ">";
if (precedence_gt.get(group2)!.has(group1)) return "<";
return "!";
}
Wrap Up
Error Handling
Handle errors by checking for unexpected tokens.
Start with the head parser.
function error_bad_token(token) {
return new Error(`Bad token "${token}"`);
}
function expr_head(ctx) {
const token = next_token(ctx);
if (token === "(") {
const right_expr = expr(ctx, Op.Root);
const rparen = next_token(ctx);
if (rparen !== ")") throw error_bad_token(rparen, ")");
return right_expr;
} else if (is_number_token(token)) {
return number(token);
} else if (is_prefix_op_token(token)) {
const op = prefix_op(token);
const right_expr = expr(ctx, op);
return prefix(op, right_expr);
} else {
throw error_bad_token(token ?? "EOF");
}
}
And now the tail parser.
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
const token = peek_token(ctx);
if (token === ")" || token === "]" || token === ":") break;
const op = tail_op(token);
if (op === undefined) throw error_bad_token(token);
const order = cmp_precedence(op, parent_op);
if (order === "!") throw error_unrelated_ops(op, parent_op);
if (order === "<") break;
if (order === "=") {
if (assoc(op) === "left") break;
if (assoc(op) === "none") throw error_assoc(op);
}
next_token(ctx);
if (op === Op.Ternary) {
const middle_expr = expr(ctx, Op.Root);
let right_expr = undefined;
if (peek_token(ctx) === ":") {
next_token(ctx);
right_expr = expr(ctx, op);
}
left_expr = ternary(left_expr, middle_expr, right_expr);
} else if (op === Op.Subscript) {
const middle_expr = expr(ctx, Op.Root);
const rbracket = next_token(ctx);
if (rbracket !== "]") throw error_bad_token(rbracket);
left_expr = subscript(left_expr, middle_expr);
} else if (is_postfix_op(op)) {
left_expr = postfix(op, left_expr);
} else if (is_infix_op(op)) {
const right_expr = expr(ctx, op);
left_expr = infix(op, left_expr, right_expr);
}
}
return left_expr;
}
Grammar
This is the final grammar.
expr = expr_head expr_tail*
expr_head = number
| prefix_op expr
expr_tail = infix_op expr
| postfix_op
| "[" expr "]"
| "?" expr (":" expr)?
The parser produces syntax trees of this form.
expr = number
| prefix
| postfix
| infix
| ternary
| subscript
prefix = prefix_op expr
postfix = expr postfix_op
infix = expr infix_op expr
ternary = expr "?" expr (":" expr)?
subscript = expr "[" expr "]"
Parser
Code for the final parser is here It includes a scanner and makes greater use of types and enums.
Run tests with deno test parser.ts
.
Binding Power
Other posts explain Pratt parsing using "binding power."
Each operator is assigned a left and right binding power based on its precedence and associativity:
Operator | Left Binding Power | Right Binding Power | Notes |
---|---|---|---|
** | 6 | 5 | Right-associative operators have higher left binding power |
* | 3 | 4 | Left-associative operators have higher right binding power |
+ | 1 | 2 | Operators with lower precedence have lower binding powers |
- | 1 | 2 | Operators with the same precedence and associativity have the same binding powers |
The condition for exiting a level is simplified to a binding power comparison.
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
// ...
if (binding_power_right(op) < binding_power_left(parent_op)) break;
// ...
}
return left_expr;
}
I think binding power is less intuitive than the equivalent precedence and associativity checks.
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
// ...
const order = cmp_precedence(op, parent_op);
if (order === "<") break;
if (order === "=" && assoc(op) === "left") break;
// ...
}
return left_expr;
}
I think requiring parentheses on expressions that are ambiguous to humans is good language design.
function expr_tail(ctx, parent_op, left_expr) {
while (has_token(ctx)) {
// ...
const order = cmp_precedence(op, parent_op);
if (order === "!") throw error_unrelated_ops(op, parent_op);
if (order === "<") break;
if (order === "=") {
if (assoc(op) === "left") break;
if (assoc(op) === "none") throw error_assoc(op);
}
// ...
}
return left_expr;
}
But binding power requires operators to have global precedence and associativity.