LPeg: a New Pattern-Matching Library for Lua

Introduction · Basic Constructions · Grammars · Captures · Some Examples · Download

Disclaimer: this is version 0.3! The purpose of this version is to present the main ideas behind the library. Everything presented here may change without further notice.


Introduction

LPeg is a new pattern-matching library for Lua, based on Parsing Expression Grammars (PEGs). I assume that readers are familiar with PEGs. If you are not, you can get a quick start reading Section 2 of Parsing Expression Grammars: A Recognition-Based Syntactic Foundation (the section has only one page) or the Wikipedia Entry for PEGs. The nice thing about PEGs is that it has a formal basis (instead of being an ad-hoc set of features), allows an efficient and simple implementation, and does most things we expect from a pattern-matching library (and more, as we can define entire grammars).

Following the Snobol tradition, LPeg defines patterns as first-class objects. That is, patterns are regular Lua values (represented by userdata). The library offers several functions to create and compose patterns. With the use of metamethods, several of these functions are provided as infix or prefix operators. The result is usually much more verbose than the typical encoding of patterns using the so called regular expressions (which usually are not regular expressions in the formal sense). On the other hand, first-class patterns allow much better documentation (as it is easy to comment the code, to use auxiliary variables to break complex definitions, etc.) and are extensible: we can define in Lua new functions to create and compose patterns.

For those not convinced by the previous arguments, there is also a library that implements patterns following a regular-expression style. (This library is less than 200 lines of Lua code, and of course uses LPeg to parse regular expressions.) This library is even more green than LPeg, and currently supports only very basic captures. But it provides a good source for those looking for extended examples of LPeg definitions.

Basic Constructions

Most of the following operations build patterns. All operations that expect a pattern as an argument may receive also strings, tables, or numbers, which are translated to patterns according to the rules of function lpeg.P.

lpeg.match (subject, pattern [, init])

Main (and currently only) matching function. It attempts to match the given subject string to the given pattern. If the match succeeds, returns the index in the subject of the first character after the match. or the values of captured values (if the pattern captured any value).

An optional numeric argument init makes the match starts at that position in the subject string. As usual in Lua libraries, a negative value counts from the end.

Unlike typical pattern-matching functions, match works only in anchored mode; that is, it tries to match the pattern with a prefix of the given subject string (at position init), not with an arbitrary substring of the subject. So, if we want to find a pattern anywhere in a string, we must either write a loop in Lua or write a pattern that matches anywhere. This second approach is easy and quite efficient; see examples.

lpeg.P (value)

Converts the given value into a proper pattern, according to the following rules:

lpeg.R (char1, char2)

Returns a pattern that matches any single character with a code between the codes of char1 and char2 (both inclusive). The characters are given as strings of length 1. (The R stands for Range.)

As an example, the pattern lpeg.R("0", "9") matches any digit.

lpeg.S (string)

Returns a pattern that matches any single character that appears in the given string. (The S stands for Set.)

As an example, the pattern lpeg.S("0123456789abcdefABCDEF") matches any hexadecimal digit.

Note that, if s is a character (that is, a string of length 1), then lpeg.P(s) is equivalent to lpeg.S(s) which is equivalent to lpeg.R(s, s). Note also that lpeg.S("") is a pattern that always fails.

lpeg.F (function)

Returns a pattern that matches according to the given function. Each time there is an attempt for a match against this pattern, function is called, always with two arguments: the original subject string, and the current position in the subject. If function returns a valid number, the match succeeds and the returned number becomes the new current position. If function returns false or nil or an invalid number, the match fails.

If the function is called with parameters s and i, its result is valid if i <= result <= len(s) + 1.

#patt

Returns a pattern equivalent to &patt. This is a pattern that matches only if the input string does match patt, but without consuming any input, independently of success or failure.

-patt

Returns a pattern equivalent to !patt. This pattern matches only if the input string does not match patt. It does not consume any input, independently of success or failure.

As an example, the pattern -1 matches only the end of string.

patt1 + patt2

Returns a pattern equivalent to patt1 / patt2. It matches either patt1 or patt2 (with no backtracking once one of them succeed).

If both patt1 and patt2 are character sets, this operation is equivalent to set union:

lower = lpeg.R("a", "z")
upper = lpeg.R("A", "Z")
letter = lower + upper

patt1 - patt2

Returns a pattern equivalent to !patt2 patt1. This pattern asserts that the input does not match patt2 and then matches patt1.

If both patt1 and patt2 are character sets, this operation is equivalent to set difference. Note that -patt is equivalent to "" - patt (or 0 - patt). If patt is a character set, 1 - patt is its complement.

patt1 * patt2

Returns a pattern equivalent to patt1 patt2. It matches patt1, and then matches patt2 starting where patt1 finished.

(We used the * operator both because it has the right priority and because in formal languages it is common to use a dot for denoting concatenation.)

patt^n

If n is nonnegative, this pattern is equivalent to pattn patt*. It matches at least n occurrences of patt.

Otherwise, when n is negative, this pattern is equivalent to (patt?)-n. That is, it matches at most -n occurrences of patt.

In particular, patt^0 is equivalent to patt*, patt^1 is equivalent to patt+, and patt^-1 is equivalent to patt?.

Grammars

With the use of Lua variables, it is possible to define patterns incrementally, with each new pattern using previously defined ones. However, this technique does not allow the definition of recursive patterns. For recursive patterns, we need real grammars.

Grammars are represented by tables, where each entry is a rule. Currently, LPeg allows only numbers as "non-terminals"; more specifically, grammars must be lists (arrays), with rules indexed by consecutive integer keys starting from 1.

When a table is converted to a pattern, the result is a pattern that matches the first rule of the table.

lpeg.V (n)

Creates a pattern equivalent to Vn. This pattern represents the n-th nonterminal (or variable) in a grammar.

Because the grammar still does not exist when this function is evaluated, the result is an open reference to the respective rule.

A table is fixed when it is converted to a grammar. Then every open reference created by lpeg.V(n) is corrected to refer to the n-th rule of the table.

Captures

Captures specify what a match operation should return (the so called semantic information). LPeg offers several kinds of captures, which build values based on matches and combine them to create new values.

A capture pattern captures a value every time it succeeds. A capture inside a loop generates as many values as matched by the loop. A capture generates a value only when it succeeds. A pattern like lpeg.C(lpeg.P'a'^-1) captures the empty string when there is no 'a' (because the pattern 'a'? succeeds), while a pattern like lpeg.C(lpeg.P'a')^-1 does not capture any value when there is no 'a' (because the pattern 'a' fails).

Some captures may have an optional label, which can be any Lua value (except nil). When a capture has a label, its label is returned by the match operation before its captured value. (There is a good chance that this feature will be removed from future versions!!)

lpeg.C (patt [, label])

Creates a simple capture, which captures the substring of the subject that matches patt. If patt has other string or position captures, their values are returned after this one. The captured value is a string.

lpeg.Ca (patt)

Creates an accumulator capture. This capture assumes that patt should produce at least one captured value of any kind, which becomes the initial value of an accumulator. Pattern patt then may produce zero or more function captures. Each of these functions in these captures is called having the accumulator as its first argument (followed by any other arguments provided by its own pattern), and the value returned by the function becomes the new value of the accumulator. The final value of this accumulator is the sole result of the whole capture.

lpeg.Cc (value)

Creates a constant capture. This pattern matches the empty string and produces value as a captured value.

lpeg.Cs (patt)

Creates a substitution capture, which captures the substring of the subject that matches patt, with substitutions. For any capture inside patt, the substring that matched the capture is replaced by the capture value (which should be a string). The capture values from patt are not returned independently (only as substrings in the resulting string).

lpeg.Cp ([label])

Creates a position capture. It matches the empty string and captures the position in the subject where the match occurs. The captured value is a number.

lpeg.Ct (patt [, label])

Creates a table capture. This capture creates a table and puts all captures made by patt inside this table. Anonymous captures are stored in successive integer keys, starting at 1. Labeled captures are stored with the label as the key.

The captured value is only this table. The captures made by patt are not returned independently (only as table elements).

patt / function

Creates a function capture. It calls the given function passing all captures made by patt as arguments, or the whole match if patt made no capture. The value returned by the function is the final value of the capture.

If function returns nil (or no value), there is no captured value. Everything works as if there was no capture.

patt / string

Creates a string capture. It creates a capture string based on string. The captured value is a copy of string, except that the character % works as an escape character: any sequence in repl of the form %n, with n between 1 and 9, stands for the match of the n-th capture in patt. (Note that it uses the match, not the capture value. We strongly recommend that patt contains only single string captures, wherein the match and capture value are the same.) The sequence %0 stands for the whole match. The sequence %% stands for a single %.

patt / table

Creates a query capture. It indexes the given table using as key the value of the first capture of patt, or the whole match if patt made no capture. The value at that index is the final value of the capture.

If the table does not have that key, there is no captured value. Everything works as if there was no capture.

Some Examples

Splitting a String

The following code splits a string using a given pattern sep as a separator:

function split (s, sep)
  sep = lpeg.P(sep)
  local elem = lpeg.C((1 - sep)^0)
  local p = elem * (sep * elem)^0
  return lpeg.match(s, p)
end

First the function ensures that sep is a proper pattern. The pattern elem is a repetition of zero of more arbitrary characters as long as there is not a match against the separator. It also captures its result. The pattern p matches a list of elements separated by sep.

If the split results in too many values, it may overflow the maximum number of values that can be returned by a Lua function. In this case, we should collect these values in a table:

function split (s, sep)
  sep = lpeg.P(sep)
  local elem = lpeg.C((1 - sep)^0)
  local p = lpeg.Ct(elem * (sep * elem)^0)   -- make a table capture
  return lpeg.match(s, p)
end

Searching for a Pattern

The primitive match works only in anchored mode. If we want to find a pattern anywhere in a string, we must write a pattern that matches anywhere.

Because patterns are composable, we can write a function that, given any arbitrary pattern p, returns a new pattern that searches for p anywhere in a string. There are several ways to do the search. one way is like this:

function anywhere (p)
  return lpeg.P{
    [1] = p + 1 * lpeg.V(1)
  }
end

The grammar has a straight reading: it matches p or skips one character and tries again.

If we want to know where the pattern is in the string (instead of knowing only that it is there somewhere), we can add position captures in the pattern:

local I = lpeg.Cp()
function anywhere (p)
  return lpeg.P{
    [1] = I * p * I + 1 * lpeg.V(1)
  }
end

Another option for the search is like this:

local I = lpeg.Cp()
function anywhere (p)
  return (1 - lpeg.P(p))^0 * I * p * I
end

Again the pattern has a straight reading: it skips as many characters as possible while not matching p, and then matches p (plus appropriate captures).

If we want to look for a pattern only at word boundaries, we can use the following transformer:

local wordletter = lpeg.R("A", "Z") + lpeg.R("a", "z")

function atwordboundary (p)
  return lpeg.P{
    [1] = p + wordletter^0 * (1 - wordletter)^1 * lpeg.V(1)
  }
end

Balanced Parentheses

The following pattern matches only strings with balanced parentheses:

b = lpeg.P{
  [1] = "(" * ((1 - lpeg.S"()") + lpeg.V(1))^0 * ")"
}

Reading the first (and only) rule of the given grammar, we have that a balanced string is an open parenthesis, followed by zero or more repetitions of either a non-parenthesis character or a balanced string (lpeg.V(1)), followed by a closing parenthesis.

Global Substitution

The next example does a job somewhat similar to string.gsub. It receives a pattern and a replacement value, and substitutes the replacement value for all occurrences of the pattern in a given string:

function gsub (s, patt, repl)
  patt = lpeg.P(patt)
  patt = lpeg.Cs((patt / repl + 1)^0)
  return lpeg.match(s, patt)
end

As in string.gsub, the replacement value can be a string, a function, or a table.

Comma-Separated Values (CSV)

This example breaks a string into comma-separated values, returning all fields:

local field = '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"' +
                    lpeg.C((1 - lpeg.S',\n"')^0)

local record = field * (',' * field)^0 * (lpeg.P'\n' + -1)

function csv (s)
  return lpeg.match(s, record)
end

A field is either a quoted field (and then may contain any character except an individual quote, which may be written as two quotes that are replaced by one quote) or an unquoted field (and then cannot contain commas, newlines, or quotes). A record is a list of fields separated by commas, ending with a newline or the string end (-1).

Lua's Long Strings

This example matches long strings in Lua. We could not come up with a "pure" PEG grammar that matches that syntax, so we resort to Lua. First, we define a function that creates pairs of related patterns: the first one matches a given pattern, and the second one matches only the same string matched by the first:

function backwardref (patt)
  local b         -- string matched by first function
  local l = 0     -- its length
  patt = lpeg.P(patt)   -- ensure 'patt' is a pattern

  -- First function
  local pre = function (s, i)     
    local e = lpeg.match(s, patt, i)   -- matches against 'patt'
    if not e then return nil end
    b = string.sub(s, i, e - 1)
    l = e - i;
    return e;
  end

  -- Second function
  local pos = function (s, i)
    s = string.sub(s, i, i + l - 1)
    return s == b and i + l            
  end

  return lpeg.F(pre), lpeg.F(pos)
end

Now, we can use the previous function to ensure that both the open and close delimiters match the same number of equal signs:

local p1, p2 = backwardref(lpeg.P"="^0)

local open = '[' * p1 * '['
local close = ']' * p2 * ']'
comment = open  * (1 - close)^0 * close

Lua-function patterns are not very efficient. So, it may be worth the use of an and predicate to pre-analyze the subject before calling Lua:

local equals = lpeg.P"="^0
local p1, p2 = backwardref(equals)

local open = '[' * #(equals * '[') * p1 * '['
local close = ']' * #(equals * ']') * p2 * ']'
comment = open  * (1 - close)^0 * close

Arithmetic Expressions

This example is a complete parser and evaluator for simple arithmetic expressions. We write it in two styles. The first approach first builds a syntax tree and then traverses this tree to compute the expression value:

-- Lexical Elements
local Space = lpeg.S(" \n\t")^0
local Number = lpeg.C(lpeg.P"-"^-1 * lpeg.R("0", "9")^1) * Space
local FactorOp = lpeg.C(lpeg.S("+-")) * Space
local TermOp = lpeg.C(lpeg.S("*/")) * Space
local Open = "(" * Space
local Close = ")" * Space

-- Grammar
local V = lpeg.V
local Exp, Term, Factor = 1, 2, 3
G = lpeg.P{
  [Exp] = lpeg.Ct(V(Factor) * (FactorOp * V(Factor))^0);
  [Factor] = lpeg.Ct(V(Term) * (TermOp * V(Term))^0);
  [Term] = Number + Open * V(Exp) * Close;
}

G = Space * G * -1

-- Evaluator
function eval (x)
  if type(x) == "string" then
    return tonumber(x)
  else
    local op1 = eval(x[1])
    for i = 2, #x, 2 do
      local op = x[i]
      local op2 = eval(x[i + 1])
      if (op == "+") then op1 = op1 + op2
      elseif (op == "-") then op1 = op1 - op2
      elseif (op == "*") then op1 = op1 * op2
      elseif (op == "/") then op1 = op1 / op2
      end
    end
    return op1
  end
end

-- Parser/Evaluator
function evalExp (s)
  local t = lpeg.match(s, G)
  if not t then error("syntax error", 2) end
  return eval(t)
end

-- small example
print(evalExp"3 + 5*9 / (1+1) - 12")

The second style computes the expression value on the fly, without building the syntax tree. The following grammar takes this approach. (It assumes the same lexical elements as before.)

-- Auxiliary function
function eval (v1, op, v2)
  if (op == "+") then return v1 + v2
  elseif (op == "-") then return v1 - v2
  elseif (op == "*") then return v1 * v2
  elseif (op == "/") then return v1 / v2
  end
end

-- Grammar
local V = lpeg.V
local Exp, Term, Factor = 1, 2, 3
G = lpeg.P{
  [Exp] = lpeg.Ca(V(Factor) * (FactorOp * V(Factor) / eval)^0);
  [Factor] = lpeg.Ca(V(Term) * (TermOp * V(Term) / eval)^0);
  [Term] = Number / tonumber + Open * V(Exp) * Close;
}

Note the use of the accumulator capture. To compute the value of an expression, the accumulator starts with the value of the first factor, and then applies eval over the accumulator, the operator, and the new factor for each repetition.