Disclaimer: this is version 0.1! The purpose of this version is to present the main ideas behind the library.
LPeg is a new pattern-matching library for Lua, based on Parsing Expression Grammars (PEGs). I assume that readers are familiar with PEGs. If you are not, you can get a quick start reading Section 2 of Parsing Expression Grammars: A Recognition-Based Syntactic Foundation (the section has only one page) or the Wikipedia Entry for PEGs. The nice thing about PEGs is that it has a formal basis (instead of being an ad-hoc set of features), allows an efficient and simple implementation, and does most things we expect from a pattern-matching library (and more, as we can define entire grammars).
Following the Snobol tradition, LPeg defines patterns as first-class objects. That is, patterns are regular Lua values (represented by userdata). The library offers several functions to create and compose patterns. With the use of metamethods, several of these functions are provided as infix or prefix operators. The result is usually much more verbose than the typical encoding of patterns using the so called regular expressions (which usually are not regular expressions in the formal sense). On the other hand, first-class patterns allow much better documentation (as it is easy to comment the code, to use auxiliary variables to break complex definitions, etc.) and are extensible: we can define in Lua new functions to create and compose patterns.
For those not convinced by the previous arguments, there is also a library that implements patterns following a regular-expression style. (This library is less than 200 lines of Lua code, and of course uses LPeg to parse regular expressions.) This library is even more green than LPeg, and currently does not support any kind of capture. But it provides a good source for those looking for extended examples of LPeg definitions.
Most of the following operations build patterns.
All operations that expect a pattern as an argument
may receive also strings, tables, or numbers,
which are translated to patterns according to
the rules of function lpeg.P.
lpeg.match (subject, pattern)Main (and currently only) matching function. It attempts to match the given subject string to the given pattern. If the match succeeds, returns the index in the subject where the match finished, or the values of captured values (if the pattern captured any value).
Unlike typical pattern-matching functions,
match works only in anchored mode;
that is, it tries to match the pattern with a prefix of
the given subject string,
not with any substring of the subject.
So, if we want to find a pattern anywhere in a string,
we must either write a loop in Lua or write a pattern that
matches anywhere.
This second approach is easy and quite efficient;
see examples.
lpeg.P (value)Converts the given value into a proper pattern, according to the following rules:
If the argument is a pattern, it is returned unmodified.
If the argument is a string, it is translated to a pattern that matches literally the string.
If the argument is a table, it is interpreted as a grammar (see Grammars).
If the argument is a number, it is translated as follows. A non-negative number n gives a pattern that matches exactly n characters; a negative number -n gives a pattern that succeeds only if the input string does not have n characters. It is (as expected) equivalent to the unary minus operation (see below) applied over the absolute value of n.
lpeg.R (char1, char2)
Returns a pattern that matches any single character with
a code between the codes of char1 and char2
(both inclusive).
The characters are given as strings of length 1.
As an example, the pattern
lpeg.R("0", "9") matches any digit.
lpeg.S (string)Returns a pattern that matches any single character that appears in the given string.
As an example, the pattern
lpeg.S("0123456789abcdefABCDEF") matches any
hexadecimal digit.
Note that, if s is a character
(that is, a string of length 1),
then lpeg.P(s) is equivalent to lpeg.S(s)
which is equivalent to lpeg.R(s, s).
Note also that lpeg.S("") is a pattern that always fails.
#patt
Equivalent to &patt.
Returns a pattern that matches only if the input string
does match patt,
but without consuming any input,
independently of success or failure.
(Maybe the # should be used for optional pattern;
it seems to be more useful, and it has no symbol for it right now.
We must write patt + "".)
-patt
Equivalent to !patt.
Returns a pattern that matches only if the input string
does not match patt.
It does not consume any input,
independently of success or failure.
As an example, the pattern
-1 matches only the end of string.
patt1 + patt2
Equivalent to patt1 / patt2.
Returns a pattern that matches either
patt1 or patt2
(with no backtracking once one of them succeed).
If both patt1 and patt2 are
character sets,
this operation is equivalent to set union:
lower = lpeg.R("a", "z")
upper = lpeg.R("A", "Z")
letter = lower + upper
patt1 - patt2
Equivalent to !patt2 patt1.
Returns a pattern that asserts the input does not
match patt2 and then matches patt1.
If both patt1 and patt2 are
character sets,
this operation is equivalent to set difference.
Note that -patt is equivalent to "" - patt.
If patt is a character set,
1 - patt is its complement.
patt1 * patt2
Equivalent to patt1 patt2.
Returns a pattern that matches patt1 and
then matches patt2
starting where patt1 finished.
(We used the * operator both because it has
the right priority and because in formal languages it is
common the use of a dot to denote concatenation.)
patt^n
If n is nonnegative,
this pattern is
equivalent to pattn patt*.
It returns a pattern that matches at least n
occurrences of patt.
Otherwise, when n is negative,
this pattern is equivalent to (patt?)-n.
That is, it returns a pattern that matches at most -n
occurrences of patt.
In particular, patt^0 is equivalent to patt*,
patt^1 is equivalent to patt+,
and patt^-1 is equivalent to patt?.
With the use of Lua variables, it is possible to define patterns incrementally, with each new pattern using previously defined ones. However, this technique does not allow the definition of recursive patterns. For recursive patterns, we need real grammars.
Grammars are represented by tables, where each entry is a rule. Currently, LPeg allows only numbers as "non-terminals"; more specifically, grammars must be lists (arrays), with rules indexed by consecutive integer keys starting from 1.
When a table is converted to a pattern, the result is a pattern that matches the first rule of the table.
lpeg.V (n)Equivalent to Vn. Creates a pattern that represents the n-th nonterminal (or variable) in a grammar.
Because the grammar still does not exist when this function is evaluated, the result is an open reference to the respective rule.
A table is fixed when it is converted to a grammar.
Then every open reference created by lpeg.V(n)
is corrected to refer to the n-th element of the table.
Captures specify what a match operation should return. LPeg offers three kinds of captures: string, position, and table. A string capture captures the substring of the subject that matches a pattern. A position capture captures the position in the subject where the capture was made. A table capture collects all captures of a pattern in a table, which becomes the only capture of the pattern.
A capture pattern captures a value every time it succeeds.
A pattern inside a loop generates as many values as matched by the loop.
A capture generates a value only when it succeeds.
A pattern like
lpeg.C(lpeg.P'a'^-1)
captures the empty string when there is no 'a',
while a pattern like
lpeg.C(lpeg.P'a')^-1
does not capture any value when there is no 'a'.
Each capture (of any kind) may have an optional name. When a capture has a name, its name is returned by the match operation before its captured value.
lpeg.C (patt [, name])
Captures the substring of the subject that matches patt.
If patt has other captures,
their values are returned after this one.
The captured value is a string.
lpeg.I ([name])Matches the empty string and captures the position (or the Index) in the subject where the math occurs. The captured value is a number.
lpeg.T (patt [, name])
This capture creates a table and puts all captures made by
patt inside this table.
Anonymous captures are stored in successive integer keys,
starting at 1.
Named captures are stored with the name as the key.
The captured value is only this table.
The captures made by patt are not
returned independently (only as table elements).
The following code splits a string using a given pattern
sep as a separator:
function split (s, sep) local elem = lpeg.C((1 - lpeg.P(sep))^0) local p = elem * (sep * elem)^0 return lpeg.match(s, p) end
The pattern elem is a repetition of zero of more
arbitrary characters as long as there is not a match against
the separator. It also captures its result.
The pattern p matches a list of elements separated
by sep.
Because match works only in anchored mode,
to find a pattern anywhere in a string,
we must write a pattern that matches anywhere.
Because patterns are composable,
we can write a function that,
given any arbitrary pattern p,
returns a new pattern that searches for p
anywhere in a string.
There are several ways to do the search.
one way is like this:
function anywhere (p)
return lpeg.P{
[1] = p + 1 * lpeg.V(1)
}
end
The grammar has a straight reading:
it matches p or skips one character and tries again.
If we want to know where the pattern is in the string (instead of knowing only that it is there somewhere), we can add position captures in the pattern:
function anywhere (p)
return lpeg.P{
[1] = lpeg.I() * p * lpeg.I() + 1 * lpeg.V(1)
}
end
Another option for the search is like this:
function anywhere (p) return (1 - lpeg.P(p))^0 * lpeg.I() * p * lpeg.I() end
Again the pattern has a straight reading:
skips as many characters as possible while not matching p,
and then matches p.
If we want to look for a pattern only at word boundaries, we can use to following transformer:
local wordletter = lpeg.R("A", "Z") + lpeg.R("a", "z")
function atwordboundary (p)
return lpeg.P{
[1] = p + wordletter^0 * (1 - wordletter)^1 * lpeg.V(1)
}
end
The following pattern matches only strings with balanced parentheses:
b = lpeg.P{
[1] = "(" * ((1 - lpeg.S"()") + lpeg.V(1))^0 * ")"
}
Reading the first (and only) rule of the given grammar,
we have that a balanced string is
an open parenthesis,
followed by zero or more repetitions of either
a non-parenthesis character or
a balanced string (lpeg.V(1)),
followed by a closing parenthesis.
This example breaks a string into comma-separated values, returning all fields:
local field = '"' * lpeg.C(((lpeg.P(1) - '"') + '""')^0) * '"' +
lpeg.C((1 - lpeg.S',\n"')^0)
local record = field * (',' * field)^0 * (lpeg.P'\n' + -1)
function csv (s)
return lpeg.match(s, record)
end
A field is either a quoted field (and then may contain any character except an individual quote, which must be written as two quotes) or an unquoted field (and then cannot contain commas, newlines, or quotes). A record is a list of fields separated by commas, ending with a newline or the string end (-1).
This example is a complete parser and evaluator for simple arithmetic expressions:
-- Lexical Elements
local Space = lpeg.S(" \n\t")^0
local Number = lpeg.C(lpeg.R("0", "9")^1) * Space
local FactorOp = lpeg.C(lpeg.S("+-")) * Space
local TermOp = lpeg.C(lpeg.S("*/")) * Space
local Open = "(" * Space
local Close = ")" * Space
-- Grammar
local V = lpeg.V
local Exp, Term, Factor = 1, 2, 3
G = lpeg.P{
[Exp] = lpeg.T(V(Factor) * (FactorOp * V(Factor))^0);
[Factor] = lpeg.T(V(Term) * (TermOp * V(Term))^0);
[Term] = Number + Open * V(Exp) * Close;
}
G = Space * G * -1
-- Evaluator
function eval (x)
if type(x) == "string" then
return tonumber(x)
else
local op1 = eval(x[1])
for i = 2, #x, 2 do
local op = x[i]
local op2 = eval(x[i + 1])
if (op == "+") then op1 = op1 + op2
elseif (op == "-") then op1 = op1 - op2
elseif (op == "*") then op1 = op1 * op2
elseif (op == "/") then op1 = op1 / op2
end
end
return op1
end
end
-- Parser/Evaluator
function evalExp (s)
local t = lpeg.match(s, G)
if not t then error("syntax error", 2) end
return eval(t)
end
-- small example
print(evalExp"3 + 5*9 / (1+1) - 12")