FParsec-Pipes: making FParsec parsers even more concise and consistent.
FParsec is an F# library for writing parsers. This library extends FParsec with a new set of combinators, which I believe make it even easier to translate from a formal grammar like EBNF to F# parsing code and end up with a fast, highly readable parser.
Show me an example
Here's a parser for ISO-8601-formatted datetimes.
// Parses `count` digits and returns the result as an integer.
let digits (count : int) =
%% +.(qty.[count] * digit)
-|> (String >> Int32.Parse)
// Parses a yyyy-mm-dd date and returns the result as a tuple of 3 integers.
let date =
%% +.digits 4 -- '-' -- +.digits 2 -- '-' -- +.digits 2
-%> auto
// Parses a hh:mm:ss time and returns the result as a tuple of 3 integers.
let time =
%% +.digits 2 -- ':' -- +.digits 2 -- ':' -- +.digits 2
-%> auto
// Parses a DateTime in the format yyyy-mm-ddThh:mm:ss
let datetime : Parser<_, unit> =
%% +.date -- 'T' -- +.time
-|> fun (yyyy, mm, dd) (hh, mi, ss) ->
new DateTime(yyyy, mm, dd, hh, mi, ss)
What are the features?
Default parsers for values
The unary operator %
can be used to convert various values to the "obvious" corresponding parser.
This cuts down on clutter from explicitly using pstring
, pchar
, and others.
See the examples below:
// Expression: // Equivalent to:
%"bob" // pstring "bob"
%ci "jim" // pstringCI "jim" // case insensitive
%'b' // pchar 'b'
%ci 'b' // pcharCI 'b' // case insensitive
%['a'; 'b'] // choice [pchar 'a'; pchar 'b']
%["alice"; "bob"] // choice [pstring "alice"; pstring "bob"]
%[pint32; preturn (-1)] // choice [pint32; preturn (-1)]
The value you pass to %
can also be of type DefaultParserOf<'a>
, which you
can obtain by using p<'a>
. If 'a
is a supported type, the resulting
parser will be one to parse any 'a
.
This works for several primitive types, including all integer types.
// Expression: // Equivalent to:
%p<int> // pint32
%p<uint16> // puint16
%p<float> // pfloat
%p<Position> // getPosition
You can make p
work for your own types, too. Simply define:
static member DefaultParser : Parser<'a, 'u>
type Vector3D =
{
X : float
Y : float
Z : float
}
// Parses a vector in the format (x, y, z)
static member DefaultParser =
let comma = spaces >>. %',' >>. spaces
pipe3
(%'(' >>. spaces >>. %p<float>)
(comma >>. %p<float>)
(comma >>. %p<float> .>> spaces .>> %')')
(fun x y z -> { X = x; Y = y; Z = z })
%p<Vector3D> // equivalent to Vector3D.DefaultParser
Pipes
Notice that the above DefaultParser
example used the function pipe3
, and the combinators >>.
and .>>
.
Using vanilla FParsec, when you want to make a parser like "parserA
followed by parserB
followed by parserC
",
you must choose between >>.
, .>>
, .>>.
, pipe2
through pipe5
, or tuple2
through tuple5
, depending on which
of the chained parsers you want to consume the output of.
This means that two parsers written to consume the same grammar may look quite different, if their authors had different requirements about how much information to capture in their output AST. A "recognizer" that simply validates the source passed to it will have a very different structure from a parser that produces an AST.
FParsec-Pipes replaces all such sequencing combinators with "pipes", so named because they fill the purpose of the pipe2
family of operators.
The rules for combining parsers with these operators are simple:
- Start the pipe with
%%
followed by the first parser. - Add additional parsers to the pipe with the
--
operator. - "Capture" parser outputs in the pipe with the
+.
prefix operator. - Complete the pipe with
-|>
by supplying a function to combine all the "captured" outputs.
All the operators use %
to convert their operands to parsers, so you can use raw characters and strings
in the pipeline. Here's the rewritten version of the Vector3D
parser.
let vector3pipe : Parser<_, unit> =
%% '(' -- spaces
-- +.p<float> -- spaces -- ',' -- spaces
-- +.p<float> -- spaces -- ',' -- spaces
-- +.p<float> -- spaces -- ')'
-|> fun x y z -> { X = x; Y = y; Z = z }
Sometimes it's overkill to write a function to combine the captured outputs of a pipe. Maybe the pipe has no outputs, or only one, or you just need a tuple.
For pipes with no outputs, you can give any value to -|>
and that value will be returned from the parser on success.
This is like the >>%
operator in FParsec.
let unitexample : Parser<unit, unit> =
%% "this pipe" -- "has no" -- "captures" -|> ()
// Always produces the value 1337 on success
let constexample : Parser<int, unit> =
%% "some text" -|> 1337
For pipes with 1 to 5 outputs, you can automatically combine them into a tuple by using
-%> auto
as the terminator.
// Parses an int
let auto1example : Parser<_, unit> =
%% "this pipe" -- "has one" -- +.p<int> -- "capture"
-%> auto
// Parse a tuple (float, int)
let auto2example : Parser<_, unit> =
%% "this pipe" -- +.p<float> -- "has two" -- +.p<int> -- "captures"
-%> auto
Backtracking with pipes
Sometimes it is necessary to backtrack in a parser. For example, suppose that at a given place in a file format, it is valid to find either a date/time in ISO 8601 format, or a Unix timestamp as a 64-bit integer.
Here is one attempt at this:
type Timestamp =
| UnixTimestamp of int64
| CalendarTimestamp of DateTime
let unixTimestamp = %% +.p<int64> -|> UnixTimestamp
let calendarTimestamp =
let digits (count : int) = %% +.(qty.[count] * digit) -|> (String >> Int32.Parse)
%% +.digits 4 -- '-' -- +.digits 2 -- '-' -- +.digits 2 -- 'T'
-- +.digits 2 -- ':' -- +.digits 2 -- ':' -- +.digits 2
-|> fun yyyy mm dd h m s ->
new DateTime(yyyy, mm, dd, h, m, s) |> CalendarTimestamp
let timestamp : Parser<_, unit> =
%[calendarTimestamp; unixTimestamp]
The problem with this approach is that, given the Unix timestamp input 1458730894
,
the first part of calendarTimestamp
will succeed since it successfully parses 4 digits.
It will fail when it encounters 7
as the next character instead of the expected -
,
but since it's already consumed input and advanced the parser state, the alternative
unixTimestamp
parser will not be attempted.
The solution is to add backtracking to calendarTimestamp
. What we want is to ensure that,
until we've seen the first -
, the whole pipe can be backtracked to the beginning.
Vanilla FParsec provides the attempt
, >>?
, .>>?
, and .>>.?
combinators for this type of situation.
However, there's no need to abandon the pipe syntax -- it supports backtracking too, with the ?-
and -?
operators.
The ?-
operator wraps everything to its left in attempt
.
let calendarTimestampBacktrack1 : Parser<_, unit> =
let digits (count : int) = %% +.(qty.[count] * digit) -|> (String >> Int32.Parse)
%% +.digits 4 -- '-' ?- +.digits 2 -- '-' -- +.digits 2 -- 'T'
// ^^ Backtrack the whole pipe if anything to the left of this fails.
-- +.digits 2 -- ':' -- +.digits 2 -- ':' -- +.digits 2
-|> fun yyyy mm dd h m s ->
new DateTime(yyyy, mm, dd, h, m, s) |> CalendarTimestamp
The -?
also wraps everything to its left in attempt
, but also joins its left and right
together using .>>.?
, so that if the right side fails without changing the parser state,
the pipe will backtrack to the beginning.
This is a bit harder to understand, but can sometimes produce better error messages.
let calendarTimestampBacktrack2 : Parser<_, unit> =
let digits (count : int) = %% +.(qty.[count] * digit) -|> (String >> Int32.Parse)
%% +.digits 4 -? '-' -- +.digits 2 -- '-' -- +.digits 2 -- 'T'
// ^^ Backtrack the whole pipe if anything to the left of this fails,
// or if the parser to the right fails without changing the parser state.
-- +.digits 2 -- ':' -- +.digits 2 -- ':' -- +.digits 2
-|> fun yyyy mm dd h m s ->
new DateTime(yyyy, mm, dd, h, m, s) |> CalendarTimestamp
You can think of ?-
and -?
as marking the end of ambiguity.
?-
says "if we've gotten this far, we're definitely looking at something that is supposed to
match this parser, and if it doesn't, that's a syntax error."
-?
is similar, but changes the condition to "if we've gotten this far, and the next little bit doesn't immediately look wrong".
Repetition and separation
FParsec includes many combinators for defining parsers like "zero or more of parserA
" or
"one or more of parserA
separated by parserB
". These include many
, many1
, sepBy
, and sepEndBy
.
FParsec-Pipes uses F#'s slice notation to handle the concept of ranges.
You can slice into qty
to get a Range
object, and multiply a parser
(or a value that can be converted to a parser with %
) by that range.
The resulting parser parses a System.Collections.Generic.List
(not an F# list) of values.
'a' * qty.[3] // parse exactly 3 'a's -- similar to `parray 3 %'a'`
'a' * qty.[1..] // parse one or more 'a's -- similar to `many1 %'a'`
'a' * qty.[0..] // parse 0 or more 'a's -- similar to `many %'a'`
'a' * qty.[..5] // parse up to 5 'a's
'a' * qty.[3..6] // parse at least 3 and at most 6 'a's
qty.[3] * 'a' // the range can be on either side of the *
For separation, /
is also overloaded on ranges.
Another variant, /.
, allows (but does not require) a separator to appear after the last element.
Mathematically this is of course nonsense, but /
was chosen because
you can say that the elements are "divided" by their separators, in the
same way that a fence divides pieces of property.
'a' * (qty.[0..] / ',') // parse 0 or more 'a's separated by ','s
// similar to sepBy %'a' %','
qty.[0..] / ',' * 'a' // same as above, just written in a different order
// this may be preferable as it doesn't require parentheses
qty.[1..] /. ',' * 'a' // parse one or more 'a's separated by ','s, with an optional trailing ','
// similar to sepEndBy1 %'a' %','
let comma() = %% "," -- spaces -|> ()
let element() = %% 'a' -- spaces -|> ()
in qty.[2..10] / comma() * element() // parse 2 to 10 'a's separated by commas, with whitespace allowed
Full name: Intro.digits
val int : value:'T -> int (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.int
--------------------
type int = int32
Full name: Microsoft.FSharp.Core.int
--------------------
type int<'Measure> = int
Full name: Microsoft.FSharp.Core.int<_>
Full name: FParsec.Pipes.Many.qty
Full name: FParsec.CharParsers.digit
type String =
new : value:char -> string + 7 overloads
member Chars : int -> char
member Clone : unit -> obj
member CompareTo : value:obj -> int + 1 overload
member Contains : value:string -> bool
member CopyTo : sourceIndex:int * destination:char[] * destinationIndex:int * count:int -> unit
member EndsWith : value:string -> bool + 2 overloads
member Equals : obj:obj -> bool + 2 overloads
member GetEnumerator : unit -> CharEnumerator
member GetHashCode : unit -> int
...
Full name: System.String
--------------------
String(value: nativeptr<char>) : unit
String(value: nativeptr<sbyte>) : unit
String(value: char []) : unit
String(c: char, count: int) : unit
String(value: nativeptr<char>, startIndex: int, length: int) : unit
String(value: nativeptr<sbyte>, startIndex: int, length: int) : unit
String(value: char [], startIndex: int, length: int) : unit
String(value: nativeptr<sbyte>, startIndex: int, length: int, enc: Text.Encoding) : unit
struct
member CompareTo : value:obj -> int + 1 overload
member Equals : obj:obj -> bool + 1 overload
member GetHashCode : unit -> int
member GetTypeCode : unit -> TypeCode
member ToString : unit -> string + 3 overloads
static val MaxValue : int
static val MinValue : int
static member Parse : s:string -> int + 3 overloads
static member TryParse : s:string * result:int -> bool + 1 overload
end
Full name: System.Int32
Int32.Parse(s: string, provider: IFormatProvider) : int
Int32.Parse(s: string, style: Globalization.NumberStyles) : int
Int32.Parse(s: string, style: Globalization.NumberStyles, provider: IFormatProvider) : int
Full name: Intro.date
Full name: FParsec.Pipes.Pipes.auto
Full name: Intro.time
Full name: Intro.datetime
Full name: FParsec.Primitives.Parser<_,_>
Full name: Microsoft.FSharp.Core.unit
type DateTime =
struct
new : ticks:int64 -> DateTime + 10 overloads
member Add : value:TimeSpan -> DateTime
member AddDays : value:float -> DateTime
member AddHours : value:float -> DateTime
member AddMilliseconds : value:float -> DateTime
member AddMinutes : value:float -> DateTime
member AddMonths : months:int -> DateTime
member AddSeconds : value:float -> DateTime
member AddTicks : value:int64 -> DateTime
member AddYears : value:int -> DateTime
...
end
Full name: System.DateTime
--------------------
DateTime()
(+0 other overloads)
DateTime(ticks: int64) : unit
(+0 other overloads)
DateTime(ticks: int64, kind: DateTimeKind) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, calendar: Globalization.Calendar) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, kind: DateTimeKind) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, calendar: Globalization.Calendar) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int, kind: DateTimeKind) : unit
(+0 other overloads)
Full name: FParsec.Pipes.DefaultParsers.ci
Full name: FParsec.CharParsers.pint32
Full name: FParsec.Primitives.preturn
Full name: FParsec.Pipes.DefaultParsers.p
val uint16 : value:'T -> uint16 (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.uint16
--------------------
type uint16 = UInt16
Full name: Microsoft.FSharp.Core.uint16
val float : value:'T -> float (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.float
--------------------
type float = Double
Full name: Microsoft.FSharp.Core.float
--------------------
type float<'Measure> = float
Full name: Microsoft.FSharp.Core.float<_>
type Position =
new : streamName:string * index:int64 * line:int64 * column:int64 -> Position
member Column : int64 with get, set
member CompareTo : other:Position -> int
member Equals : obj:obj -> bool + 1 overload
member GetHashCode : unit -> int
member Index : int64 with get, set
member Line : int64 with get, set
member StreamName : string with get, set
member ToString : unit -> string
static member Compare : left:Position * right:Position -> int
Full name: FParsec.Position
--------------------
Position(streamName: string, index: int64, line: int64, column: int64) : unit
{X: float;
Y: float;
Z: float;}
static member DefaultParser : Parser<Vector3D,unit>
Full name: Intro.Vector3D
static member Vector3D.DefaultParser : Parser<Vector3D,unit>
Full name: Intro.Vector3D.DefaultParser
--------------------
type DefaultParser =
| DefaultParser
static member ( %!!~~% ) : DefaultParser:DefaultParser * list:'a0 list -> (CharStream<'a2> -> Reply<'a1>) (requires member ( %!!~~% ))
static member ( %!!~~% ) : DefaultParser:DefaultParser * literal:string -> Parser<string,'a0>
static member ( %!!~~% ) : DefaultParser:DefaultParser * literal:char -> Parser<char,'a0>
static member ( %!!~~% ) : DefaultParser:DefaultParser * existing:Parser<'a,'u> -> Parser<'a,'u>
static member ( %!!~~% ) : DefaultParser:DefaultParser * cap:Fitting<'a0,'a1,'a2> -> Fitting<'a0,'a1,'a2> (requires 'a2 :> Appendable)
Full name: FParsec.Pipes.DefaultParsers.DefaultParser
Full name: FParsec.CharParsers.spaces
Full name: FParsec.Primitives.pipe3
Full name: Intro.vector3pipe
Full name: Intro.unitexample
Full name: Intro.constexample
Full name: Intro.auto1example
Full name: Intro.auto2example
| UnixTimestamp of int64
| CalendarTimestamp of DateTime
Full name: Intro.Timestamp
val int64 : value:'T -> int64 (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.int64
--------------------
type int64 = Int64
Full name: Microsoft.FSharp.Core.int64
--------------------
type int64<'Measure> = int64
Full name: Microsoft.FSharp.Core.int64<_>
Full name: Intro.unixTimestamp
Full name: Intro.calendarTimestamp
Full name: Intro.timestamp
Full name: Intro.calendarTimestampBacktrack1
Full name: Intro.calendarTimestampBacktrack2
Full name: Intro.comma
Full name: Intro.element