Chapter 9Parsing records

Often, what you are trying to parse can been seen as file of records. Consider this input, for example:

1 |This is a line\
  |that continues
  |This is a new line
  |So is this
5 |But this one \
  |continues \
  |for \
  |several lines.

Many readers will be familiar with this pattern. Each line that ends with a “\” is interpreted as continuing on the next line. So this file consists of four records. We can parse it with this grammar:

  |records = record+ .
  |record = simple-value | extended-value .
  |-simple-value = ~[#a]*, ~["\"], -#a .
  |-extended-value = ~[#a]*, "\", #a, -record .

to obtain:

 1 |$ coffeepot -g:lines.ixml -i:lines.txt -pp
   |<records>
   |   <record>This is a line\
   |that continues</record>
 5 |   <record>This is a new line</record>
   |   <record>So is this</record>
   |   <record>But this one \
   |continues \
   |for \
10 |several lines.</record>
   |</records>

Starting with version 1.99.10, CoffeePot offers an alternative approach. You can use the “--record-end” option to tell CoffeePot to break the file into records for you, and then parse each record with your grammar.

With this approach, you can use a simpler grammar:

  |record = ~[]*.

To obtain the same result:

 1 |$ coffeepot -g:line.ixml -i:lines.txt --record-end:"([^\\\\])\\n" -pp
   |<records>
   |<record>This is a line\
   |that continues</record>
 5 |<record>This is a new line</record>
   |<record>So is this</record>
   |<record>But this one \
   |continues \
   |for \
10 |several lines.</record>
   |</records>

You can embed the regular expression in the grammar with a pragma so that you don’t have to deal with double-escaping the backslashes (once for the shell and once for the regular expression).

There’s a corresponding “--record-start” option for the case where the records are more easily identified by their beginning than their end.

In either case, capture groups are preserved and everything else is discarded.

For some inputs, using record splitting in this way may result in dramatically improved performance and/or simpler grammars.