Appendix A. Property files

A worked example

Introduction…

A.1 A first attempt

Suppose that we wanted to read Java-style property files, like this one:

# This is a comment.
name1 : value1
name2 = value2

Where:

  1. Lines consist of name/value pairs, separated by “:” or “=”; whitespace around the separator is irrelevant.

  2. If the first non-whitespace character in a line is “!” or “#”, it is a comment.

Our first attempt to parse this grammar might look something like prop1.ixml:

1property-file: line+ .                           
-line: comment ; name-value .                    
 
comment: s, -["#!"], char*, NL .                 
5name-value: char*, NL .                          
 
-NL: -#a ; -#d, -#a .                            
-char: ~[#a] .
-s: (-[Zs]; -#9; -#d)* .

This grammar says, roughly: a property file consists of one or more lines.

A line is either a comment or a name-value.

A comment is whitespace followed by a “#” or a “!”, followed by any characters up to the end of the line.

A name-value is any character up to the end of the line. (This is clearly a toy definition that we’ll be refining.)

A NL is either a linefeed or a carriage return followed by a linefeed. A char is anything except a linefeed. And s, whitespace, is any sequence of zero or more space separators, tabs, or carriage returns.

Judicious use of “-” characters before terminals and non-terminals keeps the output clean. If you run this through coffeepot, you’ll get:

$ coffeepot -v -g:examples/prop1.ixml -i:examples/example1.properties -pp
Loading ixml grammar: examples/prop1.ixml
Loading input from examples/example1.properties
There are 2 possible parses.
<property-file xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <name-value># This is a comment.</name-value>
   <name-value>name1 : value1</name-value>
   <name-value>name2 = value2</name-value>
</property-file>

That looks reasonable for our initial grammar, but you might wonder where the ambiguity arises. Let’s find out with the --describe-ambiguity option:

$ coffeepot -v -g:examples/prop1.ixml -i:examples/example1.properties --describe-ambiguity -pp
Loading ixml grammar: examples/prop1.ixml
Loading input from examples/example1.properties
There are 2 possible parses.
Ambiguity:
$2, 0, 51
        line, 0, 21 / $3ⁿ, 21, 51
        line, 0, 21 / $3ⁿ, 21, 51
<property-file xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <name-value># This is a comment.</name-value>
   <name-value>name1 : value1</name-value>
   <name-value>name2 = value2</name-value>
</property-file>

When you ask coffeepot to describe ambiguity, or when it fails to parse your document and attempts to report errors, it has little choice at the moment except to expose some of the inner workings of the parser. This is described more thoroughly in Chapter 3, How it works.

This output indicates that the nonterminal “$2”, covering the range of characters 0-51 has two different derivations. Sometimes it’s useful to look at the graph . You can get an SVG version of it with the --graph-svg option:

Figure A.1.1 Part of the parse forest

There you can see that the culprit is that a line can be either a comment or a name-value. Does that seem strange? Well, look back at our proto-grammar:

comment: s, -["#!"], char*, NL .
name-value: char*, NL .

It says that a comment has to begin with a “#” or “!”, so line 1 could be a comment, but all that name-value says at the moment is that it doesn’t include newlines. So it could also match the first line!

A.2 Refining name-value

It’s reasonably straight-forward to improve on name-value in prop2.ixml:

6name-value: s, name, s, -[":="], s, value, NL.
name: namestart, namefollower* .
value: ~[Zs; #9; #d], char* .
 
10-namestart: ["_"; L] .
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn] .

Here we’re saying that a name-value is a name, followed by a “:” or “=” separator, followed by a value; a name is a name start character followed by zero or more name follower characters, and a value is something that isn’t whitespace followed by any characters.

This does a good job on our sample file:

1$ coffeepot -v -g:examples/prop2.ixml -i:examples/example1.properties -pp
Loading ixml grammar: examples/prop2.ixml
Loading input from examples/example1.properties
<property-file>
5   <comment> This is a comment.</comment>
   <name-value>
      <name>name1</name>
      <value>value1</value>
   </name-value>
10   <name-value>
      <name>name2</name>
      <value>value2</value>
   </name-value>
</property-file>

A.3 More line options

The format for property files is actually a bit more complicated. They allow blank lines, continuation lines, and several flavors of escaped characters:

1# This is a comment.
name1 : value1
name2 = value2
name3 = apple,\
5        banana,\
        pear
 
name4 = a\tb
name5 = a\u2192b 
10name6 = c:\\path\\to\\thing

In fact, the format as described by Java allows even more escaping, and allows names without values, which we’re not going to try to cover now. The Java description is a fine example of a messy, procedural description of a file format. Their parsing description is explicitly two-pass, though it’s unclear if that’s necessary or if the author was just describing what their code does.

Before looking at the solution, have a go at extending the grammar to support blank lines and continuations. Blank lines are easy, continuations are a little more complicated.

Here’s one solution: in prop3.ixml:

2-line: blank ; comment ; name-value .            
 
blank: s, NL .                                   
5comment: s, -["#!"], char*, NL .
 
name-value: s, name, s, -[":="], s, value .
value: simple-value ; extended-value .           
-simple-value: atomic-value, NL .                
10-extended-value: atomic-value, -"\", NL, s, -value .  
-atomic-value: ~[Zs; #9; #d], char* .            
                                                 

A blank line is another kind of line, it consists entirely of whitespace.

A name-value no longer includes the newline because we have address continuations. (If it weren’t for continuations, this grammar could be written more simply as a series of lines separated by newlines: property-file: line+#a..)

A value is now either a simple value or an extended value.

A simple value is an atomic value followed by a newline.

An extended value is an atomic value that ends with a backslash followed by a newline. That’s followed by whitespace and another value. This recursive definition assures that if there are several continuations, we catch them all.

Finally, an atomic value is just a non-whitespace characater followed by more characters. It will always be bounded by the nonterminal that refers to it.

Now we get:

1$ coffeepot -v -g:examples/prop3.ixml -i:examples/example.properties --describe-ambiguity -pp
Loading ixml grammar: examples/prop3.ixml
Loading input from examples/example.properties
<property-file>
5   <comment> This is a comment.</comment>
   <name-value>
      <name>name1</name>
      <value>value1</value>
   </name-value>
10   <name-value>
      <name>name2</name>
      <value>value2</value>
   </name-value>
   <name-value>
15      <name>name3</name>
      <value>apple,banana,pear</value>
   </name-value>
   <blank/>
   <name-value>
20      <name>name4</name>
      <value>a\tb</value>
   </name-value>
   <name-value>
      <name>name5</name>
25      <value>a\u2192b </value>
   </name-value>
   <name-value>
      <name>name6</name>
      <value>c:\\path\\to\\thing</value>
30   </name-value>
</property-file>

Note that “apple”, “banana”, and “pear” have been correctly combined into a single value. The blank line is explicit, but we could suppress it by putting “-” before it’s name.

A.4 Character escapes

The last thing we’ll look at are characater escapes. The property file format says that tab, carriage return, and newline can be escaped as “\t”, “\r”, and “\n”, respectively. This also requires introducing an escape for “\”, “\\”. In addition, Java-style Unicode references are allowed: “\uHHHH” where “HHHH” is any four hexidecimal digits.

As before, you might want to think about this before you look at the solution.

The solution in prop4.ixml is:

21-char: ~["\";#a] ; tab; cr ; nl ; bs ; uref .
 
tab: -"\t" .
cr: -"\r" .
25nl: -"\n" .
-bs: "\", -"\" .
 
uref: -"\u", digit, digit, digit, digit .
-digit: ["0"-"9"; "a"-"f"; "A"-"F"] .

We augment char so that it’s a non-backslash character or a backslash followed by one of “t”, “r”, “n”, or “\”. Or it’s a “\u” followed by four hexidecimal digits.

Here we encounter an interesting consequence of the design of Invisible XML version 1.0. Although for the “\\” case, we can suppress one backslash and output the other, there’s nothing we can do, for example, to replace “\t” with a literal tab character. Instead, we leave <tab/>, etc. in the output where they can be cleaned up later.

1$ coffeepot -v -g:examples/prop4.ixml -i:examples/example.properties --describe-ambiguity -pp
Loading ixml grammar: examples/prop4.ixml
Loading input from examples/example.properties
<property-file>
5   <comment> This is a comment.</comment>
   <name-value>
      <name>name1</name>
      <value>value1</value>
   </name-value>
10   <name-value>
      <name>name2</name>
      <value>value2</value>
   </name-value>
   <name-value>
15      <name>name3</name>
      <value>apple,banana,pear</value>
   </name-value>
   <blank/>
   <name-value>
20      <name>name4</name>
      <value>a<tab/>b</value>
   </name-value>
   <name-value>
      <name>name5</name>
25      <value>a<uref>2192</uref>b </value>
   </name-value>
   <name-value>
      <name>name6</name>
      <value>c:\path\to\thing</value>
30   </name-value>
</property-file>

A.5 Challenges for the reader

The example grammar in this chapter doesn’t cover all of the features of property files. If you’re looking for a challenge, consider these improvements:

  1. The property file format also specifies that unnecessarily escaped characters are allowed, but the escaping is ignored. An occurrence of \" is the same as ".

  2. The property file format allows “=” and “:” to occur in property names if they are escaped as \= and \:, respectively.

  3. In a property file, the “end of file” marks the end of a value. In the grammar presented in this chapter, a terminating newline is required. Can this be fixed?