Skip to content
forked from jawher/immanix

A Java library for processing (large) xml files using parser combinators-like approach

License

Notifications You must be signed in to change notification settings

sshikov/immanix

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Immanix

A Java library for processing (large) xml files using parser combinators-like approach. Please see below for a more detailed explanation on how to use Immanix.

Introduction

Introducing Immanix: a Java library to process XML using parser combinators

Using this library

To use this library in your projects, add the following to the dependencies section of your pom.xml (immanix is in central):

<dependency>
  <groupId>com.plecting</groupId>
  <artifactId>immanix</artifactId>
  <version>1.0.0</version>
</dependency>

Given an xml file that looks like the following:

<root>
    <entry>
        <value>[...]</value>
    </entry>
    <entry>
        <value>[...]</value>
        <extra>[...]</extra>
    </entry>
    <entry>
        <value>[...]</value>
    </entry>
    :
    :
</root>

Where the element can appear any number of times, and the element is optional (may be present or not), a parser able to process this xml can be assembled by combining simple and elemetary parsers.

We're going to start by matching the element, which in StAX language can be thought of as a sequence of 3 events: element start, characters and element end.

An element start event can be matched by the StartElementParser. but here, we actually need to match an element start with a precise name "value", so we're going to use NamedStartElementMatcher instead:

StaxMatcher<StartElement> s = new NamedStartElementMatcher("value");

We can also use a factory method to save ourselves a couple of keystrokes:

StaxMatcher<StartElement> s = start("value");

The produced object s is an elementary matcher (or parser) that'll read an event from the XML stream, and check if it's a start element event, and if the element has the supplied name. If this is the case, it succeeds and returns the start element event as a result, if not, it'll fail and no events will be consumed from the stream.

Afterwards, we need to match the textual content of the element. The CharsMatcher can be used to this end:

StaxMatcher<String> t = new CharsMatcher();

Or by using the factory method:

StaxMatcher<String> t = chars();

Notice that the produced matcher is parametrized with the String type, which indicates that if the matcher succeeds, a String is returned.

We need now to combine these 2 matchers (s and t) into a new matcher that'll match a start element named "value", then a textual content. the SuccMatcher can be used to this end:

StaxMatcher<Tuple2<StartElement, String>> st = new SuccMatcher<StartElement, String>(s, t);

There's a shorter syntax to achieve the same:

StaxMatcher<Tuple2<StartElement, String>> st = s.then(t);

The combined matcher returns a Tuple2 (a pair of objects) of a StartElement event and a String if it succeeds.

Usuallay, we won't be interested in the start element event, but only in the textual content of the element for the processing we intend to perform. In such cases, we can use another variant of the succession matcher which only keeps the result of one of its child matchers (left or right):

StaxMatcher<String> st = new RightSuccMatcher<StartElement, String>(s, t);

Or with the shorter syntax:

StaxMatcher<String> st = s.thenr(t);

Notice that the result matcher is parametrized with the String type only, as the StartElement event is discarded.

We then combine the st matcher with another matcher that accepts the "value" end element event, and as we won't need this latter in the processing we intend to perform later, we're going to use a LeftSuccMatcher:

StaxMatcher<String> ste = st.thenl(end("value");

Combining all these steps, we end up with this code:

StaxMatcher<StartElement> s = start("value");
StaxMatcher<String> t = chars();
StaxMatcher<String> st = s.thenr(t);
StaxMatcher<String> ste = st.thenl(end("value");

Or in one line:

StaxMatcher<String> value = start("value").thenr(chars()).thenl(end("value"));

Using the same process for the element, we end up with:

StaxMatcher<String> extra = start("extra").thenr(chars()).thenl(end("extra"));

Except that the element is optional. If it doesn't appear in the XML file, the parsing will fail. To handle such cases, a matcher can be marked as optional by simply calling the method optional on it:

extra = extra.optional();

Or in one line:

StaxMatcher<String> extra = start("extra").thenr(chars()).thenl(end("extra")).optional();

We need now to combine the value matcher with the extra matcher. As we need both of their results, we combine them using then:

StaxMatcher<Tuple2<String, String>>  valueExtra = value.then(extra);

Augmenting this matcher to handle the element can be achieved this way :

StaxMatcher<Tuple2<String, String>>  entry =  start("entry").thenr(valueExtra).thenl(end("entry"));

Once an entry is matched, we usually need to perform an operation on it, inserting it in a database for example. To inject code to be executed when a matcher succeeds, the Mapper matcher can be used:

Mapper<Void> execEntry = new Mapper<Tuple2<String, String>, T>(entry) {
    public Void process(Tuple2<String, String> data) {
        // insert data into the database
        return null;
    }
}

Mapper is actually a matcher that transforms the results of a delegate matcher into another type/value. In this case, and since. We sort of abused it here by using it to process the result and returning null.

As said earlier, the element can appear any number of times in the file. To handle this case, the entry matcher can be augmented to handle repetition by calling one of the repetition methods on it (zeroOrMore, zeroOrMoreButFinite, oneOrMore, oneOrMoreButFinite, atLeast, atLeastButFinite and atMost). In this case, and since we'll want to handle any arbitrary number of repetitions, including none, zeroOrMore is best suited.

StaxMatcher<Void> entries = execEntry.zeroOrMore();

Calling zeroOrMore on a matcher produces a new matcher that throws away its delegate matcher result (unlike zeroOrMoreButFinite which accumulates them in a list). This is done to ensure that your code won't choke on large enough XML files (think about the cost of accumulating millions of intermediary results in a list in the memory).

In this case, this behavior has been accounted for as the execEntry processes its result as its produced, and so it can be safely thrown away afterwards.

Finally, it remains to handle the element and the document start/end events:

StaxMatcher<Void> almighty = startDoc().then(start("root")).thenr(entries).thenl(end("root")).thenl(endDoc());

We ended up with a combined matcher that can match the whole XML file by calling its match method on a XMLEventReader:

XMLInputFactory inputFactory = createInputFactory();
XMLEventReader evr = inputFactory.createXMLEventReader(new FileInputStream("/some/file.xml"));

MatcherResult<Void> result = almighty.match(evr);
if(result.isFailure()){
    //handle errors
} else {
    //everything went well
}

Here's how the whole program looks like (or as a Gist if you prefer):

StaxMatcher<String> value = start("value").thenr(chars()).thenl(end("value"));

StaxMatcher<String> extra = start("extra").thenr(chars()).thenl(end("extra"));

StaxMatcher<Tuple2<String, String>>  entry =  start("entry").thenr(value).then(extra).thenl(end("entry"));

Mapper<Void> execEntry = new Mapper<Tuple2<String, String>, T>(entry) {
    public Void process(Tuple2<String, String> data) {
        // insert data into the database
        return null;
    }
}

StaxMatcher<Void> entries = execEntry.zeroOrMore();

StaxMatcher<Void> almighty = startDoc().then(start("root")).thenr(entries).thenl(end("root")).thenl(endDoc());

XMLInputFactory inputFactory = createInputFactory();
XMLEventReader evr = inputFactory.createXMLEventReader(new FileInputStream("/some/file.xml"));

MatcherResult<Void> result = almighty.match(evr);

if(result.isFailure()){
    //handle errors
} else {
    //everything went well
}

License

See LICENSE for details (hint: it's MIT).

Building

You need a Java 5 (or newer) environment and Maven 3 installed:

$ mvn --version

Apache Maven 2.2.1 (r801777; 2009-08-06 20:16:01+0100)
Java version: 1.6.0_20
Java home: /usr/lib/jvm/java-6-openjdk/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux" version: "2.6.35-24-generic" arch: "amd64" Family: "unix"

You should now be able to do a full build of immanix:

$ git clone git://github.com/jawher/immanix.git
$ cd immanix
$ mvn clean install

Troubleshooting

Please consider using Github issues tracker to submit bug reports or feature requests.

About

A Java library for processing (large) xml files using parser combinators-like approach

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%