Add a way to preprocess chunks before attempting to reduce the test case #13

sethfowler · 2017-04-22T17:33:39Z

It'd be great to be able to provide, in an addition to an "interestingness" test, a chunk preprocessing strategy.

Here's what I'm interested in using it for:

What I observe when I'm using lithium is that at large chunk sizes, it's often the case that a chunk could have been removed (i.e., the file was interesting without it) except that a poorly placed chunk boundary led to a syntax error. Since removing large chunks early on drastically reduces runtimes, I'd really like to help those large chunk removals succeed. It seems to me that a small degree of knowledge about the syntax of the file that lithium is processing would go a long way.

What I plan to do as a first attempt is to process the input file with pygments, which is a Python library for syntax highlighting. The syntax highlighting definitions are mostly implemented using regular expressions, but a stack is included to support grammars which involve nesting. Essentially, pygments provides simple parsers for a very large number of programming languages.

Most of the information that pygments produces is specific to a particular file format, but what I find interesting is the stack. I'd expect that if a chunk has exactly the same stack at the beginning and the end (i.e., we haven't popped the original elements in the stack off, and we have popped off everything that's been added in the interim) then it's much more likely to be syntactically correct and hence removable. So, given this information, we can move around the chunk boundaries in a way that should help us remove more large chunks.

The pygments stuff is speculative at this point, and it may be a bit much to include in upstream lithium (though I'd be happy to fold it in if there's interest). I think that offering a general way to preprocess the chunks selected for each pass in this fashion would probably be quite useful in all sorts of ways, though.

Does that sound like a feature you'd like to include?

nth10sd · 2017-05-01T23:12:13Z

This does sound interesting, although having it as an experiment / branch might be a better way to start off. Note that #11 is a primitive way to parse the syntax by looking out for matching closing braces/square braces.

With your method, we might even be able to remove try ... catch blocks.

try {
    <code>
} catch (e) {}

to just:

<code>

sethfowler · 2017-05-02T02:25:18Z

Oh nice! Thanks for pointing out #11, that looks like a big win.

I agree, this should definitely start out as an experiment. I'm planning to hack something together within the next month or two, so I'll come back then and report my initial results. I'm hoping that a big payoff is possible with a relatively small amount of work.

sethfowler mentioned this issue May 2, 2017

Support for parallel reduction #12

Open

nth10sd added the enhancement label May 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to preprocess chunks before attempting to reduce the test case #13

Add a way to preprocess chunks before attempting to reduce the test case #13

sethfowler commented Apr 22, 2017

nth10sd commented May 1, 2017

sethfowler commented May 2, 2017

Add a way to preprocess chunks before attempting to reduce the test case #13

Add a way to preprocess chunks before attempting to reduce the test case #13

Comments

sethfowler commented Apr 22, 2017

nth10sd commented May 1, 2017

sethfowler commented May 2, 2017