You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It'd be great to be able to provide, in an addition to an "interestingness" test, a chunk preprocessing strategy.
Here's what I'm interested in using it for:
What I observe when I'm using lithium is that at large chunk sizes, it's often the case that a chunk could have been removed (i.e., the file was interesting without it) except that a poorly placed chunk boundary led to a syntax error. Since removing large chunks early on drastically reduces runtimes, I'd really like to help those large chunk removals succeed. It seems to me that a small degree of knowledge about the syntax of the file that lithium is processing would go a long way.
What I plan to do as a first attempt is to process the input file with pygments, which is a Python library for syntax highlighting. The syntax highlighting definitions are mostly implemented using regular expressions, but a stack is included to support grammars which involve nesting. Essentially, pygments provides simple parsers for a very large number of programming languages.
Most of the information that pygments produces is specific to a particular file format, but what I find interesting is the stack. I'd expect that if a chunk has exactly the same stack at the beginning and the end (i.e., we haven't popped the original elements in the stack off, and we have popped off everything that's been added in the interim) then it's much more likely to be syntactically correct and hence removable. So, given this information, we can move around the chunk boundaries in a way that should help us remove more large chunks.
The pygments stuff is speculative at this point, and it may be a bit much to include in upstream lithium (though I'd be happy to fold it in if there's interest). I think that offering a general way to preprocess the chunks selected for each pass in this fashion would probably be quite useful in all sorts of ways, though.
Does that sound like a feature you'd like to include?
The text was updated successfully, but these errors were encountered:
This does sound interesting, although having it as an experiment / branch might be a better way to start off. Note that #11 is a primitive way to parse the syntax by looking out for matching closing braces/square braces.
With your method, we might even be able to remove try ... catch blocks.
Oh nice! Thanks for pointing out #11, that looks like a big win.
I agree, this should definitely start out as an experiment. I'm planning to hack something together within the next month or two, so I'll come back then and report my initial results. I'm hoping that a big payoff is possible with a relatively small amount of work.
It'd be great to be able to provide, in an addition to an "interestingness" test, a chunk preprocessing strategy.
Here's what I'm interested in using it for:
What I observe when I'm using lithium is that at large chunk sizes, it's often the case that a chunk could have been removed (i.e., the file was interesting without it) except that a poorly placed chunk boundary led to a syntax error. Since removing large chunks early on drastically reduces runtimes, I'd really like to help those large chunk removals succeed. It seems to me that a small degree of knowledge about the syntax of the file that lithium is processing would go a long way.
What I plan to do as a first attempt is to process the input file with pygments, which is a Python library for syntax highlighting. The syntax highlighting definitions are mostly implemented using regular expressions, but a stack is included to support grammars which involve nesting. Essentially, pygments provides simple parsers for a very large number of programming languages.
Most of the information that pygments produces is specific to a particular file format, but what I find interesting is the stack. I'd expect that if a chunk has exactly the same stack at the beginning and the end (i.e., we haven't popped the original elements in the stack off, and we have popped off everything that's been added in the interim) then it's much more likely to be syntactically correct and hence removable. So, given this information, we can move around the chunk boundaries in a way that should help us remove more large chunks.
The pygments stuff is speculative at this point, and it may be a bit much to include in upstream lithium (though I'd be happy to fold it in if there's interest). I think that offering a general way to preprocess the chunks selected for each pass in this fashion would probably be quite useful in all sorts of ways, though.
Does that sound like a feature you'd like to include?
The text was updated successfully, but these errors were encountered: