Add language-specific node filtering #190

zenspider · 2017-06-14T22:56:15Z

99% of the work wound up in sexp_processor, which got released to beta yesterday. This has minor patches against Flay that will get pushed upstream and released.

This also has changes to parsing strategy. In my measurements, if a (especially foreign language) file can't parse in 10 seconds, it can't parse in 5 minutes (the default timeout for CommandLineRunner), basically chewing up 1/3rd of the overall default timeout for no benefit. By changing them to 10s, I timeout on only the most egregious files but now finish w/o a timeout error on all the projects.

The only thing I find really icky is the new_nodes array in lib/ccflay.rb. This is needed for the stable hash algorithm. It is probably missing node types and it would be nice if there was a comprehensive list, but the foreign parsers are messy and relatively undocumented as such. (I'd highly recommend adding a rewrite phase to cleane and speed things up even further).

…verything.

This code also adds a bunch of node names to Sexp::NODE_NAMES for JS and PHP. It is probably not exhaustive.

* Fixed bug setting line number for hash literals to line of opening brace. Which changes line numbers in the tests here.

By observation, either an external parser is going to finish quickly or it probably isn't going to finish at all (at least within the total run timeout thresholds). Currently the timeout is 300s (5 minutes) out of a total run timeout of 900s (15m). That means 3 large files will kill off an entire project run. I've found that even rather javascript large files all parse within 2 seconds and those that don't don't parse within 5 minutes. Dropping the thresholds to 10s seems safe and will allow more projects to finish analyzing to report SOMETHING. Besides, all the files I've seen time out were vendored versions of react/ember and the like. They're huge and they're NOT actually part of the project. Losing them quickly will not reduce the quality of analysis.

I should probably make these 1 column and put it at the end of the file for maintainability.

So now a .codeclimate.yml file can have an entry like: - ruby: filters: - "(defn, [m /^test_/], _, ___)" and that will filter out all methods whose names start with "test_". This requires releases of flay and sexp_processor to pass tests. Pushing now to get eyeballs on.

AFAIK, I have NO projects to test against from the testing group. So I don't know if that value is too small. I suspect it'll be on par with my data from javascript and PHP tho (ie, if it can't parse in 10s, it can't parse in 5m).

Includes a failing spec from: #188 made to pass by filtering "(hash ___)" (for now?).

…parser_base.

zenspider · 2017-06-14T22:57:55Z

As extra datapoints:

the big_files project used to time out at 900s, even with only source 3 files in it.
Profiling and changes to both codeclimate and codeclimate-duplication dropped runtime down to ~300s.
Filtering (everything) drops the time down to ~30s.

dblandin

This looks great!

Is the default parsing timeout value of 300 useful anymore? Would it make sense to change that default to 10 and override it to 30 within the ruby parser?

Would it make sense to define some default filters when none are defined with the codeclimate yml configuration?

dblandin · 2017-06-15T17:25:33Z

spec/cc/engine/analyzers/ruby/main_spec.rb

@@ -197,5 +235,18 @@ def self.from_remediation_amount(amount)
    def engine_conf
      EngineConfig.new({})
    end
+
+    def filtered_engine_conf *patterns
+      EngineConfig.new({


Is this indentation intentional?

Intentional, yes. Good? probably not. It's how emacs does it by default (more like lisp indentation).

dblandin · 2017-06-15T17:26:51Z

spec/cc/engine/analyzers/sexp_lines_spec.rb

-        expect([locations[0].begin_line, locations[0].end_line]).to eq([2, 7])
-        expect([locations[1].begin_line, locations[1].end_line]).to eq([11, 16])
+        expect([locations[0].begin_line, locations[0].end_line]).to eq([1, 7])
+        expect([locations[1].begin_line, locations[1].end_line]).to eq([10, 16])


I'm curious what caused these line locations to change.

Probably fixes to ruby_parser. I have a release coming out soon that makes more of these fixes so there might be more failed tests coming up.

dblandin · 2017-06-15T17:30:14Z

lib/cc/engine/analyzers/engine_config.rb

+        def filters_for(language)
+          fetch_language(language).fetch("filters", []).map { |s|
+            Sexp::Matcher.parse s
+          }


What does the s variable represent? Does it stand for string? What do you think of fleshing it out to "filter" if we're mapping over filters?

zenspider · 2017-06-15T19:56:49Z

@dblandin:

Is the default parsing timeout value of 300 useful anymore? Would it make sense to change that default to 10 and override it to 30 within the ruby parser?

I suspect it probably never made sense. 30 or 10 are more more sensible defaults based on the numbers I've gathered. I'd be fine dropping the numbers... but it might be better if data were gathered so that statistics could decide what the numbers should be.

Would it make sense to define some default filters when none are defined with the codeclimate yml configuration?

Absolutely. But I suspect at this time it would be premature to decide what those should be. @ABaldwinHunter and others would have more experienced/valid opinions on this matter than I would.

dblandin · 2017-06-22T16:38:21Z

This is available on the beta channel now:

engines:
  duplication:
    enabled: true
    channel: beta
    config:
      languages:
      - ruby

dblandin · 2017-06-22T20:10:20Z

@gettinToasty Would you mind sharing your .codeclimate.yml file?

gettinToasty · 2017-06-22T20:17:50Z

@dblandin I managed to fix that issue-- the new readme says the format should be

config:
    languages:
        javascript:
          filters:

but it should be:

config:
      languages:
      - javascript:
          filters:

i am having some trouble with the pattern matching though. my main problem is propTypes being marked as duplicates so i tried - "(object (lit _) (PropTypes _) ___)" as well as - "(object ___)" but didn't get any fixed results

dblandin · 2017-06-22T20:25:47Z

i am having some trouble with the pattern matching though. my main problem is propTypes being marked as duplicates so i tried - "(object (lit _) (PropTypes _) ___)" as well as - "(object ___)" but didn't get any fixed results

@zenspider should be able to help with the pattern syntax ^

@gettinToasty Could you provide an example case of detected duplication you're trying to filter out?

gettinToasty · 2017-06-22T20:29:23Z

@dblandin sure thing! here are some lightly modified code snippets:

EventBlock.propTypes = {
  heading: PropTypes.string.isRequired,
  sub_heading: PropTypes.string.isRequired,
  description: PropTypes.string.isRequired,
  event_description: PropTypes.string.isRequired,

VideoList.propTypes = {
  heading: PropTypes.string.isRequired,
  list: PropTypes.string.isRequired,
  subtext: PropTypes.string.isRequired,
  closing_line: PropTypes.string.isRequired,

which have a Mass 90 duplication issue

zenspider · 2017-06-22T20:34:55Z

Hrm... wrt the config, here is what I'm using:

engines:
  duplication:
    enabled: true
    config:
      debug: true
      languages:
        ruby:
          filters:
          - "([m /./] ___)"
        javascript:
          filters:
          - "([m /./] ___)"
        php:
          filters:
          - "([m /./] ___)"

(intentionally nuking EVERYTHING, to see full coverage and its corresponding performance differences)

zenspider · 2017-06-22T20:49:46Z

OK. WRT the code... this is a doozy. I am not fond of the AST that comes back from javascript, but after our munging, this is how it looks:

s(:Program,
 :module,
 s(:body,
  s(:ExpressionStatement,
   :expression,
   s(:AssignmentExpression,
    :"=",
    :left,
    s(:MemberExpression,
     :object,
     s(:Identifier, :EventBlock),
     :property,
     s(:Identifier, :propTypes)),
    s(:right,
     :ObjectExpression,
     s(:properties,
      s(:ObjectProperty,
       :key,
       s(:Identifier, :heading),
       :value,
       s(:MemberExpression,
        :object,
        s(:MemberExpression,
         :object,
         s(:Identifier, :PropTypes),
         :property,
         s(:Identifier, :string)),
        :property,
        s(:Identifier, :isRequired))),
...

so you probably want something more like:

(AssignmentExpression _ _ (MemberExpression object _ property _) ___)

OR:

(right ObjectExpression (properties ___))

gettinToasty · 2017-06-22T20:52:21Z

@zenspider would something like (AssignmentExpression _ ([m /propTypes$/])) work as well, considering that all proptype validations have that on the left?

zenspider · 2017-06-22T20:54:19Z

I'd also really like some feedback on how you (or anyone) thinks that you should be able to see the AST to figure that out. Right now I have a patch on CCD that I run inside docker that dumps it. It is incredibly verbose for javascript (as hinted, the AST is ginormous).

I could probably have this set up to output to the debug log if CODECLIMATE_DEBUG is set... but I'm not sure codeclimate employees want to see that normally. I'm totally open to suggestion.

gettinToasty · 2017-06-22T21:05:32Z

@zenspider as someone completely new to this syntax there's no way i would have been able to get anywhere close without seeing the AST haha, but that first expression you gave seems to do the trick really well

zenspider · 2017-06-23T00:03:35Z

Yeah... The JS parser in particular is a little crazy... I want to expose SOMETHING to make it easier, just not sure what it should be yet...

Maybe the filtering should just be against the structure of the tree and not the tree itself? @gordondiggs ?

gdiggs · 2017-06-27T16:19:56Z

@zenspider can you explain the different options there a bit?

zenspider · 2017-06-29T19:54:23Z

@gordondiggs I've been mulling over this for a week and I think it'd be a mistake...

So, the difference... given some code:

def x(y)
  y + 1
end

The raw parse tree looks like:

s(:defn, :x, s(:args, :y), s(:call, s(:lvar, :y), :+, s(:lit, 1)))

and the structural tree looks like:

s(:defn, s(:args), s(:call, s(:lvar), s(:lit)))

which is easier to read and perhaps easier to write filters against, but it also means you can't filter on, for example, method names.

I can totally imagine people filtering on something like: (defn [m /^test_/] ___)

zenspider · 2017-06-29T20:42:48Z

I found a tool: http://astexplorer.net It lets you visualize and explore the AST in a few languages, with the main focus being javascript. It might help (although right now we're massaging our AST for performance reasons which might make it a bit trickier).

zenspider · 2017-07-08T00:06:47Z

Just pushed PR #193 that allows you to dump the AST to better help with writing filters.

dblandin · 2017-08-31T20:28:59Z

I'm pretty sure that this has been merged upstream, so I'm going to go ahead and close this PR. Let me know if that's not the case!

gettinToasty · 2018-01-16T00:16:07Z

@dblandin i believe this functionality is only on the beta branch and not the main branch; i put up a new PR trying to use filters first with the beta branch which had what looked like some engine changes as well (new issues discovered) but when i removed channel: beta it took away all of the duplication fixes as a result of the filter

hiattp · 2019-01-11T17:33:33Z

@gettinToasty I know this ticket is old but we're trying to filter out PropTypes similarity warnings as well...happen to remember what filter you ended up using? #89 suggests that changing the mass_threshold may be the best bet, but @zenspider's filtering looks more robust if PropType definitions are the only issue

zenspider added 12 commits May 22, 2017 15:29

Use and modify defaults from flay instead of redundantly specifying e…

286ae28

…verything.

Update flay and remove code pushed upstream.

dd7866a

This code also adds a bunch of node names to Sexp::NODE_NAMES for JS and PHP. It is probably not exhaustive.

Fixed in latest ruby_parser:

97d98fd

* Fixed bug setting line number for hash literals to line of opening brace. Which changes line numbers in the tests here.

Minor refactoring: extract self.class::LANGUAGE to language method.

3df5317

Updated node additions list.

12cf1be

I should probably make these 1 column and put it at the end of the file for maintainability.

Added pattern-based filtering patch for Flay.

3f2ae1f

Includes a failing spec from: #188 made to pass by filtering "(hash ___)" (for now?).

Updated Gemfile to use the beta version of sexp_processor.

b8ba0d8

Added top-level doco for filtering

72f673e

oops. Need to catch the timeout. Should probably be refactored up to …

b43ec5d

…parser_base.

codeclimate-hermes assigned dblandin Jun 14, 2017

dblandin reviewed Jun 15, 2017

View reviewed changes

minor changes based on reviews

03dbb56

dblandin closed this Aug 31, 2017

rangoo94 mentioned this pull request Apr 3, 2018

Using beta channel with JavaScript (ES6+) throws errors #306

Closed

hiattp unassigned dblandin Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add language-specific node filtering #190

Add language-specific node filtering #190

zenspider commented Jun 14, 2017

zenspider commented Jun 14, 2017 •

edited

Loading

dblandin left a comment

dblandin Jun 15, 2017

zenspider Jun 15, 2017

dblandin Jun 15, 2017

zenspider Jun 15, 2017

dblandin Jun 15, 2017

zenspider Jun 15, 2017

zenspider commented Jun 15, 2017

dblandin commented Jun 22, 2017

dblandin commented Jun 22, 2017

gettinToasty commented Jun 22, 2017 •

edited

Loading

dblandin commented Jun 22, 2017

gettinToasty commented Jun 22, 2017

zenspider commented Jun 22, 2017

zenspider commented Jun 22, 2017

gettinToasty commented Jun 22, 2017 •

edited

Loading

zenspider commented Jun 22, 2017

gettinToasty commented Jun 22, 2017

zenspider commented Jun 23, 2017

gdiggs commented Jun 27, 2017

zenspider commented Jun 29, 2017

zenspider commented Jun 29, 2017

zenspider commented Jul 8, 2017

dblandin commented Aug 31, 2017

gettinToasty commented Jan 16, 2018

hiattp commented Jan 11, 2019

Add language-specific node filtering #190

Add language-specific node filtering #190

Conversation

zenspider commented Jun 14, 2017

zenspider commented Jun 14, 2017 • edited Loading

dblandin left a comment

Choose a reason for hiding this comment

dblandin Jun 15, 2017

Choose a reason for hiding this comment

zenspider Jun 15, 2017

Choose a reason for hiding this comment

dblandin Jun 15, 2017

Choose a reason for hiding this comment

zenspider Jun 15, 2017

Choose a reason for hiding this comment

dblandin Jun 15, 2017

Choose a reason for hiding this comment

zenspider Jun 15, 2017

Choose a reason for hiding this comment

zenspider commented Jun 15, 2017

dblandin commented Jun 22, 2017

dblandin commented Jun 22, 2017

gettinToasty commented Jun 22, 2017 • edited Loading

dblandin commented Jun 22, 2017

gettinToasty commented Jun 22, 2017

zenspider commented Jun 22, 2017

zenspider commented Jun 22, 2017

gettinToasty commented Jun 22, 2017 • edited Loading

zenspider commented Jun 22, 2017

gettinToasty commented Jun 22, 2017

zenspider commented Jun 23, 2017

gdiggs commented Jun 27, 2017

zenspider commented Jun 29, 2017

zenspider commented Jun 29, 2017

zenspider commented Jul 8, 2017

dblandin commented Aug 31, 2017

gettinToasty commented Jan 16, 2018

hiattp commented Jan 11, 2019

zenspider commented Jun 14, 2017 •

edited

Loading

gettinToasty commented Jun 22, 2017 •

edited

Loading

gettinToasty commented Jun 22, 2017 •

edited

Loading