Skip to content

Add language-specific node filtering #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

Add language-specific node filtering #190

wants to merge 13 commits into from

Conversation

zenspider
Copy link
Contributor

99% of the work wound up in sexp_processor, which got released to beta yesterday. This has minor patches against Flay that will get pushed upstream and released.

This also has changes to parsing strategy. In my measurements, if a (especially foreign language) file can't parse in 10 seconds, it can't parse in 5 minutes (the default timeout for CommandLineRunner), basically chewing up 1/3rd of the overall default timeout for no benefit. By changing them to 10s, I timeout on only the most egregious files but now finish w/o a timeout error on all the projects.

The only thing I find really icky is the new_nodes array in lib/ccflay.rb. This is needed for the stable hash algorithm. It is probably missing node types and it would be nice if there was a comprehensive list, but the foreign parsers are messy and relatively undocumented as such. (I'd highly recommend adding a rewrite phase to cleane and speed things up even further).

zenspider added 12 commits May 22, 2017 15:29
This code also adds a bunch of node names to Sexp::NODE_NAMES for JS
and PHP. It is probably not exhaustive.
  * Fixed bug setting line number for hash literals to line of opening brace.

Which changes line numbers in the tests here.
By observation, either an external parser is going to finish quickly
or it probably isn't going to finish at all (at least within the total
run timeout thresholds). Currently the timeout is 300s (5 minutes) out
of a total run timeout of 900s (15m). That means 3 large files will
kill off an entire project run.

I've found that even rather javascript large files all parse within 2
seconds and those that don't don't parse within 5 minutes. Dropping
the thresholds to 10s seems safe and will allow more projects to
finish analyzing to report SOMETHING.

Besides, all the files I've seen time out were vendored versions of
react/ember and the like. They're huge and they're NOT actually part
of the project. Losing them quickly will not reduce the quality of
analysis.
I should probably make these 1 column and put it at the end of the
file for maintainability.
So now a .codeclimate.yml file can have an entry like:

  - ruby:
      filters:
        - "(defn, [m /^test_/], _, ___)"

and that will filter out all methods whose names start with "test_".

This requires releases of flay and sexp_processor to pass tests.

Pushing now to get eyeballs on.
AFAIK, I have NO projects to test against from the testing group. So I
don't know if that value is too small. I suspect it'll be on par with
my data from javascript and PHP tho (ie, if it can't parse in 10s, it
can't parse in 5m).
Includes a failing spec from:

#188

made to pass by filtering "(hash ___)" (for now?).
@zenspider
Copy link
Contributor Author

zenspider commented Jun 14, 2017

As extra datapoints:

  1. the big_files project used to time out at 900s, even with only source 3 files in it.
  2. Profiling and changes to both codeclimate and codeclimate-duplication dropped runtime down to ~300s.
  3. Filtering (everything) drops the time down to ~30s.

Copy link
Contributor

@dblandin dblandin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Is the default parsing timeout value of 300 useful anymore? Would it make sense to change that default to 10 and override it to 30 within the ruby parser?

Would it make sense to define some default filters when none are defined with the codeclimate yml configuration?

@@ -197,5 +235,18 @@ def self.from_remediation_amount(amount)
def engine_conf
EngineConfig.new({})
end

def filtered_engine_conf *patterns
EngineConfig.new({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this indentation intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional, yes. Good? probably not. It's how emacs does it by default (more like lisp indentation).

expect([locations[0].begin_line, locations[0].end_line]).to eq([2, 7])
expect([locations[1].begin_line, locations[1].end_line]).to eq([11, 16])
expect([locations[0].begin_line, locations[0].end_line]).to eq([1, 7])
expect([locations[1].begin_line, locations[1].end_line]).to eq([10, 16])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what caused these line locations to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fixes to ruby_parser. I have a release coming out soon that makes more of these fixes so there might be more failed tests coming up.

def filters_for(language)
fetch_language(language).fetch("filters", []).map { |s|
Sexp::Matcher.parse s
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the s variable represent? Does it stand for string? What do you think of fleshing it out to "filter" if we're mapping over filters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zenspider
Copy link
Contributor Author

@dblandin:

Is the default parsing timeout value of 300 useful anymore? Would it make sense to change that default to 10 and override it to 30 within the ruby parser?

I suspect it probably never made sense. 30 or 10 are more more sensible defaults based on the numbers I've gathered. I'd be fine dropping the numbers... but it might be better if data were gathered so that statistics could decide what the numbers should be.

Would it make sense to define some default filters when none are defined with the codeclimate yml configuration?

Absolutely. But I suspect at this time it would be premature to decide what those should be. @ABaldwinHunter and others would have more experienced/valid opinions on this matter than I would.

@dblandin
Copy link
Contributor

This is available on the beta channel now:

engines:
  duplication:
    enabled: true
    channel: beta
    config:
      languages:
      - ruby

@dblandin
Copy link
Contributor

@gettinToasty Would you mind sharing your .codeclimate.yml file?

@gettinToasty
Copy link

gettinToasty commented Jun 22, 2017

@dblandin I managed to fix that issue-- the new readme says the format should be

config:
    languages:
        javascript:
          filters:

but it should be:

config:
      languages:
      - javascript:
          filters:

i am having some trouble with the pattern matching though. my main problem is propTypes being marked as duplicates so i tried - "(object (lit _) (PropTypes _) ___)" as well as - "(object ___)" but didn't get any fixed results

@dblandin
Copy link
Contributor

i am having some trouble with the pattern matching though. my main problem is propTypes being marked as duplicates so i tried - "(object (lit _) (PropTypes _) ___)" as well as - "(object ___)" but didn't get any fixed results

@zenspider should be able to help with the pattern syntax ^

@gettinToasty Could you provide an example case of detected duplication you're trying to filter out?

@gettinToasty
Copy link

@dblandin sure thing! here are some lightly modified code snippets:

EventBlock.propTypes = {
  heading: PropTypes.string.isRequired,
  sub_heading: PropTypes.string.isRequired,
  description: PropTypes.string.isRequired,
  event_description: PropTypes.string.isRequired,
VideoList.propTypes = {
  heading: PropTypes.string.isRequired,
  list: PropTypes.string.isRequired,
  subtext: PropTypes.string.isRequired,
  closing_line: PropTypes.string.isRequired,

which have a Mass 90 duplication issue

@zenspider
Copy link
Contributor Author

Hrm... wrt the config, here is what I'm using:

engines:
  duplication:
    enabled: true
    config:
      debug: true
      languages:
        ruby:
          filters:
          - "([m /./] ___)"
        javascript:
          filters:
          - "([m /./] ___)"
        php:
          filters:
          - "([m /./] ___)"

(intentionally nuking EVERYTHING, to see full coverage and its corresponding performance differences)

@zenspider
Copy link
Contributor Author

OK. WRT the code... this is a doozy. I am not fond of the AST that comes back from javascript, but after our munging, this is how it looks:

s(:Program,
 :module,
 s(:body,
  s(:ExpressionStatement,
   :expression,
   s(:AssignmentExpression,
    :"=",
    :left,
    s(:MemberExpression,
     :object,
     s(:Identifier, :EventBlock),
     :property,
     s(:Identifier, :propTypes)),
    s(:right,
     :ObjectExpression,
     s(:properties,
      s(:ObjectProperty,
       :key,
       s(:Identifier, :heading),
       :value,
       s(:MemberExpression,
        :object,
        s(:MemberExpression,
         :object,
         s(:Identifier, :PropTypes),
         :property,
         s(:Identifier, :string)),
        :property,
        s(:Identifier, :isRequired))),
...

so you probably want something more like:

(AssignmentExpression _ _ (MemberExpression object _ property _) ___)

OR:

(right ObjectExpression (properties ___))

@gettinToasty
Copy link

gettinToasty commented Jun 22, 2017

@zenspider would something like (AssignmentExpression _ ([m /propTypes$/])) work as well, considering that all proptype validations have that on the left?

@zenspider
Copy link
Contributor Author

I'd also really like some feedback on how you (or anyone) thinks that you should be able to see the AST to figure that out. Right now I have a patch on CCD that I run inside docker that dumps it. It is incredibly verbose for javascript (as hinted, the AST is ginormous).

I could probably have this set up to output to the debug log if CODECLIMATE_DEBUG is set... but I'm not sure codeclimate employees want to see that normally. I'm totally open to suggestion.

@gettinToasty
Copy link

@zenspider as someone completely new to this syntax there's no way i would have been able to get anywhere close without seeing the AST haha, but that first expression you gave seems to do the trick really well

@zenspider
Copy link
Contributor Author

Yeah... The JS parser in particular is a little crazy... I want to expose SOMETHING to make it easier, just not sure what it should be yet...

Maybe the filtering should just be against the structure of the tree and not the tree itself? @gordondiggs ?

@gdiggs
Copy link
Contributor

gdiggs commented Jun 27, 2017

@zenspider can you explain the different options there a bit?

@zenspider
Copy link
Contributor Author

@gordondiggs I've been mulling over this for a week and I think it'd be a mistake...

So, the difference... given some code:

def x(y)
  y + 1
end

The raw parse tree looks like:

s(:defn, :x, s(:args, :y), s(:call, s(:lvar, :y), :+, s(:lit, 1)))

and the structural tree looks like:

s(:defn, s(:args), s(:call, s(:lvar), s(:lit)))

which is easier to read and perhaps easier to write filters against, but it also means you can't filter on, for example, method names.

I can totally imagine people filtering on something like: (defn [m /^test_/] ___)

@zenspider
Copy link
Contributor Author

I found a tool: http://astexplorer.net It lets you visualize and explore the AST in a few languages, with the main focus being javascript. It might help (although right now we're massaging our AST for performance reasons which might make it a bit trickier).

@zenspider
Copy link
Contributor Author

Just pushed PR #193 that allows you to dump the AST to better help with writing filters.

@dblandin
Copy link
Contributor

I'm pretty sure that this has been merged upstream, so I'm going to go ahead and close this PR. Let me know if that's not the case!

@dblandin dblandin closed this Aug 31, 2017
@gettinToasty
Copy link

@dblandin i believe this functionality is only on the beta branch and not the main branch; i put up a new PR trying to use filters first with the beta branch which had what looked like some engine changes as well (new issues discovered) but when i removed channel: beta it took away all of the duplication fixes as a result of the filter

@hiattp
Copy link

hiattp commented Jan 11, 2019

@gettinToasty I know this ticket is old but we're trying to filter out PropTypes similarity warnings as well...happen to remember what filter you ended up using? #89 suggests that changing the mass_threshold may be the best bet, but @zenspider's filtering looks more robust if PropType definitions are the only issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants