Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for groovy static analysis for groovy scripts #14844

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

abhishekbafna
Copy link
Contributor

@abhishekbafna abhishekbafna commented Jan 20, 2025

Github Issue: #14995

Google Doc: https://docs.google.com/document/d/10-j1hevpwOWzaU8q0ndqv3toRBWTiPfOWy6xHojoiL4

It adds the support for static analysis for the groovy functions. Users can configure the allowed receivers, allowed imports, allowed static imports, disallowed method names, toggle method definition allowed in the groovy scripts.

A groovy listener to update the configuration on the server nodes.

  • REST APIs for getting the default/sample groovy configuration, update/create groovy configuration and get the current groovy configurations.
  • The AST analysis covers for query and table config updates via controller.

Testing

  • Unit tests
  • Local testing using quickstart setup

@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 71.12676% with 41 lines in your changes missing coverage. Please review.

Project coverage is 63.46%. Comparing base (59551e4) to head (79b1c7d).
Report is 1753 commits behind head on master.

Files with missing lines Patch % Lines
.../controller/api/resources/PinotClusterConfigs.java 0.00% 31 Missing ⚠️
...sthandler/BaseSingleStageBrokerRequestHandler.java 70.58% 4 Missing and 1 partial ⚠️
...egment/local/function/GroovyFunctionEvaluator.java 91.66% 3 Missing and 1 partial ⚠️
...va/org/apache/pinot/spi/utils/CommonConstants.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14844      +/-   ##
============================================
+ Coverage     61.75%   63.46%   +1.71%     
- Complexity      207     1483    +1276     
============================================
  Files          2436     2749     +313     
  Lines        133233   154575   +21342     
  Branches      20636    23823    +3187     
============================================
+ Hits          82274    98097   +15823     
- Misses        44911    49080    +4169     
- Partials       6048     7398    +1350     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.41% <71.12%> (+1.70%) ⬆️
java-21 63.35% <71.12%> (+1.73%) ⬆️
skip-bytebuffers-false 63.43% <71.12%> (+1.69%) ⬆️
skip-bytebuffers-true 63.32% <71.12%> (+35.59%) ⬆️
temurin 63.46% <71.12%> (+1.71%) ⬆️
unittests 63.45% <71.12%> (+1.71%) ⬆️
unittests1 56.08% <81.81%> (+9.19%) ⬆️
unittests2 34.02% <64.78%> (+6.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@soumitra-st
Copy link
Contributor

I think the checks are disabled by default. Can you create a doc on various configurations supported by Pinot?

@abhishekbafna
Copy link
Contributor Author

Github issue link: #14995

@abhishekbafna
Copy link
Contributor Author

I think the checks are disabled by default. Can you create a doc on various configurations supported by Pinot?

@soumitra-st I have created a github issue and linked the document. Please review it. Thank you.

Copy link
Contributor

@soumitra-st soumitra-st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@npawar
Copy link
Contributor

npawar commented Feb 7, 2025

Github Issue: #14995

Google Doc: https://docs.google.com/document/d/10-j1hevpwOWzaU8q0ndqv3toRBWTiPfOWy6xHojoiL4

It adds the support for static analysis for the groovy functions. Users can configure the allowed receivers, allowed imports, allowed static imports, disallowed method names, toggle method definition allowed in the groovy scripts.

  • REST APIs for getting the default/sample groovy configuration, update/create groovy configuration and get the current groovy configurations.
  • The AST analysis covers for query, ingestion and table config updates via controller.

Testing

  • Unit tests
  • Local testing using quickstart setup

Please add the new config in the PR description, and call out a) where does it need to be set (server/minion/controller conf or cluster config) b) is restart required when it's changed

@npawar npawar added release-notes Referenced by PRs that need attention when compiling the next release notes security labels Feb 7, 2025
@Jackie-Jiang
Copy link
Contributor

Nice! Is this another attempt of #14197?

@Jackie-Jiang Jackie-Jiang added documentation Configuration Config changes (addition/deletion/change in behavior) labels Feb 7, 2025
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is static analysis, we should apply the analysis in 2 places:

  1. When table config gets created/updated, we want to validate the groovy transforms configured
  2. When broker gets a query, we want to validate the groovy transform within it

Take a look at where we block groovy functions right now. We can integrate this logic at the same place.

@abhishekbafna
Copy link
Contributor Author

Since this is static analysis, we should apply the analysis in 2 places:

  1. When table config gets created/updated, we want to validate the groovy transforms configured
  2. When broker gets a query, we want to validate the groovy transform within it

Take a look at where we block groovy functions right now. We can integrate this logic at the same place.

The analysis applied at both the suggested places.

  1. The groovy expression is validated and blocked (if found containing in-secure code) at the table creation and update stage. This happens in the controller.
  2. For the queries, the analysis happens in the server and failure is reported.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not able to find the related change in controller and broker code. Can you point me to the class changed?

@@ -38,7 +38,8 @@ public enum MinionMeter implements AbstractMetrics.Meter {
SEGMENT_BYTES_UPLOADED("bytes", false),
RECORDS_PROCESSED_COUNT("rows", false),
RECORDS_PURGED_COUNT("rows", false),
COMPACTED_RECORDS_COUNT("rows", false);
COMPACTED_RECORDS_COUNT("rows", false),
GROOVY_SECURITY_VIOLATIONS("exceptions", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean up the unrelated changes. The metric should be emitted from controller for table config, and broker for query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the metric part.

@abhishekbafna
Copy link
Contributor Author

I'm not able to find the related change in controller and broker code. Can you point me to the class changed?

For both table creation/update and querying, the groovy static analysis is applied during the script parse step. This happens when the GroovyFunctionEvaluator object is initialised.
_script = createSafeShell(_binding, groovyStaticAnalyzerConfig).parse(scriptText);

For the table create/update it happens in the controller and for query is happens in the server nodes. Attaching screenshots.

The GroovyStaticAnalyzerConfig is a static parameter which init during the service start using configureGroovySecurity method.

When the groovy security config is updated, the update happens through the controller API so controller set it using GroovyFunctionEvaluator.setConfig(groovyConfig); using the POST API call.

For the server, updates are applied using the config change listener GroovyConfigChangeListener.

Query:
QC Error Message

Query - Call Stack

Controller:

Controller Table API Call Stack

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we apply groovy check in 2 places:

  1. In TableConfigUtils (see the usage of _disableGroovy field)
  2. In BaseSingleStageBrokerRequestHandler (see the usage of _disableGroovy field)

Can we apply the static analysis at the same place? Applying it during ingestion/query execution might already be too late.

* @param groovyConfig GroovyStaticAnalyzerConfig instance to be used for static syntax analysis.
* @return GroovyShell instance with static syntax analysis.
*/
private GroovyShell createSafeShell(Binding binding, GroovyStaticAnalyzerConfig groovyConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you anticipate performance overhead for the safe shell?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. There should not be any performance impact. The CompilerConfiguration is created only once and reused. It is updated everytime static analyzer config is updated.

@abhishekbafna
Copy link
Contributor Author

Currently we apply groovy check in 2 places:

  1. In TableConfigUtils (see the usage of _disableGroovy field)
  2. In BaseSingleStageBrokerRequestHandler (see the usage of _disableGroovy field)

Can we apply the static analysis at the same place? Applying it during ingestion/query execution might already be too late.

Now the static analysis for both ingestion and query is applied along with the disableGroovy checks.

For the ingestion, it happens in the TableConfigUtils#validateIngestionConfig when FunctionEvaluatorFactory.getExpressionEvaluator(filterFunction) is called for the groovy scripts.

For the query, added a new method groovySecureAnalysis that would execute when groovy is enabled. Any changes to the configuration would require restart of the broker nodes.

Screenshot 2025-02-20 at 4 01 16 PM

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the change in org.apache.pinot.segment.local.utils.TableConfigUtils. That is needed for the ingestion groovy validation

if (handlerContext._disableGroovy) {
rejectGroovyQuery(serverPinotQuery);
}
rejectGroovyQuery(serverPinotQuery, handlerContext._disableGroovy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest renaming it to validateGroovy

@@ -29,6 +29,8 @@


public class CommonConstants {
public static final String GROOVY_STATIC_ANALYZER_CONFIG = "pinot.server.groovy.static.analyzer";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be added as top level config. We need this config for both controller and broker, and they can potentially be different (one is for ingestion, and one is for query). Take a look at the subclass Controller and Broker within this class.
We can also consider adding a common one to be used when there is no controller/broker specific config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OKay. As a single configuration, it was added at the top level. I will create separate configuration for controller (ingestion) and broker (query).

For ingestion, under Controller subclass: pinot.ingestion.groovy.static.analyzer
For query, under Broker subclass: pinot.query.groovy.static.analyzer
Common configuration, as top level config in the CommonConstants class: pinot.groovy.static.analyzer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use service name only and not use case specific name like pinot.broker.* and 'pinot.controller.*` to continue the existing pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, I think, we can avoid the specific REST APIs too for the groovy config and use POST /cluster/configs API to configure the groovy static analyzer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Since we are not keeping the instance type prefix, we can put them in the top level, or add a subclass Groovy. Let's not put it in the beginning of the file, and also add some documentation for them.

Regarding the REST API, the ser/de of this config is not easy, so I'd suggest keeping the REST API, but add a type param which can be ingestion, query or all. Same for the GET request, where we perform the same fallback logic based on the type asked

@abhishekbafna
Copy link
Contributor Author

I don't see the change in org.apache.pinot.segment.local.utils.TableConfigUtils. That is needed for the ingestion groovy validation

The ingestion (table create and update) path is already covered as part of the TableConfigUtils#validateIngestionConfig when GroovyFunctionEvaluator object is initialized, the parsing and validation happens.

This happens when FunctionEvaluatorFactory.getExpressionEvaluator(filterFunction) is called in the validate ingestion config method. I attached a screenshot as well earlier show the same.

Screenshot 2025-02-12 at 10 33 51 AM Controller Table API Call Stack

@vrajat
Copy link
Collaborator

vrajat commented Feb 21, 2025

The docs mention that security is required in Minions as well. However I dont see the security apparatus setup in Minions. Is that required ?

@Jackie-Jiang
Copy link
Contributor

@vrajat Minion should be fine because it pulls the table config which is already validated on the controller

@@ -29,6 +29,8 @@


public class CommonConstants {
public static final String GROOVY_STATIC_ANALYZER_CONFIG = "pinot.server.groovy.static.analyzer";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Since we are not keeping the instance type prefix, we can put them in the top level, or add a subclass Groovy. Let's not put it in the beginning of the file, and also add some documentation for them.

Regarding the REST API, the ser/de of this config is not easy, so I'd suggest keeping the REST API, but add a type param which can be ingestion, query or all. Same for the GET request, where we perform the same fallback logic based on the type asked

@abhishekbafna
Copy link
Contributor Author

@vrajat Minion should be fine because it pulls the table config which is already validated on the controller

@Jackie-Jiang one thing to note is, the controller config would be applied on the table update and creation. It would not apply on any existing ingestion config. For that it need to be init the config in minion and other external jobs (hadoop, spark etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) documentation release-notes Referenced by PRs that need attention when compiling the next release notes security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants