Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STORM-3683] Check if JVM options used for launching worker are valid. #3319

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bipinprasad
Copy link
Contributor

What is the purpose of the change

Error in JVM options can cause the worker launch to fail. This change will detect the erroneous options to be detected early when topology is submitted.

How was the change tested

With new unit tests

@@ -1907,4 +1909,201 @@ private void readArchive(ZipFile zipFile) throws IOException {
}
}
}

/**
* Return path to the Java command "x", prefixing with $}{JAVA_HOME}/bin/ if JAVA_HOME system property is defined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra } bracket in $}{JAVA_HOME}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


/**
* Return path to the Java command "x", prefixing with $}{JAVA_HOME}/bin/ if JAVA_HOME system property is defined.
* Otherwie return the supplied Java command unmodified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise spelling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* @param cmd Java command, e.g. "java", "jar" etc.
* @return command string to use.
*/
public static String getJavaCmd(String cmd) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this.

@@ -568,14 +570,7 @@ private int getMemOffHeap(WorkerResources resources) {
}

protected String javaCmd(String cmd) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any need to keep this method now that it is available in Utils?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I took this out, a bunch of test classes started failing. They were using a Mock class to override this class method to return just "java". And in the test, they check the returned array. In order to avoid changing a whole bunch of other classes, I left this signature unchanged. But this definitely looks ugly and the tests should be fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'm fine keeping this for now.

public static boolean validateWorkerLaunchOptions(Map<String, Object> supervisorConf, Map<String, Object> topoConf,
Map<String, Object> substitutions, boolean throwExceptionOnFailure)
throws InvalidTopologyException {
// from storm-server/.../BasicContainer.mkLaunchCommand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we commonize this, or grab them from a BasicContainer method? Hard to maintain this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BasicContainer is in storm-server package which depends on the storm-client package for compile.
Should be commonized somehow - otherwise this is a hidden forward dependency. Will look into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that storm-server:BasicContainer.java is now using storm-client:Utils.WellKnownRuntimeSubstitutionVars. Is this sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like others to chime in here. I like the intent of this change, but the way we're replicating the code here so specifically I'm not so happy with. I understand the dependency issue causes problems.

Copy link
Contributor

@Ethanlm Ethanlm Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the replication either. It adds a lot of maintenance overhead. The suggestion I can think of is to move the whole check into server side at Nimbus.submitTopologyWithOpts, which is better in a few ways. But it has its own problem to be taken care of (see my comments above).

Some thoughts on this:
This feature (check GC option conflicts) is nice to have. But it is not hard to find out that workers fail because of GC option conflicts. So I think we don't have to implement this feature if there is no good/clean way to implement it.

@bipinprasad
Copy link
Contributor Author

close/reopen for rebuild

@bipinprasad bipinprasad reopened this Aug 12, 2020
@@ -528,6 +528,7 @@ private static void validateConfs(Map<String, Object> topoConf, StormTopology to
InvalidTopologyException, AuthorizationException {
ConfigValidation.validateTopoConf(topoConf);
Utils.validateTopologyBlobStoreMap(topoConf);
Utils.validateWorkerLaunchOptions(null, topoConf, null, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A validation here is not likely to prevent the issue since it doesn't use the supervisorConf here. But in BasicContainer, it will use supervisorConf, which might have some default values.

We can add the check inside Nimbus.submitTopologyWithOpts (server side), and if it fails the check, throws InvalidTopologyException. This can avoid code duplication and avoid exposing too many method on the client side.

But the problem is still it will use nimbusConf instead of supervisorConf, unless we can find a clean way to use the same source of truth.

public static boolean validateWorkerLaunchOptions(Map<String, Object> supervisorConf, Map<String, Object> topoConf,
Map<String, Object> substitutions, boolean throwExceptionOnFailure)
throws InvalidTopologyException {
// from storm-server/.../BasicContainer.mkLaunchCommand
Copy link
Contributor

@Ethanlm Ethanlm Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the replication either. It adds a lot of maintenance overhead. The suggestion I can think of is to move the whole check into server side at Nimbus.submitTopologyWithOpts, which is better in a few ways. But it has its own problem to be taken care of (see my comments above).

Some thoughts on this:
This feature (check GC option conflicts) is nice to have. But it is not hard to find out that workers fail because of GC option conflicts. So I think we don't have to implement this feature if there is no good/clean way to implement it.

@bipinprasad
Copy link
Contributor Author

I think a good reason to catching this early (i.e. at submission time) is that the cycle time for error detection is longer. Case in point is JVM options changes and the debug cycle yesterday just to discover the problem with the profiler options.

@Ethanlm
Copy link
Contributor

Ethanlm commented Sep 4, 2020

I think a good reason to catching this early (i.e. at submission time) is that the cycle time for error detection is longer.

It is not hard to find out that jvm didn't launch because of jvm option issue, GC option, profiler option or other options, since they are in the logs.

But if this feature is really desired, I would suggest to move to Nimbus code. Doing it in Nimbus.submitTopologyWithOpts is still at submission time and users will see the exception from console log. But doing it on StormSubmitter side won't be able to detect the issue because it doesn't use daemonConf.

Another advantage of doing it at Nimbus.submitTopologyWithOpts is we can avoid code duplication and we can restrict the scope of the code related to launching a worker. Because the code to launch the worker is purely server-side, it would be better to not expose it to client.

But after moving it to Nimbus code, we need to deal with nimbusConf vs supervisorConf issue (although it is a less serious problem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants