Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor query error handling to use QueryErrorCode and QueryErrorMessage for improved clarity and consistency #15037

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

gortiz
Copy link
Contributor

@gortiz gortiz commented Feb 12, 2025

This PR derives from #14994. Here, I'm applying the bigger refactors required to clean up the error hierarchy. Although the number of lines changed is pretty extensive, the actual changes are:

  1. Create some new classes in org.apache.pinot.spi.exception:
    1. PinotQueryException, which should be used as our main runtime class.
    2. QueryException, which extends PinotQueryException. This class substitutes most usages of ProcessingException, which was checked and includes tons of unnecessary thrift code. It is important to notice that this class differs from the older org.apache.pinot.common.exception.QueryException).
    3. QueryErrorCode, an enum that substitutes the integer error codes in the older QueryException.
    4. QueryErrorMessage, which contains a QueryErrorCode and two strings: one that can be shown to the user and one that could be used internally (mainly to be logged if needed).
  2. The older QueryException class (which did not actually extend the exception) has been deleted. Its responsibilities have been moved to the new classes QueryException and QueryErrorCode.
  3. Most uses of ProcessingException have been replaced with QueryException (when we actually need to throw it) or QueryErrorMessage when we need to keep the error code and messages in memory (usually to send them to other nodes).
  4. In this version, errors received by customers never include stack traces. Instead, some messages have been improved, and the idea is to continue improving them in subsequent PRs, trying to achieve something closer to what Improve exceptions broker #14994 was doing.
  5. Some error codes have changed. For example the query testQueryException("SELECT COUNT(*) FROM mytable where ArrTime = 'potato'" fails with QUERY_EXECUTION in MSE (as you can see in test BaseClusterIntegrationTestSet

@gortiz gortiz force-pushed the cleanUpQueryExceptions branch from 2308828 to 9fd0886 Compare February 12, 2025 11:44
@codecov-commenter
Copy link

codecov-commenter commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 45.01217% with 226 lines in your changes missing coverage. Please review.

Project coverage is 63.40%. Comparing base (59551e4) to head (69c4363).
Report is 1753 commits behind head on master.

Files with missing lines Patch % Lines
...sthandler/BaseSingleStageBrokerRequestHandler.java 7.50% 37 Missing ⚠️
...requesthandler/MultiStageBrokerRequestHandler.java 0.00% 28 Missing ⚠️
...t/controller/api/resources/PinotQueryResource.java 25.00% 21 Missing ⚠️
.../pinot/query/service/dispatch/QueryDispatcher.java 17.64% 13 Missing and 1 partial ⚠️
...erator/streaming/BaseStreamingCombineOperator.java 23.07% 9 Missing and 1 partial ⚠️
...apache/pinot/query/service/server/QueryServer.java 35.71% 9 Missing ⚠️
.../apache/pinot/spi/exception/QueryErrorMessage.java 40.00% 7 Missing and 2 partials ⚠️
...org/apache/pinot/spi/exception/QueryErrorCode.java 85.96% 6 Missing and 2 partials ⚠️
...roker/requesthandler/BaseBrokerRequestHandler.java 12.50% 6 Missing and 1 partial ⚠️
.../core/operator/combine/GroupByCombineOperator.java 41.66% 6 Missing and 1 partial ⚠️
... and 30 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #15037      +/-   ##
============================================
+ Coverage     61.75%   63.40%   +1.65%     
- Complexity      207     1480    +1273     
============================================
  Files          2436     2750     +314     
  Lines        133233   154472   +21239     
  Branches      20636    23819    +3183     
============================================
+ Hits          82274    97942   +15668     
- Misses        44911    49132    +4221     
- Partials       6048     7398    +1350     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.37% <45.01%> (+1.66%) ⬆️
java-21 63.30% <45.01%> (+1.67%) ⬆️
skip-bytebuffers-false 63.39% <45.01%> (+1.64%) ⬆️
skip-bytebuffers-true 63.28% <45.01%> (+35.55%) ⬆️
temurin 63.40% <45.01%> (+1.65%) ⬆️
unittests 63.40% <45.01%> (+1.65%) ⬆️
unittests1 56.01% <56.41%> (+9.12%) ⬆️
unittests2 33.96% <21.65%> (+6.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gortiz gortiz marked this pull request as ready for review February 13, 2025 13:14
@gortiz gortiz requested review from Jackie-Jiang, yashmayya and vrajat and removed request for Jackie-Jiang, yashmayya and vrajat February 13, 2025 13:14


public enum QueryErrorCode {
JSON_PARSING(100, "JsonParsingError"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style question: Is this sorted on any criteria ? If not, can the list be sorted either by error code or name ? Error code is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These enums are listed in the same order they were in the older QueryException class. Ideally, at least from my point of view, they should be listed in error code order.

I didn't change it for two reasons: to simplify the review and... mostly because I'm lazy 😆

*/
public class QueryProcessingException {
public class BrokerQueryErrorMessage {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more why both BrokerQueryErrorMessage and QueryErrorMessage is required ? Is the main difference that QueryErrorMessage has a user/log message ? Also in QueryErrorMessage` are log & user message different often ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The answer could be added to the Javadoc as well.

In the code currently in master, we use ProcessingException in two cases:

  1. We throw that exception when we detect some errors at runtime.
  2. We use them to store error messages in blocks we use to transfer information between operators and nodes.

The first case, throw exceptions, is addressed by QueryException. The second case is handled by QueryErrorMessage, which is entirely internal. In contrast, BrokerQueryErrorMessage is not. It is essentially a new name for QueryProcessingException (which was not an actual Java exception) and is utilized as part of the BrokerResponseNative. This means that BrokerQueryErrorMessage belongs to the presentation layer, and its content, form, and semantics depend on what we communicate to the user. We must also be very meticulous about backward compatibility.

In QueryErrorMessage, I've added extra content such as the log message, and I plan to include more contextual information, like on which stage the error originated (which I found useful for the MSE send operator to know if the error was raised in the current or a child stage).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack on splitting the function of ProcessingException. I suggest changing the name of either QueryErrorMessage (QEM) or BrokerQueryErrorMessage (BQEM). I assumed from the name that BQEM was a subset of QEM and that QEM will be passed around internally and then finally converted to BQEM when prepping a response for the user. However that doesnt seem to be the case. BQEM seems to be set from Exception objects or strings in maps. Maybe rename to [User|Client]ErrorMessage ?

@@ -22,7 +22,7 @@
* The {@code EarlyTerminationException} can be thrown from {Operator#nextBlock()} when the operator is early
* terminated (interrupted).
*/
public class EarlyTerminationException extends RuntimeException {
public class EarlyTerminationException extends PinotRuntimeException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be derived from QueryException ? I checked usages and it is used in query processing only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right.

responseObserver.onNext(Worker.ExplainResponse.newBuilder()
.putMetadata(CommonConstants.Explain.Response.ServerResponseStatus.STATUS_ERROR,
QueryException.getTruncatedStackTrace(e)).build());
.putMetadata(CommonConstants.Explain.Response.ServerResponseStatus.STATUS_ERROR, errorMsg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure I understand the pattern here. LOGGER.error prints the stack trace. However the stack trace is not added to the user message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is the key idea of this refactor. In both MSE and SSE errors (and with error I mean the abstract concept, not java.lang.error) are detected and transmitted downstream (to the parent operators). But there are clearly two different states/phases/worlds/whatever where the error may be:

  1. As a java.lang.Throwable, which is eventually caught and transform into
  2. An error message inside an error block. Depending on the block implementation, this abstract error can be stored in very different Java objects (including ProcessingExceptions, but also Map.Entry whose key is an error code and value is the String containing the error message).

Not all errors communicated to the user require starting as an actual java.lang.Throwables being thrown. Sometimes the error situation is detected and an error code is created. In master this sometimes means we need to create an Exception instance that is never being thrown, which is a bit strange.

Even more strange is that sometimes we allocate an static exception and just send that exception as error message. That exception includes a completely fake stack trace (whose root is the static section of the class that contains the static Exception reference, which is super misleading).

In the code in master we:

  1. A logical error can be logged 0, 1 or multiple times on the node (server, broker, etc) that throws the exception.
  2. When the exception is wrapped into a block, we include a partial stack trace.
  3. When error blocks are found, sometimes we log their error messages. This means we end up logging the traces an extra time. Even worse, given the blocks are usually sent to other nodes (in SSE to the broker and in MSE to both broker and other servers) we end up logging the stacktrace of another process, which is super misleading.

Instead what we should do is:

  1. Only allocate exceptions when they are immediately going to be thrown
  2. Most of the times, these exceptions will be caught and converted into an error message. At that point, we have the exact stack trace, and there is where we should log it (if needed). If we want to log it, we know what we want to log, so we can calculate the logMsg the QueryErrorMessage should contain.
  3. Once converted to a QueryErrorMessage, we don't need the stack trace as it should already have been logged. What we need is the message we want to present to the user (which should never include the stack trace) and the message we want to log in other modules (including other servers and brokers) when the block is received. This log message doesn't require the stack trace because it is not going to be useful in other nodes.

LOGGER.info("Caught BadQueryRequestException while processing requestId: {}, {}", requestId, e.getMessage());
instanceResponse.addException(QueryException.getException(QueryException.QUERY_VALIDATION_ERROR, e));
} else if (e instanceof QueryCancelledException) {
if (e instanceof QueryCancelledException) {
Copy link
Collaborator

@vrajat vrajat Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked in slack as well and not specific to this PR. Why is this pattern used instead catch(QueryCancelledException) { ... } catch (Exception e) { ... }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants