-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor query error handling to use QueryErrorCode and QueryErrorMessage for improved clarity and consistency #15037
base: master
Are you sure you want to change the base?
Conversation
…sage for improved clarity and consistency
2308828
to
9fd0886
Compare
…ge constructor for improved JSON serialization
… error codes instead of messages
…lingTest to focus on error codes
…stead of messages
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #15037 +/- ##
============================================
+ Coverage 61.75% 63.40% +1.65%
- Complexity 207 1480 +1273
============================================
Files 2436 2750 +314
Lines 133233 154472 +21239
Branches 20636 23819 +3183
============================================
+ Hits 82274 97942 +15668
- Misses 44911 49132 +4221
- Partials 6048 7398 +1350
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
||
|
||
public enum QueryErrorCode { | ||
JSON_PARSING(100, "JsonParsingError"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style question: Is this sorted on any criteria ? If not, can the list be sorted either by error code or name ? Error code is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These enums are listed in the same order they were in the older QueryException class. Ideally, at least from my point of view, they should be listed in error code order.
I didn't change it for two reasons: to simplify the review and... mostly because I'm lazy 😆
*/ | ||
public class QueryProcessingException { | ||
public class BrokerQueryErrorMessage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more why both BrokerQueryErrorMessage
and QueryErrorMessage
is required ? Is the main difference that QueryErrorMessage has a user/log message ? Also in
QueryErrorMessage` are log & user message different often ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The answer could be added to the Javadoc as well.
In the code currently in master, we use ProcessingException in two cases:
- We throw that exception when we detect some errors at runtime.
- We use them to store error messages in blocks we use to transfer information between operators and nodes.
The first case, throw exceptions, is addressed by QueryException
. The second case is handled by QueryErrorMessage
, which is entirely internal. In contrast, BrokerQueryErrorMessage
is not. It is essentially a new name for QueryProcessingException (which was not an actual Java exception) and is utilized as part of the BrokerResponseNative
. This means that BrokerQueryErrorMessage
belongs to the presentation layer, and its content, form, and semantics depend on what we communicate to the user. We must also be very meticulous about backward compatibility.
In QueryErrorMessage
, I've added extra content such as the log message, and I plan to include more contextual information, like on which stage the error originated (which I found useful for the MSE send operator to know if the error was raised in the current or a child stage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack on splitting the function of ProcessingException
. I suggest changing the name of either QueryErrorMessage (QEM)
or BrokerQueryErrorMessage (BQEM)
. I assumed from the name that BQEM
was a subset of QEM
and that QEM
will be passed around internally and then finally converted to BQEM
when prepping a response for the user. However that doesnt seem to be the case. BQEM
seems to be set from Exception objects or strings in maps. Maybe rename to [User|Client]ErrorMessage
?
@@ -22,7 +22,7 @@ | |||
* The {@code EarlyTerminationException} can be thrown from {Operator#nextBlock()} when the operator is early | |||
* terminated (interrupted). | |||
*/ | |||
public class EarlyTerminationException extends RuntimeException { | |||
public class EarlyTerminationException extends PinotRuntimeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be derived from QueryException
? I checked usages and it is used in query processing only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right.
responseObserver.onNext(Worker.ExplainResponse.newBuilder() | ||
.putMetadata(CommonConstants.Explain.Response.ServerResponseStatus.STATUS_ERROR, | ||
QueryException.getTruncatedStackTrace(e)).build()); | ||
.putMetadata(CommonConstants.Explain.Response.ServerResponseStatus.STATUS_ERROR, errorMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making sure I understand the pattern here. LOGGER.error
prints the stack trace. However the stack trace is not added to the user message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That is the key idea of this refactor. In both MSE and SSE errors (and with error I mean the abstract concept, not java.lang.error) are detected and transmitted downstream (to the parent operators). But there are clearly two different states/phases/worlds/whatever where the error may be:
- As a
java.lang.Throwable
, which is eventually caught and transform into - An error message inside an error block. Depending on the block implementation, this abstract error can be stored in very different Java objects (including ProcessingExceptions, but also Map.Entry whose key is an error code and value is the String containing the error message).
Not all errors communicated to the user require starting as an actual java.lang.Throwables
being thrown. Sometimes the error situation is detected and an error code is created. In master this sometimes means we need to create an Exception instance that is never being thrown, which is a bit strange.
Even more strange is that sometimes we allocate an static exception and just send that exception as error message. That exception includes a completely fake stack trace (whose root is the static section of the class that contains the static Exception reference, which is super misleading).
In the code in master we:
- A logical error can be logged 0, 1 or multiple times on the node (server, broker, etc) that throws the exception.
- When the exception is wrapped into a block, we include a partial stack trace.
- When error blocks are found, sometimes we log their error messages. This means we end up logging the traces an extra time. Even worse, given the blocks are usually sent to other nodes (in SSE to the broker and in MSE to both broker and other servers) we end up logging the stacktrace of another process, which is super misleading.
Instead what we should do is:
- Only allocate exceptions when they are immediately going to be thrown
- Most of the times, these exceptions will be caught and converted into an error message. At that point, we have the exact stack trace, and there is where we should log it (if needed). If we want to log it, we know what we want to log, so we can calculate the
logMsg
the QueryErrorMessage should contain. - Once converted to a QueryErrorMessage, we don't need the stack trace as it should already have been logged. What we need is the message we want to present to the user (which should never include the stack trace) and the message we want to log in other modules (including other servers and brokers) when the block is received. This log message doesn't require the stack trace because it is not going to be useful in other nodes.
LOGGER.info("Caught BadQueryRequestException while processing requestId: {}, {}", requestId, e.getMessage()); | ||
instanceResponse.addException(QueryException.getException(QueryException.QUERY_VALIDATION_ERROR, e)); | ||
} else if (e instanceof QueryCancelledException) { | ||
if (e instanceof QueryCancelledException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked in slack as well and not specific to this PR. Why is this pattern used instead catch(QueryCancelledException) { ... } catch (Exception e) { ... }
This PR derives from #14994. Here, I'm applying the bigger refactors required to clean up the error hierarchy. Although the number of lines changed is pretty extensive, the actual changes are:
org.apache.pinot.spi.exception
:org.apache.pinot.common.exception.QueryException
).testQueryException("SELECT COUNT(*) FROM mytable where ArrTime = 'potato'"
fails withQUERY_EXECUTION
in MSE (as you can see in testBaseClusterIntegrationTestSet