Refactor query error handling to use QueryErrorCode and QueryErrorMessage for improved clarity and consistency #15037

gortiz · 2025-02-12T11:37:19Z

This PR derives from #14994. Here, I'm applying the bigger refactors required to clean up the error hierarchy. Although the number of lines changed is pretty extensive, the actual changes are:

Create some new classes in org.apache.pinot.spi.exception:
1. PinotQueryException, which should be used as our main runtime class.
2. QueryException, which extends PinotQueryException. This class substitutes most usages of ProcessingException, which was checked and includes tons of unnecessary thrift code. It is important to notice that this class differs from the older org.apache.pinot.common.exception.QueryException).
3. QueryErrorCode, an enum that substitutes the integer error codes in the older QueryException.
4. QueryErrorMessage, which contains a QueryErrorCode and two strings: one that can be shown to the user and one that could be used internally (mainly to be logged if needed).
The older QueryException class (which did not actually extend the exception) has been deleted. Its responsibilities have been moved to the new classes QueryException and QueryErrorCode.
Most uses of ProcessingException have been replaced with QueryException (when we actually need to throw it) or QueryErrorMessage when we need to keep the error code and messages in memory (usually to send them to other nodes).
In this version, errors received by customers never include stack traces. Instead, some messages have been improved, and the idea is to continue improving them in subsequent PRs, trying to achieve something closer to what Improve exceptions broker #14994 was doing.
Some error codes have changed. For example the query testQueryException("SELECT COUNT(*) FROM mytable where ArrTime = 'potato'" fails with QUERY_EXECUTION in MSE (as you can see in test BaseClusterIntegrationTestSet

…sage for improved clarity and consistency

…ge constructor for improved JSON serialization

… error codes instead of messages

…lingTest to focus on error codes

…stead of messages

codecov-commenter · 2025-02-12T15:28:19Z

Codecov Report

Attention: Patch coverage is 45.01217% with 226 lines in your changes missing coverage. Please review.

Project coverage is 63.40%. Comparing base (59551e4) to head (69c4363).
Report is 1753 commits behind head on master.

Files with missing lines	Patch %	Lines
...sthandler/BaseSingleStageBrokerRequestHandler.java	7.50%	37 Missing ⚠️
...requesthandler/MultiStageBrokerRequestHandler.java	0.00%	28 Missing ⚠️
...t/controller/api/resources/PinotQueryResource.java	25.00%	21 Missing ⚠️
.../pinot/query/service/dispatch/QueryDispatcher.java	17.64%	13 Missing and 1 partial ⚠️
...erator/streaming/BaseStreamingCombineOperator.java	23.07%	9 Missing and 1 partial ⚠️
...apache/pinot/query/service/server/QueryServer.java	35.71%	9 Missing ⚠️
.../apache/pinot/spi/exception/QueryErrorMessage.java	40.00%	7 Missing and 2 partials ⚠️
...org/apache/pinot/spi/exception/QueryErrorCode.java	85.96%	6 Missing and 2 partials ⚠️
...roker/requesthandler/BaseBrokerRequestHandler.java	12.50%	6 Missing and 1 partial ⚠️
.../core/operator/combine/GroupByCombineOperator.java	41.66%	6 Missing and 1 partial ⚠️
... and 30 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #15037      +/-   ##
============================================
+ Coverage     61.75%   63.40%   +1.65%     
- Complexity      207     1480    +1273     
============================================
  Files          2436     2750     +314     
  Lines        133233   154472   +21239     
  Branches      20636    23819    +3183     
============================================
+ Hits          82274    97942   +15668     
- Misses        44911    49132    +4221     
- Partials       6048     7398    +1350

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.37% <45.01%> (+1.66%)`	⬆️
java-21	`63.30% <45.01%> (+1.67%)`	⬆️
skip-bytebuffers-false	`63.39% <45.01%> (+1.64%)`	⬆️
skip-bytebuffers-true	`63.28% <45.01%> (+35.55%)`	⬆️
temurin	`63.40% <45.01%> (+1.65%)`	⬆️
unittests	`63.40% <45.01%> (+1.65%)`	⬆️
unittests1	`56.01% <56.41%> (+9.12%)`	⬆️
unittests2	`33.96% <21.65%> (+6.23%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vrajat · 2025-02-13T13:40:38Z

pinot-spi/src/main/java/org/apache/pinot/spi/exception/QueryErrorCode.java

+
+
+public enum QueryErrorCode {
+  JSON_PARSING(100, "JsonParsingError"),


Style question: Is this sorted on any criteria ? If not, can the list be sorted either by error code or name ? Error code is better.

These enums are listed in the same order they were in the older QueryException class. Ideally, at least from my point of view, they should be listed in error code order.

I didn't change it for two reasons: to simplify the review and... mostly because I'm lazy 😆

vrajat · 2025-02-13T13:45:29Z

pinot-common/src/main/java/org/apache/pinot/common/response/broker/BrokerQueryErrorMessage.java

 */
-public class QueryProcessingException {
+public class BrokerQueryErrorMessage {


Can you explain more why both BrokerQueryErrorMessage and QueryErrorMessage is required ? Is the main difference that QueryErrorMessage has a user/log message ? Also in QueryErrorMessage` are log & user message different often ?

Good question. The answer could be added to the Javadoc as well.

In the code currently in master, we use ProcessingException in two cases:

We throw that exception when we detect some errors at runtime.

We use them to store error messages in blocks we use to transfer information between operators and nodes.

The first case, throw exceptions, is addressed by QueryException. The second case is handled by QueryErrorMessage, which is entirely internal. In contrast, BrokerQueryErrorMessage is not. It is essentially a new name for QueryProcessingException (which was not an actual Java exception) and is utilized as part of the BrokerResponseNative. This means that BrokerQueryErrorMessage belongs to the presentation layer, and its content, form, and semantics depend on what we communicate to the user. We must also be very meticulous about backward compatibility.

In QueryErrorMessage, I've added extra content such as the log message, and I plan to include more contextual information, like on which stage the error originated (which I found useful for the MSE send operator to know if the error was raised in the current or a child stage).

Ack on splitting the function of ProcessingException. I suggest changing the name of either QueryErrorMessage (QEM) or BrokerQueryErrorMessage (BQEM). I assumed from the name that BQEM was a subset of QEM and that QEM will be passed around internally and then finally converted to BQEM when prepping a response for the user. However that doesnt seem to be the case. BQEM seems to be set from Exception objects or strings in maps. Maybe rename to [User|Client]ErrorMessage ?

vrajat · 2025-02-13T13:46:53Z

pinot-spi/src/main/java/org/apache/pinot/spi/exception/EarlyTerminationException.java

@@ -22,7 +22,7 @@
 * The {@code EarlyTerminationException} can be thrown from {Operator#nextBlock()} when the operator is early
 * terminated (interrupted).
 */
-public class EarlyTerminationException extends RuntimeException {
+public class EarlyTerminationException extends PinotRuntimeException {


Should this be derived from QueryException ? I checked usages and it is used in query processing only.

You are probably right.

vrajat · 2025-02-13T13:51:20Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/service/server/QueryServer.java

      responseObserver.onNext(Worker.ExplainResponse.newBuilder()
-          .putMetadata(CommonConstants.Explain.Response.ServerResponseStatus.STATUS_ERROR,
-              QueryException.getTruncatedStackTrace(e)).build());
+          .putMetadata(CommonConstants.Explain.Response.ServerResponseStatus.STATUS_ERROR, errorMsg)


Making sure I understand the pattern here. LOGGER.error prints the stack trace. However the stack trace is not added to the user message.

Yes. That is the key idea of this refactor. In both MSE and SSE errors (and with error I mean the abstract concept, not java.lang.error) are detected and transmitted downstream (to the parent operators). But there are clearly two different states/phases/worlds/whatever where the error may be:

As a java.lang.Throwable, which is eventually caught and transform into

An error message inside an error block. Depending on the block implementation, this abstract error can be stored in very different Java objects (including ProcessingExceptions, but also Map.Entry whose key is an error code and value is the String containing the error message).

Not all errors communicated to the user require starting as an actual java.lang.Throwables being thrown. Sometimes the error situation is detected and an error code is created. In master this sometimes means we need to create an Exception instance that is never being thrown, which is a bit strange.

Even more strange is that sometimes we allocate an static exception and just send that exception as error message. That exception includes a completely fake stack trace (whose root is the static section of the class that contains the static Exception reference, which is super misleading).

In the code in master we:

A logical error can be logged 0, 1 or multiple times on the node (server, broker, etc) that throws the exception.

When the exception is wrapped into a block, we include a partial stack trace.

When error blocks are found, sometimes we log their error messages. This means we end up logging the traces an extra time. Even worse, given the blocks are usually sent to other nodes (in SSE to the broker and in MSE to both broker and other servers) we end up logging the stacktrace of another process, which is super misleading.

Instead what we should do is:

Only allocate exceptions when they are immediately going to be thrown

Most of the times, these exceptions will be caught and converted into an error message. At that point, we have the exact stack trace, and there is where we should log it (if needed). If we want to log it, we know what we want to log, so we can calculate the logMsg the QueryErrorMessage should contain.

Once converted to a QueryErrorMessage, we don't need the stack trace as it should already have been logged. What we need is the message we want to present to the user (which should never include the stack trace) and the message we want to log in other modules (including other servers and brokers) when the block is received. This log message doesn't require the stack trace because it is not going to be useful in other nodes.

vrajat · 2025-02-17T08:03:48Z

pinot-core/src/main/java/org/apache/pinot/core/query/executor/ServerQueryExecutorV1Impl.java

-        LOGGER.info("Caught BadQueryRequestException while processing requestId: {}, {}", requestId, e.getMessage());
-        instanceResponse.addException(QueryException.getException(QueryException.QUERY_VALIDATION_ERROR, e));
-      } else if (e instanceof QueryCancelledException) {
+      if (e instanceof QueryCancelledException) {


I asked in slack as well and not specific to this PR. Why is this pattern used instead catch(QueryCancelledException) { ... } catch (Exception e) { ... }

Refactor query error handling to use QueryErrorCode and QueryErrorMes…

9fd0886

…sage for improved clarity and consistency

gortiz force-pushed the cleanUpQueryExceptions branch from 2308828 to 9fd0886 Compare February 12, 2025 11:44

gortiz added 10 commits February 12, 2025 13:00

Add JsonCreator and JsonProperty annotations to BrokerQueryErrorMessa…

d19c41d

…ge constructor for improved JSON serialization

Make SqlCompilationException a QueryException

744226a

Return the correct error code when catching QueryExceptions

36f89f1

Fix two test that was expecting an older error message

77231d5

Improve error message in BaseCombineOperator

5ac9c00

Fix a test that was expecting an older error message

8d8d85b

Remove exception name from some errors

e609c93

Update error code assertions in OfflineClusterIntegrationTest to test…

57b6f27

… error codes instead of messages

Refactor exception assertions in OfflineClusterMemBasedServerQueryKil…

144f8e1

…lingTest to focus on error codes

Refactor QueryQuotaClusterIntegrationTest to assert on error codes in…

b69f7a4

…stead of messages

gortiz added 7 commits February 13, 2025 09:09

Keep QueryException error codes in BaseCombineOperator

cb22097

Adapt test to the new more precise errors in MSE

4ba96e0

Fix error handling in QueryDispatcher and its tests

d9074d8

Merge remote-tracking branch 'origin/master' into cleanUpQueryExceptions

50e1860

Remove unused import for TimeoutException in QueryRunnerTestBase

6868df7

fix imports

a51c854

Enhance error message in QueryServer to include exception details

e7b67ea

gortiz marked this pull request as ready for review February 13, 2025 13:14

gortiz requested review from Jackie-Jiang, yashmayya and vrajat and removed request for Jackie-Jiang, yashmayya and vrajat February 13, 2025 13:14

vrajat reviewed Feb 13, 2025

View reviewed changes

vrajat reviewed Feb 17, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into cleanUpQueryExceptions

69c4363

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor query error handling to use QueryErrorCode and QueryErrorMessage for improved clarity and consistency #15037

Refactor query error handling to use QueryErrorCode and QueryErrorMessage for improved clarity and consistency #15037

gortiz commented Feb 12, 2025 •

edited

Loading

codecov-commenter commented Feb 12, 2025 •

edited

Loading

vrajat Feb 13, 2025

gortiz Feb 14, 2025

vrajat Feb 13, 2025

gortiz Feb 14, 2025

vrajat Feb 17, 2025

vrajat Feb 13, 2025

gortiz Feb 14, 2025

vrajat Feb 13, 2025

gortiz Feb 14, 2025

vrajat Feb 17, 2025 •

edited

Loading



		public enum QueryErrorCode {
		JSON_PARSING(100, "JsonParsingError"),

Refactor query error handling to use QueryErrorCode and QueryErrorMessage for improved clarity and consistency #15037

Are you sure you want to change the base?

Refactor query error handling to use QueryErrorCode and QueryErrorMessage for improved clarity and consistency #15037

Conversation

gortiz commented Feb 12, 2025 • edited Loading

codecov-commenter commented Feb 12, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vrajat Feb 17, 2025 • edited Loading

Choose a reason for hiding this comment

gortiz commented Feb 12, 2025 •

edited

Loading

codecov-commenter commented Feb 12, 2025 •

edited

Loading

vrajat Feb 17, 2025 •

edited

Loading