-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix tablet scheduler exception #52925
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: sevev <[email protected]>
continue; | ||
} | ||
list.add(tablet); | ||
count--; | ||
} | ||
return list; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
Potential race condition or inconsistency issues due to missing finalizeTabletCtx
call if checkIfTabletExpired
returns true or throws an exception.
You can modify the code like this:
private synchronized List<TabletSchedCtx> getNextTabletCtxBatch() {
while (count > 0) {
Tablet tablet = getTablet();
if (tablet == null) {
// no more tablets
break;
}
try {
// ignore tablets that will expire and erase soon
if (checkIfTabletExpired(tablet)) {
finalizeTabletCtx(tablet, TabletSchedCtx.State.EXPIRED, "Tablet expired");
continue;
}
list.add(tablet);
count--;
} catch (Exception e) {
LOG.warn("got unexpected exception, discard this schedule. tablet: {}",
tablet.getTabletId(), e);
finalizeTabletCtx(tablet, TabletSchedCtx.State.UNEXPECTED, e.getMessage());
}
}
return list;
}
By ensuring that finalizeTabletCtx
is called whenever skipping a tablet, we avoid potential leaks in state management.
Quality Gate passedIssues Measures |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]❌ fail : 4 / 8 (50.00%) file detail
|
Why I'm doing:
If
TabletScheduler
encountered an unhandled exception when scheduling pending tablets, it only removed the task but did not remove the task_ctx. Consequently, subsequent scheduling will ignore these tablets, and clone and other tasks will no longer be triggered, ultimately causing the alter task to stall.e.g
What I'm doing:
Catch the exception and remove the tablet_ctx.
Fixes https://github.com/StarRocks/StarRocksTest/issues/8829
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: