Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: reduce overhead of distinct estimation #140772

Closed
yuzefovich opened this issue Feb 8, 2025 · 1 comment · Fixed by #143087
Closed

stats: reduce overhead of distinct estimation #140772

yuzefovich opened this issue Feb 8, 2025 · 1 comment · Fixed by #143087
Assignees
Labels
A-sql-table-stats Table statistics (and their automatic refresh). C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team target-release-25.2.0

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Feb 8, 2025

We currently use hyperloglog library to estimate the number of distinct elements. I just collected a 50s cpu profile non-nil datum alloc (about the time it took for ANALYZE to complete) on a cluster that only had ANALYZE tpcc.customer running, and this distinct estimation is the most expensive part of the stats collection (this was on dbb0baa plus a revert of 2831511 and another commit to introduce a cluster setting for using nil or non-nil DatumAlloc in stats):

Image

We should investigate whether it's possible to reduce this overhead. We recently upgraded the hyperloglog library, so there is no quick fix like that :/

There were some ideas floated around that we could avoid this expensive computation altogether for key columns if we were to scan the secondary indexes.

Related to #135988.

nil datum alloc

Jira issue: CRDB-47355

@yuzefovich yuzefovich added A-sql-table-stats Table statistics (and their automatic refresh). C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team labels Feb 8, 2025
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Feb 8, 2025
@rytaft rytaft moved this from Triage to Active in SQL Queries Feb 11, 2025
@mgartner
Copy link
Collaborator

axiomhq/hyperloglog#43 should be a fairly significant improvement that was introduced in v0.2.1 (we are currently on v0.2.0). So an upgrade might benefit us.

I also found some more minor improvements and submitted a PR here: axiomhq/hyperloglog#50

mgartner added a commit to mgartner/cockroach that referenced this issue Mar 18, 2025
The hyperloglog library has been upgraded from v0.2.0 to v0.2.5. See the
commits in this upgrade here:
axiomhq/hyperloglog@v0.2.0...v0.2.5

Fixes cockroachdb#140772

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this issue Mar 18, 2025
The hyperloglog library has been upgraded from v0.2.0 to v0.2.5. See the
commits in this upgrade here:
axiomhq/hyperloglog@v0.2.0...v0.2.5

Fixes cockroachdb#140772

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this issue Mar 18, 2025
The hyperloglog library has been upgraded from v0.2.0 to v0.2.5. See the
commits in this upgrade here:
axiomhq/hyperloglog@v0.2.0...v0.2.5

Fixes cockroachdb#140772

Release note: None
craig bot pushed a commit that referenced this issue Mar 20, 2025
142979: kvserver: update raft log stats with trunc state r=tbg a=pav-kv

This PR iterates on the raft log truncation code, and makes it more consolidated.

It also fixes one bug: the log size update is now done in the same `Replica.mu` critical section with the `RaftTruncatedState` update. This is achieved by moving the truncated files size computation from post-apply to pre-apply stage. The latter change bears no performance implications because the pre-apply stage already reads from FS when determining whether a truncation affects any sideloaded entries.

Epic: none
Release note: none

143087: go.mod: update hyperloglog to v0.2.5 r=mgartner a=mgartner

The hyperloglog library has been upgraded from v0.2.0 to v0.2.5. See the
commits in this upgrade here:
axiomhq/hyperloglog@v0.2.0...v0.2.5

Fixes #140772

Release note: None


Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@craig craig bot closed this as completed in 2d300f0 Mar 20, 2025
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-table-stats Table statistics (and their automatic refresh). C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team target-release-25.2.0
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants