Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-6728] Introduce new methods to lookup tables and schemas inside schemas #4100

Merged
merged 1 commit into from
Mar 6, 2025

Conversation

kramerul
Copy link
Contributor

@kramerul kramerul commented Dec 19, 2024

Motivation

For databases with a huge set of schemas and tables it takes quite long to prepare queries. Currently all tables/schemas are loaded into memory.

Caching all these schemas and tables is not an option

  1. It will require a lot of memory
  2. The eviction of the cache must happen quite often since it's likely that every second one of these table is changed.

Therefore, we tried to find a way to load only those tables/schemas, which are required to prepare a query.

API Changes

This PR introduces a new mechanism to lookup tables and schemas within a schema. For this purpose a new interface is introduced

public interface Lookup<T> {
  @Nullable T get(String name);
  @Nullable Named<T> getIgnoreCase(String name);
  Set<String> getNames(LikePattern pattern);
}

The LikePattern was extracted from CalciteMetaImpl to hold a pattern, which can be used to query tables and schemas inside a JDBC database using the LIKE operator. Additionally, it also supports the conversion to a Predicate1<String> which can be used to implement filters in plain java.

The Schema is now using this Lookup interface to find schemas and tables. It could be also extended to functions and types.

public interface Schema {
  default Lookup<Table> tables() {
    ...
  }
  default Lookup<? extends Schema> subSchemas() {
    ...
  }
  ...
}

Implementation

The case insensitive search is now directly implemented in the specific Schema using matching implementation of the Lookup interface. Formerly, it was done in the CalciteSchema.

JdbcSchema and JdbcCatalogSchema are using a special implementation of Lookup: LoadingCacheLookup. This implementation is using a LoadingCache inside to speed up things. If only case sensitive schema/table lookup is required, this can be done quite fast since DatabaseMetaData#getTables can be used to query a single table. The result is cached inside the LoadingCache for one minute.

Unfortunately DatabaseMetaData#getTables doesn't support case insensitive queries. In this case, it's still required to load all database tables to perform case insensitive lookups.

The performance gain for huge sets of tables/schemas in database schemas can only be achieved if caching is turned off in Calcite (SimpleCalciteSchema is used instead of CachingCalciteSchema).

I tried to keep the behavior of CachingCalciteSchema exactly the same. This behavior includes that all tables/schemas are loaded into memory. CachedLookup is used to achieve this.

@kramerul kramerul marked this pull request as ready for review January 2, 2025 08:07
Copy link

github-actions bot commented Feb 2, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 90 days if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 2, 2025
@kramerul
Copy link
Contributor Author

kramerul commented Feb 3, 2025

We are still interested in getting this PR merged.

@mihaibudiu
Copy link
Contributor

I will try to review this, although it's in an area where I don't know much about the codebase.

@github-actions github-actions bot removed the stale label Feb 4, 2025
@kramerul
Copy link
Contributor Author

kramerul commented Feb 4, 2025

I know that this PR is quite huge. I discussed it with Julian Hyde, if it makes sense to open such a PR. For details see https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-6728

Copy link
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has any benchmarking been done to prove the efficiency of this approach?
I am not an expert in this part of the code, but the PR looks pretty good to me.
I have only made "syntactic" comments.

@@ -7489,7 +7488,7 @@ private void checkGetTimestamp(Connection con) throws SQLException {
aSchema.setCacheEnabled(true);

// explicit should win implicit.
assertThat(aSchema.getSubSchemaNames(), hasSize(1));
assertThat(aSchema.subSchemas().getNames(LikePattern.any()), hasSize(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change all these tests, can't the original function still be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first startet with a @Deprecated annotation on getSubSchemaNames(). Therefore, I needed to rename it in all tests.
Afterwards, I removed the @Deprecated annotation because it might cause trouble in some cases.

I could revert this change, if you prefer this.

@kramerul
Copy link
Contributor Author

Has any benchmarking been done to prove the efficiency of this approach? I am not an expert in this part of the code, but the PR looks pretty good to me. I have only made "syntactic" comments.

This PR only improves performance for huge database. We are using a database with more than 500000 schemas containing up to 500000 tables.

In such an environment, it takes more than 10 seconds to load all table names from the database. Formerly, this was necessary during the preparation of each query. With the new approach, only the involved tables are loaded from the database. This speeds up the preparation by factors. It also takes much less memory, because it's no longer necessary to hold a list of all tables in memory (snapshot).

@mihaibudiu
Copy link
Contributor

Please ask for a new review when you are done

Copy link
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty good, I think one more iteration and we can merge it.
Regarding the deprecation, I would trust @asolimando's expertise.

/**
* Returns a table with a given name, or null if not found.
*
* <p>Please use {@link Schema#tables()} and {@link Lookup#get(String)} instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these methods are not used anywhere anymore, then it may be good to mark them as deprecated, and document this in history.md.

introduces new methods to lookup tables and sub schemas inside schemas.
The methods used before (`Schema:getTable(String name)`, `Schema:getTableNames()`,
`Schema.getSubSchema(String name)` and `Schema.getSubSchemaNames(String name)`)
have been markes as deprecated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@mihaibudiu
Copy link
Contributor

I think this is ready for merging; please fix the typo when you squash the commits.

@mihaibudiu mihaibudiu added the LGTM-will-merge-soon Overall PR looks OK. Only minor things left. label Feb 28, 2025
…ide schemas

[CALCITE-6728] Changes due to the PR review

[CALCITE-6728] Changes due to the second PR review

[CALCITE-6728] Fix typo
@kramerul
Copy link
Contributor Author

kramerul commented Mar 4, 2025

I fixed the type and squashed all commits.

@mihaibudiu mihaibudiu merged commit efafa4f into apache:main Mar 6, 2025
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LGTM-will-merge-soon Overall PR looks OK. Only minor things left.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants