Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28655: Implement HMS Related Drop Stats Changes #5578

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DanielZhu58
Copy link
Contributor

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

bool delete_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name, 4:string engine) throws
(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3,
4:InvalidInputException o4)
bool delete_table_column_statistics_req(1: DeleteTableColumnStatisticsRequest req) throws
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge the two methods? i.e, merge them to bool delete_column_statistics_req(DeleteColumnStatisticsRequest req)

* catalog name, database name, table name, partition name, column names, and engine name
* @throws TException thrift transport error
*/
public boolean deletePartitionMultiColumnStatistics(DeletePartitionColumnStatisticsRequest req) throws TException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to merge the deletePartitionMultiColumnStatistics and deleteTableMultiColumnStatistics, and keep only one method, i.e, a deleteMultiColumnStatistics(DeleteColumnStatisticsRequest req) for both dropping table statistics and partition statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the user just want to drop the stats in a specific partition.
In some cases, a certain partition stats becomes really huge so we want to drop stats for it only.
The 2 methods are designed for different use cases.

* @throws InvalidObjectException error dropping the stats
* @throws InvalidInputException bad input, such as null table or database name.
*/
boolean deletePartitionColumnStatisticsInBatch(String catName, String dbName, String tableName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between deletePartitionMultiColumnStatistics and deletePartitionColumnStatisticsInBatch, can them merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I think we can merge these 2 methods.

@@ -1395,9 +1395,50 @@ List<ColumnStatistics> getPartitionColumnStatistics(
* @throws InvalidInputException bad input, such as null table or database name.
*/
boolean deletePartitionColumnStatistics(String catName, String dbName, String tableName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this method as default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

import java.util.Optional;
import java.util.Set;
import java.util.Stack;
import java.util.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: restore the import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.


getMS().openTransaction();
try {
List<String> partVals = getPartValsFromName(getMS(), parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName, partName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this partName in thrift defines as optional, should we take care of the nullable partName. To make the interface more generic, can we define the partName as List<String>, so we can delete multiple partitions' statistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't need to delete the stats for multiple partitions, because currently there is no syntax for the user to drop stats for multiple partitions in HQL.
And also in the use cases, it's not very common that stats of multiple adjacent partitions become huge. The user can drop stats of multiple partitions one by one manually.
I think this partName here is not optional.
In hive_metastore.thrift,
struct DeletePartitionColumnStatisticsRequest { 1: required string cat_name, 2: required string db_name, 3: required string tbl_name, 4: required string part_name, 5: optional list<string> col_names, 6: required string engine }

Table table = getMS().getTable(parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName);
// This API looks unused; if it were used we'd need to update stats state and write ID.
// We cannot just randomly nuke some txn stats.
if (TxnUtils.isTransactionalTable(table)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we drop the transactional statistics? can we make the partition stats as incorrect in partition params by StatsSetupConst.COLUMN_STATS_ACCURATE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comments are not written by me. We can mark this issue as to do, and discuss about it later.

new DeletePartitionColumnStatEvent(parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName,
partName, partVals, colName, engine, this));
}
if (!listeners.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this call be outside of the transaction? why do we notify the listener just on success and colNames is not null, will some events be missing?

throw new NoSuchObjectException("Partition " + partName
+ " for which stats deletion is requested doesn't exist");
}
query = pm.newQuery(MPartitionColumnStatistics.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use this query?

query = pm.newQuery(MPartitionColumnStatistics.class);
if (colNames != null) {
for (String colName : colNames){
deletePartitionColumnStatistics(catName, dbName, tableName, partName, partVals, colName, engine);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we drop all the columns' statistics in a round?

@Aggarwal-Raghav
Copy link
Contributor

@DanielZhu58, the thrift code needs to be generated as there is change in hive_metastore.thrift. Please use -Pthriftif profile.

@Aggarwal-Raghav
Copy link
Contributor

@DanielZhu58, one question, to call the drop stats command it will be something like: ALTER TABLE .. drop col/partition stats... So antlr file changes (HiveParser.g) are required right?

@DanielZhu58
Copy link
Contributor Author

@DanielZhu58, one question, to call the drop stats command it will be something like: ALTER TABLE .. drop col/partition stats... So antlr file changes (HiveParser.g) are required right?

Hi @Aggarwal-Raghav , yes, the method changes in this patch are mainly focused on the HMS API changes. It's only the part of the big change.
There is another patch included the HiveParser.g changes coming soon, authored by @ramesh0201
#5176

@Aggarwal-Raghav
Copy link
Contributor

@DanielZhu58, one question, to call the drop stats command it will be something like: ALTER TABLE .. drop col/partition stats... So antlr file changes (HiveParser.g) are required right?

Hi @Aggarwal-Raghav , yes, the method changes in this patch are mainly focused on the HMS API changes. It's only the part of the big change. There is another patch included the HiveParser.g changes coming soon, authored by @ramesh0201 #5176

ok

MTable mTable = getMTable(catName, dbName, tableName);
MPartitionColumnStatistics mStatsObj;
List<MPartitionColumnStatistics> mStatsObjColl;
if (mTable == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check mTable == null redundant? as we validate the mPartition afterwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this if condition.
Because if mTable is null, the exception will be thrown from here, so we can know the "Table" does not exist.
If the exception is thrown later with the mPartition, we can know the "Partition" does not exist.

query = pm.newQuery(MPartitionColumnStatistics.class);
String filter;
String parameters;
if (colNames != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: colNames != null && !colNames.isEmpty()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

query.setFilter(filter);
query.declareParameters(parameters);
if (colNames != null) {
query.setUnique(true);
Copy link
Member

@dengzhhu653 dengzhhu653 Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we might have multiple columns, there is no guarantee that the query will return only one row, can we just remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Yes, this should be removed.

query.setUnique(true);
for (String colName : colNames){
// trim the extra spaces, and change to lowercase
normalizeIdentifier(colName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to put the normalized column names into new list which will be passed to executeWithArray

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Created a new list normalizedColNames, to pass to executeWithArray.

normalizeIdentifier(colName);
}
if (engine != null) {
mStatsObj =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query result should be a collection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

+ TableName.getQualified(catName, dbName, tableName) +
" partition=" + partName + " col=" + String.join(", ", colNames));
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we merge the if else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One condition is (colNames != null) and the other one condition is (colNames == null). In this 2 conditions, the JDO queries are different. So it's clearer to not merge them for now.

+ TableName.getQualified(catName, dbName, tableName) + " partition" + partName);
}
}
ret = commitTransaction();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after deleting the column stats, should we mark the COLUMN_STATS_ACCURATE to false in partition's parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I don't have an answer for it now.
We had the feature of dropping column stats for single column before, and the COLUMN_STATS_ACCURATE was not mark to false.
So let's discuss about that later.

query = pm.newQuery(MTableColumnStatistics.class);
String filter;
String parameters;
if (colNames != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: colNames != null and !colNames.isEmpty()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

query.setFilter(filter);
query.declareParameters(parameters);
if (colNames != null) {
query.setUnique(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the unique check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

query.setUnique(true);
for (String colName : colNames){
// trim the extra spaces, and change to lowercase
normalizeIdentifier(colName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the normalize into a new list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

throw new NoSuchObjectException("Column stats doesn't exist for db=" + dbName + " table="
+ tableName + " col=" + String.join(", ", colNames));
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge the if else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasons same as mentioned above.

}
pm.retrieveAll(mStatsObjColl);
if (mStatsObjColl != null) {
pm.deletePersistentAll(mStatsObjColl);
} else {
throw new NoSuchObjectException("Column stats doesn't exist for table="
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this exception should never throw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this today. I am not sure if I understand this.

normalizeIdentifier(tableName),
normalizeIdentifier(catName),
engine);
(List<MPartitionColumnStatistics>) query.executeWithArray(partName.trim(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: code format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

* @throws InvalidInputException bad input, such as null table or database name.
*/
boolean deletePartitionMultiColumnStatistics(String catName, String dbName, String tableName,
String partName, List<String> partVals, List<String> colNames, String engine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the partVals never be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used around ObjectStore.java Line 10266

if (colName != null) {
colNames.add(colName);
}
return deletePartitionMultiColumnStatistics(catName, dbName, tableName, partName, partVals, colNames, engine);
Copy link
Member

@dengzhhu653 dengzhhu653 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direct sql path is missing, we'd better pushing the colNames into the direct sql, example:
directSql.deletePartitionColumnStats(catName, dbName, tableName, partName, colNames, engine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this today. I am not sure if I understand this.

default boolean deletePartitionColumnStatistics(String catName, String dbName, String tableName,
String partName, List<String> partVals, String colName, String engine)
throws NoSuchObjectException, MetaException, InvalidObjectException, InvalidInputException{
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return deletePartitionMultiColumnStatistics(catName, dbName, tableName,
partName, partVals, Arrays.asList(colName), engine);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

dbName, tblName, Lists.newArrayList(colName[1]), ENGINE);
assertTrue("stats are not empty: " + stats, stats.isEmpty());

client.updateTableColumnStatistics(colStats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//case 3: column names contains multiple column names, then delete the column stats in the column name list
status = client.deleteTableMultiColumnStatistics(dbName, tblName, Arrays.asList(colName), ENGINE);
assertTrue(status);
stats = client.getTableColumnStatistics(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get the stats in a round, client.getTableColumnStatistics(dbName, tblName, Lists.newArrayList(colName), ENGINE);

Lists.newArrayList(partName), Lists.newArrayList(colName[1]), ENGINE);
assertTrue("stats are not empty: " + stats2, stats2.isEmpty());
// compare stats obj to ensure what we get is what we wrote
assertNotNull(colStats2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restore the code format

// case 1: column names are null, all column stats should be deleted
status = client.deletePartitionMultiColumnStatistics(dbName, tblName, partName, null, ENGINE);
assertTrue(status);
stats2 = client.getPartitionColumnStatistics(dbName, tblName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: get the all column stats in a round

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants