-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28655: Implement HMS Related Drop Stats Changes #5578
base: master
Are you sure you want to change the base?
Conversation
bool delete_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name, 4:string engine) throws | ||
(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3, | ||
4:InvalidInputException o4) | ||
bool delete_table_column_statistics_req(1: DeleteTableColumnStatisticsRequest req) throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge the two methods? i.e, merge them to bool delete_column_statistics_req(DeleteColumnStatisticsRequest req)
* catalog name, database name, table name, partition name, column names, and engine name | ||
* @throws TException thrift transport error | ||
*/ | ||
public boolean deletePartitionMultiColumnStatistics(DeletePartitionColumnStatisticsRequest req) throws TException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to merge the deletePartitionMultiColumnStatistics
and deleteTableMultiColumnStatistics
, and keep only one method, i.e, a deleteMultiColumnStatistics(DeleteColumnStatisticsRequest req)
for both dropping table statistics and partition statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes the user just want to drop the stats in a specific partition.
In some cases, a certain partition stats becomes really huge so we want to drop stats for it only.
The 2 methods are designed for different use cases.
* @throws InvalidObjectException error dropping the stats | ||
* @throws InvalidInputException bad input, such as null table or database name. | ||
*/ | ||
boolean deletePartitionColumnStatisticsInBatch(String catName, String dbName, String tableName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between deletePartitionMultiColumnStatistics
and deletePartitionColumnStatisticsInBatch
, can them merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I think we can merge these 2 methods.
@@ -1395,9 +1395,50 @@ List<ColumnStatistics> getPartitionColumnStatistics( | |||
* @throws InvalidInputException bad input, such as null table or database name. | |||
*/ | |||
boolean deletePartitionColumnStatistics(String catName, String dbName, String tableName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make this method as default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
import java.util.Optional; | ||
import java.util.Set; | ||
import java.util.Stack; | ||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: restore the import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
|
||
getMS().openTransaction(); | ||
try { | ||
List<String> partVals = getPartValsFromName(getMS(), parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName, partName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this partName
in thrift defines as optional
, should we take care of the nullable partName
. To make the interface more generic, can we define the partName
as List<String>
, so we can delete multiple partitions' statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't need to delete the stats for multiple partitions, because currently there is no syntax for the user to drop stats for multiple partitions in HQL.
And also in the use cases, it's not very common that stats of multiple adjacent partitions become huge. The user can drop stats of multiple partitions one by one manually.
I think this partName here is not optional.
In hive_metastore.thrift,
struct DeletePartitionColumnStatisticsRequest { 1: required string cat_name, 2: required string db_name, 3: required string tbl_name, 4: required string part_name, 5: optional list<string> col_names, 6: required string engine }
Table table = getMS().getTable(parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName); | ||
// This API looks unused; if it were used we'd need to update stats state and write ID. | ||
// We cannot just randomly nuke some txn stats. | ||
if (TxnUtils.isTransactionalTable(table)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we drop the transactional statistics? can we make the partition stats as incorrect in partition params by StatsSetupConst.COLUMN_STATS_ACCURATE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comments are not written by me. We can mark this issue as to do, and discuss about it later.
new DeletePartitionColumnStatEvent(parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName, | ||
partName, partVals, colName, engine, this)); | ||
} | ||
if (!listeners.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this call be outside of the transaction
? why do we notify the listener just on success and colNames
is not null, will some events be missing?
throw new NoSuchObjectException("Partition " + partName | ||
+ " for which stats deletion is requested doesn't exist"); | ||
} | ||
query = pm.newQuery(MPartitionColumnStatistics.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use this query
?
query = pm.newQuery(MPartitionColumnStatistics.class); | ||
if (colNames != null) { | ||
for (String colName : colNames){ | ||
deletePartitionColumnStatistics(catName, dbName, tableName, partName, partVals, colName, engine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we drop all the columns' statistics in a round?
@DanielZhu58, the thrift code needs to be generated as there is change in hive_metastore.thrift. Please use |
e6bfa1f
to
3cc3276
Compare
Quality Gate passedIssues Measures |
@DanielZhu58, one question, to call the drop stats command it will be something like: |
Hi @Aggarwal-Raghav , yes, the method changes in this patch are mainly focused on the HMS API changes. It's only the part of the big change. |
ok |
3cc3276
to
dcd1438
Compare
dcd1438
to
9ff54ca
Compare
MTable mTable = getMTable(catName, dbName, tableName); | ||
MPartitionColumnStatistics mStatsObj; | ||
List<MPartitionColumnStatistics> mStatsObjColl; | ||
if (mTable == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check mTable == null
redundant? as we validate the mPartition
afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep this if condition.
Because if mTable is null, the exception will be thrown from here, so we can know the "Table" does not exist.
If the exception is thrown later with the mPartition, we can know the "Partition" does not exist.
query = pm.newQuery(MPartitionColumnStatistics.class); | ||
String filter; | ||
String parameters; | ||
if (colNames != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: colNames != null && !colNames.isEmpty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
query.setFilter(filter); | ||
query.declareParameters(parameters); | ||
if (colNames != null) { | ||
query.setUnique(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we might have multiple columns, there is no guarantee that the query will return only one row, can we just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged. Yes, this should be removed.
query.setUnique(true); | ||
for (String colName : colNames){ | ||
// trim the extra spaces, and change to lowercase | ||
normalizeIdentifier(colName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to put the normalized column names into new list which will be passed to executeWithArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged. Created a new list normalizedColNames
, to pass to executeWithArray
.
normalizeIdentifier(colName); | ||
} | ||
if (engine != null) { | ||
mStatsObj = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this query result should be a collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
+ TableName.getQualified(catName, dbName, tableName) + | ||
" partition=" + partName + " col=" + String.join(", ", colNames)); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we merge the if else
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One condition is (colNames != null) and the other one condition is (colNames == null). In this 2 conditions, the JDO queries are different. So it's clearer to not merge them for now.
+ TableName.getQualified(catName, dbName, tableName) + " partition" + partName); | ||
} | ||
} | ||
ret = commitTransaction(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after deleting the column stats, should we mark the COLUMN_STATS_ACCURATE
to false in partition's parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I don't have an answer for it now.
We had the feature of dropping column stats for single column before, and the COLUMN_STATS_ACCURATE
was not mark to false.
So let's discuss about that later.
query = pm.newQuery(MTableColumnStatistics.class); | ||
String filter; | ||
String parameters; | ||
if (colNames != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: colNames != null and !colNames.isEmpty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
query.setFilter(filter); | ||
query.declareParameters(parameters); | ||
if (colNames != null) { | ||
query.setUnique(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the unique check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
query.setUnique(true); | ||
for (String colName : colNames){ | ||
// trim the extra spaces, and change to lowercase | ||
normalizeIdentifier(colName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the normalize into a new list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
throw new NoSuchObjectException("Column stats doesn't exist for db=" + dbName + " table=" | ||
+ tableName + " col=" + String.join(", ", colNames)); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge the if else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasons same as mentioned above.
bb24404
to
68a5561
Compare
} | ||
pm.retrieveAll(mStatsObjColl); | ||
if (mStatsObjColl != null) { | ||
pm.deletePersistentAll(mStatsObjColl); | ||
} else { | ||
throw new NoSuchObjectException("Column stats doesn't exist for table=" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this exception should never throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this today. I am not sure if I understand this.
normalizeIdentifier(tableName), | ||
normalizeIdentifier(catName), | ||
engine); | ||
(List<MPartitionColumnStatistics>) query.executeWithArray(partName.trim(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: code format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
* @throws InvalidInputException bad input, such as null table or database name. | ||
*/ | ||
boolean deletePartitionMultiColumnStatistics(String catName, String dbName, String tableName, | ||
String partName, List<String> partVals, List<String> colNames, String engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the partVals
never be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used around ObjectStore.java Line 10266
if (colName != null) { | ||
colNames.add(colName); | ||
} | ||
return deletePartitionMultiColumnStatistics(catName, dbName, tableName, partName, partVals, colNames, engine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direct sql path is missing, we'd better pushing the colNames
into the direct sql, example:
directSql.deletePartitionColumnStats(catName, dbName, tableName, partName, colNames, engine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this today. I am not sure if I understand this.
default boolean deletePartitionColumnStatistics(String catName, String dbName, String tableName, | ||
String partName, List<String> partVals, String colName, String engine) | ||
throws NoSuchObjectException, MetaException, InvalidObjectException, InvalidInputException{ | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return deletePartitionMultiColumnStatistics(catName, dbName, tableName,
partName, partVals, Arrays.asList(colName), engine);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
dbName, tblName, Lists.newArrayList(colName[1]), ENGINE); | ||
assertTrue("stats are not empty: " + stats, stats.isEmpty()); | ||
|
||
client.updateTableColumnStatistics(colStats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have tested the single column case at https://github.com/apache/hive/pull/5578/files#diff-120e8efba348e9a8c233de4135cdcc4aaeb6bc0fe2540d9d952554af2c05fdecR1867
//case 3: column names contains multiple column names, then delete the column stats in the column name list | ||
status = client.deleteTableMultiColumnStatistics(dbName, tblName, Arrays.asList(colName), ENGINE); | ||
assertTrue(status); | ||
stats = client.getTableColumnStatistics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get the stats in a round, client.getTableColumnStatistics(dbName, tblName, Lists.newArrayList(colName), ENGINE);
Lists.newArrayList(partName), Lists.newArrayList(colName[1]), ENGINE); | ||
assertTrue("stats are not empty: " + stats2, stats2.isEmpty()); | ||
// compare stats obj to ensure what we get is what we wrote | ||
assertNotNull(colStats2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restore the code format
// case 1: column names are null, all column stats should be deleted | ||
status = client.deletePartitionMultiColumnStatistics(dbName, tblName, partName, null, ENGINE); | ||
assertTrue(status); | ||
stats2 = client.getPartitionColumnStatistics(dbName, tblName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: get the all column stats in a round
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
Is the change a dependency upgrade?
How was this patch tested?