-
Notifications
You must be signed in to change notification settings - Fork 114
Refactoring for an extensible Index API #443
Conversation
c90d93d
to
bfeddf1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major comment other than naming :)
Could you rebase the change & update PR description to capture the public API change? e.g.
before:
hs.createIndex(df, IndexConfig("indexName", Seq("indexedCol"), Seq("includedCol")
after:
hs.createIndex(df, CoveringIndexConfig( ...
@@ -97,7 +97,8 @@ object RuleUtils { | |||
val deletedBytesRatio = 1 - commonBytes / entry.sourceFilesSizeInBytes.toFloat | |||
|
|||
val deletedCnt = entry.sourceFileInfoSet.size - commonCnt | |||
val isAppendAndDeleteCandidate = hybridScanDeleteEnabled && entry.hasLineageColumn && | |||
val isAppendAndDeleteCandidate = hybridScanDeleteEnabled && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi change moved to CandidateIndexCollector
src/main/scala/com/microsoft/hyperspace/index/IndexStatistics.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very big PR with lots of changes and mixed concerns. I tried to cover it as best as I could.
Could you split it in multiple PRs?
I suggest at least create a separate PR for Python code. There are also the changes related to additional stats that can be a separate PR.
WDYT?
src/main/scala-spark2/com/microsoft/hyperspace/shim/StructTypeConverter.scala
Show resolved
Hide resolved
src/main/scala-spark3/com/microsoft/hyperspace/shim/StructTypeConverter.scala
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshQuickAction.scala
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
- Introduce common interfaces for indexes with which Hyperspace can manage various types of indexes. - Adjust IndexStatistics so that implementation-specific fields can be added. For instance, included columns are one of such fields now. - Actions work with generic indexes, not just covering indexes which are the only type we support at the moment. - Existing rules only work with covering indexes. New rules will be added along with new index types. Breaking changes: - Serialization format of CoveringIndex is changed. - IndexConfig is now a trait. To create a covering index, use CoveringIndexConfig. - The format of IndexStatistics is changed. It means the format of the dataframe returned by Hyperspace.indexes is also changed.
I tried to split, but changes for Python code and additional stats are too small (< 100 lines) so it didn't make this PR much smaller. Also, to extract Python changes, the renaming of IndexConfig to CoveringIndexConfig should be done together. But it creates more changes than needed because every occurrence of IndexConfig should be changed unless there is the IndexConfig trait which is only in this PR. |
A PR with 100 lines is the perfect PR to review 😁. Anyway, I understand that is hard to split and I'll try harder to understand the changes. In the mean time could you at least mark the pieces of code that were just moved from a place to another with some comments? Thanks. |
Actually, ~20 lines and ~60 lines respectively, and the main point was it didn't reduce the size of this PR much. It's like making 1100 lines of changes to 1000 lines. I'll add some comments to help reviewers. |
And restore CoveringIndexConfig to IndexConfig, to keep the user interface compatibility
Decided to keep |
protected def write(spark: SparkSession, df: DataFrame, indexConfig: IndexConfig): Unit = { | ||
val numBuckets = numBucketsForIndex(spark) | ||
|
||
val (indexDataFrame, resolvedIndexedColumns, _) = | ||
prepareIndexDataFrame(spark, df, indexConfig) | ||
|
||
// Run job | ||
val repartitionedIndexDataFrame = { | ||
// We are repartitioning with normalized columns (e.g., flattened nested column). | ||
indexDataFrame.repartition(numBuckets, resolvedIndexedColumns.map(_.toNormalizedColumn): _*) | ||
} | ||
|
||
// Save the index with the number of buckets specified. | ||
repartitionedIndexDataFrame.write | ||
.saveWithBuckets( | ||
repartitionedIndexDataFrame, | ||
indexDataPath.toString, | ||
numBuckets, | ||
resolvedIndexedColumns.map(_.normalizedName), | ||
SaveMode.Overwrite) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to CoveringIndex.write
def createIndexData( | ||
ctx: IndexerContext, | ||
sourceData: DataFrame, | ||
indexedColumns: Seq[String], | ||
includedColumns: Seq[String], | ||
hasLineageColumn: Boolean): (DataFrame, Seq[ResolvedColumn], Seq[ResolvedColumn]) = { | ||
val spark = ctx.spark | ||
val (resolvedIndexedColumns, resolvedIncludedColumns) = | ||
resolveConfig(sourceData, indexedColumns, includedColumns) | ||
val projectColumns = (resolvedIndexedColumns ++ resolvedIncludedColumns).map(_.toColumn) | ||
|
||
val indexData = | ||
if (hasLineageColumn) { | ||
val relation = IndexUtils.getRelation(spark, sourceData.queryExecution.optimizedPlan) | ||
|
||
// Lineage is captured using two sets of columns: | ||
// 1. DATA_FILE_ID_COLUMN column contains source data file id for each index record. | ||
// 2. If source data is partitioned, all partitioning key(s) are added to index schema | ||
// as columns if they are not already part of the schema. | ||
val partitionColumnNames = relation.partitionSchema.map(_.name) | ||
val resolvedColumnNames = (resolvedIndexedColumns ++ resolvedIncludedColumns).map(_.name) | ||
val missingPartitionColumns = | ||
partitionColumnNames | ||
.filter(ResolverUtils.resolve(spark, _, resolvedColumnNames).isEmpty) | ||
.map(col) | ||
|
||
// File id value in DATA_FILE_ID_COLUMN column (lineage column) is stored as a | ||
// Long data type value. Each source data file has a unique file id, assigned by | ||
// Hyperspace. We populate lineage column by joining these file ids with index records. | ||
// The normalized path of source data file for each record is the join key. | ||
// We normalize paths by removing extra preceding `/` characters in them, | ||
// similar to the way they are stored in Content in an IndexLogEntry instance. | ||
// Path normalization example: | ||
// - Original raw path (output of input_file_name() udf, before normalization): | ||
// + file:///C:/hyperspace/src/test/part-00003.snappy.parquet | ||
// - Normalized path (used in join): | ||
// + file:/C:/hyperspace/src/test/part-00003.snappy.parquet | ||
import spark.implicits._ | ||
val dataPathColumn = "_data_path" | ||
val lineagePairs = relation.lineagePairs(ctx.fileIdTracker) | ||
val lineageDF = lineagePairs.toDF(dataPathColumn, IndexConstants.DATA_FILE_NAME_ID) | ||
|
||
sourceData | ||
.withColumn(dataPathColumn, input_file_name()) | ||
.join(lineageDF.hint("broadcast"), dataPathColumn) | ||
.select(projectColumns ++ missingPartitionColumns :+ col( | ||
IndexConstants.DATA_FILE_NAME_ID): _*) | ||
} else { | ||
sourceData.select(projectColumns: _*) | ||
} | ||
|
||
(indexData, resolvedIndexedColumns, resolvedIncludedColumns) | ||
} | ||
|
||
private def resolveConfig( | ||
df: DataFrame, | ||
indexedColumns: Seq[String], | ||
includedColumns: Seq[String]): (Seq[ResolvedColumn], Seq[ResolvedColumn]) = { | ||
val spark = df.sparkSession | ||
val plan = df.queryExecution.analyzed | ||
val resolvedIndexedColumns = ResolverUtils.resolve(spark, indexedColumns, plan) | ||
val resolvedIncludedColumns = ResolverUtils.resolve(spark, includedColumns, plan) | ||
|
||
(resolvedIndexedColumns, resolvedIncludedColumns) match { | ||
case (Some(indexed), Some(included)) => (indexed, included) | ||
case _ => | ||
val unresolvedColumns = (indexedColumns ++ includedColumns) | ||
.map(c => (c, ResolverUtils.resolve(spark, Seq(c), plan).map(_.map(_.name)))) | ||
.collect { case (c, r) if r.isEmpty => c } | ||
throw HyperspaceException( | ||
s"Columns '${unresolvedColumns.mkString(",")}' could not be resolved " + | ||
s"from available source columns:\n${df.schema.treeString}") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from CreateActionBase
def hasLineageColumn: Boolean = { | ||
derivedDataset.properties.properties.getOrElse( | ||
IndexConstants.LINEAGE_PROPERTY, IndexConstants.INDEX_LINEAGE_ENABLED_DEFAULT).toBoolean | ||
} | ||
|
||
def hasParquetAsSourceFormat: Boolean = { | ||
relations.head.fileFormat.equals("parquet") || | ||
derivedDataset.properties.properties.getOrElse( | ||
IndexConstants.HAS_PARQUET_AS_SOURCE_FORMAT_PROPERTY, "false").toBoolean | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to CoveringIndex
def bucketSpec: BucketSpec = | ||
BucketSpec( | ||
numBuckets = numBuckets, | ||
bucketColumnNames = indexedColumns, | ||
sortColumnNames = indexedColumns) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to CoveringIndex
def schema: StructType = | ||
DataType.fromJson(derivedDataset.properties.schemaString).asInstanceOf[StructType] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to CoveringIndex
content.root.equals(that.content.root) && | ||
source.equals(that.source) && | ||
properties.equals(that.properties) && | ||
state.equals(that.state) | ||
case _ => false | ||
} | ||
|
||
def numBuckets: Int = derivedDataset.properties.numBuckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to CoveringIndex
src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala
Outdated
Show resolved
Hide resolved
import org.apache.spark.sql.DataFrame | ||
|
||
import com.microsoft.hyperspace.util.HyperspaceConf | ||
|
||
/** | ||
* IndexConfig specifies the configuration of an index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revise the comment to indicate that IndexConfig
is only for Covering index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this IndexConfig
(& trait) under com.microsoft.hyperspace.index.configs
(or com.microsoft.hyperspace.index.types
? not sure which way is better)
We can define below package object for backward compatibility if we use a new package name
package com.microsoft.hyperspace
package object index {
val IndexConfig = configs.IndexConfig
type IndexConfig = configs.IndexConfig
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revise the comment to indicate that
IndexConfig
is only for Covering index?
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the namespace, please see my comment below.
src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
@@ -131,6 +131,7 @@ object RuleUtils { | |||
index: IndexLogEntry, | |||
plan: LogicalPlan, | |||
useBucketSpec: Boolean): LogicalPlan = { | |||
val ci = index.derivedDataset.asInstanceOf[CoveringIndex] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we extend these transformPlanToUse* functions for other index types?
Can we rename RuleUtils to CoveringIndexApplicator or CoveringIndexApplier?
or move these functions to CoveringIndex? or any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR I'm trying to minimize the size of changes. In another PR, I'll move remaining covering index related code close to CoveringIndex.
import com.microsoft.hyperspace.util.ResolverUtils | ||
import com.microsoft.hyperspace.util.ResolverUtils.ResolvedColumn | ||
|
||
case class CoveringIndex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a new package? com.microsoft.hyperspace.index.types
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think horizontal packaging is better: putting CoveringIndex and IndexConfig (CoveringIndexConfig) in the same package because they are closely related.
In my working PR for data skipping indexes, I've created com.microsoft.hyperspace.index.dataskipping
for classes such as DataSkippingIndex
and DataSkippingIndexConfig
.
This is similar to how source implementations are put in their own packages, e.g. com.microsoft.hyperspace.index.sources.delta
has Delta Lake related implementations.
Sorry, there were comments I've missed. Resolved some outdated comments regarding existing/moved code. |
The property is about the source and independent of index implementations.
A seperate PR would be more appropriate
override val indexedColumns: Seq[String], | ||
includedColumns: Seq[String], | ||
// See schemaJson for more information about the annotation. | ||
@JsonDeserialize(converter = classOf[StructTypeConverter.T]) schema: StructType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use schemaString, as discussed in #456? Other than this, LGTM! + please rebase onto master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 if @clee704 is also fine with it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that those macros are optional for this to work. Do you still prefer schemaString over just schema?
It seems schemaString has been used instead of just schema because of the Jackson serialization issue. Shouldn't we fix the issue in the right way, instead of working around it? In the entire code base, the string is constantly being converted to StructType
with DataType.fromJson
. With a few simple annotations, this is not necessary. Having a string in the field and converting it to a typed object every time it is accessed seems unusual and raises my eyebrows whenever I think about it. There should be other reasons than the Jackson issue to justify it.
I admit the practical gain might not be great, but it costs almost nothing. Also, however small, the gain is on the user's side, whereas the cost is on our side. Since the user base is orders of magnitude larger than us developing Hyperspace, shouldn't we prioritize them?
We already have many shim files, and they don't contribute much to the overall complexity of the project as they are independent of each other and we can inspect them individually. It's like linear complexity is cheap compared to quadratic or exponential complexities, which correspond to other parts of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Relation.dataSchemaJson
? Do we want to make it consistent? I would let @sezruby make the final decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for Relation.dataSchemaJson
? I'm okay w/ Shim + StructType if you prefer that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's handle that in another PR as it is not related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for the great work, @clee704!
Would be good to capture the list of remaining tasks for refactoring in PR description, for reference.
TODO
What is the context for this pull request?
What changes were proposed in this pull request?
manage various types of indexes.
added. For instance, included columns are one of such fields now.
the only type we support at the moment.
added along with new index types.
Does this PR introduce any user-facing change?
dataframe returned by Hyperspace.indexes is also changed.
How was this patch tested?
With existing unit tests