Refactoring for an extensible Index API #443

clee704 · 2021-05-17T14:56:05Z

TODO

Fix StructType serialization issue on Relation as well -> Fix StructType serialization #459
Move CoveringIndex related code -> TBD

What is the context for this pull request?

Tracking Issue: [PROPOSAL]: Data Skipping Indexes #441
Parent Issue: N/A
Dependencies: N/A

What changes were proposed in this pull request?

Introduce common interfaces for indexes with which Hyperspace can
manage various types of indexes.
Adjust IndexStatistics so that implementation-specific fields can be
added. For instance, included columns are one of such fields now.
Actions work with generic indexes, not just covering indexes which are
the only type we support at the moment.
Existing rules only work with covering indexes. New rules will be
added along with new index types.

Does this PR introduce any user-facing change?

Serialization format of CoveringIndex, and thus IndexLogEntry, is changed.
The format of IndexStatistics is changed. It means the format of the
dataframe returned by Hyperspace.indexes is also changed.
The user interface for users to create/manage indexes is not changed.

How was this patch tested?

With existing unit tests

sezruby

No major comment other than naming :)

Could you rebase the change & update PR description to capture the public API change? e.g.

before:
hs.createIndex(df, IndexConfig("indexName", Seq("indexedCol"), Seq("includedCol")
after:
hs.createIndex(df, CoveringIndexConfig( ...

sezruby · 2021-05-31T17:37:13Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

@@ -97,7 +97,8 @@ object RuleUtils {
          val deletedBytesRatio = 1 - commonBytes / entry.sourceFilesSizeInBytes.toFloat

          val deletedCnt = entry.sourceFileInfoSet.size - commonCnt
-          val isAppendAndDeleteCandidate = hybridScanDeleteEnabled && entry.hasLineageColumn &&
+          val isAppendAndDeleteCandidate = hybridScanDeleteEnabled &&


fyi change moved to CandidateIndexCollector

src/test/scala/com/microsoft/hyperspace/actions/RefreshActionTest.scala

src/main/scala/com/microsoft/hyperspace/index/IndexStatistics.scala

src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala

andrei-ionescu

This is very big PR with lots of changes and mixed concerns. I tried to cover it as best as I could.

Could you split it in multiple PRs?

I suggest at least create a separate PR for Python code. There are also the changes related to additional stats that can be a separate PR.

WDYT?

src/main/scala-spark2/com/microsoft/hyperspace/shim/StructTypeConverter.scala

src/main/scala-spark3/com/microsoft/hyperspace/shim/StructTypeConverter.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshQuickAction.scala

src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala

src/main/scala/com/microsoft/hyperspace/index/Index.scala

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

- Introduce common interfaces for indexes with which Hyperspace can manage various types of indexes. - Adjust IndexStatistics so that implementation-specific fields can be added. For instance, included columns are one of such fields now. - Actions work with generic indexes, not just covering indexes which are the only type we support at the moment. - Existing rules only work with covering indexes. New rules will be added along with new index types. Breaking changes: - Serialization format of CoveringIndex is changed. - IndexConfig is now a trait. To create a covering index, use CoveringIndexConfig. - The format of IndexStatistics is changed. It means the format of the dataframe returned by Hyperspace.indexes is also changed.

clee704 · 2021-06-03T07:29:57Z

This is very big PR with lots of changes and mixed concerns. I tried to cover it as best as I could.

Could you split it in multiple PRs?

I suggest at least create a separate PR for Python code. There are also the changes related to additional stats that can be a separate PR.

WDYT?

I tried to split, but changes for Python code and additional stats are too small (< 100 lines) so it didn't make this PR much smaller. Also, to extract Python changes, the renaming of IndexConfig to CoveringIndexConfig should be done together. But it creates more changes than needed because every occurrence of IndexConfig should be changed unless there is the IndexConfig trait which is only in this PR.

For a better diff

andrei-ionescu · 2021-06-03T16:22:00Z

@clee704

A PR with 100 lines is the perfect PR to review 😁.

Anyway, I understand that is hard to split and I'll try harder to understand the changes. In the mean time could you at least mark the pieces of code that were just moved from a place to another with some comments? Thanks.

clee704 · 2021-06-04T07:26:07Z

@clee704

A PR with 100 lines is the perfect PR to review 😁.

Anyway, I understand that is hard to split and I'll try harder to understand the changes. In the mean time could you at least mark the pieces of code that were just moved from a place to another with some comments? Thanks.

Actually, ~20 lines and ~60 lines respectively, and the main point was it didn't reduce the size of this PR much. It's like making 1100 lines of changes to 1000 lines.

I'll add some comments to help reviewers.

And restore CoveringIndexConfig to IndexConfig, to keep the user interface compatibility

clee704 · 2021-06-04T14:24:44Z

Decided to keep IndexConfig and renamed the IndexConfig trait to IndexConfigTrait. Names might be changed in v1.0.

build.sbt

project/plugins.sbt

python/run-tests.py

clee704 · 2021-06-04T14:27:19Z

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

-  protected def write(spark: SparkSession, df: DataFrame, indexConfig: IndexConfig): Unit = {
-    val numBuckets = numBucketsForIndex(spark)
-
-    val (indexDataFrame, resolvedIndexedColumns, _) =
-      prepareIndexDataFrame(spark, df, indexConfig)
-
-    // Run job
-    val repartitionedIndexDataFrame = {
-      // We are repartitioning with normalized columns (e.g., flattened nested column).
-      indexDataFrame.repartition(numBuckets, resolvedIndexedColumns.map(_.toNormalizedColumn): _*)
-    }
-
-    // Save the index with the number of buckets specified.
-    repartitionedIndexDataFrame.write
-      .saveWithBuckets(
-        repartitionedIndexDataFrame,
-        indexDataPath.toString,
-        numBuckets,
-        resolvedIndexedColumns.map(_.normalizedName),
-        SaveMode.Overwrite)
-  }


Moved to CoveringIndex.write

clee704 · 2021-06-04T14:34:08Z

src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala

+  def createIndexData(
+      ctx: IndexerContext,
+      sourceData: DataFrame,
+      indexedColumns: Seq[String],
+      includedColumns: Seq[String],
+      hasLineageColumn: Boolean): (DataFrame, Seq[ResolvedColumn], Seq[ResolvedColumn]) = {
+    val spark = ctx.spark
+    val (resolvedIndexedColumns, resolvedIncludedColumns) =
+      resolveConfig(sourceData, indexedColumns, includedColumns)
+    val projectColumns = (resolvedIndexedColumns ++ resolvedIncludedColumns).map(_.toColumn)
+
+    val indexData =
+      if (hasLineageColumn) {
+        val relation = IndexUtils.getRelation(spark, sourceData.queryExecution.optimizedPlan)
+
+        // Lineage is captured using two sets of columns:
+        // 1. DATA_FILE_ID_COLUMN column contains source data file id for each index record.
+        // 2. If source data is partitioned, all partitioning key(s) are added to index schema
+        //    as columns if they are not already part of the schema.
+        val partitionColumnNames = relation.partitionSchema.map(_.name)
+        val resolvedColumnNames = (resolvedIndexedColumns ++ resolvedIncludedColumns).map(_.name)
+        val missingPartitionColumns =
+          partitionColumnNames
+            .filter(ResolverUtils.resolve(spark, _, resolvedColumnNames).isEmpty)
+            .map(col)
+
+        // File id value in DATA_FILE_ID_COLUMN column (lineage column) is stored as a
+        // Long data type value. Each source data file has a unique file id, assigned by
+        // Hyperspace. We populate lineage column by joining these file ids with index records.
+        // The normalized path of source data file for each record is the join key.
+        // We normalize paths by removing extra preceding `/` characters in them,
+        // similar to the way they are stored in Content in an IndexLogEntry instance.
+        // Path normalization example:
+        // - Original raw path (output of input_file_name() udf, before normalization):
+        //    + file:///C:/hyperspace/src/test/part-00003.snappy.parquet
+        // - Normalized path (used in join):
+        //    + file:/C:/hyperspace/src/test/part-00003.snappy.parquet
+        import spark.implicits._
+        val dataPathColumn = "_data_path"
+        val lineagePairs = relation.lineagePairs(ctx.fileIdTracker)
+        val lineageDF = lineagePairs.toDF(dataPathColumn, IndexConstants.DATA_FILE_NAME_ID)
+
+        sourceData
+          .withColumn(dataPathColumn, input_file_name())
+          .join(lineageDF.hint("broadcast"), dataPathColumn)
+          .select(projectColumns ++ missingPartitionColumns :+ col(
+            IndexConstants.DATA_FILE_NAME_ID): _*)
+      } else {
+        sourceData.select(projectColumns: _*)
+      }
+
+    (indexData, resolvedIndexedColumns, resolvedIncludedColumns)
+  }
+
+  private def resolveConfig(
+      df: DataFrame,
+      indexedColumns: Seq[String],
+      includedColumns: Seq[String]): (Seq[ResolvedColumn], Seq[ResolvedColumn]) = {
+    val spark = df.sparkSession
+    val plan = df.queryExecution.analyzed
+    val resolvedIndexedColumns = ResolverUtils.resolve(spark, indexedColumns, plan)
+    val resolvedIncludedColumns = ResolverUtils.resolve(spark, includedColumns, plan)
+
+    (resolvedIndexedColumns, resolvedIncludedColumns) match {
+      case (Some(indexed), Some(included)) => (indexed, included)
+      case _ =>
+        val unresolvedColumns = (indexedColumns ++ includedColumns)
+          .map(c => (c, ResolverUtils.resolve(spark, Seq(c), plan).map(_.map(_.name))))
+          .collect { case (c, r) if r.isEmpty => c }
+        throw HyperspaceException(
+          s"Columns '${unresolvedColumns.mkString(",")}' could not be resolved " +
+            s"from available source columns:\n${df.schema.treeString}")
+    }
+  }


Moved from CreateActionBase

clee704 · 2021-06-04T14:35:11Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

-  def hasLineageColumn: Boolean = {
-    derivedDataset.properties.properties.getOrElse(
-      IndexConstants.LINEAGE_PROPERTY, IndexConstants.INDEX_LINEAGE_ENABLED_DEFAULT).toBoolean
-  }
-
-  def hasParquetAsSourceFormat: Boolean = {
-    relations.head.fileFormat.equals("parquet") ||
-      derivedDataset.properties.properties.getOrElse(
-        IndexConstants.HAS_PARQUET_AS_SOURCE_FORMAT_PROPERTY, "false").toBoolean
-  }
-


Moved to CoveringIndex

clee704 · 2021-06-04T14:35:22Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

-  def bucketSpec: BucketSpec =
-    BucketSpec(
-      numBuckets = numBuckets,
-      bucketColumnNames = indexedColumns,
-      sortColumnNames = indexedColumns)
-


Moved to CoveringIndex

clee704 · 2021-06-04T14:35:31Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

-  def schema: StructType =
-    DataType.fromJson(derivedDataset.properties.schemaString).asInstanceOf[StructType]
-


Moved to CoveringIndex

clee704 · 2021-06-04T14:35:40Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

        content.root.equals(that.content.root) &&
        source.equals(that.source) &&
        properties.equals(that.properties) &&
        state.equals(that.state)
    case _ => false
  }

-  def numBuckets: Int = derivedDataset.properties.numBuckets


Moved to CoveringIndex

src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala

sezruby · 2021-06-07T06:20:53Z

src/main/scala/com/microsoft/hyperspace/index/IndexConfig.scala

+import org.apache.spark.sql.DataFrame
+
+import com.microsoft.hyperspace.util.HyperspaceConf
+
 /**
 * IndexConfig specifies the configuration of an index.


Could you revise the comment to indicate that IndexConfig is only for Covering index?

Can we move this IndexConfig (& trait) under com.microsoft.hyperspace.index.configs (or com.microsoft.hyperspace.index.types? not sure which way is better)
We can define below package object for backward compatibility if we use a new package name

package com.microsoft.hyperspace package object index { val IndexConfig = configs.IndexConfig type IndexConfig = configs.IndexConfig }

Could you revise the comment to indicate that IndexConfig is only for Covering index?

Done

Regarding the namespace, please see my comment below.

src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

sezruby · 2021-06-07T06:35:24Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

@@ -131,6 +131,7 @@ object RuleUtils {
      index: IndexLogEntry,
      plan: LogicalPlan,
      useBucketSpec: Boolean): LogicalPlan = {
+    val ci = index.derivedDataset.asInstanceOf[CoveringIndex]


How can we extend these transformPlanToUse* functions for other index types?
Can we rename RuleUtils to CoveringIndexApplicator or CoveringIndexApplier?
or move these functions to CoveringIndex? or any other ideas?

In this PR I'm trying to minimize the size of changes. In another PR, I'll move remaining covering index related code close to CoveringIndex.

sezruby · 2021-06-07T07:28:41Z

src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala

+import com.microsoft.hyperspace.util.ResolverUtils
+import com.microsoft.hyperspace.util.ResolverUtils.ResolvedColumn
+
+case class CoveringIndex(


Can we create a new package? com.microsoft.hyperspace.index.types?

I think horizontal packaging is better: putting CoveringIndex and IndexConfig (CoveringIndexConfig) in the same package because they are closely related.

In my working PR for data skipping indexes, I've created com.microsoft.hyperspace.index.dataskipping for classes such as DataSkippingIndex and DataSkippingIndexConfig.

This is similar to how source implementations are put in their own packages, e.g. com.microsoft.hyperspace.index.sources.delta has Delta Lake related implementations.

clee704 · 2021-06-07T11:28:41Z

Sorry, there were comments I've missed. Resolved some outdated comments regarding existing/moved code.

The property is about the source and independent of index implementations.

A seperate PR would be more appropriate

sezruby · 2021-06-07T18:10:56Z

src/main/scala/com/microsoft/hyperspace/index/CoveringIndex.scala

+    override val indexedColumns: Seq[String],
+    includedColumns: Seq[String],
+    // See schemaJson for more information about the annotation.
+    @JsonDeserialize(converter = classOf[StructTypeConverter.T]) schema: StructType,


Can we just use schemaString, as discussed in #456? Other than this, LGTM! + please rebase onto master.

+1 if @clee704 is also fine with it :)

Note that those macros are optional for this to work. Do you still prefer schemaString over just schema?

It seems schemaString has been used instead of just schema because of the Jackson serialization issue. Shouldn't we fix the issue in the right way, instead of working around it? In the entire code base, the string is constantly being converted to StructType with DataType.fromJson. With a few simple annotations, this is not necessary. Having a string in the field and converting it to a typed object every time it is accessed seems unusual and raises my eyebrows whenever I think about it. There should be other reasons than the Jackson issue to justify it.

I admit the practical gain might not be great, but it costs almost nothing. Also, however small, the gain is on the user's side, whereas the cost is on our side. Since the user base is orders of magnitude larger than us developing Hyperspace, shouldn't we prioritize them?

We already have many shim files, and they don't contribute much to the overall complexity of the project as they are independent of each other and we can inspect them individually. It's like linear complexity is cheap compared to quadratic or exponential complexities, which correspond to other parts of the codebase.

How about Relation.dataSchemaJson? Do we want to make it consistent? I would let @sezruby make the final decision.

+1 for Relation.dataSchemaJson? I'm okay w/ Shim + StructType if you prefer that.

Let's handle that in another PR as it is not related.

sezruby

LGTM thanks for the great work, @clee704!

Would be good to capture the list of remaining tasks for refactoring in PR description, for reference.

clee704 force-pushed the ds branch from 1d589b0 to d09953e Compare May 17, 2021 14:58

clee704 changed the title ~~[WIP] Data skipping index~~ [WIP] Data skipping indexes May 17, 2021

clee704 mentioned this pull request May 17, 2021

[PROPOSAL]: Data Skipping Indexes #441

Open

clee704 force-pushed the ds branch 2 times, most recently from c90d93d to bfeddf1 Compare May 28, 2021 12:55

clee704 changed the title ~~[WIP] Data skipping indexes~~ Refactoring for an extensible Index API May 28, 2021

clee704 requested a review from sezruby May 28, 2021 13:41

clee704 marked this pull request as ready for review May 28, 2021 13:52

sezruby reviewed Jun 1, 2021

View reviewed changes

andrei-ionescu suggested changes Jun 2, 2021

View reviewed changes

sezruby mentioned this pull request Jun 2, 2021

Multiple Index Refactor With Json Parser Support #430

Closed

clee704 force-pushed the ds branch from 63c33d2 to 7fe43ed Compare June 3, 2021 07:11

Chungmin Lee added 4 commits June 3, 2021 16:42

Match the formatting to the old code

2194eb7

For a better diff

Update comments

f9d7c3f

Preserve old code for smaller diff

cb440f7

Simplify test

6497e42

Chungmin Lee added 2 commits June 4, 2021 16:29

Rename impl to type

2ba64dc

Rename IndexConfig to IndexConfigTrait

3bf8dfc

And restore CoveringIndexConfig to IndexConfig, to keep the user interface compatibility

clee704 commented Jun 4, 2021

View reviewed changes

clee704 requested a review from sezruby June 5, 2021 02:06

clee704 mentioned this pull request Jun 5, 2021

[WIP] Introduce conditional compilation macros #456

Closed

sezruby reviewed Jun 7, 2021

View reviewed changes

sezruby assigned clee704 Jun 7, 2021

Chungmin Lee added 2 commits June 7, 2021 17:20

Covariant return type

ea22d75

Address review comments

51ba4ff

Chungmin Lee added 2 commits June 7, 2021 20:18

Revert build changes

3842cc0

Fix scalastyle

c3a8ef5

Chungmin Lee added 9 commits June 7, 2021 21:22

Typo

06a2334

Simplify refreshIncremental

e777e53

Move hasParquetAsSourceFormat to IndexLogEntry

7b76835

The property is about the source and independent of index implementations.

Move IndexLogEntry specific properties to IndexLogEntry.properties

b38509e

Minimize diff

41f710f

Simplify code

0a78429

Typo

9cdf5af

Minor restructuring

74acd9e

Revert properties changes

027231b

A seperate PR would be more appropriate

sezruby reviewed Jun 7, 2021

View reviewed changes

Chungmin Lee added 3 commits June 8, 2021 12:21

Merge branch 'master' into ds

ea0ca67

Fix refreshIncremental param type

b6035b5

Merge branch 'master' into ds

5d5f507

sezruby approved these changes Jun 8, 2021

View reviewed changes

clee704 merged commit 9e1f702 into microsoft:master Jun 9, 2021

clee704 deleted the ds branch June 9, 2021 08:38

clee704 mentioned this pull request Jun 9, 2021

Fix StructType serialization #459

Merged

sezruby added this to the v0.5.0 milestone Jun 10, 2021

sezruby added the breaking changes label Jun 10, 2021

clee704 added the enhancement New feature or request label Jun 15, 2021

clee704 mentioned this pull request Jun 22, 2021

[WIP] Refactor Hyperspace Index Configs to allow support for addition of new index #357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring for an extensible Index API #443

Refactoring for an extensible Index API #443

clee704 commented May 17, 2021 •

edited

Loading

sezruby left a comment

sezruby May 31, 2021

andrei-ionescu left a comment

clee704 commented Jun 3, 2021

andrei-ionescu commented Jun 3, 2021

clee704 commented Jun 4, 2021

clee704 commented Jun 4, 2021

clee704 Jun 4, 2021

clee704 Jun 4, 2021

clee704 Jun 4, 2021

clee704 Jun 4, 2021

clee704 Jun 4, 2021

clee704 Jun 4, 2021

sezruby Jun 7, 2021

sezruby Jun 7, 2021

clee704 Jun 7, 2021

clee704 Jun 7, 2021

sezruby Jun 7, 2021

clee704 Jun 7, 2021 •

edited

Loading

sezruby Jun 7, 2021

clee704 Jun 7, 2021 •

edited

Loading

clee704 commented Jun 7, 2021

sezruby Jun 7, 2021

imback82 Jun 7, 2021

clee704 Jun 8, 2021

imback82 Jun 8, 2021

sezruby Jun 8, 2021

clee704 Jun 8, 2021

sezruby left a comment •

edited

Loading

		def schema: StructType =
		DataType.fromJson(derivedDataset.properties.schemaString).asInstanceOf[StructType]

Refactoring for an extensible Index API #443

Refactoring for an extensible Index API #443

Conversation

clee704 commented May 17, 2021 • edited Loading

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

sezruby left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrei-ionescu left a comment

Choose a reason for hiding this comment

clee704 commented Jun 3, 2021

andrei-ionescu commented Jun 3, 2021

clee704 commented Jun 4, 2021

clee704 commented Jun 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clee704 Jun 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clee704 Jun 7, 2021 • edited Loading

Choose a reason for hiding this comment

clee704 commented Jun 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sezruby left a comment • edited Loading

Choose a reason for hiding this comment

clee704 commented May 17, 2021 •

edited

Loading

clee704 Jun 7, 2021 •

edited

Loading

clee704 Jun 7, 2021 •

edited

Loading

sezruby left a comment •

edited

Loading