[PROPOSAL]: Changes to support for multiple index types #342

apoorvedave1 · 2021-01-28T01:52:27Z

Problem Statement

Code changes required to support multiple index types like bloom filter index and partition elimination index.

Background and Motivation

Why limit to covering indexes? Let's expand hyperspace to make it flexible for more index types.

Proposed Solution

Make changes to the existing design to allow for flexibility in adding more index types.

Design

Changes to Action classes

Applying Rules: Updated

Have a single Hyperspace rule which gets added to spark optimizer. This is composed of internal pluggable rules

object HyperspaceRule extends Rule[LogicalPlan] = {
  def apply(plan: LogicalPlan) = {
    new Ranker(new Selection().select(plan)).head
  }
}

Have multiple internal rules which work on specialized types of indexes. For e.g. CIJoinRule, CIFilterRule, BFFilterRule, PEFilterRule etc.
Generate final plans by applying all rules independently to the current plan.

class Selection {
  val rules: Seq[HyperspaceInternalRule] = JoinRule :: FilterRule :: Nil
  def select(plan: LogicalPlan): Seq[LogicalPlan] = {
    rules.flatMap(r => r(plan))
  }
}

Rank them based on hueristics (currently hardcoded rules) to get a cost-wise ordered list of plans. Pick the head plan and return
(global ranker)

class Ranker {
  def rank(plans: Seq[LogicalPlan]): Seq[LogicalPlan]
}

PartitionEliminationIndex Design

Extending the new index config defined in this design doc: #341
we can define the PartitionElimination non-covering index as below:

case class PartitionEliminationIndexConfig extends NonCoveringIndexConfig

Using PartitionElimination Index

PartitionEliminationIndex is a reverse index from index columns and the data files which contain these values. These could be useful especially for point lookups and range queries.

Implementation

Refactoring Tasks

Tasks:

trait: CreateIndex
Class: CreateBFIndex: def op()
Class: CreatePEIndex: def op()
Class: CreateCoveringIndex: def op()
Class: CreatePartitionEliminationIndex: def op()
Class: Ranker : Global ranker which is hardcoded as of now
Class: Selection
Class: HyperspaceRule

PartitionEliminationIndex specific tasks

PartitionEliminationFilterIndexRule

Get the source plan
Get the index similar to covering index rule. Choose only those indexes whose type is PartitionEliminationIndex
Run a spark query on index data with the query on the index columns.
Collect a list of data file paths which satisfy the index.
Return a new logical plan which reads data from these filtered source data files.

Creating PEIndex

Extend from Covering Index
Exactly as creating a covering index. Just skip the included columns and add the filename column by default.

Refreshing PEIndex

Extend from Covering Index
Exactly as refreshing a covering index. Just skip the included columns and add the file name column by default.

Optimizing PEIndex

Extend from Covering Index
Exactly as optimizing a covering index. Just skip the included columns and add the file name column by default.

Order of PRs:

Refactoring

Updates for a single Hyperspace rule. (3d)
Refactor IndexConfig and IndexLogEntry for existing Covering Index. Update apis to reflect this change (1w)
Refactor CreateIndex for Covering Index. (3d)
Refactor RefreshIndex for Covering Index. (3d)
Refactor OptimizeIndex for Covering Index. (3d)

BFIndex

Introduce new Index type to support:
a. CreateIndex
b. Supported Rules for this index type
RefreshIndex
OptimizeIndex
any other index maintenance operation required

PEIndex

Introduce new Index type to support:
a. CreateIndex (2d)
b. Supported Rules for this index type (2w)
RefreshIndex (2d)
OptimizeIndex (2d)
any other index maintenance operation required (-)

Performance Implications (if applicable)

None

Alternate Design Options

sezruby · 2021-01-28T03:00:59Z

Seems IndexTypeHandler is not necessary as we should create new rules for Non-covering index / bloom filter index.

apoorvedave1 · 2021-01-28T21:58:44Z

thanks @sezruby , it still is a good choice, for e.g. filterindexrule can directly use bloom filter and non-covering index. It's generally a better abstraction I think to separate out index type logic from rules from now on moving forward.

sezruby · 2021-01-28T23:43:42Z

Ok these are the points:

we need to keep Rules simple and clear.
each index type might have all different conditions and getCandidateIndexes and rank algorithms will be incompatible. Having index type condition for the functions might not be clear.
I also have some refactoring plan of rules for whyNot API. I'll refactor the rules as series of "Checks" to filter the candidate indexes. We might be able to reuse some "Check" between rules.
Hybrid Scan for bf or non covering indexes incurs some complexity

And lastly how can we apply both a covering index and a BF index using one Filter Rule?

rapoth · 2021-02-02T00:04:28Z

+1 to what @sezruby is saying. I'm in favor of separating out the rules per index type since the logic might be totally different. Ideally, we would have:

Covering index comes with a collection of rules e.g., FilterIndexRule, JoinIndexRule and later AggIndexRule
Non-covering indexes (like fine-grained partition elimination and bloom filter index) will come with their own set of rules e.g., FilterRule to begin with. Also, note that the join optimizations through fine-grained partition elimination (index intersection) and bloom filter indexes (bloom filter gets pushed to one side of the join) are totally different and I do not see any reusability.

The important thing we should consider is ensuring the duplication is minimum to the extent possible.

@apoorvedave1 @thugsatbay What are your thoughts on this?

apoorvedave1 · 2021-02-02T03:49:20Z

Ok these are the points:

we need to keep Rules simple and clear.

yeah so if we do this design, i think rules will become simpler and clearer. I am not saying to not write new rules. I am saying if rules can be reused, we don't need to duplicate it if possible. One example is Filter rule. If join rule requires a different logic, we can write a new join rule. but filter rule doesn't require duplication.

each index type might have all different conditions and getCandidateIndexes and rank algorithms will be incompatible. Having index type condition for the functions might not be clear.

We either implement this ranking logic, or we stick with creating different rules for each index type and hardcode their ordering.

I also have some refactoring plan of rules for whyNot API. I'll refactor the rules as series of "Checks" to filter the candidate indexes. We might be able to reuse some "Check" between rules.

Hybrid Scan for bf or non covering indexes incurs some complexity

hybrid scan logic gets extracted out of the filter rule into the index type handler. CoveringIndexHandler knows how to handle hybrid scan for covering index. Same for BF, NC

And lastly how can we apply both a covering index and a BF index using one Filter Rule?

If you take a look at the design, what I have done is added an IndexTypeHandler for exactly this question. Given multiple index types, a ranker decides which index to pick. if it is a covering index, covering handler will update the plan. if it's a bf index, bf handler will update the plan.

Here's the bottom line. Given a data source and two types of indexes which are eligible. E.g. a covering index, which was not updated for long, and a bf index, which is most recently updated. How do we decide which index to choose from?
Option 1: Use a ranker for the rule type. Filter index rule can use a ranker to choose the bf or covering index and pick one.
Option 2: Use two rules. FilterRuleForCovering, FilterRuleForBF. Now the quesiton is, how do we decide (today) which rule to prioritize? Remember, the Covering index could be outdated or the BF index could be outdated so i think we should not hardcode it to prioritize one over the other.

thugsatbay · 2021-02-02T19:23:14Z

If we look at FilterIndexRule or JoinIndexRule. We are trying to find if there is a filter or join condition. Once it is done, we are trying to figure can we use covering index. And if they are multiple index we rank them to chose the best.

The question is are we ever going to compare two different type of index. If no, then how we figure out which we prefer more based on type, staleness .. it becomes complex. If yes, then we need to do inside the rule (what metrics to use to compare for 2 different type of indexses is debatable) ?

What I believe is that changing the getCandidateIndex method/function/utility to find the best index should be the right approach.

For each Rule we would already know what type of index we can apply. Going through each eligible index and applying on the rule would give us a new item [Rule, Index] to look at. Once we have collected all these items we can figure out which would work best based on our internal heuristics (metrics to discuss). Once done we return the index back. This is where the handlers come into play as they will take the rule and create an item and help compare which one is the best through rankers.

For future cost optimization problem (as discussed with @apoorvedave1) which rule+index combination to chose first. Each Rule should expose an internal API telling which index it supports (). Also each rule should expose an internal API allowing to give preference order of selection of index.

andrei-ionescu · 2021-02-05T23:05:32Z

This is a great proposal. I would suggest to have an Epic for it where to include the work that needs to happen. From the top of my head it will require:

Change in the public API
Refactoring the core part of Hyperspace
Each type of index has its own particularities and needs to be understand
1. How it is created
2. How it gets updated
3. How to manage it
4. Hybrid scans over each one
5. Joins between datasets with different indexes
Performance testing

I'm sure I lost a lot of other work needed for this.

rapoth · 2021-02-06T02:00:44Z

@andrei-ionescu Yep! We have an epic tracking this work #157 (we are using ZenHub so we can see all the linked issues).

The goal is to add a few index types so we can flesh up all the generality and change the design abstractions as needed. What I've requested @apoorvedave1 to do was to update this issue with a detailed break-down of all the steps. He will most likely get to this early next week.

andrei-ionescu · 2021-04-07T09:04:08Z

@rapoth, @apoorvedave1: For file skipping indexing please have a look on the XSkipper built by IBM.

clee704 · 2021-06-22T12:30:58Z

Closing old issues. Further discussions can continue in #441.

apoorvedave1 added proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue labels Jan 28, 2021

thugsatbay mentioned this issue Feb 2, 2021

[PROPOSAL]: Design Doc Bloom Filter #341

Closed

12 tasks

apoorvedave1 assigned thugsatbay Feb 2, 2021

rapoth assigned sezruby and apoorvedave1 Feb 3, 2021

clee704 closed this as completed Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL]: Changes to support for multiple index types #342

[PROPOSAL]: Changes to support for multiple index types #342

apoorvedave1 commented Jan 28, 2021 •

edited

Loading

sezruby commented Jan 28, 2021

apoorvedave1 commented Jan 28, 2021

sezruby commented Jan 28, 2021 •

edited

Loading

rapoth commented Feb 2, 2021

apoorvedave1 commented Feb 2, 2021 •

edited

Loading

thugsatbay commented Feb 2, 2021 •

edited

Loading

andrei-ionescu commented Feb 5, 2021

rapoth commented Feb 6, 2021

andrei-ionescu commented Apr 7, 2021

clee704 commented Jun 22, 2021

[PROPOSAL]: Changes to support for multiple index types #342

[PROPOSAL]: Changes to support for multiple index types #342

Comments

apoorvedave1 commented Jan 28, 2021 • edited Loading

Problem Statement

Background and Motivation

Proposed Solution

Design

Changes to Action classes

Applying Rules: Updated

PartitionEliminationIndex Design

Using PartitionElimination Index

Implementation

Refactoring Tasks

PartitionEliminationIndex specific tasks

PartitionEliminationFilterIndexRule

Creating PEIndex

Refreshing PEIndex

Optimizing PEIndex

Order of PRs:

Refactoring

BFIndex

PEIndex

Performance Implications (if applicable)

sezruby commented Jan 28, 2021

apoorvedave1 commented Jan 28, 2021

sezruby commented Jan 28, 2021 • edited Loading

rapoth commented Feb 2, 2021

apoorvedave1 commented Feb 2, 2021 • edited Loading

thugsatbay commented Feb 2, 2021 • edited Loading

andrei-ionescu commented Feb 5, 2021

rapoth commented Feb 6, 2021

andrei-ionescu commented Apr 7, 2021

clee704 commented Jun 22, 2021

apoorvedave1 commented Jan 28, 2021 •

edited

Loading

sezruby commented Jan 28, 2021 •

edited

Loading

apoorvedave1 commented Feb 2, 2021 •

edited

Loading

thugsatbay commented Feb 2, 2021 •

edited

Loading