Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

[PROPOSAL]: Changes to support for multiple index types #342

Closed
apoorvedave1 opened this issue Jan 28, 2021 · 10 comments
Closed

[PROPOSAL]: Changes to support for multiple index types #342

apoorvedave1 opened this issue Jan 28, 2021 · 10 comments
Assignees
Labels
proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue

Comments

@apoorvedave1
Copy link
Contributor

apoorvedave1 commented Jan 28, 2021

Problem Statement

Code changes required to support multiple index types like bloom filter index and partition elimination index.

Background and Motivation

Why limit to covering indexes? Let's expand hyperspace to make it flexible for more index types.

Proposed Solution

Make changes to the existing design to allow for flexibility in adding more index types.

Design

Changes to Action classes

Actions Class diagrams

Applying Rules: Updated

  1. Have a single Hyperspace rule which gets added to spark optimizer. This is composed of internal pluggable rules
object HyperspaceRule extends Rule[LogicalPlan] = {
  def apply(plan: LogicalPlan) = {
    new Ranker(new Selection().select(plan)).head
  }
}
  1. Have multiple internal rules which work on specialized types of indexes. For e.g. CIJoinRule, CIFilterRule, BFFilterRule, PEFilterRule etc.

  2. Generate final plans by applying all rules independently to the current plan.

class Selection {
  val rules: Seq[HyperspaceInternalRule] = JoinRule :: FilterRule :: Nil
  def select(plan: LogicalPlan): Seq[LogicalPlan] = {
    rules.flatMap(r => r(plan))
  }
}
  1. Rank them based on hueristics (currently hardcoded rules) to get a cost-wise ordered list of plans. Pick the head plan and return
    (global ranker)
class Ranker {
  def rank(plans: Seq[LogicalPlan]): Seq[LogicalPlan]
}

PartitionEliminationIndex Design

Extending the new index config defined in this design doc: #341
we can define the PartitionElimination non-covering index as below:

case class PartitionEliminationIndexConfig extends NonCoveringIndexConfig

Using PartitionElimination Index

PartitionEliminationIndex is a reverse index from index columns and the data files which contain these values. These could be useful especially for point lookups and range queries.

Implementation

Refactoring Tasks

Tasks:

  1. trait: CreateIndex
  2. Class: CreateBFIndex: def op()
  3. Class: CreatePEIndex: def op()
  4. Class: CreateCoveringIndex: def op()
  5. Class: CreatePartitionEliminationIndex: def op()
  6. Class: Ranker : Global ranker which is hardcoded as of now
  7. Class: Selection
  8. Class: HyperspaceRule

PartitionEliminationIndex specific tasks

PartitionEliminationFilterIndexRule

  1. Get the source plan
  2. Get the index similar to covering index rule. Choose only those indexes whose type is PartitionEliminationIndex
  3. Run a spark query on index data with the query on the index columns.
  4. Collect a list of data file paths which satisfy the index.
  5. Return a new logical plan which reads data from these filtered source data files.

Creating PEIndex

  1. Extend from Covering Index
  2. Exactly as creating a covering index. Just skip the included columns and add the filename column by default.

Refreshing PEIndex

  1. Extend from Covering Index
  2. Exactly as refreshing a covering index. Just skip the included columns and add the file name column by default.

Optimizing PEIndex

  1. Extend from Covering Index
  2. Exactly as optimizing a covering index. Just skip the included columns and add the file name column by default.

Order of PRs:

Refactoring

  1. Updates for a single Hyperspace rule. (3d)
  2. Refactor IndexConfig and IndexLogEntry for existing Covering Index. Update apis to reflect this change (1w)
  3. Refactor CreateIndex for Covering Index. (3d)
  4. Refactor RefreshIndex for Covering Index. (3d)
  5. Refactor OptimizeIndex for Covering Index. (3d)

BFIndex

  1. Introduce new Index type to support:
    a. CreateIndex
    b. Supported Rules for this index type
  2. RefreshIndex
  3. OptimizeIndex
  4. any other index maintenance operation required

PEIndex

  1. Introduce new Index type to support:
    a. CreateIndex (2d)
    b. Supported Rules for this index type (2w)
  2. RefreshIndex (2d)
  3. OptimizeIndex (2d)
  4. any other index maintenance operation required (-)

Performance Implications (if applicable)

None

Alternate Design Options

@apoorvedave1 apoorvedave1 added proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue labels Jan 28, 2021
@sezruby
Copy link
Collaborator

sezruby commented Jan 28, 2021

Seems IndexTypeHandler is not necessary as we should create new rules for Non-covering index / bloom filter index.

@apoorvedave1
Copy link
Contributor Author

thanks @sezruby , it still is a good choice, for e.g. filterindexrule can directly use bloom filter and non-covering index. It's generally a better abstraction I think to separate out index type logic from rules from now on moving forward.

@sezruby
Copy link
Collaborator

sezruby commented Jan 28, 2021

Ok these are the points:

  • we need to keep Rules simple and clear.
  • each index type might have all different conditions and getCandidateIndexes and rank algorithms will be incompatible. Having index type condition for the functions might not be clear.
  • I also have some refactoring plan of rules for whyNot API. I'll refactor the rules as series of "Checks" to filter the candidate indexes. We might be able to reuse some "Check" between rules.
  • Hybrid Scan for bf or non covering indexes incurs some complexity

And lastly how can we apply both a covering index and a BF index using one Filter Rule?

@rapoth
Copy link
Contributor

rapoth commented Feb 2, 2021

+1 to what @sezruby is saying. I'm in favor of separating out the rules per index type since the logic might be totally different. Ideally, we would have:

  1. Covering index comes with a collection of rules e.g., FilterIndexRule, JoinIndexRule and later AggIndexRule
  2. Non-covering indexes (like fine-grained partition elimination and bloom filter index) will come with their own set of rules e.g., FilterRule to begin with. Also, note that the join optimizations through fine-grained partition elimination (index intersection) and bloom filter indexes (bloom filter gets pushed to one side of the join) are totally different and I do not see any reusability.

The important thing we should consider is ensuring the duplication is minimum to the extent possible.

@apoorvedave1 @thugsatbay What are your thoughts on this?

@apoorvedave1
Copy link
Contributor Author

apoorvedave1 commented Feb 2, 2021

Ok these are the points:

  • we need to keep Rules simple and clear.

yeah so if we do this design, i think rules will become simpler and clearer. I am not saying to not write new rules. I am saying if rules can be reused, we don't need to duplicate it if possible. One example is Filter rule. If join rule requires a different logic, we can write a new join rule. but filter rule doesn't require duplication.

  • each index type might have all different conditions and getCandidateIndexes and rank algorithms will be incompatible. Having index type condition for the functions might not be clear.

We either implement this ranking logic, or we stick with creating different rules for each index type and hardcode their ordering.

  • I also have some refactoring plan of rules for whyNot API. I'll refactor the rules as series of "Checks" to filter the candidate indexes. We might be able to reuse some "Check" between rules.
  • Hybrid Scan for bf or non covering indexes incurs some complexity

hybrid scan logic gets extracted out of the filter rule into the index type handler. CoveringIndexHandler knows how to handle hybrid scan for covering index. Same for BF, NC

And lastly how can we apply both a covering index and a BF index using one Filter Rule?

If you take a look at the design, what I have done is added an IndexTypeHandler for exactly this question. Given multiple index types, a ranker decides which index to pick. if it is a covering index, covering handler will update the plan. if it's a bf index, bf handler will update the plan.

Here's the bottom line. Given a data source and two types of indexes which are eligible. E.g. a covering index, which was not updated for long, and a bf index, which is most recently updated. How do we decide which index to choose from?
Option 1: Use a ranker for the rule type. Filter index rule can use a ranker to choose the bf or covering index and pick one.
Option 2: Use two rules. FilterRuleForCovering, FilterRuleForBF. Now the quesiton is, how do we decide (today) which rule to prioritize? Remember, the Covering index could be outdated or the BF index could be outdated so i think we should not hardcode it to prioritize one over the other.

@thugsatbay
Copy link
Contributor

thugsatbay commented Feb 2, 2021

If we look at FilterIndexRule or JoinIndexRule. We are trying to find if there is a filter or join condition. Once it is done, we are trying to figure can we use covering index. And if they are multiple index we rank them to chose the best.

The question is are we ever going to compare two different type of index. If no, then how we figure out which we prefer more based on type, staleness .. it becomes complex. If yes, then we need to do inside the rule (what metrics to use to compare for 2 different type of indexses is debatable) ?

What I believe is that changing the getCandidateIndex method/function/utility to find the best index should be the right approach.

For each Rule we would already know what type of index we can apply. Going through each eligible index and applying on the rule would give us a new item [Rule, Index] to look at. Once we have collected all these items we can figure out which would work best based on our internal heuristics (metrics to discuss). Once done we return the index back. This is where the handlers come into play as they will take the rule and create an item and help compare which one is the best through rankers.

For future cost optimization problem (as discussed with @apoorvedave1) which rule+index combination to chose first. Each Rule should expose an internal API telling which index it supports (). Also each rule should expose an internal API allowing to give preference order of selection of index.

@andrei-ionescu
Copy link
Contributor

This is a great proposal. I would suggest to have an Epic for it where to include the work that needs to happen. From the top of my head it will require:

  1. Change in the public API
  2. Refactoring the core part of Hyperspace
  3. Each type of index has its own particularities and needs to be understand
    1. How it is created
    2. How it gets updated
    3. How to manage it
    4. Hybrid scans over each one
    5. Joins between datasets with different indexes
  4. Performance testing

I'm sure I lost a lot of other work needed for this.

@rapoth
Copy link
Contributor

rapoth commented Feb 6, 2021

@andrei-ionescu Yep! We have an epic tracking this work #157 (we are using ZenHub so we can see all the linked issues).

The goal is to add a few index types so we can flesh up all the generality and change the design abstractions as needed. What I've requested @apoorvedave1 to do was to update this issue with a detailed break-down of all the steps. He will most likely get to this early next week.

@andrei-ionescu
Copy link
Contributor

@rapoth, @apoorvedave1: For file skipping indexing please have a look on the XSkipper built by IBM.

@clee704
Copy link

clee704 commented Jun 22, 2021

Closing old issues. Further discussions can continue in #441.

@clee704 clee704 closed this as completed Jun 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue
Projects
None yet
Development

No branches or pull requests

6 participants