You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.
Hyperspace currently only supports covering indexing over the datasets. The covering indexing is good
when user knows or has a pre-defined set of query's he wants to execute on the data. However, in cases where
user wants to run some queries on certain columns which are not widely used but also want to leverage our
indexing system, maintaining a full fledged covering index can be expensive. Or another scenario where user want
to leverage our index system, but the user data is just too big and maintaining a covering index is not
worthwhile (storage expensive). Hence, we propose bloom filter. A non-covering index that is space-efficient
probabilistic data structure to calculate and store contain property. Eventually benefiting user query by reducing scan time/files.
Proposal
In this design document, we propose an addition to hyperspace indexing system. By adding a potential first
'non-covering' index.
Covering and non-covering index config API in Hyperspace which allows users to build indexes on their dataset.
This doc goes in parallel with #342.
case class BloomIndexConfig private (
indexName: String,
indexedColumns: Seq[String],
expectedNumItems: Long,
fpp: Double,
numBits: Long
) extends NonCoveringIndexConfig
def this(
indexName: String,
indexedColumns: Seq[String],
expectedNumItems: Long
)
def this(
indexName: String,
indexedColumns: Seq[String],
expectedNumItems: Long,
numBits: Long
)
def this(
indexName: String,
indexedColumns: Seq[String],
expectedNumItems: Long,
fpp: Double
)
Or we can substitute this with 3 builders design.
final case class IndexConfig(
indexName: String,
indexedColumns: Seq[String],
includedColumns: Seq[String] = Seq()
) extends CoveringIndexConfig
By allowing Index Config to remain same we allow
backward compatibility with older scripts.
Additional Methods
// Returns the erroraneous probability of this
// BloomFilter returning true for an element not
// actually being put in this BloomFilter
def expectedFpp(): Double
Three different ways we could create bloom filter.
If using Builder configuration the declaration of BloomIndexConfig
will change but will look more or less the same.
Only find files that have been deleted and invalidate the BF data.
Append Files
Delete Files
- Construct a df with new BF entries for appended files
- write the df as “append” mode, wherever we end up storing BF information
- merge new BF entries into global BF
- keep the deleted file list in `excluded`
- Global BF is still valid, Update Global BF for better performance ?
File Level
Global Level
- get deleted file paths and appended file paths
- remove BF entries of deleted file paths in File-level index data
- index data has only valid rows after removing invalidated entries
- construct a df with new BF entries for appended files
- Update global BF using the updated file-level BF entries
- merge in place or union, TODO explore
Index Optimize
This op should do nothing as we need to maintain BF for each file.
3) Creating covering non-covering index inside hyperspace / refactoring to support non covering index.
There will be code changes inside the CreateActionBase & IndexLogEntry that should span
many small refactoring changes in the codebase. TODO, needs experimentation with current JSON parser.
Either in parquet file for each user data file or can store data directly in index file / config.
FileName
BloomFilterBinaryData
Updated
path/to/part-0001-xxxxx.parquet
0xacdfacdfacdfacdfacdfacdf…
1
path/to/part-0002-xxxxx.parquet
0xabcdabcdabcdabcdabcda…
0
path/to/part-0003-xxxxx.parquet
0xabcdabasdadcdabcdabdda…
1
The Updated column tells if the BF data can be used or is stale.
Global BF, for all data in the indexed column can be stored as a separate file or in the IndexLogEntry File ?
[TODO] Are we allowed to compare two different index types HyperSpaceIndex#kindAbbr or we would always prefer one over the another ?
[TODO] Each IndexLogEntry should have equals and compareTo method now which may or may not be used in ranker. That defines how each index compares to other index based on certain heuristics or rules.
[TODO] Based on rule can comparison methods/routine of index change ?
All answers are provided through construction of rankers for the index and the rules.
Implementation
Refactor Index Config to introduce non covering index (1 week) - 357
Support new index configs for bloom filter (1.5 week) - 357
Add a new case class for non-covering Index
Modify IndexLogEntry to support design changes including non-covering index (1 week)
Create the index using the config (2 week)
Store the newly created index config info in a usable file format (1 week)
Define Ranker for this index (3 week)
Plugin the index in the FilterIndexRule (2 week)
Plugin the index in the JoinIndexRule - Separate Design and Time requirement
Join Rule is complex and can be divided into as :
When BF moves to the right side allowing which files to scan ?
Can BF be used for each key in Join and decide for each reducer how scan shuffle takes place ?
Impact on Performance (if applicable)
[A discussion of impact on performance and any corner cases that the author is aware of.
If there is a negative impact on performance, please make sure
to capture an issue in the next section. This section may be omitted if there are none.]
Mostly we need to run a small spark job on the Bloom Filter information to figure which data files are worth
keeping for scanning. TODO, detailed analysis required.
Open issues (if applicable)
[A discussion of issues relating to this proposal for which the author does not
know the solution. If you have already opened the corresponding issues, please link
to them here. This section may be omitted if there are none.]
Understanding how the plan will look for join operation, working on it TODO(thugsatbay)
The text was updated successfully, but these errors were encountered:
Bloom Filter non-covering index for HyperSpace
Discussion of #161 Introduction to Bloom Filter.
A design doc proposing how we might go on implementing Bloom Filter in HyperSpace.
Describe the problem (Background)
Hyperspace currently only supports covering indexing over the datasets. The covering indexing is good
when user knows or has a pre-defined set of query's he wants to execute on the data. However, in cases where
user wants to run some queries on certain columns which are not widely used but also want to leverage our
indexing system, maintaining a full fledged covering index can be expensive. Or another scenario where user want
to leverage our index system, but the user data is just too big and maintaining a covering index is not
worthwhile (storage expensive). Hence, we propose bloom filter. A non-covering index that is space-efficient
probabilistic data structure to calculate and store contain property. Eventually benefiting user query by reducing scan time/files.
Proposal
In this design document, we propose an addition to hyperspace indexing system. By adding a potential first
'non-covering' index.
Covering and non-covering index config API in Hyperspace which allows users to build indexes on their dataset.
This doc goes in parallel with #342.
Design
1) Creating covering non-covering index config.
Or we can substitute this with 3 builders design.
By allowing Index Config to remain same we allow
backward compatibility with older scripts.
If using Builder configuration the declaration of BloomIndexConfig
will change but will look more or less the same.
2) Index Refresh & Optimize
How Merge BloomFilter Op works !
Index Refresh
- write the df as “append” mode, wherever we end up storing BF information
- merge new BF entries into global BF
- Global BF is still valid, Update Global BF for better performance ?
- remove BF entries of deleted file paths in File-level index data
- index data has only valid rows after removing invalidated entries
- construct a df with new BF entries for appended files
- merge in place or union, TODO explore
Index Optimize
This op should do nothing as we need to maintain BF for each file.
3) Creating covering non-covering index inside hyperspace / refactoring to support non covering index.
There will be code changes inside the
CreateActionBase
&IndexLogEntry
that should spanmany small refactoring changes in the codebase. TODO, needs experimentation with current JSON parser.
4) Storing BFBinaryData, BF metadta file
Either in parquet file for each user data file or can store data directly in index file / config.
The
Updated
column tells if the BF data can be used or is stale.Global BF, for all data in the indexed column can be stored as a separate file or in the
IndexLogEntry
File ?5) Rules
Detailed Discussion in #342.
Proposed -
[TODO] Are we allowed to compare two different index types
HyperSpaceIndex#kindAbbr
or we would always prefer one over the another ?[TODO] Each
IndexLogEntry
should haveequals
andcompareTo
method now which may or may not be used in ranker. That defines how each index compares to other index based on certain heuristics or rules.[TODO] Based on rule can comparison methods/routine of index change ?
Implementation
IndexLogEntry
to support design changes including non-covering index (1 week)FilterIndexRule
(2 week)JoinIndexRule
- Separate Design and Time requirementImpact on Performance (if applicable)
[A discussion of impact on performance and any corner cases that the author is aware of.
If there is a negative impact on performance, please make sure
to capture an issue in the next section. This section may be omitted if there are none.]
keeping for scanning. TODO, detailed analysis required.
Open issues (if applicable)
[A discussion of issues relating to this proposal for which the author does not
know the solution. If you have already opened the corresponding issues, please link
to them here. This section may be omitted if there are none.]
The text was updated successfully, but these errors were encountered: