diff --git a/docs/design/ 161-bloom-filter.md b/docs/design/ 161-bloom-filter.md new file mode 100644 index 000000000..aaef936b5 --- /dev/null +++ b/docs/design/ 161-bloom-filter.md @@ -0,0 +1,205 @@ +# Proposal: Bloom Filter non-covering index for HyperSpace + +Discussion of [#161](https://github.com/microsoft/hyperspace/issues/161) Bloom Filter. + +## Abstract + +A design doc proposing how we might go on implementing Bloom Filter in [HyperSpace](https://github.com/microsoft/hyperspace). + +## Background + +Hyperspace currently only supports covering indexing over the datasets. The covering indexing is good +when user knows or has a pre-defined set of query's he wants to execute on the data. However, in cases where +user wants to run some queries on certain columns which are not widely used but also want to leverage our +indexing system, maintaining a full fledged covering index can be expensive. Or another scenario where user want +to leverage our index system, but the user data is just too big and maintaining a covering index is not +worthwhile (storage expensive). Hence, we propose bloom filter. A non-covering index that is space-efficient +probabilistic data structure to calculate and store and eventually benefits by reducing scan time/files. + +## Proposal + +In this design document, we propose an addition to hyperspace indexing system. By adding a potential first +'non-covering' index. +Covering and non-covering index config API in Hyperspace which allows users to build indexes on their dataset. + +## Rationale + +[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] + +TBD, (examples of how it can be used given in Background) + +## Compatibility + +[A discussion of the change with regard to the +[compatibility guidelines](../../COMPATIBILITY.md).] + +TBD + +## Design + +Creating covering non-covering index config. +
+ | Bloom Filter Config Design | +Covering Index Config Design Changes | +
---|---|---|
Initial Config | ++ + sealed trait IndexConfigBase { + indexName: String + indexedColumns: Seq[String] + } + + trait CoveringIndexConfig extends IndexConfigBase { + includedColumns: Seq[String] + } + + trait NonCoveringIndexConfig extends IndexConfigBase { + } + | +|
Defining Config | ++ + case class BloomIndexConfig private ( + indexName: String, + indexedColumns: Seq[String], + expectedNumItems: Long, + fpp: Double, + numBits: Long + ) extends NonCoveringIndexConfig + + def this( + indexName: String, + indexedColumns: Seq[String], + expectedNumItems: Long + ) + + def this( + indexName: String, + indexedColumns: Seq[String], + expectedNumItems: Long, + numBits: Long + ) + + def this( + indexName: String, + indexedColumns: Seq[String], + expectedNumItems: Long, + fpp: Double + ) + +Or we can substitute this with 3 builders design. + | +
+
+ final case class IndexConfig(
+ indexName: String,
+ indexedColumns: Seq[String],
+ includedColumns: Seq[String] = Seq()
+ ) extends CoveringIndexConfig
+
+By allowing Index Config to remain same we allow +backward compatibility with older scripts. + |
+
Additional Methods | ++ + // Returns the erroraneous probability of this + // BloomFilter returning true for an element not + // actually being put in this BloomFilter + def expectedFpp(): Double + | ++ + // TODO - proposed + def addAllIndexedColumns(columnName: String*): IndexConfig + def removeAllIndexedColumns(columnName: String*): IndexConfig + def addAllIncludedColumns(columnName: String*): IndexConfig + def removeAllIncludedColumns(columnName: String*): IndexConfig + | +
+ | Covering Index | +Non Covering Index | +
Base | ++ + sealed trait HyperSpaceIndex { + def kind: String + def kindAbbr: String + } + | +|
Definition | ++ + case class CoveringIndex( + kind: String = "Covering", + kindAbbr: String = "CI", + properties: CoveringIndex.Properties + ) extends HyperSpaceIndex + | ++ + case class BloomIndex( + kind: String = "NonCovering", + kindAbbr: String = "BFNC", + properties: CoveringIndex.Properties + ) extends HyperSpaceIndex + | +