Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an implementation for mutable bloom filters #673

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

anish749
Copy link
Contributor

Adds an implementation of mutable Bloom Filters to com.twitter.algebird.mutable.

This uses java.util.BitSet underneath, and the += and ++= operators mutates this Bloom Filter when adding elements.
This is significantly faster than the current BloomFilter implementation, and is useful when working with large BloomFIlters (> 1M elements) in one step (before we start reading from the filter). Since this is a mutable monoid, it doesn't have to copy the underlying bitset and can be more than 100 times faster. Scala BitSets are slower with loops for copying vs Java BitSet which uses System.arraycopy and hence I started experimenting with Java BitSet.

This implementation is also suitable for cases where there are heavy queries. I've noticed improvements in query performance as well.

Hash
The hash functions used in this is of the form h1 + i*h2 described in this paper. This also contributes to speeding up the creation and query of this filter compared to the immutable filter. This also makes the immutable and mutable filters incompatible, since their hash functions are different.

Benchmarks
I've used a superset of numberOfElements and fpProb for benchmarking these implementations:
I've added a few other benchmarks via #672
Create
Immutable

[info] Benchmark                                               (falsePositiveRate)  (nbrOfElements)   Mode  Cnt    Score     Error  Units
[info] BloomFilterCreateBenchmark.createBloomFilter                          0.001            10000  thrpt    3  171.513 ± 281.008  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                          0.001            50000  thrpt    3   11.661 ± 175.592  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                           0.01            10000  thrpt    3   36.594 ±  87.057  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                           0.01            50000  thrpt    3   25.702 ± 259.042  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                            0.1            10000  thrpt    3  409.528 ± 540.620  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                            0.1            50000  thrpt    3   76.085 ± 264.660  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                0.001            10000  thrpt    3    3.059 ±   7.600  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                0.001            50000  thrpt    3    0.164 ±   0.177  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                 0.01            10000  thrpt    3    7.157 ±   9.560  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                 0.01            50000  thrpt    3    0.341 ±   0.235  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                  0.1            10000  thrpt    3   26.621 ±  40.213  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                  0.1            50000  thrpt    3    1.200 ±   1.451  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                 0.001            10000  thrpt    3    3.738 ±   6.267  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                 0.001            50000  thrpt    3    0.159 ±   0.150  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                  0.01            10000  thrpt    3    7.476 ±  16.579  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                  0.01            50000  thrpt    3    0.335 ±   0.721  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                   0.1            10000  thrpt    3   24.621 ±  44.696  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                   0.1            50000  thrpt    3    1.295 ±   0.839  ops/s

Mutable BloomFilter with new Hash (the one in this PR)

[info] Benchmark                                                       (falsePositiveRate)  (nbrOfElements)   Mode  Cnt    Score      Error  Units
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                          0.001            10000  thrpt    3  355.751 ± 1250.674  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                          0.001            50000  thrpt    3   51.248 ±   23.725  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                           0.01            10000  thrpt    3  403.380 ± 1581.138  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                           0.01            50000  thrpt    3   70.345 ±  210.749  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                            0.1            10000  thrpt    3  495.017 ± 1078.017  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                            0.1            50000  thrpt    3   96.735 ±  185.395  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                0.001            10000  thrpt    3  372.392 ±  773.982  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                0.001            50000  thrpt    3   71.716 ±  327.051  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                 0.01            10000  thrpt    3  417.301 ± 1412.174  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                 0.01            50000  thrpt    3   75.678 ±  261.029  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                  0.1            10000  thrpt    3  487.571 ± 1650.689  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                  0.1            50000  thrpt    3   94.129 ±  213.556  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                 0.001            10000  thrpt    3  294.363 ± 1242.557  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                 0.001            50000  thrpt    3   67.536 ±  224.817  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                  0.01            10000  thrpt    3  379.930 ± 1031.260  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                  0.01            50000  thrpt    3   75.034 ±  302.566  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                   0.1            10000  thrpt    3  467.775 ± 1789.243  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                   0.1            50000  thrpt    3   94.527 ±  269.796  ops/s

I also tried to benchmark with old BFHash (I tried to see if it would be a good idea to make the two filters compatible)

[info] Benchmark                                                       (falsePositiveRate)  (nbrOfElements)   Mode  Cnt    Score      Error  Units
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                          0.001            10000  thrpt    3  110.083 ±  727.355  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                          0.001            50000  thrpt    3   28.249 ±   70.475  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                           0.01            10000  thrpt    3  226.370 ±  947.800  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                           0.01            50000  thrpt    3   44.636 ±  107.383  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                            0.1            10000  thrpt    3  431.400 ± 1126.885  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilter                            0.1            50000  thrpt    3   80.858 ±  235.058  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                0.001            10000  thrpt    3  163.738 ±  298.165  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                0.001            50000  thrpt    3   31.606 ±   78.850  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                 0.01            10000  thrpt    3  191.907 ±  523.795  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                 0.01            50000  thrpt    3   41.001 ±   65.581  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                  0.1            10000  thrpt    3  407.909 ± 1883.297  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterAggregator                  0.1            50000  thrpt    3   33.530 ±   76.722  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                 0.001            10000  thrpt    3   96.030 ±  474.962  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                 0.001            50000  thrpt    3   22.870 ±  127.996  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                  0.01            10000  thrpt    3  218.759 ±  734.978  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                  0.01            50000  thrpt    3   25.332 ±  287.861  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                   0.1            10000  thrpt    3  363.827 ±  403.306  ops/s
[info] MutableBloomFilterBenchmark.createMutableBloomFilterUsingFold                   0.1            50000  thrpt    3    7.420 ±   84.390  ops/s

I don't think it is a good idea, because we are adding this as a new Bloom Filter and we don't want to deserialise immutable bloom filters into mutable filters. The performance improvements in these bench marks make sense to keep the new hashing functions.

Query
Immutable

[info] Benchmark                                          (falsePositiveRate)  (nbrOfElements)   Mode  Cnt         Score          Error  Units
[info] BloomFilterQueryBenchmark.queryBloomFilter                       0.001              100  thrpt    3   2219781.232 ± 16557611.738  ops/s
[info] BloomFilterQueryBenchmark.queryBloomFilter                       0.001             1000  thrpt    3   4089699.123 ±  4145796.378  ops/s
[info] BloomFilterQueryBenchmark.queryBloomFilter                       0.001            10000  thrpt    3   4496559.024 ±  1933258.381  ops/s
[info] BloomFilterQueryBenchmark.queryBloomFilter                        0.01              100  thrpt    3   6408263.291 ±  1375159.950  ops/s
[info] BloomFilterQueryBenchmark.queryBloomFilter                        0.01             1000  thrpt    3   5937409.163 ±  4979813.237  ops/s
[info] BloomFilterQueryBenchmark.queryBloomFilter                        0.01            10000  thrpt    3   6202689.025 ±  3715862.721  ops/s

Mutable (Because of new Hash functions)

[info] Benchmark                                          (falsePositiveRate)  (nbrOfElements)   Mode  Cnt         Score          Error  Units
[info] MutableBloomFilterQueryBenchmark.queryBloomFilter                0.001              100  thrpt    3  10159937.092 ±   879212.867  ops/s
[info] MutableBloomFilterQueryBenchmark.queryBloomFilter                0.001             1000  thrpt    3   9805759.107 ±  7481552.541  ops/s
[info] MutableBloomFilterQueryBenchmark.queryBloomFilter                0.001            10000  thrpt    3   9960957.309 ±  2251532.113  ops/s
[info] MutableBloomFilterQueryBenchmark.queryBloomFilter                 0.01              100  thrpt    3   9336716.113 ±  6779167.528  ops/s
[info] MutableBloomFilterQueryBenchmark.queryBloomFilter                 0.01             1000  thrpt    3  10691768.647 ±  6625311.112  ops/s
[info] MutableBloomFilterQueryBenchmark.queryBloomFilter                 0.01            10000  thrpt    3  10408856.514 ±  1731026.826  ops/s

I will try add some documentation for this as well, in a few days.

@anish749
Copy link
Contributor Author

Build are failing because of an old ruby version in travis. Here is a fix #674.

@johnynek
Copy link
Collaborator

Thanks for the PR.

I will take a look at it this weekend.

One thing I have been thinking about is copying this code: https://github.com/typelevel/cats-collections/blob/master/core/src/main/scala/cats/collections/BitSet.scala into the repo since that bitset is competitive with mutable (faster in some cases) but still immutable.

Mutable data structures are really a last resort in this library. Even the priority queue case should be solved if you ask me since I just didn’t know about good immutable priority queues when I did the mutable package to add one.

@anish749
Copy link
Contributor Author

This implementation of BitSet looks very clever. I liked the idea and would go though it to understand the code as well. Thanks for sharing this.

@codecov-io
Copy link

Codecov Report

Merging #673 into develop will increase coverage by 0.05%.
The diff coverage is 91.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop    #673      +/-   ##
==========================================
+ Coverage    89.45%   89.5%   +0.05%     
==========================================
  Files          113     114       +1     
  Lines         8944    9027      +83     
  Branches       490     496       +6     
==========================================
+ Hits          8001    8080      +79     
- Misses         943     947       +4
Impacted Files Coverage Δ
...ala/com/twitter/algebird/mutable/BloomFilter.scala 91.56% <91.56%> (ø)
...main/scala/com/twitter/algebird/monad/Reader.scala 50% <0%> (-5.56%) ⬇️
.../main/scala/com/twitter/algebird/Successible.scala 87.5% <0%> (-4.17%) ⬇️
.../main/scala/com/twitter/algebird/BloomFilter.scala 93.8% <0%> (-0.45%) ⬇️
.../main/scala/com/twitter/algebird/HyperLogLog.scala 92.99% <0%> (+0.77%) ⬆️
...c/main/scala/com/twitter/algebird/MapAlgebra.scala 75.67% <0%> (+0.9%) ⬆️
...src/main/scala/com/twitter/algebird/Interval.scala 84.34% <0%> (+2.6%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 07b185f...acb35f7. Read the comment docs.

@anish749
Copy link
Contributor Author

Hello @johnynek

I tried a snapshot version of cats-collections here. I might be missing a few things here. It does improve compared to scala.immutable.BitSet. The javaewah.EWAHCompressedBitmap now becomes the bottleneck.

Cats Immutable BitSet with original BFHasher

[info] Benchmark                                               (falsePositiveRate)  (nbrOfElements)   Mode  Cnt    Score     Error  Units
[info] BloomFilterCreateBenchmark.createBloomFilter                          0.001            10000  thrpt    3   39.430 ± 384.656  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                          0.001            50000  thrpt    3    7.246 ±  22.062  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                           0.01            10000  thrpt    3   72.440 ±  84.621  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                           0.01            50000  thrpt    3    9.912 ±  23.070  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                            0.1            10000  thrpt    3  132.935 ± 365.928  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                            0.1            50000  thrpt    3   21.471 ±  19.366  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                0.001            10000  thrpt    3    4.423 ±   8.199  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                0.001            50000  thrpt    3    0.174 ±   0.006  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                 0.01            10000  thrpt    3    7.462 ±   9.626  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                 0.01            50000  thrpt    3    0.388 ±   0.254  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                  0.1            10000  thrpt    3   28.085 ±   2.947  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                  0.1            50000  thrpt    3    1.345 ±   0.025  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                 0.001            10000  thrpt    3    4.292 ±   7.491  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                 0.001            50000  thrpt    3    0.163 ±   0.032  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                  0.01            10000  thrpt    3    8.162 ±  24.924  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                  0.01            50000  thrpt    3    0.351 ±   0.056  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                   0.1            10000  thrpt    3   28.473 ±   4.047  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                   0.1            50000  thrpt    3    1.293 ±   0.217  ops/s

I also noticed that BFSparse[A] + (item: A) always results in BFSparse, which doesn't use the immutable cats implementation. So I tried to make it as BFSparse + item: A => BFInstance here

This shows a significant improvement for fold / aggregate operations.
Cats BitSet backed BloomFilter with early conversion to BFInstance from BFSparse along with original Hasher

[info] Benchmark                                               (falsePositiveRate)  (nbrOfElements)   Mode  Cnt    Score     Error  Units
[info] BloomFilterCreateBenchmark.createBloomFilter                          0.001            10000  thrpt    3   15.685 ± 156.889  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                          0.001            50000  thrpt    3    7.558 ±  25.905  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                           0.01            10000  thrpt    3   86.841 ±  25.936  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                           0.01            50000  thrpt    3   12.258 ±   8.708  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                            0.1            10000  thrpt    3  139.833 ± 537.118  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                            0.1            50000  thrpt    3   22.686 ±  19.699  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                0.001            10000  thrpt    3   57.539 ±  56.254  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                0.001            50000  thrpt    3    8.313 ±   2.851  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                 0.01            10000  thrpt    3   81.581 ±  62.966  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                 0.01            50000  thrpt    3   10.145 ±  17.169  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                  0.1            10000  thrpt    3  147.263 ± 433.301  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                  0.1            50000  thrpt    3   22.500 ±  15.501  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                 0.001            10000  thrpt    3   59.596 ±  46.199  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                 0.001            50000  thrpt    3    8.582 ±   1.808  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                  0.01            10000  thrpt    3   82.842 ±  61.627  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                  0.01            50000  thrpt    3   11.530 ±  20.440  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                   0.1            10000  thrpt    3  193.603 ±  79.113  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                   0.1            50000  thrpt    3   22.170 ±   2.315  ops/s

The create benchmarks are lower because it converts from BFSparse to BFInstance after adding quite a few elements in the sumOption function. Using aggregate / fold it converts to BFInstance much earlier.

@johnynek
Copy link
Collaborator

Thanks for adding the benchmark with cats.

One thing I'm thinking: maybe we don't need to use EWAH actually. With the tree-based cats-collection immutable bitset we may also not need BFSparse either, since the bitset is already internally sparse.

@anish749
Copy link
Contributor Author

Yes, I thought the same and wanted to remove EWAH, and I did bench mark that today morning when I was trying out cats. It provides a lot of improvement and a quick dirty way to check that was this (original code was this), which meant that when adding elements via foldLeft / aggregate, it would quickly convert from BFSparse => BFInstance. The above benchmark I posted is with and without this dirty change.

Here is the benchmark for only Cats with original Hasher.
Cats backed BF only

[info] Benchmark                                               (falsePositiveRate)  (nbrOfElements)   Mode  Cnt    Score     Error  Units
[info] BloomFilterCreateBenchmark.createBloomFilter                          0.001            10000  thrpt    3   44.939 ± 227.744  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                          0.001            50000  thrpt    3    5.687 ±  19.081  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                           0.01            10000  thrpt    3   78.949 ± 154.880  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                           0.01            50000  thrpt    3   11.778 ±   8.267  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                            0.1            10000  thrpt    3  189.331 ±  52.266  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilter                            0.1            50000  thrpt    3   22.157 ±  20.995  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                0.001            10000  thrpt    3   53.450 ±  55.043  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                0.001            50000  thrpt    3    7.222 ±  15.144  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                 0.01            10000  thrpt    3   72.325 ± 100.853  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                 0.01            50000  thrpt    3   10.576 ±  26.579  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                  0.1            10000  thrpt    3  157.641 ± 351.298  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterAggregator                  0.1            50000  thrpt    3   18.624 ±  11.941  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                 0.001            10000  thrpt    3   58.734 ±  57.416  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                 0.001            50000  thrpt    3    8.304 ±   2.001  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                  0.01            10000  thrpt    3   78.388 ±  98.764  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                  0.01            50000  thrpt    3    5.545 ±  75.888  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                   0.1            10000  thrpt    3  183.115 ± 128.414  ops/s
[info] BloomFilterCreateBenchmark.createBloomFilterUsingFold                   0.1            50000  thrpt    3   20.643 ±  52.748  ops/s

So moving from Scala Immutable + EWAH => Cats provides a lot of improvements, and is nearly same as Cats + EWAH with an early conversion to BFInstance from BFSparse (which is obvious as the underlying BitSet changes.

@anish749
Copy link
Contributor Author

Hi @johnynek I was having another idea about this.
Have a BF backed by Mutable java BitSet, and provide a method to convert to an immutable BloomFilter backed by an immutable BitSet.

The BF Aggregator may internally use the mutable BloomFilter and in the present step convert to a immutable BF and expose it. This way the creation would become considerably faster and while reading it would be an immutable BF. The aggregator would look like BloomFilterAggregator[A, mutable.BF[A], BF[A]]. Since BloomFilters are idempotent when it comes to addition of elements, and in the aggregator, we use it in a map -> reduce -> map way, I don't think there will be any problem if we use mutable implementations because of failing reduce tasks. Do you think this is something that we should try? I can send a PR comparing this.

This would take away a lot of complexity. In the current implementation we have mutability at the LongBitSet level, which is internal to the BF. and at the tree branches for the cats immutable bitset.

@anish749
Copy link
Contributor Author

anish749 commented Mar 8, 2019

Hello @johnynek Were you able to take a look at this?

@nevillelyh
Copy link
Collaborator

👍 on the mutable backed BF & immutable present idea. But of course I'm biased since we use BF a lot internally. @johnynek any thoughts?

@CLAassistant
Copy link

CLAassistant commented Nov 16, 2019

CLA assistant check
All committers have signed the CLA.

@johnynek
Copy link
Collaborator

johnynek commented May 7, 2020

This somehow fell off my radar, i'm sorry.

I assume you have an internal implementation that you are happy with.

  1. can you sign the CLA? When it is not signed, it stays yellow and is easy to lose track of non-green PRs.
  2. I think having a mutable intermediate state (in Aggregator) is nicer than having a whole mutable monoid since as you note you generally encapsulate the usage immutable -> immutable (who cares if in between it is mutable).
  3. I'm still excited about having a fast purely immutable bloomfilter if we can make the approach from cats work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants