Added Spark Dataset support #814

Tomczik76 · 2020-04-29T22:01:38Z

No description provided.

CLAassistant · 2020-04-29T22:01:47Z

All committers have signed the CLA.

johnynek · 2020-04-30T02:16:55Z

This is great! thanks for the PR. I'd like to merge it. you need to accept the contributor agreement above.

Also, can you run scalafmtAll at the sbt prompt and commit the results?

johnynek · 2020-04-30T02:17:21Z

algebird-spark/src/main/scala/com/twitter/algebird/spark/implicits/package.scala

+
+package object implicits {
+
+import com.twitter.algebird.BF


can you move the imports to the top of the file?

johnynek · 2020-04-30T02:17:57Z

algebird-spark/src/main/scala/com/twitter/algebird/spark/package.scala

  /**
   * spark exposes an Aggregator type, so this is here to avoid shadowing
   */
  type AlgebirdAggregator[A, B, C] = Aggregator[A, B, C]
  val AlgebirdAggregator = Aggregator

-  implicit class ToAlgebird[T](val rdd: RDD[T]) extends AnyVal {
+  implicit class ToAlgebirdRDD[T](val rdd: RDD[T]) extends AnyVal {


can we not change the name of this? this will break source and binary compatibility for people.

regadas · 2020-04-30T02:32:21Z

algebird-spark/src/test/scala/com/twitter/algebird/spark/AlgebirdDatasetTests.scala

+  }
+
+  after {
+    // try spark.stop()


Seems that we can remove this after block?

regadas · 2020-04-30T02:36:01Z

algebird-spark/src/main/scala/com/twitter/algebird/spark/implicits/package.scala

@@ -0,0 +1,29 @@
+package com.twitter.algebird.spark
+
+package object implicits {


I think I would actually create an EncoderInstances trait and have package object spark extends EncoderInstances .

Users will do import com.twitter.algebird.spark._ and they will get these lower priority implicits. I can see that this was to possibly be inline with spark way of things, but I think this is more consistent with the rest of algebird.

I tried this and it didn't work with spark.implicits._. So I'm moving them into the package object spark instead.

regadas · 2020-04-30T02:39:14Z

algebird-spark/src/main/scala/com/twitter/algebird/spark/implicits/package.scala

+
+  import scala.reflect.ClassTag
+  implicit def kryoPriorityQueueEncoder[A](implicit ct: ClassTag[PriorityQueue[A]]): Encoder[PriorityQueue[A]] =
+    org.apache.spark.sql.Encoders.kryo[PriorityQueue[A]](ct)


Suggested change

org.apache.spark.sql.Encoders.kryo[PriorityQueue[A]](ct)

org.apache.spark.sql.Encoders.kryo[PriorityQueue[A]]

iirc ClassTag is a context bound so the above should be possible?

regadas · 2020-04-30T02:48:51Z

algebird-spark/src/test/scala/com/twitter/algebird/spark/AlgebirdDatasetTests.scala

+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.Dataset
+import com.twitter.algebird.BloomFilter


seems unused?

regadas · 2020-04-30T02:57:46Z

algebird-spark/src/test/scala/com/twitter/algebird/spark/AlgebirdDatasetTests.scala

+   * above to at least check compilation
+   */
+  test("aggregate") {
+    val sparkImplicits = spark.implicits


I think I would move these to #L24 to be more evident that these are indeed the SparkSession implicitis

johnynek · 2020-05-07T02:24:38Z

@Tomczik76 I'd love to include this. Any idea when you might have time to get back around to it?

Added Spark Dataset support

4d0716b

johnynek reviewed Apr 30, 2020

View reviewed changes

regadas reviewed Apr 30, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Spark Dataset support #814

Added Spark Dataset support #814

Tomczik76 commented Apr 29, 2020

CLAassistant commented Apr 29, 2020 •

edited

Loading

johnynek commented Apr 30, 2020

johnynek Apr 30, 2020

johnynek Apr 30, 2020

regadas Apr 30, 2020

regadas Apr 30, 2020

Tomczik76 May 1, 2020

regadas Apr 30, 2020

regadas Apr 30, 2020

regadas Apr 30, 2020

johnynek commented May 7, 2020

		@@ -0,0 +1,29 @@
		package com.twitter.algebird.spark

		package object implicits {

	org.apache.spark.sql.Encoders.kryo[PriorityQueue[A]](ct)
	org.apache.spark.sql.Encoders.kryo[PriorityQueue[A]]

Added Spark Dataset support #814

Are you sure you want to change the base?

Added Spark Dataset support #814

Conversation

Tomczik76 commented Apr 29, 2020

CLAassistant commented Apr 29, 2020 • edited Loading

johnynek commented Apr 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented May 7, 2020

CLAassistant commented Apr 29, 2020 •

edited

Loading