Algebird's HyperLogLog support for Apache Spark.
Перейти к файлу
Roberto Agostino Vitillo 574d9b9235 Add sbt plugins. 2016-04-09 15:14:17 +01:00
project Add sbt plugins. 2016-04-09 15:14:17 +01:00
src Configure distribution. 2016-04-09 15:07:08 +01:00
.gitignore First commit. 2016-04-09 08:49:56 +01:00
LICENSE First commit. 2016-04-09 08:49:56 +01:00
README.md Configure distribution. 2016-04-09 15:07:08 +01:00
build.sbt Bump version. 2016-04-09 15:10:44 +01:00

README.md

spark-hyperloglog

Algebird's HyperLogLog support for Apache Spark. This package can be used in concert with presto-hyperloglog to share HyperLogLog sets between Spark and Presto.

Example usage

import com.mozilla.spark.sql.hyperloglog.aggregates._
import com.mozilla.spark.sql.hyperloglog.function._

val hllMerge = new HyperLogLogMerge
sqlContext.udf.register("hll_merge", hllMerge)
sqlContext.udf.register("hll_create", hllCreate _)
sqlContext.udf.register("hll_cardinality", hllCardinality _)

val frame = sc.parallelize(List("a", "b", "c", "c")).toDF("id")
val count = frame
  .select(expr("hll_create(id, 12) as hll"))
  .groupBy()
  .agg(expr("hll_cardinality(hll_merge(hll)) as count"))
  .show()

yields:

+-----+
|count|
+-----+
|    3|
+-----+