Algebird's HyperLogLog support for Apache Spark.
Перейти к файлу
Jeff Klukas f58b859d53
Update example usage for modern Spark versions (#11)
sqlContext is deprecated
2018-07-24 12:04:38 -04:00
.circleci Add filters to CircleCI test job (#8) 2018-07-03 13:31:42 -04:00
project Bug 1466936 - Distribute via spark-packages.org 2018-06-28 09:39:52 -04:00
python Update pyspark usage instructions to avoid warning (#10) 2018-07-24 12:04:27 -04:00
src Bug 1466936 - Distribute via spark-packages.org 2018-06-28 09:39:52 -04:00
.gitignore Bug 1466936 - Include python files in jar and use tag-based publishing (#7) 2018-07-03 13:12:33 -04:00
LICENSE First commit. 2016-04-09 08:49:56 +01:00
README.md Update example usage for modern Spark versions (#11) 2018-07-24 12:04:38 -04:00
build.sbt Prevent spPackage duplicating python files (#9) 2018-07-05 12:01:00 -04:00
scalastyle-config.xml Add scalastyle checks. 2016-04-09 15:43:33 +01:00

README.md

spark-hyperloglog

Algebird's HyperLogLog support for Apache Spark. This package can be used in concert with presto-hyperloglog to share HyperLogLog sets between Spark and Presto.

codecov.io CircleCi

Installing

This project is published as mozilla/spark-hyperloglog on spark-packages.org, so is available via:

spark --packages mozilla:spark-hyperloglog:2.2.0

Example usage

import com.mozilla.spark.sql.hyperloglog.aggregates._
import com.mozilla.spark.sql.hyperloglog.functions._

val hllMerge = new HyperLogLogMerge
spark.udf.register("hll_merge", hllMerge)
spark.udf.register("hll_create", hllCreate _)
spark.udf.register("hll_cardinality", hllCardinality _)

val frame = sc.parallelize(List("a", "b", "c", "c")).toDF("id")
(frame
  .select(expr("hll_create(id, 12) as hll"))
  .groupBy()
  .agg(expr("hll_cardinality(hll_merge(hll)) as count"))
  .show())

yields:

+-----+
|count|
+-----+
|    3|
+-----+

Deployment

To publish a new version of the package, you need to create a new release on GitHub with a tag version starting with v like v2.2.0. The tag will trigger a CircleCI build that publishes to Mozilla's maven repo in S3.

The CircleCI build will also attempt to publish the new tag to spark-packages.org, but due to an outstanding bug in the sbt-spark-package plugin that publish will likely fail. You can retry locally until is succeeds by creating a GitHub personal access token and, exporting the environment variables GITHUB_USERNAME and GITHUB_PERSONAL_ACCESS_TOKEN, and then repeatedly running sbt spPublish until you get a non-404 response.