61 строка
2.1 KiB
Markdown
61 строка
2.1 KiB
Markdown
# spark-hyperloglog
|
|
Algebird's HyperLogLog support for Apache Spark. This package can be used in concert
|
|
with [presto-hyperloglog](https://github.com/vitillo/presto-hyperloglog) to share
|
|
HyperLogLog sets between Spark and Presto.
|
|
|
|
[![codecov.io](https://codecov.io/github/mozilla/spark-hyperloglog/coverage.svg?branch=master)](https://codecov.io/github/mozilla/spark-hyperloglog?branch=master)
|
|
[![CircleCi](https://circleci.com/gh/mozilla/spark-hyperloglog.svg?style=shield&circle-token=5506f56072f0198ece2995a8539c174cc648c9e4)](https://circleci.com/gh/mozilla/spark-hyperloglog)
|
|
|
|
### Installing
|
|
|
|
This project is published as
|
|
[mozilla/spark-hyperloglog](https://spark-packages.org/package/mozilla/spark-hyperloglog)
|
|
on spark-packages.org, so is available via:
|
|
|
|
spark --packages mozilla:spark-hyperloglog:2.2.0
|
|
|
|
|
|
### Example usage
|
|
|
|
```scala
|
|
import com.mozilla.spark.sql.hyperloglog.aggregates._
|
|
import com.mozilla.spark.sql.hyperloglog.functions._
|
|
|
|
val hllMerge = new HyperLogLogMerge
|
|
spark.udf.register("hll_merge", hllMerge)
|
|
spark.udf.register("hll_create", hllCreate _)
|
|
spark.udf.register("hll_cardinality", hllCardinality _)
|
|
|
|
val frame = sc.parallelize(List("a", "b", "c", "c")).toDF("id")
|
|
(frame
|
|
.select(expr("hll_create(id, 12) as hll"))
|
|
.groupBy()
|
|
.agg(expr("hll_cardinality(hll_merge(hll)) as count"))
|
|
.show())
|
|
```
|
|
|
|
yields:
|
|
|
|
```bash
|
|
+-----+
|
|
|count|
|
|
+-----+
|
|
| 3|
|
|
+-----+
|
|
```
|
|
|
|
### Deployment
|
|
|
|
To publish a new version of the package, you need to
|
|
[create a new release on GitHub](https://github.com/mozilla/spark-hyperloglog/releases/new)
|
|
with a tag version starting with `v` like `v2.2.0`. The tag will trigger a CircleCI build
|
|
that publishes to Mozilla's maven repo in S3.
|
|
|
|
The CircleCI build will also attempt to publish the new tag to spark-packages.org,
|
|
but due to
|
|
[an outstanding bug in the sbt-spark-package plugin](https://github.com/databricks/sbt-spark-package/issues/31)
|
|
that publish will likely fail. You can retry locally until is succeeds by creating a GitHub
|
|
personal access token and, exporting the environment variables `GITHUB_USERNAME` and
|
|
`GITHUB_PERSONAL_ACCESS_TOKEN`, and then repeatedly running `sbt spPublish` until you get a
|
|
non-404 response.
|