* updated setup.py and explitly specifying dependencies versions + removed user_scm_version flag
* Tweaked CI configuration to comment out any potential publishing steps
* trying to fix CI build
* fixing linting error
* Tweaked CI configuration to comment out any potential publishing steps
Dataset.summaries uses a concurrent.futures.ProcessPoolExecutor to fetch multiple files from S3 at once.
ProcessPoolExecutor uses multiprocessing underneath, which defaults to using fork() on Unix.
Using fork() is dangerous and prone to deadlocks: https://codewithoutrules.com/2018/09/04/python-multiprocessing/
This is a possible source of observed deadlocks during calls to Dataset.records.
Using threads should not be a performance regression since the operation we're parallelizing over is network-bound,
not CPU-bound, so there should not be much contention for the GIL.
We should definitely be testing docs as part of CI to make sure they build,
which is addressed here. But this change also explores what it could look
like to publish docs to Github Pages rather than ReadTheDocs, so that we can
avoid the additional developer friction of understanding that service and
maintaining user permissions there.
The gh-pages docs are live at:
http://mozilla.github.io/python_moztelemetry/
A few features I notice missing that ReadTheDocs provides:
- hosting multiple versions of the docs, though we don't look to be using this
- download links for PDF, HTML, and Epub
- "Edit on GitHub" links; the gh-pages rendered version links to the content
on the gh-pages branch rather than on master
The above features are indeed nice to have. RTD is also a fairly
python-specific tool, so if we do value hosting API docs for our projects,
the technique here is a bit more transferable.
This PR is mostly intended to provoke discussion.
I'm totally fine if we decide to close this.
* Update docstring to clarify SparkContext vs. SparkSession
* Ignore new pycodestyle W504 rule
It's in pycodestyle's default ignore list since it's mutually
exclusive with the existing W503 rule.
`./bin/test tests/heka/` was running all tests rather than just those in the
target directory because `--cov` was eating the directory, assuming it was
the argument of where we should do code coverage.
@acmiyaguchi reported that with the source mounted into the container,
cache files were written to the local file system that then
couldn't be removed without sudo on an Ubuntu host.
This change should make sure all cache files are written inside the
container so they don't hit the local filesystem.
On CircleCI's docker infrastructure, cpu_count returns 32,
even though the container is limited to 2 virtual CPUs;
this caused a high number of spawned processes that caused
test timeouts.
We set max_concurrency low for tests so they can complete quickly
on CircleCI.
Documents from landfill will be decoded directly into their string
representation. The logic for _parse_heka_record is generally
unnecessary because fields are not extracted when dumped to landfill.
* Use parse_scalars.py instead of custom code
This additionally removes the REQUIRED_FIELDS and
OPTIONAL_FIELDS dictionaries: these checks would
be performed by the parse_scalars.py library with
|strict_type_checks=True|. However, for server side
computation, we're usually disabling this to be
backward compatible with older registry formats.
* Make the updater script refresh all the dependencies