4eec031238 | ||
---|---|---|
.devcontainer | ||
.github | ||
.vscode | ||
.yarn | ||
docs | ||
javascript | ||
python | ||
schema | ||
scripts | ||
.eslintignore | ||
.eslintrc | ||
.gitattributes | ||
.gitignore | ||
.vsts-ci.yml | ||
.yarnrc.yml | ||
CODEOWNERS | ||
CODE_OF_CONDUCT.md | ||
LICENSE | ||
README.md | ||
SECURITY.md | ||
SUPPORT.md | ||
cspell.config.yaml | ||
dictionary.txt | ||
package.json | ||
rome.json | ||
turbo.json | ||
yarn.lock |
README.md
DataShaper
This project provides a collection of components for executing processing pipelines, particularly oriented to data wrangling. Detailed documentation is provided in subfolders, with an overview of high-level goals and concepts here. Most of the documentation within individual packages is tailored to developers needing to understand how the code is organized and executed. Higher-level concepts for the project as a whole, constructing workflows, etc. are in the root docs folder.
Motivation
There are four primary goals of the project:
- Create a shareable client/server schema for describing data processing steps. This is in the schema folder. TypeScript types and JSONSchema generation is in javascript/schema, and published schemas are copied out to schema along with test cases that are executed by JavaScript and Python builds to ensure parity. Stable released versions of DataShaper schemas are hosted on github.io for permanent reference (described below).
- Maintain an implementation of a basic client-side wrangling engine (largely based on Arquero). This is in the javascript/workflow folder. This contains a reactive execution engine, along with individual verb implementations.
- Maintain a python implementation using common wrangling libraries (e.g., pandas) for backend or data science deployments. This is in the python folder. The execution engine is less complete than in JavaScript, but has complete verb implementations and test suite parity. A fuller-featured generalized pipeline execution engine is forthcoming.
- Provide an application framework along with some reusable React components so wrangling operations can be incorporated into web applications easily. This is in the javascript/app-framework and javascript/react folders.
Individual documentation for the JavaScript and Python implementations can be found in their respective folders. Broad documentation about building pipelines and the available verbs is available in the docs folder.
We currently have seven primary JavaScript packages:
- app-framework - this provides web application infrastructure for creating data-driven apps with minimal boilerplate.
- react - this is a set of React components for each verb that you can include in web apps that enable transformation pipeline building.
- schema - this is a set of core types and associated JSONSchema definitions for formalizing our data package and resource models (including the definitions for table parsing, Codebooks, and Workflows).
- tables - this is the primary set of functions for loading and parsing data tables, using Arquero under the hood.
- utilities - this is a set of helpers for working with files, etc., to ease building data wrangling applications.
- webapp - this is the deployable DataShaper webapp that includes all of the verb components and allows creation, execution, and saving of pipeline JSON files. We also rely on this to demonstrate example code, including a TestApp profile. If you're wondering how to build an app with DataShaper components, start here!
- workflow - this is the primary engine for pipeline execution. It includes low-level operational primitives to execute a wide variety of relational algebra transformations over Arquero tables.
Also note that each JavaScript package has a generated docs folder containing Markdown API documentation extracted from code comments using api-extractor.
The Python packages are much simpler, because there is no associated web application and component code.
- engine - contains the core verb implementations.
- workflow.py - this is the primary execution engine that loads and interprets pipelines, and iterates through the steps to produce outputs.
Schema management
We generate JSONSchema for formal project artifacts including resource definitions and workflow specifications. This allows validation by any consumer and/or implementor. Schema versions are published on github.io for permanent reference. Each variant of a schema is hosted in perpetuity with semantic versioning. Aliases to the most recent (unversioned latest) and major revisions are also published. Here are direct links to the latest versions of our primary schemas:
- Bundle (types) (published schema)
- Codebook (types) (published schema)
- Data Package (types) (published schema)
- Data Table (types) (published schema)
- Table Bundle (types) (published schema)
- Workflow (types) (published schema)
Note that for the purposes of pipeline development, the workflow
schema is primary. The rest are largely used for package management and table bundling in the web application.
Creating new verbs
For new verbs within the DataShaper toolkit, you must first determine if JavaScript and Python parity is desired. For operations that should be configurable via a UX, a JavaScript implementation is necessary. However, if the verb is primarily useful for data science workflows and has potentially complicated parameters, a Python-only implementation may be fine. We have a preference for parity to reduce confusion and allow for cross-platform execution of any pipelines created with the tool, but also recognize the value of the Python-based execution engine for configuring data science and ETL workflows that will only ever be run server-side.
Core verbs
Core verbs are built into the toolkit, and should generally have JavaScript and Python parity. Creating these verbs involves the following steps:
- Schema definition - this is done by authoring TypeScript types in the javascript/schema folder, which are then generated as JSONSchema during a build step.
- Cross-platform tests - these are defined in schema/fixtures, primarily in the workflow folder. Each fixture includes a workflow.json and an expected output csv file. Executors run in both JavaScript and Python to confirm that outputs match the expected table.
- JavaScript implementation - verbs are implemented in javascript/workflow/verbs
- Verb UX - individual verb UX components are in javascript/react
Python implementation
- Verbs are implemented in python/verbs
- Create a verb file following the json schema as package structure, for example, if in the schema the verbs is defined as:
"verb": {
"const": "strings.upper",
"type": "string"
}
The location of the verb must be in datashaper.engine.verbs.strings.upper.
- Create a function that replicates the same functionality as the javascript version and use the
@verb
decorator to make it available to the Workflow engine. Thename
parameter of the decorator must match the package name defined in the schema. For example:
@verb(name="my_package.upper")
def upper(input: VerbInput, column: str, to: str):
...
Important Note: If a verb already exists with the same name
you will get a ValueError
, pick a unique name for each verb. For example if you try to create a new "strings.upper"
you will get a ValueError
if you want to create a custom version of this verb you could use "my_package.upper"
like the example above.
Custom verbs
The Python implementation supports the use of custom verbs supplied by your application - this allows arbitrary processing pipelines to be built that contain custom logic and processing steps.
TODO: document custom verb format
Build and test
JavaScript
- You need node and yarn installed
- Operate from project root
- Run:
yarn
- Then:
yarn build
- Run the webapp locally:
yarn start
Python
- You need Python and poetry installed
- Operate from python/datashaper folder
- Run:
poetry install
- Then:
poetry run poe test
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.