Processing engine and React components for constructing configuration-based data transformation and processing pipelines.

Перейти к файлу

Chris Trevino d101fcd167 Refactor: Streamline Verb Core Functions, Package Exports (#747 )		2024-06-20 12:14:48 -07:00
.devcontainer	…
.github	…
.vscode	…
.yarn	…
docs	…
javascript	…
python	…
schema	…
scripts	…
.eslintignore	…
.eslintrc	…
.gitattributes	…
.gitignore	…
.vsts-ci.yml	…
.yarnrc.yml	…
CODEOWNERS	…
CODE_OF_CONDUCT.md	…
LICENSE	…
README.md	…
SECURITY.md	…
SUPPORT.md	…
biome.json	…
cspell.config.yaml	…
dictionary.txt	…
package.json	…
turbo.json	…
yarn.lock	…

README.md

DataShaper

This project provides a collection of components for executing processing pipelines, particularly oriented to data wrangling. Detailed documentation is provided in subfolders, with an overview of high-level goals and concepts here. Most of the documentation within individual packages is tailored to developers needing to understand how the code is organized and executed. Higher-level concepts for the project as a whole, constructing workflows, etc. are in the root docs folder.

Motivation

There are four primary goals of the project:

Create a shareable client/server schema for describing data processing steps. This is in the schema folder. TypeScript types and JSONSchema generation is in javascript/schema, and published schemas are copied out to schema along with test cases that are executed by JavaScript and Python builds to ensure parity. Stable released versions of DataShaper schemas are hosted on github.io for permanent reference (described below).
Maintain an implementation of a basic client-side wrangling engine (largely based on Arquero). This is in the javascript/workflow folder. This contains a reactive execution engine, along with individual verb implementations.
Maintain a python implementation using common wrangling libraries (e.g., pandas) for backend or data science deployments. This is in the python folder. The execution engine is less complete than in JavaScript, but has complete verb implementations and test suite parity. A fuller-featured generalized pipeline execution engine is forthcoming.
Provide an application framework along with some reusable React components so wrangling operations can be incorporated into web applications easily. This is in the javascript/app-framework and javascript/react folders.

Individual documentation for the JavaScript and Python implementations can be found in their respective folders. Broad documentation about building pipelines and the available verbs is available in the docs folder.

We currently have seven primary JavaScript packages:

app-framework - this provides web application infrastructure for creating data-driven apps with minimal boilerplate.
react - this is a set of React components for each verb that you can include in web apps that enable transformation pipeline building.
schema - this is a set of core types and associated JSONSchema definitions for formalizing our data package and resource models (including the definitions for table parsing, Codebooks, and Workflows).
tables - this is the primary set of functions for loading and parsing data tables, using Arquero under the hood.
utilities - this is a set of helpers for working with files, etc., to ease building data wrangling applications.
webapp - this is the deployable DataShaper webapp that includes all of the verb components and allows creation, execution, and saving of pipeline JSON files. We also rely on this to demonstrate example code, including a TestApp profile. If you're wondering how to build an app with DataShaper components, start here!
workflow - this is the primary engine for pipeline execution. It includes low-level operational primitives to execute a wide variety of relational algebra transformations over Arquero tables.

Also note that each JavaScript package has a generated docs folder containing Markdown API documentation extracted from code comments using api-extractor.

The Python packages are much simpler, because there is no associated web application and component code.

engine - contains the core verb implementations.
workflow.py - this is the primary execution engine that loads and interprets pipelines, and iterates through the steps to produce outputs.

Schema management

We generate JSONSchema for formal project artifacts including resource definitions and workflow specifications. This allows validation by any consumer and/or implementor. Schema versions are published on github.io for permanent reference. Each variant of a schema is hosted in perpetuity with semantic versioning. Aliases to the most recent (unversioned latest) and major revisions are also published. Here are direct links to the latest versions of our primary schemas:

Bundle (types) (published schema)
Codebook (types) (published schema)
Data Package (types) (published schema)
Data Table (types) (published schema)
Table Bundle (types) (published schema)
Workflow (types) (published schema)

Note that for the purposes of pipeline development, the workflow schema is primary. The rest are largely used for package management and table bundling in the web application.

Creating new verbs

For new verbs within the DataShaper toolkit, you must first determine if JavaScript and Python parity is desired. For operations that should be configurable via a UX, a JavaScript implementation is necessary. However, if the verb is primarily useful for data science workflows and has potentially complicated parameters, a Python-only implementation may be fine. We have a preference for parity to reduce confusion and allow for cross-platform execution of any pipelines created with the tool, but also recognize the value of the Python-based execution engine for configuring data science and ETL workflows that will only ever be run server-side.

Core verbs

Core verbs are built into the toolkit, and should generally have JavaScript and Python parity. Creating these verbs involves the following steps:

Schema definition - this is done by authoring TypeScript types in the javascript/schema folder, which are then generated as JSONSchema during a build step.
Cross-platform tests - these are defined in schema/fixtures, primarily in the workflow folder. Each fixture includes a workflow.json and an expected output csv file. Executors run in both JavaScript and Python to confirm that outputs match the expected table.
JavaScript implementation - verbs are implemented in javascript/workflow/verbs
Verb UX - individual verb UX components are in javascript/react

Python implementation

Verbs are implemented in python/verbs
Create a verb file following the json schema as package structure, for example, if in the schema the verbs is defined as:

"verb": {
    "const": "strings.upper",
    "type": "string"
}

The location of the verb must be in datashaper.engine.verbs.strings.upper.

Create a function that replicates the same functionality as the javascript version and use the @verb decorator to make it available to the Workflow engine. The name parameter of the decorator must match the package name defined in the schema. For example:

@verb(name="my_package.upper")
def upper(input: VerbInput, column: str, to: str):
    ...

Important Note: If a verb already exists with the same name you will get a ValueError, pick a unique name for each verb. For example if you try to create a new "strings.upper" you will get a ValueError if you want to create a custom version of this verb you could use "my_package.upper" like the example above.

Custom verbs

The Python implementation supports the use of custom verbs supplied by your application - this allows arbitrary processing pipelines to be built that contain custom logic and processing steps.

TODO: document custom verb format

Build and test

JavaScript

You need node and yarn installed
Operate from project root
Run: yarn
Then: yarn build
Run the webapp locally: yarn start

Python

You need Python and poetry installed
Operate from python/datashaper folder
Run: poetry install
Then: poetry run poe test

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.