Processing engine and React components for constructing configuration-based data transformation and processing pipelines.

Перейти к файлу

Chris Trevino 48b16a0b98 dx improvement; use input/output in specifications		2022-04-08 15:42:04 -07:00
.devcontainer	clean up devcontainer.json	2022-01-03 14:08:40 -08:00
.github/workflows	node version update	2022-04-05 19:14:40 +00:00
.husky	mark precommit script as executable	2021-12-02 09:06:55 -08:00
.vscode	Merge branch 'main' into interface-refactor	2022-02-22 14:54:55 -08:00
.yarn	merge main	2022-04-08 11:55:12 -07:00
docs	Add ONEHOT component	2022-04-05 14:02:11 -07:00
javascript	dx improvement; use input/output in specifications	2022-04-08 15:42:04 -07:00
python	Python/table containers (#178 )	2022-03-29 14:59:59 -06:00
schema	dx improvement; use input/output in specifications	2022-04-08 15:42:04 -07:00
.eslintignore	generate docs	2022-03-10 16:06:40 -08:00
.eslintrc	replaced webpages help text with the guidance component	2022-03-29 14:24:02 -05:00
.gitattributes	Initial import	2021-12-01 19:10:08 -08:00
.gitignore	generate docs	2022-03-10 16:06:40 -08:00
.prettierignore	generate docs	2022-03-10 16:06:40 -08:00
.vsts-ci.yml	update docs	2022-02-18 15:05:41 -08:00
.yarnrc.yml	update yarn, essex-js-toolkit deps	2022-03-03 19:54:42 -08:00
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md committed	2021-11-22 14:36:49 -08:00
LICENSE	LICENSE committed	2021-11-22 14:36:51 -08:00
README.md	Merge branch 'main' into interface-refactor	2022-02-22 14:54:55 -08:00
SECURITY.md	Initial import	2021-12-01 19:10:08 -08:00
SUPPORT.md	Initial import	2021-12-01 19:10:08 -08:00
jest.config.js	remove test-only lodash import, simplify config	2022-02-18 14:18:52 -08:00
jest.setup.mjs	refactored filecollection add method, removed init, modified class tests	2022-01-19 18:10:37 -05:00
package.json	get webapp building	2022-03-29 13:53:03 -07:00
tsconfig.json	Initial import	2021-12-01 19:10:08 -08:00
yarn.lock	merge main	2022-04-08 11:55:12 -07:00

README.md

data-wrangling-components

This project provides a collection of web components for doing lightweight data wrangling.

There are four goals of the project:

Create a shareable client/server schema for serialized wrangling instructions
Maintain an implementation of a basic client-side wrangling engine (largely based on Arquero)
Maintain a python implementation using common wrangling libraries (e.g., pandas) for backend or data science deployments
Provide some reusable React components so wrangling operations can be incorporated into webapps easily.

The first goal is nascent, and currently covered by TypeScript typings in the core javascript package. However, our intent is to eventually extract a JSONSchema specification that is more readily consumable by cross-platform services. In addition, our API largely mirrors Arquero's for now; we'll review for areas of parameter commonality and make some generalizations in the future.

Individual documentation for the JavaScript and Python implementations can be found in their respective folders. Broad documentation about building pipelines and the available verbs is available in the docs folder

We currently have four packages:

core - this is the primary engine for pipeline execution. It includes low-level operational primitives to execute a wide variety of relational algebra transformations over Arquero tables. The pipeline is essentially an implementation of async chain-of-command, executing verbs serially based on an input table context and set of step configurations.
react - this is a set of React components for each verb that you can include in web apps that enable tranformation pipeline building.
utilities - this is a set of helpers for working with files, etc., to ease building data wrangling applications.
webapp - this is an example/test webapp that includes all of the verb components and allows creation, execution, and saving of pipeline JSON files.

Building

You need node and yarn installed
Run: yarn
Then: yarn build
Run the webapp locally: yarn start

Usage

The webapp uses both the core engine and React components to build a small application that demonstrates how to use the wrangling components. At a basic level, you need a set of input tables, which you place in a TableStore (basically a chain execution context). You add wrangling steps to a Pipeline, then run it to generate an output table.

Tables in the store are referenced by key. Steps can create any number of output tables that are also written to the store. Future steps can therefore build upon previous/intermediate outputs however you'd like. See the every-operation.json example for a sample of every verb we currently support.

Example joining two tables:

    import { table } from 'arquero'
    import { createTableStore, createPipeline } from '@data-wrangling-components/core'

    // id   name
    // 1    bob
    // 2    joe
    // 3    jane
    const parents = table({
        id: [1, 2, 3],
        name: ['bob', 'joe', 'jane']
    })

    // id   kid
    // 1    billy
    // 1    jill
    // 2    kaden
    // 2    kyle
    // 3    moe
    const kids = table({
        id: [1, 1, 2, 2, 3],
        kid: ['billy', 'jill', 'kaden', 'kyle', 'moe]
    })

    const store = createTableStore()

    store.set({
        id: 'parents',
        table: parents
    })
    store.set({
        id: 'kids',
        table: kids
    })

	const pipeline = createPipeline(store)

    pipeline.add({
        verb: 'join',
        input: 'parents',
        output: 'output',
        args: {
            other: 'kids',
            on: ['id']
        }
    })

    // id   name    kid
    // 1    bob     billy
    // 1    bob     jill
    // 2    joe     kaden
    // 2    joe     kyle
    // 3    jane    moe
    const result = await pipeline.run()

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.