Граф коммитов

299 Коммитов

Автор SHA1 Сообщение Дата
Michael Sharp 2dd988cfce Merged PR 5201: Updated Featurizers SimpleSchema
Updated SimpleSchema to allow for better mapping of input->output types.
Added support for multiple template types.

Note: Tuple types aren't supported yet, so the code is generated for RobustScalar but we will need to add that code in manually in the custom section.
2019-10-04 15:28:51 +00:00
Ye Wang da44bfb1f9 Merged PR 5184: RobustScalarFeaturizer
RobustScalarFeaturizer chains the RobustScalarNormEstimator and RobustScalarTransformer.

    RobustScalarNormEstimator: takes the training data and gets its median and range
    RobustScalarTransformer: use the median and range(transform to "scale" using q_range) to modify inference data row by row

Estimator signature: Estimator(ptr, _with_centering = true, _with_scaling = true, _quantile_range = (25.0, 75.0))
2019-10-03 20:03:47 +00:00
Ye Wang 916bb7958b Merged PR 5176: Fix build error(artifact download) when triggered by another build(completion)
resolve the issue by using two artifact download tasks with different condition
2019-09-30 18:22:06 +00:00
Michael Sharp e4649ef6a2 Merged PR 5170: removed ML.NET files as they are now in ML.NET itself
removed ML.NET files as they are now in ML.NET itself
2019-09-26 22:29:51 +00:00
Jin Yan f1a713d9df Merged PR 5151: new integration
add support for big endian serialization.
The goal is to serialization all in little endian to add portability across little and big endian architectures.
2019-09-26 17:43:13 +00:00
Ye Wang 969886d25d Merged PR 5144: Azure Upload
Azure upload for documentation and nuget packages
2019-09-26 17:01:50 +00:00
David Brownell 48b99117db Merged PR 5155: Updated linux runtime to linux-x64 in nuget packages
Updated linux runtime to linux-x64 in nuget packages
2019-09-24 19:48:05 +00:00
David Brownell 1242c0a9b8 Merged PR 5153: Updates for TimeSeriesImputerFeaturizer
Updates for TimeSeriesImputerFeaturizer:

- Validates input impute strategy
- Moves median/col error validation to the transformer
- Populates empty values in median scenarios when errors are suppressed
- Introduces new Shared-object layer tests
- Ensures chronological order for inputs during transform
- Enumerates tests that need to be written
2019-09-24 17:30:42 +00:00
David Brownell 733ca71a36 Merged PR 5150: Functionality to specify an optional data root dir
- Functionality to specify an optional data root dir
- Moved sg_pointerTable to a single compilation unit and renamed it to g_pointerTable
2019-09-23 21:42:22 +00:00
David Brownell 77c02d515d Merged PR 5149: Fixed bug when deserializing data used to recreate a transformer
Fixed bug when deserializing data used to recreate a transformer
2019-09-23 16:51:31 +00:00
David Brownell d4a1675353 Merged PR 5148: Signing binaries and nuget packages
Signing content
2019-09-21 01:56:22 +00:00
Anuj Shrotriya 81f98ee18c Merged PR 5139: TimeSeriesImputerFeaturizer- Initial implementation (WIP)
This featurizer is supposed to do following:
1) Fill gaps (add rows) for grain cols in timeseries.
2) Impute specified cols per the impute strategy (For this iteration, supported imputed strategies are: ffill, bfill and median).

Implementation Details
This featurizer has been implemented as composition of three estimators:
      FrequencyEstimator: It is an annotation estimator.
      MedianEstimator: It is also an annotation estimator.
      ImputationEstimator: It is an inference only estimator. It reads in the frequency and median annotations and creates the transformer which does imputation.

Opens Issues:
    PipelineExecutor to enable passing args to ctor.
    PipelineExecutor to enable invoking flush.
    flush implementation
    Archive implementation
    Honoring suppressError flag during imputation.
    More Unit tests.
2019-09-21 01:40:36 +00:00
David Brownell f8482a016e Merged PR 5142: Fixed serialization issues, added tests
Fixed serialization issues, added tests
2019-09-20 20:38:13 +00:00
David Brownell 86639a8ae7 Merged PR 5141: Added functionality to enumerate holiday files
Added functionality to enumerate holiday files and code to invoke functionality from the shared wrapper.
2019-09-20 19:03:48 +00:00
David Brownell 89dcbba065 Merged PR 5134: Estimators with custom constructor args can now be initialized within the PipelineExecutionEstimator
Prior to this checkin, all Estimators within a Pipeline had to be initialized in the same way. After this checkin, functors can be provided to the Pipeline's constructor that allow for estimator-specific construction for estimators within the chain.
2019-09-19 21:32:09 +00:00
David Brownell dae05ad276 Merged PR 5133: Shared Object Interface for TimeSeriesImputer
Initial checkin
2019-09-19 17:59:58 +00:00
Anuj Shrotriya 4181401a98 Merged PR 5123: Interfaces and Dummy implementation for TimeSeriesImputer
TimeSeries Imputer Implementation E2E (with dummy functional logic)
2019-09-19 01:19:14 +00:00
David Brownell be04d491bd Merged PR 5127: Nuget Package Updates
Updates to ensure that data files are correctly copied when using the package as a dependency.
2019-09-18 22:30:22 +00:00
David Brownell d774ed6115 Merged PR 5115: Building nuget packages
Building nuget
2019-09-17 18:04:29 +00:00
David Brownell 69232f9353 Merged PR 5102: Refactor of DateTimeFeaturizer Changes to upload the data as part of the build
- Moved data location
- Updated cmake files to used cmake module that includes copies of the data
- Added version information to Linux binaries
2019-09-13 23:46:48 +00:00
Ye Wang 23648f8774 Merged PR 5070: Add holiday name for DateTimefeaturizer
1: Add folder /3rdParty/holidays_by_country which contains json files(holiday information) for each country, and code for generating those files
2: Add json.h in /3rdParty for json related process
3: Modify DateTimeFeaturizer(.h & .cpp) to handle passing in Country Name by constructor.
4: Add tests for this new feature

_**please note that the holiday name may look different for different compilers**_
2019-09-12 23:52:53 +00:00
Jin Yan 9d9a9df2f8 Merged PR 5083: integration with generated code
Integration with generated code completed.
EstimatorHandle, TransformerHandle, and ErrorInfoHandle go through pointer table after created and before getting used.
Rebased on new Master branch.
2019-09-11 19:25:10 +00:00
David Brownell a6b244c7b0 Merged PR 5085: Removed unused code and updated build headers
Removed unused code and updated build headers
2019-09-11 17:39:56 +00:00
David Brownell db2adde992 Merged PR 5084: Disabling code coverage on Linux PR builds
Disabling code coverage on Linux PR builds
2019-09-11 00:07:03 +00:00
David Brownell 3efef14e12 Merged PR 5081: Added new build configurations, removing unused code
- Added new build configurations for MSVC and Linux builds
- Removed unused code
- Added doxygen documentation generator
- Added placeholder packaging code
2019-09-10 19:54:09 +00:00
Jin Yan b85a776f72 Merged PR 5066: Pointer Table for Security
Pointer Table implemented.
Files:
src\FeaturizerPrep\SharedLibrary\UnitTests\PointerTable_UnitTest.cpp
src\FeaturizerPrep\SharedLibrary\PointerTable.h

Testcase passed.
TODO:
Integrate with generated code.
2019-09-09 22:55:17 +00:00
David Brownell 44148ace13 Merged PR 5065: Refactor of Featurizer code and cmake build in preparation for new Featurizers
Note that this change is only a refactor and does not alter any existing functionality
2019-09-05 22:48:49 +00:00
David Brownell d53254e21e Merged PR 5060: New technique for conversion of floats and doubles to strings
New technique for conversion of floats and doubles to strings. This fixes an issue on Linux which we could only reproduce when testing via Nimbus.
2019-09-04 21:52:35 +00:00
David Brownell b88e88a3e6 Merged PR 5055: Integration tests for the Shared Library interface 2019-09-04 19:02:11 +00:00
David Brownell 6960d88541 Merged PR 5049: Fixed bug with optional strings in code generated for the DLL interface
Fixed bug with optional strings in code generated for the DLL interface
2019-08-30 18:39:26 +00:00
David Brownell 3054511061 Merged PR 5044: Scripts to create the C DLL interface, code generated by those scripts, and updates for Linux builds
Note that all code in folders name "GeneratedCode" have been generated by other tools in this repo.
2019-08-29 23:48:26 +00:00
Michael Sharp ff82df716a Merged PR 5026: Additional columns for DateTimeFeaturizer
Combined the functionality of both DateTimeFeaturizers by adding all the combined columns to this one.

Holiday information is still not in here as we are not sure how we are going to handle that yet.

I added 2 more libraries to the shared section. They are in the ORT repo and Dmitri said we are good to take a dependency on that.
2019-08-29 18:42:50 +00:00
Michael Sharp 0b6dea26ba Merged PR 5035: Changed DateTimeFeaturizer input to std::int64_t
Changed DT input to std::int64_t representing seconds since 1970
Removed existing ML.NET C++ wrapper code and tests as that will now be implemented with Dave's codegen.
2019-08-29 05:33:02 +00:00
Anuj Shrotriya a2ab1c8a4e Merged PR 5017: CatImputer
CatImputer Description:
This featurizer imputes missing values in an input column with the most frequent one.

Design:
Underlying implementation of this featurizer is composed of two estimators:
1) HistogramEstimator: This estimator computes the histogram for the input column and creates a HistogramAnnotation. Note that this 'IS A' Annotation Estimator i.e it doesn't have a transformer.
2) HistogramConsumerEstimator: This class retrieves a HistogramAnnotation created by HistogramEstimator and computes the most frequent value from it. This value is then used to impute missing values.
Both of these estimators are chained in PipelineExecutionEstimator which is exposed as CatImputer.
2019-08-28 22:49:18 +00:00
David Brownell 67d0fd984c Merged PR 5029: Fix for InferenceOnlyEstimators that were prematurely converted to Transformers within a Pipeline
InferenceOnlyEstimators don't require training, but they can't be created prematurely within a pipeline if they rely on information generation by ancestor AnnotationEstimators. This fix delays the creation of Transformers associated with InterferenceOnlyEstimators until everything that comes before it in a pipeline as completed training.
2019-08-28 00:19:15 +00:00
David Brownell b6f2cefc88 Merged PR 5025: Changes deferred from previous PR 2019-08-26 19:59:05 +00:00
David Brownell 445eb2abb8 Merged PR 5006: Added TrainingOnlyEstimatorImpl and renamed files for consistency 2019-08-26 16:43:42 +00:00
David Brownell 88a6395cd5 Merged PR 4996: Pipeline Execution Estimator
The PipelineExecutionEstimator allows the caller to chain multiple Estimators end-to-end to form a pipeline (or DAG). During training, data trickles down to the currently untrained Estimator within the chain. Once all training is complete, a Transformer is created that only invokes the Transformers associated with TransformerEstimators in the original chain.

This code attempts to provide compile-time warnings when Estimators are chained together in an incompatible way.
2019-08-21 16:44:13 +00:00
Michael Sharp 14835dbe44 Merged PR 4976: DateTimeTransformer implemented in ML.NET with assocated C++ wrapper
This is the ML.NET framework implementation to use the shared C++ library DateTimeTransformer. It includes the ML.NET C# code, as well as the C++ wrapper code.

It does NOT yet implement saving of the model or exporting to ONNX.

The C# code works and has its associated Unit Tests, but since we have not setup the new project in the ML.NET solution it cannot be built in the DataPipelines repo yet.

REVIEW: I am not sure how to make a CMAKE file for the C++ dll's, so that still needs to be added.
2019-08-19 16:45:08 +00:00
David Brownell 706ebbc9f1 Merged PR 4982: Removed boost::optional, restored Linux tests, updated some exceptions
Removed boost::optional, restored Linux tests, updated some exceptions
2019-08-15 22:24:50 +00:00
David Brownell 0d6e278ec4 Merged PR 4972: Added serialization functionality to Transformers
Added serialization functionality to Transformers
2019-08-14 20:01:20 +00:00
David Brownell 2e3d623850 Merged PR 4950: Updates to Featurizers to support additional scenarios
Updates to Featurizers to support additional scenarios
2019-08-12 20:16:20 +00:00
Ye Wang 623496d798 Merged PR 4936: add to-do list for String Transformer
add to-do list in Traits.h
2019-08-12 19:24:19 +00:00
Ye Wang 26a96778df Merged PR 4930: add tuple transformer
add tuple transformer for StringTransformer
2019-08-07 22:55:02 +00:00
Michael Sharp 855c3f6d27 Merged PR 4915: adding new optional to replace boost
adding new optional to replace boost
2019-08-07 16:28:28 +00:00
Ye Wang 05a5474c5a Merged PR 4904: String Transformer
string transformer(wrapper for traits.h) for basic types(bool, string. integers, numbers, arrays, vectors, maps). Nested struct included. tests added for string transformer and traits.
2019-08-06 21:22:42 +00:00
David Brownell e7b796b720 Merged PR 4902: Pass command line arguments to setup in the bootstrap process
Pass command line arguments to setup in the bootstrap process
2019-08-05 17:38:36 +00:00
Michael Sharp 90d7aef480 Merged PR 4846: DateTime Transformer
Created DateTimeTransformer in the Featurizers folder.
Copied Jamie's DateTime code to the Featurizers folder. //REVIEW, should I remove her code from its original location?
2019-07-31 22:28:58 +00:00
David Brownell c81b3ab3c0 Merged PR 4881: Including boost as a header only library for now (which disables cmake errors
Including boost as a header only library for now (which disables cmake errors associated when compiled boost libraries can't be found)
2019-07-31 22:17:05 +00:00
Michael Sharp 455bf4f76c Merged PR 4877: Traits class added
Traits class added
2019-07-31 21:09:38 +00:00