Updated SimpleSchema to allow for better mapping of input->output types.
Added support for multiple template types.
Note: Tuple types aren't supported yet, so the code is generated for RobustScalar but we will need to add that code in manually in the custom section.
RobustScalarFeaturizer chains the RobustScalarNormEstimator and RobustScalarTransformer.
RobustScalarNormEstimator: takes the training data and gets its median and range
RobustScalarTransformer: use the median and range(transform to "scale" using q_range) to modify inference data row by row
Estimator signature: Estimator(ptr, _with_centering = true, _with_scaling = true, _quantile_range = (25.0, 75.0))
add support for big endian serialization.
The goal is to serialization all in little endian to add portability across little and big endian architectures.
Updates for TimeSeriesImputerFeaturizer:
- Validates input impute strategy
- Moves median/col error validation to the transformer
- Populates empty values in median scenarios when errors are suppressed
- Introduces new Shared-object layer tests
- Ensures chronological order for inputs during transform
- Enumerates tests that need to be written
This featurizer is supposed to do following:
1) Fill gaps (add rows) for grain cols in timeseries.
2) Impute specified cols per the impute strategy (For this iteration, supported imputed strategies are: ffill, bfill and median).
Implementation Details
This featurizer has been implemented as composition of three estimators:
FrequencyEstimator: It is an annotation estimator.
MedianEstimator: It is also an annotation estimator.
ImputationEstimator: It is an inference only estimator. It reads in the frequency and median annotations and creates the transformer which does imputation.
Opens Issues:
PipelineExecutor to enable passing args to ctor.
PipelineExecutor to enable invoking flush.
flush implementation
Archive implementation
Honoring suppressError flag during imputation.
More Unit tests.
Prior to this checkin, all Estimators within a Pipeline had to be initialized in the same way. After this checkin, functors can be provided to the Pipeline's constructor that allow for estimator-specific construction for estimators within the chain.
1: Add folder /3rdParty/holidays_by_country which contains json files(holiday information) for each country, and code for generating those files
2: Add json.h in /3rdParty for json related process
3: Modify DateTimeFeaturizer(.h & .cpp) to handle passing in Country Name by constructor.
4: Add tests for this new feature
_**please note that the holiday name may look different for different compilers**_
Integration with generated code completed.
EstimatorHandle, TransformerHandle, and ErrorInfoHandle go through pointer table after created and before getting used.
Rebased on new Master branch.
Combined the functionality of both DateTimeFeaturizers by adding all the combined columns to this one.
Holiday information is still not in here as we are not sure how we are going to handle that yet.
I added 2 more libraries to the shared section. They are in the ORT repo and Dmitri said we are good to take a dependency on that.
Changed DT input to std::int64_t representing seconds since 1970
Removed existing ML.NET C++ wrapper code and tests as that will now be implemented with Dave's codegen.
CatImputer Description:
This featurizer imputes missing values in an input column with the most frequent one.
Design:
Underlying implementation of this featurizer is composed of two estimators:
1) HistogramEstimator: This estimator computes the histogram for the input column and creates a HistogramAnnotation. Note that this 'IS A' Annotation Estimator i.e it doesn't have a transformer.
2) HistogramConsumerEstimator: This class retrieves a HistogramAnnotation created by HistogramEstimator and computes the most frequent value from it. This value is then used to impute missing values.
Both of these estimators are chained in PipelineExecutionEstimator which is exposed as CatImputer.
InferenceOnlyEstimators don't require training, but they can't be created prematurely within a pipeline if they rely on information generation by ancestor AnnotationEstimators. This fix delays the creation of Transformers associated with InterferenceOnlyEstimators until everything that comes before it in a pipeline as completed training.
The PipelineExecutionEstimator allows the caller to chain multiple Estimators end-to-end to form a pipeline (or DAG). During training, data trickles down to the currently untrained Estimator within the chain. Once all training is complete, a Transformer is created that only invokes the Transformers associated with TransformerEstimators in the original chain.
This code attempts to provide compile-time warnings when Estimators are chained together in an incompatible way.
This is the ML.NET framework implementation to use the shared C++ library DateTimeTransformer. It includes the ML.NET C# code, as well as the C++ wrapper code.
It does NOT yet implement saving of the model or exporting to ONNX.
The C# code works and has its associated Unit Tests, but since we have not setup the new project in the ML.NET solution it cannot be built in the DataPipelines repo yet.
REVIEW: I am not sure how to make a CMAKE file for the C++ dll's, so that still needs to be added.
Created DateTimeTransformer in the Featurizers folder.
Copied Jamie's DateTime code to the Featurizers folder. //REVIEW, should I remove her code from its original location?