Combined the functionality of both DateTimeFeaturizers by adding all the combined columns to this one.
Holiday information is still not in here as we are not sure how we are going to handle that yet.
I added 2 more libraries to the shared section. They are in the ORT repo and Dmitri said we are good to take a dependency on that.
Changed DT input to std::int64_t representing seconds since 1970
Removed existing ML.NET C++ wrapper code and tests as that will now be implemented with Dave's codegen.
CatImputer Description:
This featurizer imputes missing values in an input column with the most frequent one.
Design:
Underlying implementation of this featurizer is composed of two estimators:
1) HistogramEstimator: This estimator computes the histogram for the input column and creates a HistogramAnnotation. Note that this 'IS A' Annotation Estimator i.e it doesn't have a transformer.
2) HistogramConsumerEstimator: This class retrieves a HistogramAnnotation created by HistogramEstimator and computes the most frequent value from it. This value is then used to impute missing values.
Both of these estimators are chained in PipelineExecutionEstimator which is exposed as CatImputer.
InferenceOnlyEstimators don't require training, but they can't be created prematurely within a pipeline if they rely on information generation by ancestor AnnotationEstimators. This fix delays the creation of Transformers associated with InterferenceOnlyEstimators until everything that comes before it in a pipeline as completed training.
The PipelineExecutionEstimator allows the caller to chain multiple Estimators end-to-end to form a pipeline (or DAG). During training, data trickles down to the currently untrained Estimator within the chain. Once all training is complete, a Transformer is created that only invokes the Transformers associated with TransformerEstimators in the original chain.
This code attempts to provide compile-time warnings when Estimators are chained together in an incompatible way.
This is the ML.NET framework implementation to use the shared C++ library DateTimeTransformer. It includes the ML.NET C# code, as well as the C++ wrapper code.
It does NOT yet implement saving of the model or exporting to ONNX.
The C# code works and has its associated Unit Tests, but since we have not setup the new project in the ML.NET solution it cannot be built in the DataPipelines repo yet.
REVIEW: I am not sure how to make a CMAKE file for the C++ dll's, so that still needs to be added.
Created DateTimeTransformer in the Featurizers folder.
Copied Jamie's DateTime code to the Featurizers folder. //REVIEW, should I remove her code from its original location?
Removed support for class. Renamed variables to reflect this change.
Improved IntegrationTests by reducing number of function calls, running time decreased by 50%.
Modified CheckPolicy and its UnitTest to reflect the last decision on supported types. Still need to add the processing of structs.
Changed the UnitTests of CppToJson to reflect better the types we want to accept. This is not needed, since its strictly testing CppToJson. Also added tests that were discussed with David, that will make sure that the warnings are working as expected.
Modified IntegrationTests so that now deserialize is called with the flaw always_include_optional, and removed temporary comments.
Improved performance on the UnitTests by reducing the amount of function calls it was performing. Execution time went from 5 seconds to 1.6 seconds.
Added support for struct hierarchy. Now each struct has a list of the structs it depends on.
Changed 'obj_type_list' to 'struct_list'.
Changed 'struct_name' and 'function_name' to 'name' on the SimpleSchema.
Created a deserialization IntegrationTest to make sure that the output from CppToJson matches the SimpleSchema.
Removed function AddVar, and instead now the variables are set on the constructor.
Changed the way that Declaration/Definition lines where being added into the Class Function.
Changed name from isValid to Verify in a couple functions. It makes more sense now, since they are not only checking validity, but also throwing errors where needed.
Only changed Function to a dictionary at the last second (similar to obj_type).
Fixed spelling of a couple words.
Changed 'func_name' to 'name' to keep function and struct consistent.
Added unit testing for Debug plugin and MlNet plugin.
We had an error with file names previously due to no testing on the MlNet plugin. This PR adds testing to prevent issues like that in the future.
Added a mapping from some C++ to C# types. This list will have to be updated/modified after we confirm the types we support. It will also need to be modified when we support structs.
Prior, the C++ wrapper didn't include any of the `#includes` from the C++ files. Now, it parses the includes list and includes them as well.
NOTE: The includes list does not say whether it was `#includes "file"` or `#includes <file>` and after talking with @<Teo Magnino Chaban>, determining that is not trivial. According to the [C standard](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=182), section 6.10.2 the only real difference is where the compiler looks initially for the file. If it is `#includes "file"` the compiler will first look somewhere (specific to each compiler, but usually in that directory) and then if it can't find the file will reprocess that line as if it were `#includes <file>`. Due to this, I have decided to do all includes as `#includes "file"`. The only potential problem I see with this is if there is a file in the local directory with the exact same name as in the system directory. Open to discussion on this point.
Build cpp files separately instead of #including them, as that masks some errors.
Add header files as well as cpp files to the cmake, for IDE purposes.
Adds a transform for "Regex Vectorization", creating creating a matrix of matches of a list of regular expressions on a list/column of strings.
There are a lot of unknowns on what this should *really* be doing and what the interface should be, but for now it is emulating the functionality of a python implementation.
Added required functionalities to make date_time work, but its not complete since there are a few functionalities that need multiple file access, which is not currently implemented.
Adds DateTime structure and DateTime from c++chrono::system_clock::time_point.
Conversion available as a function and as conversion constructor and assignment operator in the DateTime struct.
Though the input is time_point, we have to convert and use the old C lib methods, because there is still nothing else.
C++2x is set to finally have useful time_point functions ... I suspect we will wind up wanting to take time_t directly from callers as well, since I would imagine that's a common format for them to have and it would be silly to convert from time_t for the function to just convert right back.
I have made some assumptions about input and expected output that may need updating - most values I had to get whether they want to be 0-based or 1-based. All are documented in the struct declaration, and are trivial to change where needed.
Bigger and more important, I also assumed that we will not be operating on dates earlier than 1970. Earlier dates require a completely different implementation on Windows. (msvcrt doesn't support negative time_t, earlier dates require a win32 solution).
Related work items: #3538