619 строки
30 KiB
Markdown
619 строки
30 KiB
Markdown
# Release Notes for PROSE SDK versions 6.x
|
||
|
||
## [Release 6.20.1](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.20.1) -- 2019/06/24
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Extraction.Web:
|
||
- Remove dependence on href attributes in eWeb learning as this can cause overfitting.
|
||
- Transformation.Tree:
|
||
- Remove on the fly synthesis.
|
||
- Transformation.Text:
|
||
- Improved readability of translation of date time parsing programs.
|
||
|
||
## [Release 6.20.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.20.0) -- 2019/05/29
|
||
|
||
### New Features
|
||
|
||
- Data.Diff is a new package for computing differences in data distributions -- currently supports the 1D numerical case
|
||
and the simplest categorical case.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Compound.Split
|
||
- Added performance optimizations to learning.
|
||
- Detection
|
||
- Now depends on the ExcelDataReader NuGet package.
|
||
- Extraction.Web
|
||
- In predictive mode, include columns that are supersets of other columns when both granularities could make sense.
|
||
- When learning a new selector, prefer previously learned column selectors if they satisfy the global alignment even
|
||
if the newly learned selectors also satisfy the examples for those columns.
|
||
- Split.Text
|
||
- Added performance optimizations to learning.
|
||
|
||
## [Release 6.18.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.18.0) -- 2019/04/11
|
||
|
||
### New Features
|
||
|
||
- Extraction.Web
|
||
- Inner text excludes script/style tag contents.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Compound.Split
|
||
- Improved learning of quoted strings.
|
||
|
||
## [Release 6.17.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.17.0) -- 2019/03/15
|
||
|
||
### New Features
|
||
|
||
- Detection.DataType
|
||
- Implemented entity validators for e-mail addresses, IPv4 and IPv6 addresses, geographic coordinates (Lat/Lng), US
|
||
zip codes, and US phone numbers.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Compound.Split
|
||
- Allow learning even when the last record is trimmed by ignoring the last record during learn.
|
||
- Extraction.Web
|
||
- Added performance optimizations to learning.
|
||
- Matching.Text
|
||
- Fix pattern order in generated programs.
|
||
- Transformation.Text
|
||
- Improved readability of Python translation, including regular expressions and number format simplifications.
|
||
|
||
## [Release 6.14.6](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.14.6) -- 2019/02/15
|
||
|
||
### New Features
|
||
|
||
- Transformation.Tree
|
||
- Supports learning transformations “on the fly” by watching user edits.
|
||
- Supports negative examples.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Along with .NET 4.5 and .NET core binaries, the nuget packages now include binaries for .NET 4.6.
|
||
- Extraction.Web
|
||
- Multicolumn task learning now falls back to predictive programs and ensures consistency with previous program
|
||
constraints.
|
||
- Improved ranking of predictive tables by taking into account image nodes.
|
||
- Exclude tables from result that are duplicate or sub-tables of other tables.
|
||
- Ordering of table column and exclude duplication columns.
|
||
- Improved title detection.
|
||
- LearnTopK now supports an optional timeout.
|
||
- Matching.Text
|
||
- Miscellaneous bug fixes.
|
||
- Transformation.Text
|
||
- Python readable translation now supports programs involving datetime parsing with formats that do not map directly
|
||
to posix.
|
||
- Python readable translation better handles regex matching when tokens do not overlap and other simplification
|
||
rules.
|
||
- Python readable translation is less aggressive about inlining expressions making for much more readable code.
|
||
- Improved performance of the interpreter.
|
||
|
||
## [Release 6.11.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.11.0) -- 2019/01/16
|
||
|
||
### New Features
|
||
|
||
- Extraction.Web
|
||
- Added HTML table title recognition.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Extraction.Web
|
||
- Support providing predictively learned programs as a constraint to be passed into by-example learn calls to
|
||
improve performance and prevent needing to re-learn predicatively learned programs.
|
||
- Transformation.Text
|
||
- Improved Python translator to support ToSimpleTitleCase, EndsWith conditions, further Datetime programs, regex
|
||
pair matches and number transformations.
|
||
|
||
## [Release 6.10.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.10.0) -- 2018/12/17
|
||
|
||
### New Features
|
||
|
||
- Compound.Split
|
||
- Added support to read CSV/FW files with "junk rows"; that is, repeated headers and non-data, non-comment rows,
|
||
anywhere in the file. This is not enabled by default, but rather by adding the newly introduced
|
||
`AllowRowFiltering` constraint.
|
||
- Detection.DataType
|
||
- Add alternate, more efficient entry point for completing the task of performing datatype detection and conversion
|
||
on a specific data set rather than generating code for doing so.
|
||
- Make the Python code generator available in the C# SDK.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Compound.Split
|
||
- Improve handling of comments and empty records.
|
||
- Added support to learn a row-selector, which can now be potentially different from the first column selector.
|
||
- Extraction.Web:
|
||
- Learning now integrates top-down (by example) synthesis with bottom-up (predictive learning with zero examples)
|
||
synthesis, which improves the quality of the tables extracted from web pages
|
||
- Transformation.Text
|
||
- Support new Python translation parameter OptimizeFor.StrongReadability that gives more flexibility to the
|
||
translator to produce more readable code by removing error handling.
|
||
- Transformation.Tree:
|
||
- Node construction adds parent field during object creation, instead of after it.
|
||
- Refactored the grammar to merge predicates related to the current node, parent node, and children.
|
||
|
||
|
||
## [Release 6.9.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.9.0) -- 2018/11/16
|
||
|
||
### Breaking changes
|
||
|
||
- Compound.Split
|
||
- Hides the ReadInputLineCount constraint and instead users can specify no. of lines to read while adding inputs.
|
||
|
||
### New Features
|
||
|
||
- Transformation.Text
|
||
- Generate readable python code for Lookup() semantics operator.
|
||
- Extraction.Web
|
||
- Parametrized the SimplePrograms constraint by program input and output type, allowing it to be used for any kind
|
||
of program (field extraction, sequence extraction, etc).
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Detection.DataType
|
||
- Better handling of integer columns that look like categorical data, and handling of “NaN” values in string
|
||
columns.
|
||
- Various bugfixes, including improving datatype detection, including non-separated dates (“20120513”) and numbers
|
||
in parenthesis (“(45,345)”), and other improvements to parsing ambiguous date formats.
|
||
- Enhancements to generated Python code for datatype detection.
|
||
- Core Framework
|
||
- Improvement to how variable substitution is done for external grammars (relevant for DSL authors).
|
||
- Compound.Split
|
||
- Better handling of CRLF escape characters.
|
||
- Extraction.JSON
|
||
- Bug fixes for Python code generation to prevent unnecessary column aliases: col("color") instead of
|
||
col("color").alias("color").
|
||
- Transformation.Text
|
||
- Bug fixes to entity detection.
|
||
- Readability enhancements to Python translation.
|
||
- Matching.Text
|
||
- Generate better regex to match newlines.
|
||
- Minor bug fixes.
|
||
- Transformation.Tree
|
||
- Fix a bug where learning on large input could lead to StackOverflow.
|
||
|
||
|
||
## [Release 6.8.1](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.8.1) -- 2018/10/18
|
||
|
||
### Breaking Changes
|
||
|
||
- Compound.Split
|
||
- `ReadInputFromReaders(TextReader reader)` and `ReadInputLineCount(int limit)` are deprecated. They are replaced by
|
||
`ReadInput(TextReader reader, int linesToRead=200)`. This release also introduces `ReadInput(string s, int
|
||
linesToRead=200)`.
|
||
|
||
- Extraction.Json
|
||
- Removed the optional `JPath` parameter from the constraint `JoinSingleTopArray` – the new constraint `JoinArray`
|
||
should be used instead.
|
||
|
||
### New Features
|
||
|
||
- Common Framework
|
||
- New learning/running exception hierarchy to raise friendlier exceptions.
|
||
|
||
- Detection.DataType
|
||
- Support added to specify the list of types to detect.
|
||
|
||
- Extraction.Json
|
||
- Added new interactive constraints `Delete`, `SplitArray`, and `JoinArray`.
|
||
- Support added to detect JSON that have comments and unquoted property names so code generation can fail in
|
||
unsupported targets.
|
||
|
||
- Transformation.Text
|
||
- Introduce readable translation for number formatting and lookup programs.
|
||
|
||
- Transformation.Tree:
|
||
- `Node.Value` of type `string` has been replaced by `Node.Attributes` of type `Dictionary<string, string>` to
|
||
permit learning from distinct attributes.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Detection.DataType
|
||
- Various bug fixes and enhancement.
|
||
|
||
- Extraction.Json
|
||
- Improved flattening of embedded arrays in new-line delimited JSON (NDJSON) files.
|
||
|
||
- Split.Text
|
||
- Various performance improvements to program learning and execution.
|
||
|
||
- Transformation.Tree:
|
||
- Various bug fixes.
|
||
|
||
|
||
## [Release 6.7.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.7.0) -- 2018/09/20
|
||
|
||
### New Features
|
||
|
||
- Common framework
|
||
- Configurable equality for `FeatureCalculationContext`; Classes implementing `IFeature` can now force deep/shallow
|
||
equality computation over inputs leading to potential performance improvements during learning by setting
|
||
`IFeature.UseComputedInputsForFccEquality` (`false` is the old behavior).
|
||
|
||
- Compound.Split
|
||
- Support free-form fixed-width schema.
|
||
- Add `SkipLinesCount` constraint to manually specify number of header lines to skip instead of learning it.
|
||
|
||
- Matching.Text
|
||
- Support added for learning with `InSameCluster` and `InDifferentCluster` constraints which take a set of inputs to
|
||
be required to appear in the same or different clusters respectively.
|
||
- Support added for `UseLongConstantOptimization` constraint which aggressively generates constant tokens whenever
|
||
possible. (Note: This makes the learning incompelete.)
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Common framework
|
||
- Optimizations and style improvements in quality of generated Python code.
|
||
|
||
- Compound.Split
|
||
- Bug fixes in fixed-width file ingestion.
|
||
- Correctly handle fixed-width files where all lines start with whitespace.
|
||
- Fixes to Python translation.
|
||
|
||
- Extraction.Text
|
||
- Improved learning performance in the presence of negative constraints. This fixes the performance regression
|
||
introduced in 6.6.0.
|
||
|
||
- Extraction.Json
|
||
- Fix handling of JSON with a single array of values in pandas Python translation.
|
||
|
||
- Extraction.Web
|
||
- Improvements in predictive table detection to remove certain kinds of redundant columns.
|
||
- Improved row selector inference based on ancestor nodes rather than just the boundary nodes.
|
||
|
||
|
||
## [Release 6.6.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.6.0) -- 2018/08/20
|
||
|
||
### Breaking Changes
|
||
|
||
- Compound.Split
|
||
- The property `FieldStartPositions` has been removed and a new `IReadOnlyList<Record<int, int?>> FieldPositions`
|
||
property has been added in its place.
|
||
|
||
### New Features
|
||
|
||
- Common Framework
|
||
- Support for serialization and deserialization of spec objects (used to track subtasks during synthesis).
|
||
|
||
- Compound.Split
|
||
- Support for fixed width files with gaps between fields.
|
||
- Improvement in column name learning for some cases (e.g. single column and single row data sets).
|
||
- Support for generating PySpark code from learned program.
|
||
|
||
- Detection
|
||
- Support for rich datatype detection on CoreClr (in addition to .net framework where it was previously supported).
|
||
- Includes example values for each detected numeric and date type in the data.
|
||
|
||
- Extraction.Json
|
||
- Supports a new constraint which may be used to disable handling of invalid JSON. This constraint should be
|
||
specified if clients translate learned programs to Java and use that java with a Jackson library version less than
|
||
v2.8 since otherwise the generated Java code contains Jackson v2.8+ features. (NOTE: This version of Jackson is
|
||
particularly significant since Spark currently only allows Jackson v2.8.)
|
||
- Support for generating PySpark code from learned program.
|
||
|
||
- Extraction.Web
|
||
- Updated AngleSharp dependency to its latest release to address security concerns in that library.
|
||
- Support for explicit HTML tables when run in predictive mode.
|
||
|
||
- Matching.Text
|
||
- Includes examples of each pattern in comments next to the pattern in the generated Python code.
|
||
|
||
- Transformation.Text
|
||
- Support for serialization and deserialization for all types referenced in the grammar.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Common Framework
|
||
- Bug fixes in ProgramSet.TopK(k) method to return no less than k programs. Temporarily, this slightly degrades
|
||
Extraction.Text learn performance, which will be fixed soon.
|
||
|
||
- Extraction.Json
|
||
- More consistent handling of invalid/truncated Json files.
|
||
|
||
- Matching.Text
|
||
- Improvements in PySpark translation
|
||
|
||
- Transformation.Text
|
||
- Improvements in readability of generated Python code
|
||
|
||
- Transformation.Tree:
|
||
- Performance improvement in clustering of transformations by up to 10x depending on scenario.
|
||
- Performance improvement using incremental learning such that previously learned programs are updated with newer
|
||
examples provided on subsequent Learn() calls.
|
||
|
||
## [Release 6.5.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.5.0) -- 2018/07/23
|
||
|
||
### New Features
|
||
|
||
- Extraction.Web:
|
||
- Now supports predictive table extraction from webpages (that is, from 0 examples).
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Common framework:
|
||
- Added support for a common extension hook for custom entity detectors. This hook is currently used in
|
||
Extraction.Web and Transformation.Text but could theoretically be supported in any DSL.
|
||
- Compound.Split:
|
||
- Now supports fixed-width schema files which are themselves delimited.
|
||
- Detection:
|
||
- Fixed a bug where running detection on a set of empty strings produced a NullReferenceException.
|
||
- Fixed a bug where numbers of the form \d+\.\d+ were not recognized correctly.
|
||
- Extraction.Json:
|
||
- Added a parameter to enable/disable support for trailing commas in java translation. The library we use for json
|
||
parsing in java (Jackson) has support for trailing commas in newer versions, but some environments only allow the
|
||
older version of Jackson which does not have this support.
|
||
- Now has a new constraint to automatically flatten a json document. When this constraint is used, it’s possible to
|
||
translate the learned program to python which builds on the pandas library.
|
||
- Extraction.Web:
|
||
- Improved nth-child selection ranking in predictive extraction. Nth-child selection is important for detecting
|
||
table columns in some cases, but in others it can introduce a lot of noise. The system now uses nth-child
|
||
selection only as a fall-back if prominent tables cannot be detected otherwise.
|
||
- Text normalization (which currently ensures that text to node matching is whitespace insensitive) has been
|
||
extended to insensitivity of other special characters such as variations in quotes, dashes, ellipses, etc.
|
||
- Transformation.Text:
|
||
- Readability of python translations is improved in some cases by removing unnecessary wrapping functions/classes.
|
||
- Python translator now creates much more readable translations for programs including date time rounding.
|
||
|
||
## [Release 6.4.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.4.0) -- 2018/06/26
|
||
|
||
### Breaking Changes
|
||
|
||
- Common Framework
|
||
- When running on CoreClr, we now require runtime version 2.0.3 or newer. Previously we had workarounds for a
|
||
CoreClr bug that was fixed in 2.0.3, but those are now removed.
|
||
- Compound.Split
|
||
- Learning fixed width programs from a schema file is now specified by a constraint which contains a schema file
|
||
parameter.
|
||
- Extraction.Json
|
||
- New default behavior not to automatically join inner arrays. The old behavior may be requested by supplying the
|
||
new constraint ‘JoinInnerArrays’. (Previous constraint ‘NoJoinInnerArrays’ has been removed.)
|
||
- Inputs can now be used as an implicit flatten constraint, and the old explicit FlattenDocument constraint is
|
||
removed.
|
||
- Flattening top-level arrays is only applied when there is only one. If multiple top-level arrays are present,
|
||
they are all preserved.
|
||
- Constraints are renamed to make their meaning clearer:
|
||
- JoinInnerArrays -> JoinAllArrays
|
||
- NormalizeArrays -> JoinSingleTopArray
|
||
- SplitTopArrayToColumns -> SplitTopArrays
|
||
|
||
### New Features
|
||
|
||
- Common Framework
|
||
- VSA data structures now support Serialization and Deserialization
|
||
- Compound.Split
|
||
- Now handles small fixed-width files.
|
||
- Now supports selecting a subset of columns for the output.
|
||
- Detection:
|
||
- Data type detection now supports detecting bit types.
|
||
- Data type detection now takes a CultureInfo parameter and supports identifying a type from multiple string
|
||
values—not just one.
|
||
- Improved data type detection algorithms have been implemented for numbers, dates, Booleans and categorical values.
|
||
- Extraction.Json
|
||
- Now supports trailing commas.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Compound.Split
|
||
- Improved learning of header/skip.
|
||
- It is now possible to construct a program from a ProgramProperties structure (which can be extracted from a
|
||
learned program).
|
||
- It is also now possible to override some learned properties. This supports the scenario where a program is
|
||
learned from an input file, metadata is presented to the user, and then the user can override some of that
|
||
metadata before using a final program based on the modified properties.
|
||
- Extraction.Json
|
||
- New NormalizeArray constraint which directs the system to flatten only one array—either with a specified path or
|
||
the first, top array if no path is specified.
|
||
- Extraction.Web
|
||
- Now supports specialized selectors for HTML tables to ensure they are always handled when possible.
|
||
- Dashes are no longer escaped in CSS selectors making them more readable.
|
||
- Now supports boundary-based semantics for row-selectors in web table programs.
|
||
- Increased maximum allowed offset for satisfying examples to 5.
|
||
- Row selectors may now satisfy examples up to the maximum permitted offset if an exact match cannot be found.
|
||
- Previous program constraint is now used to ensure partial success—that is we always satisfy any previously
|
||
satisfied columns for which the examples have not changed (in preference to success on any new columns).
|
||
- When new examples are given in a session where a learn was previously performed, if the new program no longer
|
||
satisfies previously satisfied columns, we now consider those columns satisfied if the previous column examples
|
||
shift by less than some number of rows specified as a threshold in the public API. If they aren’t satisfied at
|
||
all or the shifting is greater than that number of rows, then the system falls back to the previously learned
|
||
program to ensure that the previously satisfied columns are still satisfied.
|
||
- An issue was fixed so that we now ensure correct alignment when the row selector has been previously computed.
|
||
- Matching.Text
|
||
- Fixed a bug in the python translation which didn’t properly quote regexes.
|
||
- Split.Text
|
||
- Improved detection of time expressions.
|
||
- Transformation.Text
|
||
- Python translation no longer exposes the Substring type unless optimizing for performance in order to simplify the
|
||
generated code.
|
||
- Python translation now special cases a common idiom in the translation.text DSL related to regex matching and
|
||
turns it into simpler code.
|
||
- Improved readability of python translation for some datetime programs.
|
||
- Generated programs that use regular expressions now contain simpler/easier to read expressions in a number of
|
||
cases.
|
||
- Transformation.Tree
|
||
- Performance improved.
|
||
- Fixed an issue in the Order By conversion.
|
||
- Fixed issues with the generation of join expressions.
|
||
- Table name analysis is now more robust with respect to qualified names, cases and quoted names.
|
||
- Program now exposes separate methods for finding the nodes that should be changed and for performing the
|
||
transformation instead of just doing the whole operation in a single method call.
|
||
|
||
## [Release 6.3.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.3.0) -- 2018/05/15
|
||
|
||
### Breaking Changes
|
||
|
||
- Matching.Text
|
||
- ForbidConstantTokens constraint has been removed.
|
||
- Transformation.Tree
|
||
- Utils.Parse has been renamed to Utils.ParseStatementIgnoreWhiteSpace
|
||
|
||
### New Features
|
||
|
||
- Compound.Split
|
||
- Key/Value extraction now supports fixed width extraction such as where the keys and values are in a table where
|
||
the values all start at the same column position.
|
||
- Now supports skipping empty, comment and footer records.
|
||
- Transformation.Text
|
||
- Now supports pluggable Entity Detectors specified in a constraint.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Common Framework
|
||
- PROSE now depends on a slightly older version of Newtonsoft.Json (v9.0.1) for .Net Framework consumers. For
|
||
.Net Core PROSE still requires Newtonsoft.Json v10.0.3.
|
||
- Code produced by the Java translation now makes fewer allocations in some scenarios.
|
||
- Our Python translation now produces more readable string literals.
|
||
- Compound.Split
|
||
- The Python produced for cases which pandas cannot handle is improved with function names and parameters that
|
||
align with pandas, documentation on public parts and simpler generated code.
|
||
- Extraction.Web
|
||
- When learning a new program, soft constraints are now used for any columns where the user has not changed the
|
||
examples instead of only if the user has added new columns but not changed examples for any previous columns.
|
||
- Significant improvement in learn times.
|
||
- Text comparisons are now insensitive to whitespace characters.
|
||
- Transformation.Text
|
||
- Python translation is now more readable in many cases.
|
||
- Transformation.Tree
|
||
- DSL rewritten to support additional scenarios.
|
||
- The Program class now has a field to expose a list of learned transformation rule programs. Each transformation
|
||
rule exposes an Edit program. In this way, users can pick which of the learned edits they want to apply and
|
||
apply them to specific nodes bypassing the filter checking.
|
||
- Performance improvements.
|
||
|
||
## [Release 6.2.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.2.0) -- 2018/04/25
|
||
|
||
### Breaking Changes
|
||
|
||
- Matching.Text
|
||
- The `Patterns` property has been removed. Patterns should be retrieved by calling the `LearnPatterns()`
|
||
method on the session object instead.
|
||
|
||
### New Features
|
||
|
||
- Changes that only affect those building their own DSLs
|
||
- The [Samples](https://github.com/microsoft/prose) contain a new DSL Authoring tutorial.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Compound.Split
|
||
- The system now attempts to learn an ingestion program that obeys standard CSV quoting first
|
||
and then falls back to the more flexible strategy only if the standard doesn’t work.
|
||
- Extraction.Web
|
||
- Fixed bugs with case insensitive comparisons and normalization.
|
||
- Matching.Text
|
||
- Fixed a bug in the handling of cancellation tokens. Now we check for cancellation much more frequently.
|
||
- Fixed handling of null strings.
|
||
- Improved readability and performance of python translations.
|
||
- Transformation.Text
|
||
- Significant inputs has improved performance and accuracy.
|
||
- Common Framework
|
||
- The interface for implementing a subprogram translator has been enhanced.
|
||
|
||
## [Release 6.1.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.1.0) -- 2018/04/16
|
||
|
||
### New Features
|
||
|
||
- Compound.Split
|
||
- Can now extract the schema of a fixed width file from a free-form text file description of the schema and use that
|
||
to learn an extraction program.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- Compound.Split
|
||
- Fixed width inference (for cases where the schema file is not available) is improved to prevent splitting inside
|
||
standard data types.
|
||
- Python translation now produces code that calls Pandas for supported CSV and fixed-width files.
|
||
|
||
- Extraction.Web
|
||
- The NormalizeHtml method is now significantly simpler and more robust.
|
||
- Now supports table constraints in which some cell examples are soft constraints (that is, they need not
|
||
necessarily be satisfied but they help to converge to correct programs). This can be useful in interactive
|
||
settings where if the user provides new examples after a previous learning round, then any output of the previous
|
||
learning round which the user has not changed can be treated as soft constraints.
|
||
|
||
- Transformation.Text
|
||
- Learning performance improvements.
|
||
- Improved readability of learned programs through constant folding.
|
||
- Fixes to conditional program learning such that the correct program is learned and returned in more cases instead
|
||
of the system indicating that it could not learn a program. Also conditional patterns are better clustered
|
||
together.
|
||
|
||
- Changes that only affect those building their own DSLs
|
||
- The PROSE grammar specification language has been changed to only allow binding single variables in let
|
||
expressions. None of our existing grammars used a let with multiple variables, and we decided to simplify the
|
||
grammar handling logic by enforcing this as a constraint.
|
||
- AST “holes” now have Ids and are equal if they have the same symbol and id.
|
||
- There is a new subprogram translator extension point interface which enables pattern based code generators to be
|
||
plugged into the overall translation system.
|
||
|
||
## [Release 6.0.0](https://www.nuget.org/packages/Microsoft.ProgramSynthesis/6.0.0) -- 2018/03/19
|
||
|
||
### Breaking Changes
|
||
|
||
- Our session base class no longer implements `IDisposable`, and its serialization format has changed because the
|
||
`AllowBackgroundComputations` member has been deleted.
|
||
|
||
- Session constraints and inputs are now maintained in smart collections rather than having custom Add/Remove methods.
|
||
- For most consuming code the fix is just to replace calls to `session.AddConstraints` with
|
||
`session.Constraints.Add` and `session.AddInputs` with `session.Inputs.Add`
|
||
- The Python and Java translators now all take an OptimizeFor parameter which does NOT have a default value. Callers
|
||
should specify a value of OptimizeFor.Performance or OptimizeFor.Readability (where readability means that the program
|
||
synthesized should be as easy as possible for humans to read and understand even if that means some reduction in the
|
||
speed of execution for that program).
|
||
|
||
- Matching.Text no longer allows specifying negative examples or positive examples that are marked as hard constraints.
|
||
Since the current implementation treats all examples as soft constraints, the API now makes that explicit. Similarly,
|
||
the `DisjunctionLimit` constraint is also always a soft constraint.
|
||
|
||
- Transformation.Text's `GetSignificantInputClustersAsync` method has been removed and replaced with
|
||
`GetSignificantInputsAsync`.
|
||
|
||
### New Features
|
||
|
||
- Compound.Split
|
||
- Now supports Multi-record splitting which enables splitting of files with key-value pairs and pivoting the results
|
||
into a table with the keys as columns.
|
||
|
||
### Bug Fixes / Enhancements
|
||
|
||
- The `LearnTopK` method on `Session` has a new optional parameter for requesting not only the top K programs but also a
|
||
random sample of additional programs learned.
|
||
|
||
- The dependency on the native Z3 library has been removed.
|
||
|
||
- Improvements to translation of programs.
|
||
- Avoid lifting constants if translator is not opted for Performance.
|
||
- Modify Python translator to not create classes for simple programs.
|
||
- Perform Constant propagation, Copy propagation and Common subexpression optimization on translated programs
|
||
(Java and Python).
|
||
- Other readability improvements like aliasing in Python and shortening Python operator names.
|
||
- Compound.Split now supports Python translations for simple delimiter and fixed width cases. In an upcoming
|
||
release all Compound.Split programs will support Python translation.
|
||
- Specific to Transformation.Text
|
||
- By default the generated program has a more natural function signature with an argument for each input column
|
||
that is named appropriately rather than a taking a single dictionary of the inputs.
|
||
- Programs produced use the more idiomatic + for concat and [:] for slice in appropriate places.
|
||
- Dead-code elimination.
|
||
- Function calls are not inlined but variables and literals are.
|
||
- An extra variable is no longer created for the return value of a function.
|
||
- Extraction.Web
|
||
- No longer escapes certain occurrences of hyphen in CSS selectors (when it occurs between two letters, a very
|
||
common case) making them more readable.
|
||
- Now supports entity-based extraction programs that include operators to extract data with respect to surrounding
|
||
entities in the DOM context (e.g. extracting dates from bill receipts across different formats and providers).
|
||
- A new previous program constraint which enables incremental table learning where a previously synthesized program
|
||
can be supplied when synthesizing a new program which adds additional row/column selection information. This can
|
||
bring significant perf benefits for synthesizing programs to extract large tables.
|
||
- Case insensitive text matching.
|
||
- Possibly null values are prevented in single column extractions.
|
||
- Improved predicate learning to most specific and smallest selectors.
|
||
- Split.Text (and by extension Compound.Split)
|
||
- Improvements to fixed width file detection including:
|
||
- Better number data type recognition to include numbers preceded by “+/-“.
|
||
- Reduced false positive left aligned column detection.
|
||
- Transformation.Text
|
||
- Fixed a bug in datetime rounding learning where sometimes too many dates were rounded.
|
||
|