30 KiB
Release Notes for PROSE SDK versions 6.x
Release 6.20.1 -- 2019/06/24
Bug Fixes / Enhancements
- Extraction.Web:
- Remove dependence on href attributes in eWeb learning as this can cause overfitting.
- Transformation.Tree:
- Remove on the fly synthesis.
- Transformation.Text:
- Improved readability of translation of date time parsing programs.
Release 6.20.0 -- 2019/05/29
New Features
- Data.Diff is a new package for computing differences in data distributions -- currently supports the 1D numerical case and the simplest categorical case.
Bug Fixes / Enhancements
- Compound.Split
- Added performance optimizations to learning.
- Detection
- Now depends on the ExcelDataReader NuGet package.
- Extraction.Web
- In predictive mode, include columns that are supersets of other columns when both granularities could make sense.
- When learning a new selector, prefer previously learned column selectors if they satisfy the global alignment even if the newly learned selectors also satisfy the examples for those columns.
- Split.Text
- Added performance optimizations to learning.
Release 6.18.0 -- 2019/04/11
New Features
- Extraction.Web
- Inner text excludes script/style tag contents.
Bug Fixes / Enhancements
- Compound.Split
- Improved learning of quoted strings.
Release 6.17.0 -- 2019/03/15
New Features
- Detection.DataType
- Implemented entity validators for e-mail addresses, IPv4 and IPv6 addresses, geographic coordinates (Lat/Lng), US zip codes, and US phone numbers.
Bug Fixes / Enhancements
- Compound.Split
- Allow learning even when the last record is trimmed by ignoring the last record during learn.
- Extraction.Web
- Added performance optimizations to learning.
- Matching.Text
- Fix pattern order in generated programs.
- Transformation.Text
- Improved readability of Python translation, including regular expressions and number format simplifications.
Release 6.14.6 -- 2019/02/15
New Features
- Transformation.Tree
- Supports learning transformations “on the fly” by watching user edits.
- Supports negative examples.
Bug Fixes / Enhancements
- Along with .NET 4.5 and .NET core binaries, the nuget packages now include binaries for .NET 4.6.
- Extraction.Web
- Multicolumn task learning now falls back to predictive programs and ensures consistency with previous program constraints.
- Improved ranking of predictive tables by taking into account image nodes.
- Exclude tables from result that are duplicate or sub-tables of other tables.
- Ordering of table column and exclude duplication columns.
- Improved title detection.
- LearnTopK now supports an optional timeout.
- Matching.Text
- Miscellaneous bug fixes.
- Transformation.Text
- Python readable translation now supports programs involving datetime parsing with formats that do not map directly to posix.
- Python readable translation better handles regex matching when tokens do not overlap and other simplification rules.
- Python readable translation is less aggressive about inlining expressions making for much more readable code.
- Improved performance of the interpreter.
Release 6.11.0 -- 2019/01/16
New Features
- Extraction.Web
- Added HTML table title recognition.
Bug Fixes / Enhancements
- Extraction.Web
- Support providing predictively learned programs as a constraint to be passed into by-example learn calls to improve performance and prevent needing to re-learn predicatively learned programs.
- Transformation.Text
- Improved Python translator to support ToSimpleTitleCase, EndsWith conditions, further Datetime programs, regex pair matches and number transformations.
Release 6.10.0 -- 2018/12/17
New Features
- Compound.Split
- Added support to read CSV/FW files with "junk rows"; that is, repeated headers and non-data, non-comment rows,
anywhere in the file. This is not enabled by default, but rather by adding the newly introduced
AllowRowFiltering
constraint.
- Added support to read CSV/FW files with "junk rows"; that is, repeated headers and non-data, non-comment rows,
anywhere in the file. This is not enabled by default, but rather by adding the newly introduced
- Detection.DataType
- Add alternate, more efficient entry point for completing the task of performing datatype detection and conversion on a specific data set rather than generating code for doing so.
- Make the Python code generator available in the C# SDK.
Bug Fixes / Enhancements
- Compound.Split
- Improve handling of comments and empty records.
- Added support to learn a row-selector, which can now be potentially different from the first column selector.
- Extraction.Web:
- Learning now integrates top-down (by example) synthesis with bottom-up (predictive learning with zero examples) synthesis, which improves the quality of the tables extracted from web pages
- Transformation.Text
- Support new Python translation parameter OptimizeFor.StrongReadability that gives more flexibility to the translator to produce more readable code by removing error handling.
- Transformation.Tree:
- Node construction adds parent field during object creation, instead of after it.
- Refactored the grammar to merge predicates related to the current node, parent node, and children.
Release 6.9.0 -- 2018/11/16
Breaking changes
- Compound.Split
- Hides the ReadInputLineCount constraint and instead users can specify no. of lines to read while adding inputs.
New Features
- Transformation.Text
- Generate readable python code for Lookup() semantics operator.
- Extraction.Web
- Parametrized the SimplePrograms constraint by program input and output type, allowing it to be used for any kind of program (field extraction, sequence extraction, etc).
Bug Fixes / Enhancements
- Detection.DataType
- Better handling of integer columns that look like categorical data, and handling of “NaN” values in string columns.
- Various bugfixes, including improving datatype detection, including non-separated dates (“20120513”) and numbers in parenthesis (“(45,345)”), and other improvements to parsing ambiguous date formats.
- Enhancements to generated Python code for datatype detection.
- Core Framework
- Improvement to how variable substitution is done for external grammars (relevant for DSL authors).
- Compound.Split
- Better handling of CRLF escape characters.
- Extraction.JSON
- Bug fixes for Python code generation to prevent unnecessary column aliases: col("color") instead of col("color").alias("color").
- Transformation.Text
- Bug fixes to entity detection.
- Readability enhancements to Python translation.
- Matching.Text
- Generate better regex to match newlines.
- Minor bug fixes.
- Transformation.Tree
- Fix a bug where learning on large input could lead to StackOverflow.
Release 6.8.1 -- 2018/10/18
Breaking Changes
-
Compound.Split
ReadInputFromReaders(TextReader reader)
andReadInputLineCount(int limit)
are deprecated. They are replaced byReadInput(TextReader reader, int linesToRead=200)
. This release also introducesReadInput(string s, int linesToRead=200)
.
-
Extraction.Json
- Removed the optional
JPath
parameter from the constraintJoinSingleTopArray
– the new constraintJoinArray
should be used instead.
- Removed the optional
New Features
-
Common Framework
- New learning/running exception hierarchy to raise friendlier exceptions.
-
Detection.DataType
- Support added to specify the list of types to detect.
-
Extraction.Json
- Added new interactive constraints
Delete
,SplitArray
, andJoinArray
. - Support added to detect JSON that have comments and unquoted property names so code generation can fail in unsupported targets.
- Added new interactive constraints
-
Transformation.Text
- Introduce readable translation for number formatting and lookup programs.
-
Transformation.Tree:
Node.Value
of typestring
has been replaced byNode.Attributes
of typeDictionary<string, string>
to permit learning from distinct attributes.
Bug Fixes / Enhancements
-
Detection.DataType
- Various bug fixes and enhancement.
-
Extraction.Json
- Improved flattening of embedded arrays in new-line delimited JSON (NDJSON) files.
-
Split.Text
- Various performance improvements to program learning and execution.
-
Transformation.Tree:
- Various bug fixes.
Release 6.7.0 -- 2018/09/20
New Features
-
Common framework
- Configurable equality for
FeatureCalculationContext
; Classes implementingIFeature
can now force deep/shallow equality computation over inputs leading to potential performance improvements during learning by settingIFeature.UseComputedInputsForFccEquality
(false
is the old behavior).
- Configurable equality for
-
Compound.Split
- Support free-form fixed-width schema.
- Add
SkipLinesCount
constraint to manually specify number of header lines to skip instead of learning it.
-
Matching.Text
- Support added for learning with
InSameCluster
andInDifferentCluster
constraints which take a set of inputs to be required to appear in the same or different clusters respectively. - Support added for
UseLongConstantOptimization
constraint which aggressively generates constant tokens whenever possible. (Note: This makes the learning incompelete.)
- Support added for learning with
Bug Fixes / Enhancements
-
Common framework
- Optimizations and style improvements in quality of generated Python code.
-
Compound.Split
- Bug fixes in fixed-width file ingestion.
- Correctly handle fixed-width files where all lines start with whitespace.
- Fixes to Python translation.
-
Extraction.Text
- Improved learning performance in the presence of negative constraints. This fixes the performance regression introduced in 6.6.0.
-
Extraction.Json
- Fix handling of JSON with a single array of values in pandas Python translation.
-
Extraction.Web
- Improvements in predictive table detection to remove certain kinds of redundant columns.
- Improved row selector inference based on ancestor nodes rather than just the boundary nodes.
Release 6.6.0 -- 2018/08/20
Breaking Changes
- Compound.Split
- The property
FieldStartPositions
has been removed and a newIReadOnlyList<Record<int, int?>> FieldPositions
property has been added in its place.
- The property
New Features
-
Common Framework
- Support for serialization and deserialization of spec objects (used to track subtasks during synthesis).
-
Compound.Split
- Support for fixed width files with gaps between fields.
- Improvement in column name learning for some cases (e.g. single column and single row data sets).
- Support for generating PySpark code from learned program.
-
Detection
- Support for rich datatype detection on CoreClr (in addition to .net framework where it was previously supported).
- Includes example values for each detected numeric and date type in the data.
-
Extraction.Json
- Supports a new constraint which may be used to disable handling of invalid JSON. This constraint should be specified if clients translate learned programs to Java and use that java with a Jackson library version less than v2.8 since otherwise the generated Java code contains Jackson v2.8+ features. (NOTE: This version of Jackson is particularly significant since Spark currently only allows Jackson v2.8.)
- Support for generating PySpark code from learned program.
-
Extraction.Web
- Updated AngleSharp dependency to its latest release to address security concerns in that library.
- Support for explicit HTML tables when run in predictive mode.
-
Matching.Text
- Includes examples of each pattern in comments next to the pattern in the generated Python code.
-
Transformation.Text
- Support for serialization and deserialization for all types referenced in the grammar.
Bug Fixes / Enhancements
-
Common Framework
- Bug fixes in ProgramSet.TopK(k) method to return no less than k programs. Temporarily, this slightly degrades Extraction.Text learn performance, which will be fixed soon.
-
Extraction.Json
- More consistent handling of invalid/truncated Json files.
-
Matching.Text
- Improvements in PySpark translation
-
Transformation.Text
- Improvements in readability of generated Python code
-
Transformation.Tree:
- Performance improvement in clustering of transformations by up to 10x depending on scenario.
- Performance improvement using incremental learning such that previously learned programs are updated with newer examples provided on subsequent Learn() calls.
Release 6.5.0 -- 2018/07/23
New Features
- Extraction.Web:
- Now supports predictive table extraction from webpages (that is, from 0 examples).
Bug Fixes / Enhancements
- Common framework:
- Added support for a common extension hook for custom entity detectors. This hook is currently used in Extraction.Web and Transformation.Text but could theoretically be supported in any DSL.
- Compound.Split:
- Now supports fixed-width schema files which are themselves delimited.
- Detection:
- Fixed a bug where running detection on a set of empty strings produced a NullReferenceException.
- Fixed a bug where numbers of the form \d+.\d+ were not recognized correctly.
- Extraction.Json:
- Added a parameter to enable/disable support for trailing commas in java translation. The library we use for json parsing in java (Jackson) has support for trailing commas in newer versions, but some environments only allow the older version of Jackson which does not have this support.
- Now has a new constraint to automatically flatten a json document. When this constraint is used, it’s possible to translate the learned program to python which builds on the pandas library.
- Extraction.Web:
- Improved nth-child selection ranking in predictive extraction. Nth-child selection is important for detecting table columns in some cases, but in others it can introduce a lot of noise. The system now uses nth-child selection only as a fall-back if prominent tables cannot be detected otherwise.
- Text normalization (which currently ensures that text to node matching is whitespace insensitive) has been extended to insensitivity of other special characters such as variations in quotes, dashes, ellipses, etc.
- Transformation.Text:
- Readability of python translations is improved in some cases by removing unnecessary wrapping functions/classes.
- Python translator now creates much more readable translations for programs including date time rounding.
Release 6.4.0 -- 2018/06/26
Breaking Changes
- Common Framework
- When running on CoreClr, we now require runtime version 2.0.3 or newer. Previously we had workarounds for a CoreClr bug that was fixed in 2.0.3, but those are now removed.
- Compound.Split
- Learning fixed width programs from a schema file is now specified by a constraint which contains a schema file parameter.
- Extraction.Json
- New default behavior not to automatically join inner arrays. The old behavior may be requested by supplying the new constraint ‘JoinInnerArrays’. (Previous constraint ‘NoJoinInnerArrays’ has been removed.)
- Inputs can now be used as an implicit flatten constraint, and the old explicit FlattenDocument constraint is removed.
- Flattening top-level arrays is only applied when there is only one. If multiple top-level arrays are present, they are all preserved.
- Constraints are renamed to make their meaning clearer:
- JoinInnerArrays -> JoinAllArrays
- NormalizeArrays -> JoinSingleTopArray
- SplitTopArrayToColumns -> SplitTopArrays
New Features
- Common Framework
- VSA data structures now support Serialization and Deserialization
- Compound.Split
- Now handles small fixed-width files.
- Now supports selecting a subset of columns for the output.
- Detection:
- Data type detection now supports detecting bit types.
- Data type detection now takes a CultureInfo parameter and supports identifying a type from multiple string values—not just one.
- Improved data type detection algorithms have been implemented for numbers, dates, Booleans and categorical values.
- Extraction.Json
- Now supports trailing commas.
Bug Fixes / Enhancements
- Compound.Split
- Improved learning of header/skip.
- It is now possible to construct a program from a ProgramProperties structure (which can be extracted from a learned program).
- It is also now possible to override some learned properties. This supports the scenario where a program is learned from an input file, metadata is presented to the user, and then the user can override some of that metadata before using a final program based on the modified properties.
- Extraction.Json
- New NormalizeArray constraint which directs the system to flatten only one array—either with a specified path or the first, top array if no path is specified.
- Extraction.Web
- Now supports specialized selectors for HTML tables to ensure they are always handled when possible.
- Dashes are no longer escaped in CSS selectors making them more readable.
- Now supports boundary-based semantics for row-selectors in web table programs.
- Increased maximum allowed offset for satisfying examples to 5.
- Row selectors may now satisfy examples up to the maximum permitted offset if an exact match cannot be found.
- Previous program constraint is now used to ensure partial success—that is we always satisfy any previously satisfied columns for which the examples have not changed (in preference to success on any new columns).
- When new examples are given in a session where a learn was previously performed, if the new program no longer satisfies previously satisfied columns, we now consider those columns satisfied if the previous column examples shift by less than some number of rows specified as a threshold in the public API. If they aren’t satisfied at all or the shifting is greater than that number of rows, then the system falls back to the previously learned program to ensure that the previously satisfied columns are still satisfied.
- An issue was fixed so that we now ensure correct alignment when the row selector has been previously computed.
- Matching.Text
- Fixed a bug in the python translation which didn’t properly quote regexes.
- Split.Text
- Improved detection of time expressions.
- Transformation.Text
- Python translation no longer exposes the Substring type unless optimizing for performance in order to simplify the generated code.
- Python translation now special cases a common idiom in the translation.text DSL related to regex matching and turns it into simpler code.
- Improved readability of python translation for some datetime programs.
- Generated programs that use regular expressions now contain simpler/easier to read expressions in a number of cases.
- Transformation.Tree
- Performance improved.
- Fixed an issue in the Order By conversion.
- Fixed issues with the generation of join expressions.
- Table name analysis is now more robust with respect to qualified names, cases and quoted names.
- Program now exposes separate methods for finding the nodes that should be changed and for performing the transformation instead of just doing the whole operation in a single method call.
Release 6.3.0 -- 2018/05/15
Breaking Changes
- Matching.Text
- ForbidConstantTokens constraint has been removed.
- Transformation.Tree
- Utils.Parse has been renamed to Utils.ParseStatementIgnoreWhiteSpace
New Features
- Compound.Split
- Key/Value extraction now supports fixed width extraction such as where the keys and values are in a table where the values all start at the same column position.
- Now supports skipping empty, comment and footer records.
- Transformation.Text
- Now supports pluggable Entity Detectors specified in a constraint.
Bug Fixes / Enhancements
- Common Framework
- PROSE now depends on a slightly older version of Newtonsoft.Json (v9.0.1) for .Net Framework consumers. For .Net Core PROSE still requires Newtonsoft.Json v10.0.3.
- Code produced by the Java translation now makes fewer allocations in some scenarios.
- Our Python translation now produces more readable string literals.
- Compound.Split
- The Python produced for cases which pandas cannot handle is improved with function names and parameters that align with pandas, documentation on public parts and simpler generated code.
- Extraction.Web
- When learning a new program, soft constraints are now used for any columns where the user has not changed the examples instead of only if the user has added new columns but not changed examples for any previous columns.
- Significant improvement in learn times.
- Text comparisons are now insensitive to whitespace characters.
- Transformation.Text
- Python translation is now more readable in many cases.
- Transformation.Tree
- DSL rewritten to support additional scenarios.
- The Program class now has a field to expose a list of learned transformation rule programs. Each transformation rule exposes an Edit program. In this way, users can pick which of the learned edits they want to apply and apply them to specific nodes bypassing the filter checking.
- Performance improvements.
Release 6.2.0 -- 2018/04/25
Breaking Changes
- Matching.Text
- The
Patterns
property has been removed. Patterns should be retrieved by calling theLearnPatterns()
method on the session object instead.
- The
New Features
- Changes that only affect those building their own DSLs
- The Samples contain a new DSL Authoring tutorial.
Bug Fixes / Enhancements
- Compound.Split
- The system now attempts to learn an ingestion program that obeys standard CSV quoting first and then falls back to the more flexible strategy only if the standard doesn’t work.
- Extraction.Web
- Fixed bugs with case insensitive comparisons and normalization.
- Matching.Text
- Fixed a bug in the handling of cancellation tokens. Now we check for cancellation much more frequently.
- Fixed handling of null strings.
- Improved readability and performance of python translations.
- Transformation.Text
- Significant inputs has improved performance and accuracy.
- Common Framework
- The interface for implementing a subprogram translator has been enhanced.
Release 6.1.0 -- 2018/04/16
New Features
- Compound.Split
- Can now extract the schema of a fixed width file from a free-form text file description of the schema and use that to learn an extraction program.
Bug Fixes / Enhancements
-
Compound.Split
- Fixed width inference (for cases where the schema file is not available) is improved to prevent splitting inside standard data types.
- Python translation now produces code that calls Pandas for supported CSV and fixed-width files.
-
Extraction.Web
- The NormalizeHtml method is now significantly simpler and more robust.
- Now supports table constraints in which some cell examples are soft constraints (that is, they need not necessarily be satisfied but they help to converge to correct programs). This can be useful in interactive settings where if the user provides new examples after a previous learning round, then any output of the previous learning round which the user has not changed can be treated as soft constraints.
-
Transformation.Text
- Learning performance improvements.
- Improved readability of learned programs through constant folding.
- Fixes to conditional program learning such that the correct program is learned and returned in more cases instead of the system indicating that it could not learn a program. Also conditional patterns are better clustered together.
-
Changes that only affect those building their own DSLs
- The PROSE grammar specification language has been changed to only allow binding single variables in let expressions. None of our existing grammars used a let with multiple variables, and we decided to simplify the grammar handling logic by enforcing this as a constraint.
- AST “holes” now have Ids and are equal if they have the same symbol and id.
- There is a new subprogram translator extension point interface which enables pattern based code generators to be plugged into the overall translation system.
Release 6.0.0 -- 2018/03/19
Breaking Changes
-
Our session base class no longer implements
IDisposable
, and its serialization format has changed because theAllowBackgroundComputations
member has been deleted. -
Session constraints and inputs are now maintained in smart collections rather than having custom Add/Remove methods.
- For most consuming code the fix is just to replace calls to
session.AddConstraints
withsession.Constraints.Add
andsession.AddInputs
withsession.Inputs.Add
- For most consuming code the fix is just to replace calls to
-
The Python and Java translators now all take an OptimizeFor parameter which does NOT have a default value. Callers should specify a value of OptimizeFor.Performance or OptimizeFor.Readability (where readability means that the program synthesized should be as easy as possible for humans to read and understand even if that means some reduction in the speed of execution for that program).
-
Matching.Text no longer allows specifying negative examples or positive examples that are marked as hard constraints. Since the current implementation treats all examples as soft constraints, the API now makes that explicit. Similarly, the
DisjunctionLimit
constraint is also always a soft constraint. -
Transformation.Text's
GetSignificantInputClustersAsync
method has been removed and replaced withGetSignificantInputsAsync
.
New Features
- Compound.Split
- Now supports Multi-record splitting which enables splitting of files with key-value pairs and pivoting the results into a table with the keys as columns.
Bug Fixes / Enhancements
-
The
LearnTopK
method onSession
has a new optional parameter for requesting not only the top K programs but also a random sample of additional programs learned. -
The dependency on the native Z3 library has been removed.
-
Improvements to translation of programs.
- Avoid lifting constants if translator is not opted for Performance.
- Modify Python translator to not create classes for simple programs.
- Perform Constant propagation, Copy propagation and Common subexpression optimization on translated programs (Java and Python).
- Other readability improvements like aliasing in Python and shortening Python operator names.
- Compound.Split now supports Python translations for simple delimiter and fixed width cases. In an upcoming release all Compound.Split programs will support Python translation.
- Specific to Transformation.Text
- By default the generated program has a more natural function signature with an argument for each input column that is named appropriately rather than a taking a single dictionary of the inputs.
- Programs produced use the more idiomatic + for concat and [:] for slice in appropriate places.
- Dead-code elimination.
- Function calls are not inlined but variables and literals are.
- An extra variable is no longer created for the return value of a function.
-
Extraction.Web
- No longer escapes certain occurrences of hyphen in CSS selectors (when it occurs between two letters, a very common case) making them more readable.
- Now supports entity-based extraction programs that include operators to extract data with respect to surrounding entities in the DOM context (e.g. extracting dates from bill receipts across different formats and providers).
- A new previous program constraint which enables incremental table learning where a previously synthesized program can be supplied when synthesizing a new program which adds additional row/column selection information. This can bring significant perf benefits for synthesizing programs to extract large tables.
- Case insensitive text matching.
- Possibly null values are prevented in single column extractions.
- Improved predicate learning to most specific and smallest selectors.
-
Split.Text (and by extension Compound.Split)
- Improvements to fixed width file detection including:
- Better number data type recognition to include numbers preceded by “+/-“.
- Reduced false positive left aligned column detection.
- Improvements to fixed width file detection including:
-
Transformation.Text
- Fixed a bug in datetime rounding learning where sometimes too many dates were rounded.