1. Implementation of `SequenceDistribution` (of which `StringDistribution` specialization is most important)
uses the curiously recurring template pattern (https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern)
by specifying the `TThis` template parameter. In places where `SequenceDistribution` creates new instances
it uses the `new TThis()` which goes through the `Activator.CreateInstance<>` method which is very slow.
To avoid this a special helper - `Util.New<T>` is introduced which uses a code generation at runtime trick:
https://stackoverflow.com/a/1280832 which is at least an order of magnitude faster that what compiler does.
2. `CollectionElementMappingInfo.ElementMapping` is switched from List of lists to List of read only arrays.
This brings 2 performance benefits: a) there is less indirection when reading the element mapping and
b) the common mappings can be cached and reused now.
* Inferred constant variables are not inlined
* Added Factor.ProbLessThan, ProbBetween, Quantile, Integral, Apply.
* ModelCompiler handles Delegate-valued variables.
* RatioGaussianOp handles ratio instead of ProductOp.
* ExpOp_Slow and LogOp_EP can compute derivatives
* Refactored ConstantFoldingTransform out of ModelAnalysisTransform
* CodeBuilder.MethodRefExpr takes arguments.
* Added VariableInformation.NeedsMarginalDividedByPrior and CodeRecognizer.NeedsMarginalDividedByPrior
* MessageTransform does not use a distribution for the forward message of a constant.
* DefaultFactorManager allows deterministic factors to have Full support
* DefaultFactorManager puts each factor on a separate line
* docs README mentions ShowFactorManager
BinaryNativeClassifierMapping and MulticlassNativeClassifierMapping serialize labels as strings. Incremented CustomSerializationVersion for each.
BinaryNativeClassifierMapping does not serialize labels when TLabel is bool.
IReader.ReadObject is generic.
* Added Variable.Max(int,int).
* Compiler warns about excess memory consumption in more cases when it should, and fewer cases when it shouldn't.
* TransformBrowser shows attributes by default.
* Updated FactorDocs
* BayesPointMachineClassifier.LoadBackwardCompatibleBinaryClassifier and SaveForwardCompatible use text or binary depending on the file extension
* Added IWriter and IReader, WrappedBinaryWriter, WrappedBinaryReader, WrappedTextWriter, WrappedTextReader
* Changed uses of BinaryWriter to IWriter
* Changed uses of BinaryReader to IReader
* LearnersTests uses same xunit version as other tests
Introduce "generational" strategy for clearing intermediate data in reused preallocated array for
`Automaton.Condensation.FindStronglyConnectedComponents` state.
To avoid allocations `FindStronglyConnectedComponents` employs a preallocated array with data about states.
It may use this cache multiple times when processing a single automaton. It typically will use only a handful
of entries in this array, but each time it grabs it, it has to clear it in full.
With previous array clearance strategy in the worst case `ComputeEpsilonClosure` was a quadratic algorithm,
since it will call `FindStronglyConnectedComponents` as many times as there are states in automaton - (N).
And `FindStronglyConnectedComponents` will clear array which is also O(N), turning the whole thing into O(N^2)
C# now has a built-in type to represent pairs - `ValueTupe<>`. `Pair<>` was eliminated in favour of it.
At the same time a new type - `IntPair` was introduced. It is faster than `ValueTuple<int, int>` when `.Equals()` is called often.
It makes some common string operations which do many lookups by `IntPair` keys up to 10% faster.
A lot of automata operations create large short-lived data structures.
Those are now cached in thread local static fields. This saves a lot of allocations and consequently - causes less GC pressure.
To do so also a new container is introduced - `GenerationalDictionary` which can be cleared in constant time and reuses memory after the Clear.
These changes reduce amount of memory allocations done by `StringInferencePerformanceTests` from 4Gb to 2Gb. Another 1 Gb is allocated by `FindStronlyConnectedComponents` which will be fixed in separate PR, it is harder to introduce caching in there. Another 1Gb is set up of the tests (creation of test data).
Changed the BCC confusion matrix prior so that TrueLabels can be inferred when LabelCount==2.
Fixed serialization of BCC posteriors. BCC posteriors now save to the Results folder.
Fixed serialization example code.
There are two changes:
1. (major) `LogProbabilityOverride` was removed from `DiscreteChar` and `ImmutableDiscreteChar`
This was an unsound functionality that is not used and was complicating the implementation of char distributions
2. (minor) `ImmutableDiscreteChar.Multiply` may reuse immutable discrete char in more cases out of the box.
This reduces the GC pressure a little bit.
* Subarray checks that indices are distinct when debugging
* Renamed ConcatOpTests to StringConcatOpTests
* Crowdsourcing explains why accuracy and precision are NaN.
* Added build instructions for Visual Studio Code.
- Fixed automaton deserialization ignoring LogValueOverride
- Fixed SequenceDistribution.EnumerateSupport and TryEnumerateSupport having different side effects
- Added TryDeterminize, SetLogValueOverride, and ProjectOnTransducer methods to SequenceDistribution
- Added a parameterless overload for Automaton.TryDeterminize that returns the output of TryDeterminize(out TThis) and discards information about deterministicity of said output
* Immutable distribution interfaces
* DiscreteChar made immutable
* Automata made constant
* Automaton.GetLogValue optimized for cases of deterministic and epsilon-free automata
* Fixed Automaton.[Try]EnumerateSupport so that it won't produce duplicates for non-determinizable automata
* Introduced IWeightFunction - interface for abstract weight functions used by SequenceDistribution
* Multi-representable weight function for sequence distribution, that automatically switches between point mass, dictionary, and autoaton representations as appropriate
* Early stops for automaton support enumeration
* Improved automata graphviz format
* Language writer correctly processes nested generics
* Incremented version to 0.4
* Subarray and GetItems factors and operators take IReadOnlyList instead of IList.
* IMatchboxRecommenderMapping uses IReadOnlyList instead of IList.
* Moved Subarray and GetItems factors from Factor class to Collection class.
* Moved variable factors from Factor class to Clone class.
* Conversion.IsAssignableFrom handles covariance.
* Util.GetElementType and IsIList include IReadOnlyList.
* Code cleanup
* Refactored MessageTransform.ConvertMethodInvoke
* Removed Collection.Sort
# problem
The Azure DevOps VSTest runner is not handling the Compiler Options test well, because it takes so long. Also the test takes a very long time to run.
# solution
Run different options in parallel, and in a separate process to finish reliably and in a reasonable time.
InferNet.Infer is generic.
ModelBuilder uses the correct overload of InferNet.Infer.
CodeBuilder.Method checks for correct number of arguments.
FileArray implements IReadOnlyList.
* Added WordStrings and StringFormatTests.
* FactorManager does not allow point mass conversion of the return value argument of EP evidence methods (previously handled by MessageTransform).
* Code cleanup. Renamed IdentityComparer to ReferenceEqualityComparer.