This commit converts the library back to a standard .NET 4.5 desktop
library, in preparation for a move to .NET Core as the PCL scheme is
being deprecated. To accomplish this I:
* Changed the Bio.Core project type
* Specified to use our own SortedList implementation when ambiguous
* Removed the Bio.Platform.Helpers* Projects after pushing necessary
files over.
No one should try to align sequences of size M and N where M x N >
Int32.MaxValue. However, it could happen and it leads to overflows and
strange message. This commit:
* Validates M x N < Int32.MaxValue
* Throws exception with message if not.
* Verifies this behavior with tests.
This change is about future proofing, it has no effect right now.
The aligner has to deal with an edge case where an insertion can follow
a deletion, or a deletion can follow an insertion. To make this
happen, when considering insertions or deletions, we have to consider
opening a gap from both the "Match" state as well as the "Gap Score"
and the highest of these will form the viterbi path.
For all sensible parameter settings, the Match state should be the
highest score, but for some pathological ones, the gap states might be.
This clarifies which states are the most likely, and makes it so that
in the future if someone switched the order of move preferences (e.g.
so that in an event of equal scoring, rather than vertical->left->diag
being the order of preference it is left->vertical->diag) the code
would still produce the highest score. Before, we only took the best
score for the vertical move, and the match score for the horizontal
move.
For these dynamic programming problems, when traversing through the
matrix one must store the entire traceback matrix,
but the scores are only needed for the current row and the row above it.
In the current implementation, we only use two rows for the "top
scoring" (and match) matrix, but fill out entire matrices for the
deletion and insert (or vertical and horizontal) matrices. This is
inefficient, as we then need 2 x M x N memory rather than 2 x 2 x N
memory. I switched this to only use the smaller amount. Additionally:
* Changed the minimum score on the edges from an arbitray low number to a constant near Int32.MinValue for clarity.
* Added more comments throughout.
* Rather than have three arrays, I used 1 array of a struct which may help memory locality for very large global alignments.
* Added more unit tests that directly compare against EMBOSS Needle alignment program, found and reported a bug in NEEDLE while doing that...
* Change all the vague category "Priority2" labels in NW tests to a newer version.
The aligner was not handling gaps at the end appropriately in the
affine case, as it would neglect to add a leading insert. This in turn
goofed up the MUMAligner.
The MUMmer aligner exposed a PairWiseAlgorithm to be used to stitching
together of different seeds. There were however a number of constraints
that this algorithm must obey but which were not controlled for here.
In particular, the algorithm must be a global aligner that returns only
one alignment per sequence pair, but this was not enforced. As we only
have the NeedleManWunsch aligner that fits the global bill here, I
enforced that this class must use this aligner for stitching together
seeds. This is a breaking API change, but if anyone was using the old
way their code would silently be giving bad results, so it is an
appropriate change.
This change also added comments for the aligner.
Finally, it also fixes an issue where the insertions were not
tallied correctly as it was converting to List<int> instead of
List<long>
Allow Bio.Tools to build X-Platform
Three changes
* Removed XML from build scripts seeking to restore nuget targets
* Changed packages.config files to work around case sensitive issue
(NetBio -> NETBio)
* Removed a few variables that were not used, causing Warning as Error
failures.
This collection of code allows users to call variants with QV values
from IPairwiseAlignedSequence objects that contain either regular
sequences or sequences with associated quality values. Examples of how
to use it are provided in the tests shown.
This adds a parser for the new PacBio CCS data BAM format. It wraps
the already existing BAM files and converts the SAMAlignedSequence type
into a PacBioCCSRead type for less memory storage and easier access.
Tests to verify the behavior are included.