10 Diffing Options
Egil Hansen редактировал(а) эту страницу 2023-03-05 08:48:29 +00:00

The library comes with a bunch of options (internally referred to as strategies), for the following three main steps in the diffing process:

  1. Filtering out irrelevant nodes and attributes
  2. Matching up nodes and attributes for comparison
  3. Comparing matched up nodes and attributes

To make it easier to configure the diffing engine, the library comes with a DiffBuilder class, which handles the relatively complex task of setting up the HtmlDifferenceEngine.

The following section documents the current built-in strategies that are available.

Contents:

To learn how to create your own strategies, visit the Creating Custom Diffing Strategies page.

Default Options

In most cases, calling DiffBuilder.Compare(...).WithTest(...).Build() will give you a good set of default options for comparison, e.g.

var controlHtml = "<p>Hello World</p>";
var testHtml = "<p>World, I say hello</p>";
var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .Build();

If you want to be more explicit, the following is equivalent to the code above:

var controlHtml = "<p>Hello World</p>";
var testHtml = "<p>World, I say hello</p>";
var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions((IDiffingStrategyCollection options) => options.AddDefaultOptions())
    .Build();

Calling the AddDefaultOptions() method is equivalent to specifying the following options explicitly:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions((IDiffingStrategyCollection options) => options
        .IgnoreDiffAttributes()
        .IgnoreComments()
        .AddSearchingNodeMatcher()
        .AddCssSelectorMatcher()
        .AddAttributeNameMatcher()
        .AddElementComparer()                
        .AddIgnoreElementSupport()
        .AddStyleSheetComparer()
        .AddTextComparer(WhitespaceOption.Normalize, ignoreCase: false)
        .AddAttributeComparer()
        .AddClassAttributeComparer()
        .AddBooleanAttributeComparer(BooleanAttributeComparision.Strict)
        .AddStyleAttributeComparer()
    )
    .Build();

Read more about each of the strategies below, including some that are not part of the default setting.

Filter strategies

These are the built-in filter strategies.

Ignore comments

Enabling this strategy will ignore all comment nodes during comparison. Activate by calling the IgnoreComments() method on a IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.IgnoreComments())
    .Build();

NOTE: Currently, the ignore comment strategy does NOT remove comments from CSS or JavaScript embedded in <style> or <script> tags.

Ignore elements

If the diff:ignore="true" attribute is used on a control element (="true" implicit/optional), all their attributes and child nodes are skipped/ignored during comparison, including those of the test element, the control element is matched with.

In this example, the <h1> tag, it's attribute and children are considered the same as the element it is matched with:

<header>
    <h1 class="heading-1" diff:ignore>Hello world</h1>
</header>

Activate this strategy by calling the AddIgnoreElementSupport() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddIgnoreElementSupport())
    .Build();

If the diff:ignoreChildren="true" attribute is used on a control element (="true" implicit/optional), all their child nodes are skipped/ignored during comparison the control element is matched with.

In this example, the <h1> tag, it's children are considered the same as the element it is matched with:

<header>
    <h1 class="heading-1" diff:ignoreChildren>Hello world</h1>
</header>

Ignoring special "diff"-attributes

Any attributes that start with diff: are automatically filtered out before matching/comparing happens. E.g. diff:whitespace="..." does not show up as a missing diff when added to a control element.

To enable this option, use the IgnoreDiffAttributes() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.IgnoreDiffAttributes())
    .Build();

Matching strategies

These are the built-in matching strategies. We have two different types, one for nodes and one for attributes.

Node matching strategies (elements, text, comments, etc.)

These are the built-in node matching strategies. They cover elements, text nodes, comments, and other types that inherit from INode.

One-to-one node matcher

The one-to-one node-matching strategy simply matches two node lists with each other, based on the index of each node. So, if you have two equal length control and test node lists, controlNodes[0] will be matched with testNodes[0], controlNodes[1] with testNodes[1], and so on.

If either of the lists is shorter than the other, the remaining items will be reported as missing (for control nodes) or unexpected (for test nodes).

If a node has been marked as matched by a previously executed matcher, the One-to-one matcher will not use that node in its matching and skip over it.

To choose this matcher, use the AddOneToOneNodeMatcher() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddOneToOneNodeMatcher())
    .Build();

Forward-searching node matcher

The forward-searching node-matcher strategy will only match control nodes with test nodes if their NodeName match. It does this by taking one control node at the time, and searching after the previously matched test node until it finds a match. If it does not, continues with the next control node, and the unmatched control node is marked as missing. After, any unmatched test nodes are marked as unexpected.

The follow JavaScript-ish-code illustrates how the algorithm works:

forwardSearchingMatcher(controlNodes, testNodes) {
    let matches = []
    let lastMatchedTestNode = -1
    
    foreach(controlNode in controlNodes) {
        var index = lastMatchedTestNode + 1

        while(index < testNodes.length) {
            if(controlNode.NodeName == testNodes[index].NodeName) {
                matches.push((controlNode, testNodes[index]))
                lastMatchedTestNode = index
                index = testNodes.length
            }
            index++
        }
    }
    return matches
}

To choose this matcher, use the AddSearchingNodeMatcher() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddSearchingNodeMatcher())
    .Build();

CSS-selector element matcher

The CSS-selector matcher can be used to match any test element from the test node tree with a given control element. On the control element, add the diff:match="CSS selector" attribute. The specified CSS selector should only match a zero or one test element.

For example, if the test nodes looks like this:

<header>
    <h1>hello world</h1>
</header>
<main>
...
</main>
<footer>
...
</footer>

The following control node will be compared against the <h1> in the <header> tag:

<h1 diff:match="header > h1">hello world</h1>

One use case of the CSS-selector element matcher is where you only want to test one part of a sub-tree, and ignore the rest. The example above will report the unmatched test nodes as unexpected, but those "diffs" can be ignored since that is expected. This approach can save you from specifying all the needed control nodes if only part of a subtree needs to be compared.

To choose this matcher, use the AddCssSelectorMatcher() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddCssSelectorMatcher())
    .Build();

Attribute matching strategies

These are the built-in attribute matching strategies.

Attribute name matcher

This selector will match attributes on a control element with attributes on a test element using the attribute's name. If a control attribute is not matched, it is reported as missing and if a test attribute is not matched, it is reported as unexpected.

To choose this matcher, use the AddAttributeNameMatcher() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddAttributeNameMatcher())
    .Build();

Comparing strategies

These are the built-in comparing strategies.

Element compare strategy

The basic element compare strategy will simply check if both nodes are elements and the element's name are the same.

To choose this comparer, use the AddElementComparer() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddElementComparer())
    .Build();

Element closing compare strategy

The element closing compare strategy will simply check if both nodes are elements and if the elements are closed the same way. If you add this comparer, <br /> and <br> are marked as different.

This comparer is not part of the default options. To choose this comparer, use the AddElementClosingComparer method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddElementClosingComparer())
    .Build();

Comment compare strategy

The basic comment compare strategy will simply check if both nodes are comments.

To choose this comparer, use the AddCommentComparer() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddCommentComparer())
    .Build();

Text (text nodes) strategies

The built-in text strategies offer a bunch of ways to control how text (text nodes) is handled during the diffing process.

NOTE: It is on the issues list to enable a more intelligent, e.g. whitespace-aware, comparison of JavaScript (text) inside <script>-tags and event-attributes.

Whitespace handling

Whitespace can be a source of false positives when comparing two HTML fragments. Thus, the whitespace handling strategy offers different ways to deal with it during a comparison.

  • Preserve (default): Does not change or filter out any whitespace in text nodes the control and test HTML.
  • RemoveWhitespaceNodes: Using this option filters out all text nodes that only consist of whitespace characters.
  • Normalize: Using this option will trim all text nodes and replace two or more whitespace characters with a single space character. This option implicitly includes the RemoveWhitespaceNodes option.

These options can be set either globally for the entire comparison, or inline on a specific subtrees in the comparison.

To set a global default, call the method AddTextComparer(WhitespaceOption) on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddTextComparer(WhitespaceOption.Normalize))
    .Build();

To configure/override whitespace rules on a specific subtree in the comparison, use the diff:whitespace="WhitespaceOption" inline on a control element, and it and all text nodes below it will use that whitespace option, unless it is overridden on a child element. In the example below, all whitespace inside the <h1> element is preserved:

<header>
    <h1 diff:whitespace="preserve">Hello   <em> woooorld</em></h1>
</header>

Special case for <pre>, <script>, and <style> elements: The content of <pre>, <script>, and <style> elements will always be treated as the Preserve option, even if whitespace option is globally set to RemoveWhitespaceNodes or Normalize. To override this, add a in-line diff:whitespace attribute to the tags, e.g.:

<pre diff:whitespace="RemoveWhitespaceNodes">...</pre>

This should ensure that the meaning of the content in those tags doesn't change by default. To deal correctly with whitespace in <style> tags, use the Style sheet text comparer.

Perform case-insensitve text comparison

To compare the text in two text nodes to each other using a case-insensitive comparison, call the AddTextComparer(ignoreCase: true) method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddTextComparer(ignoreCase: true))
    .Build();

To configure/override ignore case rules on a specific subtree in the comparison, use the diff:ignoreCase="true|false" inline on a control element, and it and all text nodes below it will use that ignore case setting, unless it is overridden on a child element. In the example below, ignore case is set active for all text inside the <h1> element:

<header>
    <h1 diff:ignoreCase="true">Hello   <em> woooorld</em></h1>
</header>

Note, as with all HTML5 boolean attributes, the ="true" or ="false" parts are optional.

Use regular expression when comparing text

By using the inline attribute diff:regex on the element containing the text node being compared, the comparer will consider the control text to be a regular expression, and will use that to test whether the test text node is as expected. This can be combined with the inline diff:ignoreCase attribute, to make the regular expression case-insensitive. E.g.:

<header>
    <h1 diff:regex diff:ignoreCase>Hello World \d{4}</h1>
</header>

The above control text would use a case-insensitive regular expression to match against a test text string (e.g. "HELLO WORLD 2020").

Style sheet text comparer

Different whitespace rules apply to style sheets (style information) inside <style> tags, than to HTML5. This comparer will parse the style information inside <style> tags and compare the result of the parsing, instead of doing a direct string comparison. This should remove false positives where e.g. insignificant whitespace makes two otherwise equal sets of style information result in a diff.

To add this comparer, use the AddStyleSheetComparer() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddStyleSheetComparer())
    .Build();

Attribute Compare options

The library supports various ways to perform attribute comparison.

Basic name and value comparison

The "name and value comparison" is the base comparison option, and that will test if both the names and the values of the control and test attributes are equal. E.g.:

  • attr="foo" is the same as attr="foo"
  • attr="foo" is the NOT same as attr="bar"
  • foo="attr" is the NOT same as bar="attr"

To choose this comparer, use the AddAttributeComparer() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddAttributeComparer())    
    .Build();

RegEx attribute value comparer

It is possible to specify a regular expression in the control attributes value, and add the :regex postfix to the control attributes name, to have the comparison performed using a Regex match test. E.g.

  • attr:regex="foo-\d{4}" is the same as attr="foo-2019"

Ignore case attribute value comparer

To get the comparer to perform a case insensitive comparison of the values of the control and test attribute, add the :ignoreCase postfix to the control attributes name. E.g.

  • attr:ignoreCase="FOO" is the same as attr="foo"

Combine ignore case and regex attribute value comparer

To perform a case insensitive regular expression match, combine :ignoreCase and :regex as a postfix to the control attributes name. The order you combine them does not matter. E.g.

  • attr:ignoreCase:regex="FOO-\d{4}" is the same as attr="foo-2019"
  • attr:regex:ignoreCase="FOO-\d{4}" is the same as attr="foo-2019"

Class attribute comparer

The class attribute is special in HTML. It can contain a space-separated list of CSS classes, whose order does not matter. Therefore the library will ignore the order the CSS classes are specified in the class attribute of the control and test elements, and instead, just ensure that both have the same CSS classes added to it. E.g.

  • class="foo bar" is the same as class="bar foo"

To enable the special handling of the class attribute, call the AddClassAttributeComparer() on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddClassAttributeComparer())
    .Build();

Boolean attributes comparer

Other special types of attributes are the boolean attributes. To make comparing these more forgiving, the boolean attribute comparer will consider two boolean attributes equal, according to these rules:

  • In strict mode, a boolean attribute's value is considered truthy if the value is missing, empty, or is the name of the attribute.
  • In loose mode, a boolean attribute's value is considered truthy if the attribute is present on an element.

For example, in strict mode, the following are considered equal:

  • required is the same as required=""
  • required="" is the same as required="required"
  • required="required" is the same as required="required"

To enable the special handling of boolean attributes, call the AddBooleanAttributeComparer(BooleanAttributeComparision.Strict) or AddBooleanAttributeComparer(BooleanAttributeComparision.Loose) on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddBooleanAttributeComparer(BooleanAttributeComparision.Strict))
    .Build();

Style attribute comparer

Different whitespace rules apply to style information inside style="..." attributes than to HTML5. This comparer will parse the style information inside style="..." attributes and compare the result of the parsing, instead of doing a direct string comparison. This should remove false positives where e.g. insignificant whitespace makes two otherwise equal sets of style information result in a diff.

To add this comparer, use the AddStyleAttributeComparer() method on the IDiffingStrategyCollection type, e.g.:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddStyleAttributeComparer())
    .Build();

When styles are parsed they are also normalized. This means that the following styles would be identical:

  • style="border: 1px solid red;"
  • style="border: solid 1px red;"

But if you have multiple styles the order matters and is therefore not changed. The following styles are different:

  • style="color: red; border: 0"
  • style="border: 0; color: red"

To add a style comparer where the order does not matter you can register the style comparer with the optional parameter ignoreOrder=true:

var diffs = DiffBuilder
    .Compare(controlHtml)
    .WithTest(testHtml)
    .WithOptions(options => options.AddStyleAttributeComparer(ignoreOrder: true))
    .Build();

Ignore attributes during diffing

To ignore a specific attribute during comparison, add the :ignore postfix to the attribute on the control element. Thus will simply skip comparing the two attributes and not report any differences between them. E.g. to ignore the class attribute, do:

<header>
    <h1 class:ignore>Hello world</h1>
</header>

To ignore all attributes during comparison, add the diff:ignoreAttributes attribute on the control element. Thus will skip comparing all attributes and not report any differences between them. E.g. to ignore all attributes, do:

<header>
    <h1 diff:ignoreAttributes>Hello world</h1>
</header>