git/contrib/diff-highlight
Jeff King 4551fbba14 diff-highlight: detect --graph by indent
This patch fixes a corner case where diff-highlight may
scramble some diffs when combined with --graph.

Commit 7e4ffb4c17 (diff-highlight: add support for --graph
output, 2016-08-29) taught diff-highlight to skip past the
graph characters at the start of each line with this regex:

  ($COLOR?\|$COLOR?\s+)*

I.e., any series of pipes separated by and followed by
arbitrary whitespace.  We need to match more than just a
single space because the commit in question may be indented
to accommodate other parts of the graph drawing. E.g.:

 * commit 1234abcd
 | ...
 | diff --git ...

has only a single space, but for the last commit before a
fork:

 | | |
 | * | commit 1234abcd
 | |/  ...
 | |   diff --git

the diff lines have more spaces between the pipes and the
start of the diff.

However, when we soak up all of those spaces with the
$GRAPH regex, we may accidentally include the leading space
for a context line. That means we may consider the actual
contents of a context line as part of the diff syntax. In
other words, something like this:

   normal context line
  -old line
  +new line
   -this is a context line with a leading dash

would cause us to see that final context line as a removal
line, and we'd end up showing the hunk in the wrong order:

  normal context line
  -old line
   -this is a context line with a leading dash
  +new line

Instead, let's a be a little more clever about parsing the
graph. We'll look for the actual "*" line that marks the
start of a commit, and record the indentation we see there.
Then we can skip past that indentation when checking whether
the line is a hunk header, removal, addition, etc.

There is one tricky thing: the indentation in bytes may be
different for various lines of the graph due to coloring.
E.g., the "*" on a commit line is generally shown without
color, but on the actual diff lines, it will be replaced
with a colorized "|" character, adding several bytes. We
work around this here by counting "visible" bytes. This is
unfortunately a bit more expensive, making us about twice as
slow to handle --graph output. But since this is meant to be
used interactively anyway, it's tolerably fast (and the
non-graph case is unaffected).

One alternative would be to search for hunk header lines and
use their indentation (since they'd have the same colors as
the diff lines which follow). But that just opens up
different corner cases. If we see:

  | |    @@ 1,2 1,3 @@

we cannot know if this is a real diff that has been
indented due to the graph, or if it's a context line that
happens to look like a diff header. We can only be sure of
the indent on the "*" lines, since we know those don't
contain arbitrary data (technically the user could include a
bunch of extra indentation via --format, but that's rare
enough to disregard).

Reported-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-21 10:24:19 -07:00
..
t diff-highlight: detect --graph by indent 2018-03-21 10:24:19 -07:00
.gitignore diff-highlight: split code into module 2017-06-15 12:15:58 -07:00
DiffHighlight.pm diff-highlight: detect --graph by indent 2018-03-21 10:24:19 -07:00
Makefile diff-highlight: add clean target to Makefile 2017-09-06 12:56:26 +09:00
README diff-highlight: split code into module 2017-06-15 12:15:58 -07:00
diff-highlight.perl diff-highlight: split code into module 2017-06-15 12:15:58 -07:00

README

diff-highlight
==============

Line oriented diffs are great for reviewing code, because for most
hunks, you want to see the old and the new segments of code next to each
other. Sometimes, though, when an old line and a new line are very
similar, it's hard to immediately see the difference.

You can use "--color-words" to highlight only the changed portions of
lines. However, this can often be hard to read for code, as it loses
the line structure, and you end up with oddly formatted bits.

Instead, this script post-processes the line-oriented diff, finds pairs
of lines, and highlights the differing segments.  It's currently very
simple and stupid about doing these tasks. In particular:

  1. It will only highlight hunks in which the number of removed and
     added lines is the same, and it will pair lines within the hunk by
     position (so the first removed line is compared to the first added
     line, and so forth). This is simple and tends to work well in
     practice. More complex changes don't highlight well, so we tend to
     exclude them due to the "same number of removed and added lines"
     restriction. Or even if we do try to highlight them, they end up
     not highlighting because of our "don't highlight if the whole line
     would be highlighted" rule.

  2. It will find the common prefix and suffix of two lines, and
     consider everything in the middle to be "different". It could
     instead do a real diff of the characters between the two lines and
     find common subsequences. However, the point of the highlight is to
     call attention to a certain area. Even if some small subset of the
     highlighted area actually didn't change, that's OK. In practice it
     ends up being more readable to just have a single blob on the line
     showing the interesting bit.

The goal of the script is therefore not to be exact about highlighting
changes, but to call attention to areas of interest without being
visually distracting.  Non-diff lines and existing diff coloration is
preserved; the intent is that the output should look exactly the same as
the input, except for the occasional highlight.

Use
---

You can try out the diff-highlight program with:

---------------------------------------------
git log -p --color | /path/to/diff-highlight
---------------------------------------------

If you want to use it all the time, drop it in your $PATH and put the
following in your git configuration:

---------------------------------------------
[pager]
	log = diff-highlight | less
	show = diff-highlight | less
	diff = diff-highlight | less
---------------------------------------------


Color Config
------------

You can configure the highlight colors and attributes using git's
config. The colors for "old" and "new" lines can be specified
independently. There are two "modes" of configuration:

  1. You can specify a "highlight" color and a matching "reset" color.
     This will retain any existing colors in the diff, and apply the
     "highlight" and "reset" colors before and after the highlighted
     portion.

  2. You can specify a "normal" color and a "highlight" color. In this
     case, existing colors are dropped from that line. The non-highlighted
     bits of the line get the "normal" color, and the highlights get the
     "highlight" color.

If no "new" colors are specified, they default to the "old" colors. If
no "old" colors are specified, the default is to reverse the foreground
and background for highlighted portions.

Examples:

---------------------------------------------
# Underline highlighted portions
[color "diff-highlight"]
oldHighlight = ul
oldReset = noul
---------------------------------------------

---------------------------------------------
# Varying background intensities
[color "diff-highlight"]
oldNormal = "black #f8cbcb"
oldHighlight = "black #ffaaaa"
newNormal = "black #cbeecb"
newHighlight = "black #aaffaa"
---------------------------------------------


Using diff-highlight as a module
--------------------------------

If you want to pre- or post- process the highlighted lines as part of
another perl script, you can use the DiffHighlight module. You can
either "require" it or just cat the module together with your script (to
avoid run-time dependencies).

Your script may set up one or more of the following variables:

  - $DiffHighlight::line_cb - this should point to a function which is
    called whenever DiffHighlight has lines (which may contain
    highlights) to output. The default function prints each line to
    stdout. Note that the function may be called with multiple lines.

  - $DiffHighlight::flush_cb - this should point to a function which
    flushes the output (because DiffHighlight believes it has completed
    processing a logical chunk of input). The default function flushes
    stdout.

The script may then feed lines, one at a time, to DiffHighlight::handle_line().
When lines are done processing, they will be fed to $line_cb. Note that
DiffHighlight may queue up many input lines (to analyze a whole hunk)
before calling $line_cb. After providing all lines, call
DiffHighlight::flush() to flush any unprocessed lines.

If you just want to process stdin, DiffHighlight::highlight_stdin()
is a convenience helper which will loop and flush for you.


Bugs
----

Because diff-highlight relies on heuristics to guess which parts of
changes are important, there are some cases where the highlighting is
more distracting than useful. Fortunately, these cases are rare in
practice, and when they do occur, the worst case is simply a little
extra highlighting. This section documents some cases known to be
sub-optimal, in case somebody feels like working on improving the
heuristics.

1. Two changes on the same line get highlighted in a blob. For example,
   highlighting:

----------------------------------------------
-foo(buf, size);
+foo(obj->buf, obj->size);
----------------------------------------------

   yields (where the inside of "+{}" would be highlighted):

----------------------------------------------
-foo(buf, size);
+foo(+{obj->buf, obj->}size);
----------------------------------------------

   whereas a more semantically meaningful output would be:

----------------------------------------------
-foo(buf, size);
+foo(+{obj->}buf, +{obj->}size);
----------------------------------------------

   Note that doing this right would probably involve a set of
   content-specific boundary patterns, similar to word-diff. Otherwise
   you get junk like:

-----------------------------------------------------
-this line has some -{i}nt-{ere}sti-{ng} text on it
+this line has some +{fa}nt+{a}sti+{c} text on it
-----------------------------------------------------

   which is less readable than the current output.

2. The multi-line matching assumes that lines in the pre- and post-image
   match by position. This is often the case, but can be fooled when a
   line is removed from the top and a new one added at the bottom (or
   vice versa). Unless the lines in the middle are also changed, diffs
   will show this as two hunks, and it will not get highlighted at all
   (which is good). But if the lines in the middle are changed, the
   highlighting can be misleading. Here's a pathological case:

-----------------------------------------------------
-one
-two
-three
-four
+two 2
+three 3
+four 4
+five 5
-----------------------------------------------------

   which gets highlighted as:

-----------------------------------------------------
-one
-t-{wo}
-three
-f-{our}
+two 2
+t+{hree 3}
+four 4
+f+{ive 5}
-----------------------------------------------------

   because it matches "two" to "three 3", and so forth. It would be
   nicer as:

-----------------------------------------------------
-one
-two
-three
-four
+two +{2}
+three +{3}
+four +{4}
+five 5
-----------------------------------------------------

   which would probably involve pre-matching the lines into pairs
   according to some heuristic.