How do document diff algorithms work?

Well, generally speaking, diff‘ing is usually solved by the Longest common subsequence problem. Also see the “Algorithm” section of the Wikipedia article on Diff:

The operation of diff is based on
solving the longest common subsequence
problem.

In this problem, you have two
sequences of items:

   a b c d f g h j q z

   a b c d e f g i j k r x y z

and you want to find the longest
sequence of items that is present in
both original sequences in the same
order. That is, you want to find a new
sequence which can be obtained from
the first sequence by deleting some
items, and from the second sequence by
deleting other items. You also want
this sequence to be as long as
possible. In this case it is

   a b c d f g j z

From the longest common subsequence
it’s only a small step to get
diff-like output:

   e   h i   q   k r x y 
   +   - +   -   + + + +

That said, this all works fine with text based documents. Since Word Documents are effectively in a binary format, and include lots of formatting information and data, this will be far more complex. Ideally, you could look into automating Word itself as it has the ability to “diff” between documents, as detailed here:

Microsoft Word Tip: How to compare two documents for differences

Leave a Comment