Comparing Python Diff Libraries

2021-11-03

I’m comparing the performance here between the difflib that’s part of Python and diff_match_patch that’s from Google. I was interested in extracting the word(s) that have been changed between two strings. I started out by using difflib but found that it did weird things with some sequences. But I found those problems then were resolved when using diff-match-patch. I go through some examples here.

Let me first load the libraries and some functions:

import difflib
import diff_match_patch as dmp_module
dmp = dmp_module.diff_match_patch()

def viz_sent_pair(a, b):
    """"
    Will show differences between two pieces of text. Helps to visualize difflib output.
    Got it from: https://stackoverflow.com/questions/39001097/match-changes-by-words-not-by-characters
    """
    s = difflib.SequenceMatcher(None, a, b)
    for tag, i1, i2, j1, j2 in s.get_opcodes():
        print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
            tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))

Here’s an example original and changed sentence that I will diff:

orig = "bright cove is a source of a lot of video, we got guidance that said, hey, why don't you work with Bright cove and enrich all their video so that when we receive it, it's already been clipper enriched and that way it saves us the step of having to deal with it."
chng = "Brightcove is a source of a lot of video, we got guidance that said, hey, why don't you work with Brightcove and enrich all their video so that when we receive it, it's already been CLIPr enriched and that way it saves us the step of having to deal with it."

Now here is running that with difflib, viz_sent_pair(orig, chng):

insert    a[0:0] --> b[0:177]       '' --> "Brightcove is a source of a lot of video, we got guidance that said, hey, why don't you work with Brightcove and enrich all their video so that when we receive it, it's already "
equal     a[0:1] --> b[177:178]      'b' --> 'b'
replace   a[1:23] --> b[178:230] 'right cove is a source' --> 'een CLIPr enriched and that way it saves us the step'
equal     a[23:27] --> b[230:234]   ' of ' --> ' of '
delete    a[27:238] --> b[234:234] "a lot of video, we got guidance that said, hey, why don't you work with Bright cove and enrich all their video so that when we receive it, it's already been clipper enriched and that way it saves us the step of " --> ''
equal     a[238:261] --> b[234:257] 'having to deal with it.' --> 'having to deal with it.'

You can see that it’s not specific enough here. It should have a simpler replacement. “B” should go to ‘b’ and ‘clipper’ should go to ‘CLIP’ but it lumps a bunch of stuff together.

If you run with diff_match_path then you don’t see this problem.

dmp.diff_main(orig, chng)

[(-1, 'b'),
 (1, 'B'),
 (0, 'right'),
 (-1, ' '),
 (0,
  "cove is a source of a lot of video, we got guidance that said, hey, why don't you work with Bright"),
 (-1, ' '),
 (0,
  "cove and enrich all their video so that when we receive it, it's already been "),
 (-1, 'clippe'),
 (1, 'CLIP'),
 (0,
  'r enriched and that way it saves us the step of having to deal with it.')]

I wrote a function to group some of this output to be easier to read.

  # Ok so I want to now combine this data together to be just equals and replaces
def diff_main_wrapper(orig, chng, to_print=False):
    """Runs `diff_main` and then combines consecutive insertions and deletions together as replace. This makes it easier to visualize the results"""
    ret = dmp.diff_main(orig, chng)

    new_ret = []
    blank_pair = [None, '',''] # first element is label: equal or replace, second element is original str, and third element is changed str
    cur_pair = blank_pair[:]

    for op,txt in ret:
        # Combine inserts/deletes in a row and make into replace
        if op == -1 or op == 1:
            if cur_pair[0] == 'equal':
                new_ret.append(tuple(cur_pair))
                cur_pair = blank_pair[:]
            if cur_pair[0] is None:
                cur_pair[0] = 'replace'
            if op == -1:
                cur_pair[1] += txt
            else:
                cur_pair[2] += txt
        # Combine equals in a row (likely no equals in a row though)
        else:
            if cur_pair[0] == 'replace':
                new_ret.append(tuple(cur_pair))
                cur_pair = blank_pair[:]
            if cur_pair[0] is None:
                cur_pair[0] = 'equal'
            cur_pair[1] += txt
            cur_pair[2] += txt
    if cur_pair[0]:
        new_ret.append(tuple(cur_pair))

    if to_print:
       for l,o,c in new_ret:
           print(l, ": ", "{!r}".format(o), " => ", "{!r}".format(c))

    return new_ret

_ = diff_main_wrapper(orig, chng, True)

replace :  'b'  =>  'B'
equal :  'right'  =>  'right'
replace :  ' '  =>  ''
equal :  "cove is a source of a lot of video, we got guidance that said, hey, why don't you work with Bright"  =>  "cove is a source of a lot of video, we got guidance that said, hey, why don't you work with Bright"
replace :  ' '  =>  ''
equal :  "cove and enrich all their video so that when we receive it, it's already been "  =>  "cove and enrich all their video so that when we receive it, it's already been "
replace :  'clippe'  =>  'CLIP'
equal :  'r enriched and that way it saves us the step of having to deal with it.'  =>  'r enriched and that way it saves us the step of having to deal with it.'

This output looks much better!

 You're up and running! Aligning and Extracting Tokens from BERT 