What is the fastest way to check if files are identical?

Question

I’d opt for something like the approach taken by the cmp program: open two files (say file 1 and file 2), read a block from each, and compare them byte-by-byte. If they match, read the next block from each, compare them byte-by-byte, etc. If you get to the end of both files without detecting any differences, seek to the beginning of file 1, close file 2 and open file 3 in its place, and repeat until you’ve checked all files. I don’t think there’s any way to avoid reading all bytes of all files if they are in fact all identical, but I think this approach is (or is close to) the fastest way to detect any difference that may exist.

OP Modification: Lifted up important comment from Mark Bessey

“another obvious optimization if the files are expected to be mostly identical, and if they’re relatively small, is to keep one of the files entirely in memory. That cuts way down on thrashing trying to read two files at once.”

Leave a Comment Cancel reply