Data Compression Using Long Common Strings

29 March 1999

New Image

White [1967] proposed compressing text by "replacing [a] repeated string by a reference to [an] earlier occurrence". Ziv and Lempel [1977, 1978] implemented this idea by cleverly representing strings that occur in a relatively small sliding window. We extend the basic idea to represent long common strings that may appear far apart in the input text. On typical English text, our method provides little compression; there are few long common strings to be exploited. Some files, though, do contain repeated long strings. Baker [1995] documented significant repetition in the code of large software systems. In a mathematical subroutine library, we found many blocks of code repeated across functions of type float, double, complex and double complex; our method combined with a standard compression algorithm reduced the file to half the size given by the standard algorithm alone. Corpora of real documents, such as correspondence, news articles, or netnews often contain long duplications due to quoting or republication, or even plagiarism.