CPD

Randall Maas 10/3/2010 7:25:27 AM

This time I am going to describe a copy-paste detector I wrote several years ago. This tool builds on the duplicate file finder I described earlier. The concept was introduced to me with the Tom Copeland article "Detecting Duplicate Code with PMD's CPD" on the O'Reilly, in 2003. That tool is part of the "PMD" toolset. Those tools proved invaluable at the time, because I was working with others on a set of large java-based programs.

As things are wont to do, I went on to other projects that used the C, C++, and C# languages. However, PMD (and its CPD) did not support these languages. I looked at how hard it would be to port PMD's CPD, but I abandoned that idea when I saw that CPD was part of the whole PMD toolset, and it was going to be a lot of work. Instead, I decided to write one.

The Basics of how copy-paste detection works
The parts of the program

Basics of how copy-paste detection works

The copy-paste detector conceivably could keep track of every possible substring in every file (of every length from one element all the way up to the end the file).

This is not practical for two reasons. (I think which reason you see as more obvious says a lot about you):

The first is that the combinatorial explosion of every possible substring could get out of control very fast
The second is that we do not want white space or comments etc. to mess up the detection

It turns out that you can deal with combinatorial explosion for small projects (maybe a couple of dozen files up to a thousand lines each). Honestly, that surprised me. But then again, small projects probably do not have so much redundancy to need to use this application.

The parts of the program

File reader
Lexer / Tokenizer
Long-substring matching (in the form of Karp-Rubin matching)
Stock algorithms: memory allocation, hashing, hash tables (associative arrays)

One of the implicit goals was that the executable work stand-alone: that I would not have to include an external DLL.

File Reader

I outlined the beginnings of how it recursively scans the file system when I described my "duplicate file finder". But I will go into a bit more detail about the file reader this time. The lowest layer is designed as a wrapper around OS file system interfaces. There is a win32 implementation, and another for UNIX. Primarily this is a .h file that #defines or declares procedures with a common naming and calling convention for the OS wrapper. The file IO wrappers for Win32 look like:

typedef HANDLE FileHandle_t;
#define File_open(name)            CreateFileA (name, \
                                                GENERIC_READ,           /* Open for reading  */ \
                                                FILE_SHARE_READ,        /* Share for reading */ \
                                                NULL,                   /* No security       */ \
                                                OPEN_EXISTING,          /* Existing file only*/ \
                                                FILE_ATTRIBUTE_NORMAL,  /* Normal file       */ \
                                                NULL)                   /* No template file  */ \

#define File_close(fd)             CloseHandle(fd)
#define File_isGood(fd)            ((fd) != INVALID_HANDLE_VALUE)
#define File_read(fd,buf,bufSize,numRead)  ReadFile(fd, buf, (DWORD) (bufSize), numRead, NULL)
#define File_seek(fd,ofs)        SetFilePointer(fd,ofs,0,FILE_BEGIN)

We are dealing with text files, which are small, we could read the whole file into memory for processing. It would be exceptional for a text file to be 30MB (most editor I have seen would slow down and take a very long time to load it), but even then it would be something that would work on desktop computer. I have done this in some projects.

However, in this project I already had a stock of code for reading and parsing text files. (The spoils of earlier projects writing interpreters and compilers, primarily for prolog). For the IO portion, the code is simple: it is a structure that holds a buffer, and the number of bytes in the buffer. When the upper layers need more data in order to process properly, the following happens:

The remaining bytes (if any) in the buffer are moved to the front of the buffer. (These remaining bytes are the first-part of a token or other structure that needs to be fully read in order to be processed.)
A File_read() is called to fill the rest of the buffer.

Lexer

The next layer is the "lexer" or "tokenizer". This is the stage that will simplify everything by stripping out comments and white spaces.

The lexer isn't too sophisticated. First, it maps each character to the kinds of token it can be part of:

White space (including tabs, carriage returns, non-printable characters, etc)
Starts a number (e.g. a digit or minus sign)
Part of a number (e.g. a digit or period)
Digits
Upper case letters
Lower case letters
A quote more (i.e. something that can start or end a string)
Punctuation

This mapping is stored in a table, one entry per ASCII character. The one used is broadly generic to C,C++,Java,C# and most programming languages.

The lexer then groups strings of items together with the following rules:

White space is ignored
A number can be a digit or anything else that starts a number, followed by digits or parts of a number
Letters, upper or lower case may be followed by any digits or letters.
A quote can be followed by any character, until another quote is seen. (There is some handling of escaped quotes to treat them as a part of the string)
Punctuation is grouped together. (This treats '++' and '+=' as a single token, which it is, but it also treats '(,,' as a single token was well, rather than 3 separate ones).
The special character pair // causes the tokenizer to ignore all characters until the end of the line character is seen
The special character pair /* causes the tokenizer to ignore all characters until the character pair */ is seen

For more accurate _parsing_, the mapping tables have to be done on an almost per-language basis, with finer distinction about what may start a token and what may compose it, and how comments are delimited. Many languages would share such tables and rules, but one lexer won't do for all.

The API for the lexer is, basically, a procedure called 'ScanToken'. It returns the token string, the file name that it came from, as well as the line and column in the file that the token starts at. And the type of token (number, string, punctuation, or alphanumeric).

Treating streams of tokens as strings

At this point we can produce a stream of tokens ' each is a little string. In essence, all the tokens for a file are joined back together, with a space between, to make a single cleaned-up string for each file. It's only a matter of finding the longest common substring between files.

(Note: I don't quite just join the tokens together. Along the way it has to track where each point in this cleaned-up file is in the original file.)

I choose to use a Karp Rubin algorithm to find the longest substrings. The short version of the algorithm is:

KarpRubin-esque(string* CleanedupFile, int Length)
{
     for (LengthIdx = 32, TargetLength = 1<= MinLengthLengthIdx; TargetLength >>=1, LengthIdx')
     for (Ofs = 0; Ofs + TargetLength <= Length; Ofs ++)
     {
         Substring = string copy(CleanedupFile + Ofs, TargetLength)
         make a SHA1 or MD5 hash of Substring,
         Store the hash, the file name CleanedupFile pointer, Ofs, and the location of substring in the original file into HashTable[LengthIdx];
     }
}

When it stores this information in the hash table (described in "duplicate file finder"), it checks for a collision with a similar SHA1 hash code. If it finds a collision, it adds it to the list of things to report. (There is some checking here that it isn't already reporting about that area of the files). The strings always start and end on a token size.

The CPD algorithm cheats into two ways.

A true hash-based match has to compare the substrings.
The Karp-Rubin substrings are stored as powers-of-two, so the substring matching should use that length as the starting point. It should compare as much past the end of the TargetLength that the string was hashed over to find as much as possible.

The SHA1 hashing has so low a probability of false hit that I felt a deeper string comparison was overkill. Similarly, I wanted to find where the duplicate code was, and wasn't going to spend a lot of time worrying that I found the very last character of it with CPD. I would be opening the two portions of code in text editor anyway and making judgments based on them.

When CPD runs low on memory, it drops the hash table of the shortest substrings, and increases the minimum substring length stored. (The minimum substring length has a default value to prevent reporting usage a single variable over and over; this is a command line option too)

To allow for special cases ' like copy and pasted code which then renamed the variables ' the program allows treating all strings as the same token ('$string'), or all alphanumeric tokens as the same token ("$var").

Stock algorithms

This tool was also one of the motivations for writing a custom memory allocator that I described earlier.I said the combinational explosion is tractable on a modern computer (gigabytes of VM), the compaction from the Karp-Rubin algorithm, and so forth. But I was still concerned redistributing my CPD without extra C runtime libraries. So I need to a memory allocator that could handle heavy memory usage. You can read more about the design at custom memory allocator.

The future.

Although I described the basics of the hash table I am using in my description of my "duplicate file finder", in the future I could discuss in more detail the particular implementation.