By Ilana Shay of and Nik Baer and Bob Zeidman of Zeidman Consulting
There are several different methods of comparing source code from different
programs to find copying1.
Perhaps
the most common method is comparing source code statements, comments, strings, identifiers, and
instruction sequences.
However, there are anecdotes about the use of whitespace patterns in code. These virtually invisible
patterns
of spaces, tabs, and newlines have been used in litigation to imply copying, but no formal study has
been performed
that shows that these patterns can actually identify copied code. This paper presents a detailed study of
whitespace
patterns and the uniqueness of these patterns in different programs. We decided to investigate
whitespace file
patterns and determine whether comparing whitespace patterns in different files is a reliable method to
measure code similarity and thus detect copying.
When writing code, the programmer is focused on the visual elements: statements, comments,
variable statements,
comments, variable names, and strings. During the writing process the programmer also uses non-
printing characters
to separate the programs visual elements. The non-printing characters can be spaces, tabs, or newlines.
The sequence
of these non-printing characters is the whitespace pattern.
We will score file pairs based upon a percentage of similarity of their whitespace
pattern...