By Bob Zeidman and Nikolaus Bear Bob Zeidman is the president of Zeidman Consulting, a contract R&D firm. He is an
experienced expert witness and the developer of CodeSuite®, which is now supported by SAFE
Corporation. He can be reached via email at Bob@ZeidmanConsulting.com.
Nikolaus Baer is a research engineer at Zeidman Consulting. He has utilized the CodeSuite®
software as an expert witness in litigation involving trade secret theft. He can be reached via email at
Nik@ZeidmanConsulting.com.
Introduction
Through our experience working on numerous high tech intellectual property
disputes, especially software copyright infringement and trade secret theft cases, we have developed
an exact methodology for determining certain types of trade secret theft with regard to software. It has
been our observation that experts often rely heavily on the discovery of similarities in source code.
These experts often use computer code analysis applications to locate portions of code that look similar
and then draw conclusions based upon their analysis of these portions of code. Unfortunately, this
method of analysis can vary greatly and can result in the simplistic and incorrect conclusion that mere
similarities in themselves establish trade secret theft.
The CodeMatch® program that we use is one of several tools available to
experts in the CodeSuite® suite of software analysis tools. It applies a series of algorithms to find
correlations between two sets of software source code files based upon several elements of the code
and also provides a ranking to help better determine the parts of the code on which to focus (a detailed
explanation of the CodeMatch algorithms is given on the website of Software Analysis and Forensic
Engineering Corporation at www.SAFE-
corp.biz). However, the discovery of these sections of interest is only the first step of analysis, as
there are many reasons for software correlation, and it is the job of experts to eliminate other possible
explanations before drawing any conclusions about theft. The steps for determining source code trade
secret theft are:
- Determine trade secrets. This step requires the owner of the
software to determine the intellectual property that comprises their trade secrets and then to point out
those sections of code that implement them. The expert must confirm that these sections of code do
represent real trade secrets.
- Determine source code correlation. This step requires the
determination of similarity between the two sets of code, one containing the trade secrets and the other
accused of containing those same trade secrets. Source code correlation is a quantitative measure of
software similarity that is defined in the next section.
- Determine reasons for correlation. There are six reasons that source
code can be correlated. Of these reasons, some reasons automatically rule out trade secret theft. Other
reasons require further analysis.
- Determine functionality correlation. Next, the correlation must be
examined to determine whether actual functionality of the software is correlated.
- Determine trade secret theft. If functional correlation is present, the
final step involves verifying that the correlated functions are truly trade secrets and producing a
definitive decision about whether that correlation is due to trade secret theft. A public domain literature
search is often part of this process, because the information a company treats as confidential is not
always so.
Determining trade secrets
Under the Uniform Trade Secrets Act and the Restatement formulations that still
provide the law in a few states, a trade secret is information valuable to competitors that is not
generally known to others in the industry and that a company has taken precautions to keep secret
from its competitors. General or public knowledge cannot be claimed as a trade secret. With regard to
computer source code, open source code, third party code, and generally known algorithms cannot be
considered trade secrets. Determining trade secrets in computer source code requires gauging first,
whether the source code in question is known outside the company and second, how much effort the
company puts into protecting the source code.
Two or more parties can claim the same trade secret if each one developed it
independently and kept it a secret. For example, researchers at two companies may build on existing
video compression techniques to develop a new, more efficient technique. As happens often in the
fields of engineering and science, these two techniques use the same algorithms because they are
extensions of existing algorithms and both solve the same problem. The two researchers had the same
“eureka” moment. Rather than patenting the algorithm, which would require the
companies to make them public and would result in time and legal fees for getting through the patent
office, the companies decide to keep the algorithms secret. Therefore, similarity of information does not
alone signify a trade secret. Along similar lines, employees who change jobs are free to transfer their
general skills, knowledge, and experience, and that may translate into different companies using
software that shares basic and general similarities. This is why looking only for source code similarities
is a faulty method for determining trade secret theft.
Two lines of source code can be similar or even identical, but if the functionality isn't
significant then it is not a trade secret because the code contains nothing that is not obvious or easily
understood by one of ordinary skill in the art of computer programming. The definition of a trade secret
means that finding a correlation between the source code files of two different programs does not
necessarily mean that illicit behavior has occurred. Finding the correlations is the next step in
determining trade secret theft, but further analysis must be usually performed before conclusions can
be drawn.
Determining Source Code Correlation
Source code correlation is a quantitative, repeatable, deterministic measure of the
similarity of software source code. Source code correlation takes the most important elements of a
program and finds the amount of similarity of each of these elements independently. The overall
correlation is then a combination of the individual correlations for each element. In theory, these
elements could be any elements that are determined to be important. In practice, these elements
are:
- Statements. Statements are lines of code that have functionality. If
a computer program is a recipe for the computer to follow, statements are the steps of the recipe like,
“Mix the flour and water in a large bowl.”
- Comments. Comments are descriptions for the reader of the source
code but have no functionality. In the recipe metaphor, a comment is a description like, “This
step will add flavor to your cake.”
- Identifiers. Identifiers are names of objects in the code such as data
structures and routines. Again, using the recipe metaphor, identifiers would be the names of the
ingredients, like “flour,” “water,” and “sugar.”
- Instruction sequences. Instructions are basic operations for the
computer to perform. The sequence of instructions is the basic steps of the program in the order that
they are performed. In a recipe, the instruction sequence might be “add, add, add, mix, pour,
bake, cool.”
The most widely used tool for determining source code correlation in litigation is
CodeMatch, which is part of a suite of software analysis tools called CodeSuite, available from Software
Analysis and Forensic Engineering Corporation. CodeMatch allows the user to select which elements of
source code should be used to determine correlation. For determining copyright infringement, all
elements are important. For determining trade secret infringement, statement correlation and
instruction sequence correlation are particularly important because these represent actual functionality
whereas comment correlation and identifier correlation represent nonfunctional code, though they often
point to important sections of code that might otherwise be missed.
Determining reasons for correlation
High correlation between two sets of software source code can be attributed to
several factors. These factors are:
- Common Algorithms. An algorithm is a procedure or a set of
instructions for accomplishing some task. In one programming language there may be an easy or well-
understood way of writing a particular algorithm that most programmers use. For example there
might be a way to alphabetically sort a list of names. Perhaps this algorithm is taught in most
programming classes at universities or is found in a popular programming textbook. These
commonly used algorithms will show up in many different programs, resulting in a high degree of
correlation between the programs even though there was no direct contact between the
programmers.
- Common Identifier Names. Certain identifier names are commonly
taught in schools or commonly used by programmers in certain industries. For example, the identifier
result is often used to hold the result of an operation. These identifiers will be found in
many unrelated programs and will result in these programs being correlated.
- Third-Party Source Code. It is possible that widely available open
source code is used in both programs. Also, libraries of source code can be purchased from third-party
vendors. If two different programs use the same third-party code, the programs will be correlated.
- Code Generation Tools. Automatic code generation tools, such as
Microsoft Visual Basic or Adobe Dreamweaver, generate software source code that looks very similar
with similar and often identical elements. The structure of the code generated by these tools tends to fit
into specific templates with identifiable patterns. Two different programs that were developed using the
same code generation tool will be correlated.
- Common Author. It is possible that one programmer, or
“author,” will create two programs that have correlation simply because that programmer
tends to write code in a certain way. This is the programmer’s style of coding. Thus two
programs written by the same programmer can be correlated due to the style being similar even
though there was no copying and the functionality of each program is different than that of the
other.
- Copied Code (Authorized or Plagiarized). Code was copied from one
program to another, causing the programs to be correlated. The copying may have taken place for only
certain sections of the code and may include small or significant changes to the code. When each of the
previous reasons for correlation has been eliminated, the reason that remains is copying. If the copying
was not authorized by the original owner, then it comprises plagiarism.
Once correlation is found, it is important for an expert to determine which of the
above reasons account for the correlation. At that point, the analysis continues into a new phase to
determine functional correlation.
Determining Functionality Correlation
The next phase in determining theft of trade secret rests upon the definition of a
trade secret and the functionality of the code in question. Correlation due to how the code functions is
called functional correlation. For each of the six reasons for correlation, we can look further to
determine whether there is also a functional correlation. Some reasons for correlation can be used to
rule out trade secret theft while other reasons require still further analysis.
A. Common Algorithms
Two programs that are correlated due to the fact that they both use algorithms that
are commonly used in the industry or by programmers in general is a form of functional correlation.
However, because these algorithms are commonly used and understood, this type of correlation should
not be taken as proof of trade secret theft.
B. Common Identifier Names
Identifiers have no inherent functionality and therefore cannot be a factor in
functional correlation. In addition, the use of common identifiers that are widely used throughout the
industry or by programmers in general cannot be a form of functional correlation. In cases where
software is correlated only because of common identifiers, trade secret theft can usually be ruled
out.
C. Third Party Source Code
When the correlation of two programs is due to the use of third party code, such as
open source code or purchased libraries of source code, there is no trade secret theft involved (unless
of course the third party code contains trade secrets). There may still be a concern, however, if a
programmer has stolen a third party’s confidential code that was licensed to his former
employer. In that case, the third party may have its own trade secret concerns with the
programmer.
D. Code Generation Tools
The structure of the code generated by automatic code generation tools tends to fit
into specific templates with identifiable patterns. If the correlation between two different programs is
solely because they were developed using the same code generation tool, they will be correlated but
not due to trade secret theft. Note, however, that the code must be further examined to determine
whether the automatically generated code performs the same function in the same way, which means
that the code is functionally correlated. If the programmer used enough expertise in guiding the code
generation tool to generate the code, it could still comprise a trade secret.
E. Common Author
Two programs can be correlated because they were both written by the same
programmer or programmers. If the correlation between two different programs is solely because they
were developed by the same people, they will be correlated but not due to trade secret theft. Note,
however, that the code must be further examined to determine whether the code also performs the
same function in the same way, which means that the code is functionally correlated. Common
authorship can be a hint that the functionality will also correlate, but does not necessarily mean that is
the case.
F. Copying
When all other reasons for correlation have been eliminated, what remains is
copying. Obviously, correlation due to copying means that the two programs are also functionally
correlated.
Once functional correlation is found, the next step is for an expert to determine
whether that functional correlation is due to trade secret theft.
Determining Trade Secret Theft
The question at this point is whether the functions that the code is performing are
actually unique to the company claiming the trade secrets or whether they are within the general
knowledge of software engineers. Anything that is general or public knowledge cannot be claimed as a
trade secret. This also means that if the functionality isn't significant then it might not be a trade
secret.
Finally, a trade secret does not mean that there can only be one owner of the
proprietary information. Unlike with a patent, two or more parties can claim the same trade secret on
the same code if each one developed it independently and kept it a secret. Two parties could even have
identical trade secrets. Therefore, functional correlation does not alone signify a trade secret, and a
proper expert examination should also include an analysis of how development occurred and what
means the owner used to keep the code secret. The development process and the effort to protect code
can be very important in determining trade secrets and trade secret theft.
Conclusion
Because trade secrets can be claimed at many levels of abstraction, trade secret
theft may occur in uncorrelated code because the architecture or high-level design of the code, rather
than the code itself contains trade secrets. That aspect of trade secret theft is beyond the scope of this
article, and it should not be understood that source code correlation is the only way to uncover trade
secret theft. However, when dealing with trade secrets in the source code, source code correlation is a
very valuable tool.
Most experts use some form of software analysis tool to examine code for trade
secret theft, but a lack of a common methodology makes it difficult to draw and discuss succinct
conclusions. By viewing the discovery of correlation as an initial step in a method for determining trade
secret theft, experts will be able to more quickly and accurately analyze cases and draw more objective
conclusions. Experts can move to the essence of a trade secret case by running correlated sections of
source code through the various tests described in this article.
The proper use of this methodology will make trade secret theft cases more
straightforward and objective. Without the use of correlation tools and a standard for interpreting the
results, cases can end up disorganized and subjective. Instead of being able to succinctly narrow a
mass of raw source code down to the essence of the trade secret arguments, experts and lawyers can
get lost in argument over which metrics to use in order to examine the code; the case can deteriorate
into confusing interpretation and spin. Using the proper methodology helps direct the findings in a
manner that all parties can agree upon. The relevant code will quickly become apparent. When experts
on both sides of litigation use the same methodology, the case is simplified and streamlined because it
is based upon the merits of their arguments applied against a set of standards. Correlation tools and
standards will enable experts and lawyers to better utilize their skills and knowledge and better support
their clients.