Benchmarking Static Code Analysis Tools on Production Software
13 April 2017
Software quality plays a central role in software development. Ensuring good software quality is important especially in the early phases of the development process, as defects uncovered later are more expensive to fix. Various quality assurance techniques are used to detect and reduce the number of defects. Dynamic techniques, such as testing, are based on running the program with a set of input values that are expected to produce a certain output, whereas static analysis techniques attempt to approximate program inputs and provide guarantees about the program's behaviour. With automatic static code analysis, defects can be discovered early, since executing the program is not necessary. A comparison was conducted between four competing static code analysis tools in an organization under study within a large telecom company. The underlying motivation behind the comparison was to eliminate overlap by reducing the number of static analysis tools in use. Tool effectiveness and usefulness to the organization were identified as the relevant aspects to consider in the comparisons. By using the DESMET methodology for evaluating software engineering methods and tools, benchmarking was identified as a suitable approach for comparing the analyzers. Benchmarks were obtained from the organization's product's C/C++ source code for two distinct phases: analysing benchmarks based on different versions of the source code and analysing previously reported and fixed defects. The four static code analysis tools, referred to as tools A, B, C and D due to constraints in license terms, were configured and executed against the benchmarks with their default checkers, and a randomly selected portion of their results classified into true and false findings. The results of the study showed Tool C to be the most precise of the compared tools, having most of its findings correspond with real defects with defect precisions of 0.578 and 0.620 in the first two benchmarks. Tool A had the second highest defect precisions of 0.292 and 0.473. Tool B and Tool D both scored defect precisions below 0.200. Tool C and Tool A had the most complete results, as they found the highest number of real defects with 80 for Tool C and 71 for Tool A across all benchmarks. Tool B, while discovering 12 real defects, suffered from excessive false positives. Tool D discovered 37 low severity defects and most of its true positive findings were focused on coding style. In conclusion, Tool C was found to be the most effective, as it provided the best combination of precision and completeness. Tool A and Tool C were found to be useful to keep in use in the organization, because of their good precision rates and amount of defect findings. Tool B and Tool D were considered unnecessary, because of the high number of false and insignificant findings they produced. The study noted the importance of configuring each tool individually, since default configurations do not provide an accurate look into the tool's strengths and weaknesses. The need for more focused research into the tools' capabilities was noted.?