Analyzing the Accuracy and Time Costs of Web Application Security Scanners
This paper is intended as a follow-on study to my October 2007 study, “Analyzing the Effectiveness and Coverage of Web Application Security Scanners.” This paper focuses on the accuracy and time needed to run, review and supplement the results of the web application scanners (Accunetix, Appscan by IBM, BurpSuitePro, Hailstorm by Cenzic, WebInspect by HP, NTOSpider by NT OBJECTives) as well as the Qualys managed scanning service.
In this study, both ‘Point and Shoot’ (PaS) as well as ‘Trained’ scans were performed for each scanner. In the ‘Trained’ scans, each tool was made aware of all the pages that it was supposed to test, mitigating the limitations of the crawler in the results. This was designed to address a criticism by some security professionals that PaS, the technique used in the 2007 study, is not an appropriate technique to scan web applications and that only manually trained scanning is appropriate.
The study centered around testing the effectiveness of seven web application scanners in the following 4 areas:
1. Number of verified vulnerability findings using Point and Shoot (PaS)
2. Number of verified vulnerability findings after the tool was Trained to find the links on
3. Accuracy of reported vulnerabilities
4. Amount of human time to ensure quality results
Given the large number of vulnerabilities missed by tools even when fully trained (49%) it is clear that accuracy should still be the primary focus of security teams looking to acquire a web application vulnerability assessment tool.
The results of this study are largely consistent with those in the October 2007 study. NTOSpider found over twice as many vulnerabilities as the average competitor having a 94% accuracy rating, with Hailstorm having the second best rating of 62%, but only after additional training. Appscan had the second best ‘Point and Shoot’ rating of 55% and the rest averaged 39%. It should be noted that training is time consuming and not really practical for sites beyond 50-100 links. As such, sites with a large delta between trained and untrained results (Accunetix, BurpSuitePro and Hailstorm) may require additional effort in large scans. One of the most surprising results was the findings for market share leader WebInspect, which consistently landed at the bottom of the pack in its ability to crawl the sites and find vulnerabilities; it missed approximately half of the vulnerabilities on its own test site.
When reviewing scanners, most vendors provide and host website(s) which are intentionally vulnerable in various ways. Web application security scanner vendors have seen a large number of vulnerabilities from varying web applications through their research and through their work with their clients. Vendors will often add newly discovered vulnerabilities to their test websites as they look to augment the capabilities of their scanner. As a result, these test applications represent the sum total of thousands of hours of research and thousands of real world scans and are a fairly good representation of the types of vulnerabilities that exist in the wild. I became curious as to how well the scanners actually do audit these test applications, and how other vendors’ scanners would work against them.
I decided to find out by running each vendor’s scanner against each of the vendor’s test sites and comparing the results. The assumption would be that each vendor would do the best against their own test site and the question would be which vendor would get 2nd place the most often. Part of the purpose of doing it this way is that it is reproducible by anyone with a copy of one of the scanners. The collected data is being made freely available for anyone to review and re-create. Additionally the amount of time required to make good use of the scanners was of interest. So each scanner was run in ‘Point and Shoot’ and then again after being ‘Trained’ to know all the links and how to populate all the forms. These test sites are fairly small, most being in the 10-20 link range, with one or two in the 75-100 page range. For larger sites the training time could be extrapolated based on the observations in this study (assuming that the full inventory of pages and forms is known to the auditor).
Summary of Results
The full results of the testing are going to be analyzed in further detail later in the report. I will start off with some of the overview of the data and initial conclusions. When looking at the results, there are a number of ways to look at the data. Rather than focus on code coverage as in the first report, this time the focus is on comparing the results of the scanners against each other at the heart of what these scanners are supposed to be doing: finding vulnerabilities. A review of the list of “possible” vulnerabilities that were found/missed offers up some rather interesting conclusions.
Based on all of the criticism of my first report, we should expect to see big differences between ‘Point and Shoot’ and ‘Trained’ scans, but it turns out that there are only moderate improvements gained from normal training. The one exception to this was with Cenzic Hailstorm which did improve dramatically when a solid effort was made to understand and apply all the configuration variables required for training. The findings from the first report which showed NT OBJECTives’ NTOSpider with the most findings, followed by IBM Appscan and then HP WebInspect, remain consistent. In fact, WebInspect came in dead last even with the newcomers to this analysis on the software side (Acunetix, BurpSuitePro, Cenzic) and only managed to do a little better than the new Qualys offering.
The False Positive rates were much less significant this time due to the methodology chosen which focused only on the big impact vulnerabilities (listed in Methodology section). In reality most of the scanners had many additional False Positives outside the categories included and were thus not counted.) NTOSpider remained among the best at avoiding False Positives along with WebInspect and Acunetix. It is interesting to note that the scanners that miss the most vulnerabilities tended to also report the highest numbers of False Positives. So not only do they miss more vulnerabilities, they waste more time when forced to weed out false positives. When looking at the scan times, the tests show that the fastest scan times are in this order: BurpSuitePro, Qualys, WebInspect, NTOSpider, Appscan, Hailstorm, and then Acunetix as the slowest. However, to a large extent the amount of time the scanner takes to do the scan is less relevant because the most limited resource to a security professional is human time.
When looking at the amounts of time involved for the human involved to run a scan we generally think about the amount of time to configure and ‘Train’ the scanner, which varied a substantial amount among the scanners. The first scan to be run for each scanner was ‘Point and Shoot’, and based on the results and observed coverage the ‘Training’ was undertaken in accordance with what appeared to be needed.
Every effort was made during the training process to configure the scanners in every possible way to ensure that it could get its best results. This took time in reading the docs, and consulting with a number of experts with the tool and sometimes with the vendors themselves. The ‘Training’ time does not include this extra time to learn the tools, but is only for the amount of time actually interacting with the scanner, and should reflect the results you would get when a professional is proficient in the use of the specific scanner.
The final step was to take the amount of time needed to train the tool to get its best possible results, and then take into account the False Positive and False Negative numbers. A False Positive wastes time, during the vetting of the results. In addition, a general high rate of False Positives creates additional lack of trust in the results, which causes additional scrutiny of the result. False Negatives also costs time, due to the fact that a security professional must rely less on the tool and spend time doing more manual assessment of the application, which ultimately reduces the worth of the automated tool.
By applying a simple formula (described in Methodology) to calculate the cost of these False Positives and False Negatives to calculate the Overall Human Time/Cost we can take a more realistic look at the overall cost in human time that would be required to do an audit that would give us 99% confidence that due diligence was provided.
In order to cover as many bases as possible it was decided to run each scanner in two ways:
1. Point and Shoot (PaS): This includes nothing more than run default scanning options and provide credentials if the scanner supported it and the site used any.
2. Trained: This includes any configurations, macros, scripts or other training determined to be required to get the best possible results. As needed help was requested from the vendors or from acquaintances with expertise in each scanner to make sure that each was given all possible opportunity to get its best possible results. This required days of Gotomeetings/Webexes with the “experts” helping tweak the scanners as well as they could. The only scanner that was not trained was Qualys due to the fact that it is a managed service offering and training is not a normally available option.
In this review, the number of scanners involved was increased (alphabetical order)
• Acunetix Web Security Scanner (v6.5.20091130) from Acunetix
• Appscan (v184.108.40.206.891) from IBM
• BurpSuitePro (v1.3) from Portswigger.com
• Hailstorm (v6.0 build 4510) from Cenzic
• NTOSpider (v5.0.019) from NT OBJECTives
• Qualys Web Application Scanning from Qualys
• WebInspect (v8.0.753.0) from HP
(Note: WhiteHat Security declined to participate)
Each scanner is different in the types of server attacks it can perform, such as port scanning, application detection, and ‘known vuln checking’ as examples. For the purposes of this report, the types of vulnerabilities that were counted were those that are useful against custom applications, and which most users care about. These are:
• Authentication Bypass or Brute forcing
• SQL Injection / Blind SQL Injection
• Cross Site Scripting / Persistent Cross Site Scripting
• Command Injection
• XPath Injection
• SOAP/AJAX Attacks
• CSRF / HTTP Response Splitting
• Arbitrary File Upload attacks
• Remote File Include (PHP Code Injection)
• Application Errors (only those with debugging data useful for attacking)
The vulnerabilities were recorded to cross reference which scanners found which vulnerabilities in a simple format to track and compare the results. I then created a complete listing of vulnerabilities discovered by each tool, and then manually verified their authenticity to compile a list of the overall “possible” vulnerabilities. This resulted in a fairly high degree of confidence that the extent of False Positives and False Negatives numbers was complete.
In order to determine the amount of human time that was needed to generate quality results the
following formula was used:
Training time + (# False Positives * 15min) + (# False Negatives * 15min)
False Positive: Each finding takes time to confirm, and that time is a loss in the case of a False Positive.
False Negative: Higher rates of False Negatives reduces the auditors confidence in the tool, which demands additional manual assessment to be undertaken.
* The False Negative multiplier can easily be considered too low, but this number was use to keep it simple.
For the purpose of clarity, it should be pointed out that the Qualys testing was done in a different manner from the other tools. Having access to the other tools, it was possible to run them in trained and untrained modes. With Qualys, the sites to be scanned were ordered and the reports received. In theory, Qualys could have “gamed” the results (by hand testing the sites and inputting data, as opposed to using the tool). There is no reason to believe that was the case based on: 1) the reputation for honesty of my contact at Qualys and 2) their results which were tied for last (their web expert is certainly capable of doing much better should he have decided to game the results). Having said that, this paragraph needs to be included for full disclosure.
After the extensive debates over the last report, it was clear that more detailed records of the findings would be included in this report. The full spreadsheet is included as Appendix 1. This section will cover in more detail the results and the experiences/opinions gained during the running and review of the scans.
When looking at the details its best to look at the grand total summary.
From this we get a good overview of what was generally experienced across all of the scanning that was performed during this analysis.
Entering this project it was assumed that each scanner would do the best against their own website and that the task would be in looking to find out who would consistently come in second best, and would therefore be the top scanner. These assumptions were not radically off, and most vendors did do very well against their own test sites, but they did not always win, and ended up simply missing legitimate vulnerabilities on their own test sites that other scanners could find.
Perhaps the most interesting result is the performance of WebInpect vs. their own website. These sites are made available to customers to show the power of the scanner during pre-sale evaluations. The companies are obviously aware of most, if not all, the vulnerabilities on the site because they were created by the vendor to be vulnerable. Presumably, the software companies have an incentive to make sure that their scanner performs very well against their test site. They should also be a part of the vendor’s QA process. WebInspect missed half of the vulnerabilities on its own test site even though every effort was made to train it to each of the pages I knew had vulnerabilities on them. Without belaboring a point made elsewhere, web application scanners are highly complex pieces of software that are easy to break in development if there is not a strong engineering team with good continuity. WebInspsect’s false negatives against their own site are a significant cause for concern. Accunetix also missed 31% of the vulnerabilities against its two test sites.S|A
Note: This was part 1 of 2. The second half will be published tomorrow, along with the complete version as a PDF.