«Abstract. Eight sites participated in the second DARPA off-line intrusion detection evaluation in 1999. A test bed generated live background traffic ...»
The forensic information provided in identification list files was generally accurate for attacks that were correctly detected. Table 6 shows results for four highperformance systems that provided all optional identification information. The first column in this table shows the system type. The second column shows the total number of attacks detected by each system (at the highest false alarm rate) followed by a slash and the number of in-spec attacks that this system should have detected as specified in the system description. The first two expert systems both detected roughly 80 attacks each. They were combined systems that could have detected a maximum of roughly 170 in-spec attacks using both host-based and network-based input data. The third system used network sniffing data alone and thus had fewer inspec attacks (102) and the fourth system used only Solaris file-system information and thus had only 27 in-spec attacks. The remaining columns show the accuracy of the identification information provided for detected attacks. The third column shows the percentage of detected attacks where the attack category label was correct. The fourth column shows the percentage of detected attacks where the names of old attacks were correct. Participants were provided a list of names for old attacks before the evaluation was run which were used to label attacks. Items in this column apply only to old attacks that were detected.
This table shows that the additional identification information provided was generally accurate for attacks that were correctly detected. For example, for the first expert system, the attack category and name is correct roughly 90% of the time, and the victim ports and source IP addresses are correctly identified for more than 70% of the detected attacks. The upper three systems in Table 6 all used network sniffing data and provided good identification performance. The last Forensic analysis system, was a host-based system. Its good performance suggests that much of the identification information required can be obtained from a host-based analysis that doesn’ t rely on audit data.
All systems in Table 6 also provided attack start times as optional identification information. These times were computed by participating systems using off-line data with no constraints on look-ahead and thus they do not necessarily represent times that could be provided by real-time system implementations. Start time accuracy was generally good for R2L and DoS attacks. The attack start time latencies were less than 15 seconds for more than 80% of these attacks. Start time accuracy was not as good, and differed across systems for probe and U2R attacks. Start times were provided for probe attacks by the first three systems in Table 6. The third system (DMine) correctly identified the start of all probes to within 15 seconds while the first two expert systems had start time latencies that were often many minutes delayed for slower probes that spanned long time intervals.
The first two expert systems and the last system in Table 6 provided start times for U2R attacks. These attacks were unique because many of them included multiple separate telnet interactions separated by long time intervals and others were performed as part of long single telnet sessions containing many normal user commands. In attacks that included multiple telnet sessions, initial sessions were run at user privilege level to prepare for the attack. The actual attack, which provided rootlevel privilege on UNIX machines, was run only in following sessions. Results for the first two expert systems in Table 6 and for last Forensic analysis system differ dramatically for these U2R attacks. The first two systems detected the time instant where the attacker became root, while the Forensic analysis system traced the beginning of the attack either to the beginning of the first session where attack setup actions occurred or to the beginning of the telnet session where the attack occurred.
Start times for 6 of the 8 U2R attacks detected by the Forensic analysis system were within 15 seconds of true start times, while start times for more than 90% of the U2R attacks detected by the first two Expert systems were delayed by more than a minute from the true attack times. These results suggest that the Forensic analysis system is accurately correlating information across multiple network sessions to arrive at accurate start times while the two expert systems are using the time of the root-privilege elevation as a start time.
The DARPA 1999 intrusion detection evaluation successfully evaluated 18 intrusion detection systems from 8 sites using more than 200 instances of 58 attack types embedded in three weeks of training data and two weeks of test data. Attacks were primarily launched against UNIX and Windows NT hosts. Best detection was provided by network-based systems for old probe and old denial of service attacks and by hostbased systems for Solaris user-to-root attacks launched either remotely or from the local console. A number of sites developed systems that detect known old attacks by searching for signatures in network sniffer data or Solaris BSM audit data using expert systems or rules. These systems detect old attacks well when they match known signatures, but miss many new UNIX attacks, Windows NT attacks, and stealthy attacks. Promising capabilities were provided by Solaris host-based systems which detected console-based and remote-stealthy U2R attacks, by anomaly detection systems which could detect some U2R and DoS attacks without requiring signatures, and by a host-based system that could detect Solaris U2R and R2L attacks without using audit information but by performing a forensic analysis of the Solaris file system.
Results of the 1999 evaluation should be interpreted within the context of the test bed, background traffic, attacks, and scoring procedures used. The evaluation used a reasonable, but not exhaustive, set of attacks with a limited set of actions performed as part of each attack. It also used a simple network topology, a non-restrictive security policy, a limited number of victim machines and intrusion detection systems, stationary and low-volume background traffic, lenient scoring, and extensive instrumentation to provide inputs to intrusion detection systems. One finding that should not be misinterpreted is that most systems had false alarm rates which were low and well below 10 false alarms per day. As noted above, these low rates may be caused by the use of relatively low volume background traffic with a time varying, but relatively fixed proportion of different traffic types. We currently plan to verify false alarm rates using live network traffic and a small number of high-performing systems. Live-traffic measurements will also be made to update traffic statistics and traffic generators used in the test bed. Results obtained with the DARPA research systems used in the evaluation also may not generalize to more recent research systems or to commercial systems. Performance with the 56 attack types used in the evaluation also may not be representative of performance with more recent attacks or with other attacks against different host machines, firewalls, routers, or parts of the network infrastructure. Further evaluations are required to explore performance with commercial and other research intrusion detection systems, with more complex network topologies, with a wider range of attacks, and with varying mixtures and amounts of background traffic.
Comprehensive evaluations of DARPA research systems have now been performed in 1998 and 1999. These evaluations take time and effort on the part of the evaluators and the participants. The have provided benchmark measurements that do not now need to be repeated again until system developers are able to implement many desired improvements. The current planned short-term focus in 2000 is to provide assistance to intrusion detection system developers to advance their systems and not to evaluate performance. System development can be expedited by providing descriptions and labeled examples of many new attacks, by developing threat and attack models, and by carefully evaluating COTS systems to determine where to focus research efforts.
A number of approaches to improve capabilities of existing systems are suggested by 1999 results. First, techniques should be developed to process Windows NT audit data to detect attacks by extending existing approaches from UNIX to Windows NT.
Second, host-based systems shouldn’ rely exclusively on C2-level audit data such as t Solaris BSM data or NT audit data. Instead they should also examine information in the file system and in commonly-used system logs. Systems that use file system information could be used on hosts such as Linux where there currently is no C2level auditing and on any critical host where auditing is not turned on for fear of performance degradation. Third, systems should analyze a wider range of protocols and TCP services. For some protocols, information contained in packet headers alone is insufficient, but the content of network transmissions must be extracted to determine the purpose of important network interactions. Fourth, approaches that can detect new attacks, including anomaly detection, should be extended to more hosts and network traffic types. Fifth, systems should provide more forensic information to analysts and extend the optional attack identification information provided by many systems in 1999. This forensic analysis could simplify the task of verifying each alert, determining attacker actions, and responding to an attack. It could also provide a valuable lasting record of attack-related events. Finally, other types of input features should be explored. These could be provided by new system auditing software, by firewall or router audit logs, by SNMP queries, by software wrappers, and by application-specific auditing.
Acknowledgements This work was sponsored by the Department of Defense Advanced Research Projects Agency under Air Force Contract F19628-95-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Air Force.
We would like to thank Sami Saydjari for supporting this effort. Many involved participants made this evaluation possible including Dick Kemmerer, Giovanni Vigna, Mabri Tyson, Phil Porras, Anup Ghosh, R. C. Sekar, and NingNing Wu. We would also like to thank Terry Champion and Steve Durst from AFRL for many lively discussions and for providing Linux kernel modifications that make one host simulate many IP addresses. Finally, we would like to thank others who contributed including Marc Zissman, Rob Cunningham, Seth Webster, Kris Kendall, Raj Basu, Jesse Rabek, and Simson Garfinkel.
1. J. Allen, A. Christie, W. Fithen, J. McHugh, J. Pickel, E. Stoner, State of the Practice of Intrusion Detection Technologies, Carnegie Mellon University/Software Engineering Institute Technical Report CMU/SEI-99-TR-028, January 2000.
2. E. G. Amoroso, Intrusion Detection: An Introduction to Internet Surveillance, Correlation, Trace Back, Traps, and Response, Intrusion.Net Books, 1999.
3. K. Das, The Development of Stealthy Attacks to Evaluate Intrusion Detection Systems, S.M. Thesis, MIT Department of Electrical Engineering and Computer Science, June 2000.
4. H. Debar, M. Dacier, A. Wespi, and S. Lampart, An Experimental Workbench for Intrusion Detection Systems, Research Report RZ 2998 (#93044), IBM Research Division, Zurich Research Laboratory, 8803 Ruschlikon, Switzerland, March 9, 1999, http://www.zurich.ibm.com/Technology/Security/extern/gsal/docs/index.html.
5. Robert Durst, Terrence Champion, Brian Witten, Eric Miller and Luigi Spagnuolo, Testing and evaluating computer intrusion detection systems, Communications of the ACM, 42 (1999) 53-61.
6. C. Elkan, Results of the KDD'99 Classifier Learning Contest, Sponsored by the International Conference on Knowledge Discovery in Databases, September, 1999, http://wwwcse.ucsd.edu/users/elkan/clresults.html.
7. A.K. Ghosh and A. Schwartzbard, A Study in Using Neural Networks for Anomaly and Misuse Detection, in Proceedings of the USENIX Security Symposium, August 23-26, 1999, Washington, D.C, http://www.rstcorp.com/~anup.
8. S. Jajodia, D. Barbara, B. Speegle, and N. Wu, Audit Data Analysis and Mining (ADAM), project described in http://www.isse.gmu.edu/~dbarbara/adam.html, April, 2000.
9. K. Kendall, A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems, S.M. Thesis, MIT Department of Electrical Engineering and Computer Science, June 1999.
10. J. Korba, Windows NT Attacks for the Evaluation of Intrusion Detection Systems, S.M.
Thesis, MIT Department of Electrical Engineering and Computer Science, June 2000.
11. Lawrence Berkeley National Laboratory Network Research Group provides tcpdump at http://www-nrg.ee.lbl.gov.
12. Richard P. Lippmann, David J. Fried, Isaac Graf, Joshua W. Haines, Kristopher R. Kendall, David McClung, Dan Weber, Seth E. Webster, Dan Wyschogrod, Robert K. Cunningham, and Marc A. Zissman, Evaluating Intrusion Detection Systems: the 1998 DARPA Off-Line Intrusion Detection Evaluation, in Proceedings of the 2000 DARPA Information Survivability Conference and Exposition (DISCEX), Vol. 2, IEEE Press, January 2000.
13. R. P. Lippmann and R. K. Cunningham, Guide to Creating Stealthy Attacks for the 1999 DARPA Off-Line Intrusion Detection Evaluation, MIT Lincoln Laboratory Project Report IDDE-1, June 1999.
14. MIT Lincoln Laboratory, A public web site http://www.ll.mit.edu/IST/ideval/index.html, contains limited information on the 1998 and 1999 evaluations. Follow instructions on this web site or send email to the authors (rpl or email@example.com) to obtain access to a password protected site with more complete information on these evaluations and results.
Software scripts to execute attacks are not provided on these or other web sites.
15. P. Neumann and P. Porras, Experience with EMERALD to DATE, in Proceedings 1st USENIX Workshop on Intrusion Detection and Network Monitoring, Santa Clara, California, April 1999, 73-80, http://www.sdl.sri.com/emerald/index.html.