«Abstract. Eight sites participated in the second DARPA off-line intrusion detection evaluation in 1999. A test bed generated live background traffic ...»
The 1999 DARPA Off-Line Intrusion Detection
Richard Lippmann, Joshua W. Haines, David J. Fried, Jonathan Korba, Kumar Das
Lincoln Laboratory MIT, 244 Wood Street, Lexington, MA 02173-9108
Email: firstname.lastname@example.org or email@example.com
Abstract. Eight sites participated in the second DARPA off-line intrusion detection evaluation in 1999. A test bed generated live background traffic similar
to that on a government site containing hundreds of users on thousands of
hosts. More than 200 instances of 58 attack types were launched against victim UNIX and Windows NT hosts in three weeks of training data and two weeks of test data. False alarm rates were low (less than 10 per day). Best detection was provided by network-based systems for old probe and old denialof-service (DoS) attacks and by host-based systems for Solaris user-to-root (U2R) attacks. Best overall performance would have been provided by a combined system that used both host- and network-based intrusion detection. Detection accuracy was poor for previously unseen new, stealthy, and Windows NT attacks. Ten of the 58 attack types were completely missed by all systems.
Systems missed attacks because protocols and TCP services were not analyzed at all or to the depth required, because signatures for old attacks did not generalize to new attacks, and because auditing was not available on all hosts.
Promising capabilities were demonstrated by host-based systems, by anomaly detection systems, and by a system that performs forensic analysis on file system data.
Keywords: intrusion detection, evaluate, attack, audit, test bed 1 Introduction The potential damage that can be inflicted by attacks launched over the internet keeps increasing due to a growing reliance on the internet and more extensive connectivity. Intrusion detection systems have become an essential component of computer security to detect attacks that occur despite the best preventative measures.
Comprehensive discussions of alternate approaches to intrusion detection are available in [1,2,16]. Some approaches detect attacks in real time and can be used to monitor and possibly stop an attack in progress. Others provide after-the-fact forensic information about attacks and can help repair damage, understand the attack mechanism, and reduce the possibility of future attacks of the same type. More advanced intrusion detection systems detect never-before-seen, new, attacks, while the more typical systems detect previously seen, known attacks.
Study IDs Attacks/ False Stealth Comments Victims Alarms Puketza 2 4/1 Yes/ No Automated Attacks and Simple 1994 Un- Telnet Traff
Table 1. Characteristics of past intrusion detection evaluations.
The widespread deployment and high cost of both commercial and governmentdeveloped intrusion detection systems has led to an interest in evaluating these systems. Technical evaluations that focus on algorithm performance are essential for ongoing research. They can contribute to rapid research progress by focusing efforts on difficult technical areas, they can produce common shared corpora or data bases which can be used to benchmark performance levels, and they make it easier for new researchers to enter a field and explore alternate approaches. System evaluations that focus on additional practical issues including cost, ease of use, and traffic handling capacity are also useful for determining capabilities of complete deployable systems.
Without careful evaluations, installing an intrusion detection system could be detrimental because it might lead to a relaxation of vigilance based on unproven assumptions concerning system performance. It might also lead to inefficient use of trained personnel if systems produce many difficult-to-analyze false alarms. A careful assessment of intrusion detection systems is essential to understand capabilities and limitations and construct an effective security posture that makes use of detection and prevention mechanisms.
It is difficult and costly to perform reliable, systematic evaluations of intrusion detection systems. As a result, few such evaluations have been performed. Table 1 summarizes characteristics of important past evaluations that have compared multiple intrusion detection systems. It includes early studies which describe a methodology that can be used for technical evaluations [4,18,19], the most recent and extensive system evaluation of commercial products that we are aware of , and the real-time  and off-line [12,14] components of the 1998 DARPA intrusion detection evaluation. The first column in Table 1 provides the first author and date of the study, the second column indicates the number of intrusion detection systems evaluated, and the third column provides the number of attack types used and also the number of unique victim machines attacked. The fourth column indicates whether the study analyzed the number of false alarms produced for normal background traffic and also the duration of background traffic used to measure false alarm rates. The next column indicates whether stealthy versions of attacks were used in an attempt to evade intrusion detection systems, and the final column provides additional comments on the study.
Results are not shown in Table 1 because many studies were informal and didn’ t provide detailed information and because metrics differ widely across studies. The primary performance metric in all studies is the attack detection rate for each attack type used. This metric depends on details of the attacks and on the specific version of the intrusion detection system that was tested. It also is insufficient when used alone.
It must be combined with false alarm rates for normal traffic to assess the human workload required to operate intrusion detection systems and dismiss false alarms.
False alarm rates above hundreds per day make a system excessively expensive to deploy, even with high detection accuracy. Unless a system provides forensic information which makes alerts or putative detections easy to analyze, security analysts will not trust alerts and may spend many hours each day dismissing false alarms. Low false alarm rates combined with high detection rates, however, mean that alerts can be trusted and that the human labor required to confirm detections is minimized. Only recent DARPA evaluations have measured false alarm rates with a large quantity of rich background traffic. Other important metrics used by some studies include cost of commercial systems, ease of software installation and use, traffic handling capacity, and run-time memory and CPU requirements.
As can be seen from Table 1, evaluations have become more complex and extensive over the years. Initial evaluations included few systems, few attack types, did not include stealthy attacks, and included little normal background traffic to evaluate false alarm rates. The 1998 off-line DARPA evaluation includes 10 systems, 38 attack types, weeks of rich background traffic, stealthy attacks, and also led to a corpus or data base of attacks and background traffic that is being widely used for evaluation and development of intrusion detection systems. The first two evaluations in Table 1 describe initial research programs designed to develop a methodology for intrusion detection evaluation [4,18,19]. Both studies incorporated scripting software to provide repeatability by automating generation of attacks and background traffic. Few attack types were used in these studies and background traffic consisted either of a small number of automated telnet or FTP sessions. Both studies demonstrated the importance of repeatability for intrusion detection system development. Initial low detection and high false alarm rates were improved by cyclical testing and development with repeatable attacks and background traffic. The second study  also noted that generating realistic normal background traffic was complex and time-consuming in heterogeneous computing environments.
Many product comparisons of commercial intrusion detection systems have been published in the past few years. The third entry in Table 1 is a recent comprehensive product evaluation. It includes three host-based and seven network-based commercial intrusion detection systems which were evaluated using more than 12 attack types and four victim machines. This study also included stealthy probe or scan attacks and stealthy packet modifications described in  designed to elude intrusion detection systems. This study did not provide detailed per-attack detection results, but mentions that no system detected all attacks and that stealthy attacks successfully eluded many systems. Most of the systems evaluated rely on attack “signatures” to detect old or known attacks. New signatures can often be added by hand or downloaded from a remote site. This evaluation focused on practical system characteristics such as ease of use and cost, and did not measure false alarm rates for normal background traffic. It did, however, use network load-generating software to demonstrate that some network-based intrusion detection systems fail to detect attacks at high network loads.
The last two rows in Table 1 are for real-time and off-line DARPA 1998 evaluations. As can be seen from the table, the off-line evaluation is the most complex performed to date. It was an initial attempt at a comprehensive evaluation which included background traffic to measure false alarm rates, many attacks, and more than eight different intrusion detection systems. This exploratory evaluation was limited. It included only intrusion detection systems developed under DARPA sponsorship, only attacks against UNIX hosts, and background traffic designed to be similar to traffic on one Air Force base. Six research groups participated in this statistically-blind evaluation to provide unbiased measurement of current performance levels. The off-line evaluation, performed by MIT Lincoln Laboratory, included weeks of training and test traffic, more than 300 instances of 38 attack types, and resulted in an archival 1998 intrusion detection corpus or database [12,14]. This corpus can be processed simultaneously at many sites to evaluate and develop research systems and it continues to be used for algorithm development and as a baseline for future evaluations. The real-time evaluation, performed by the Air Force Research Laboratory (AFRL), evaluated a smaller number of systems which have real-time implementations using a more complex network, fewer attacks, and four hours of traffic . Results of the 1998 evaluation helped determine the strengths and weaknesses of alternative technical approaches and had a strong influence on DARPA intrusion detection research goals. Further off-line and real-time evaluations which build on the initial 1998 effort were performed in 1999. This paper reports on the results of the off-line 1999 evaluation. Results and lessons learned from the 1998 off-line evaluation are first summarized, the 1999 off-line evaluation is described, 1999 results are presented, and suggestions are provided for future evaluations. Further details on the 1999 off-line evaluation are available in [3,10,13,14].
2 Summary of the 1998 Off-Line Evaluation
The DARPA 1998 Intrusion Detection Evaluation was an initial attempt to perform a comprehensive technical evaluation of intrusion detection technology. As noted above, this evaluation had limited goals. It was designed to evaluate only DARPA funded intrusion detection technology, and not complete deployable intrusion detection systems or commercial systems. It was also designed to measure false alarm rates using background traffic similar to that on one Air Force base and to measure detection rates of remotely-initiated attacks against UNIX hosts. Figure 1 shows the current version of an isolated test bed network which was first developed for the 1998 off-line evaluation. Scripting techniques which extend the approaches used in [4,18] are used to generate live background traffic which is similar to traffic that
flows between the inside of one Air Force base and the outside internet. This approach was selected for the evaluation because hosts can be attacked without degrading operational Air Force systems and because corpora containing background traffic and attacks can be widely distributed without security or privacy concerns. A rich variety of background traffic is generated in the test bed which looks as if it were initiated by hundreds of users on thousands of hosts. The left side of Figure 1 represents the inside of the fictional Eyrie Air Force base created for the evaluations and the right side represents the outside internet. The 1998 evaluation did not include the Windows NT victim machine or the inside sniffer shown on the left of Figure 1, but instead focused exclusively on UNIX and router attacks. Automated attacks were launched against three inside UNIX victim machines (SunOS, Solaris, Linux) and the router from outside hosts. More than 300 instances of 38 different attacks were embedded in seven weeks of training data and two weeks of test data. Machines labeled “sniffer” in Figure 1 run a program named tcpdump  to capture all packets transmitted over the attached network segment.
Six research sites participated in the blind 1998 evaluation and results were analyzed to determine the attack detection rate as a function of the false alarm rate.
Performance was evaluated for old attacks included in the training data and new attacks which only occurred in the test data. Detection performance for the best systems was above 60% correct at and below a false alarm rate of 10 false alarms per day for both old and new probe attacks and attacks where a local user illegally becomes root (U2R). Detection rates were mixed for denial of service (DoS) attacks and remote-to-local (R2L) attacks where a remote user illegally accesses a local host.