«Abstract. Eight sites participated in the second DARPA off-line intrusion detection evaluation in 1999. A test bed generated live background traffic ...»
Eight research groups participated in the evaluation using a variety of approaches to intrusion detection. Papers by these groups describing high-performing systems are provided in [7,8,15,20,21,24,25,26]. One requirement for participation in the evaluation was the submission of a detailed system description that was used for scoring and analysis. System descriptions described the types of attacks the system was designed to detect, data sources used, features extracted, and whether optional attack identification information was provided as an output. Most systems used network sniffer data to detect Probe and DoS attacks against all systems [8,15,21,25] or BSM Solaris host audit data to detect Solaris R2L and U2R attacks [7,15,25]. Two systems produced a combined output from both network sniffer data and host audit data [15,25]. A few systems used network sniffer data to detect R2L and U2R attacks against the UNIX victims [15,25]. One system used NT audit data to detect U2R and R2L attacks against the Windows NT victim  and two systems used BSM audit data to detect Data attacks against the Solaris victim [15,25]. A final system used information from a nightly file system scan to detect R2L, U2R, and Data attacks against the Solaris victim . The software program that performs this scan was the only custom auditing tool used in the evaluation.
Three weeks of training data, composed of two weeks of background traffic with no attacks and one week of background traffic with a few attacks, were provided to participants from mid May to mid July 1999 to support system tuning and training.
Locations of attacks in the training data were clearly labeled. Two weeks of unlabeled test data were provided from late September to the middle of October. Participants downloaded this data from a web site, processed it through their intrusion detection systems, and generated putative hits or alerts at the output of their intrusion detection systems. Lists of alerts were due back by early October. In addition, participants could optionally return more extensive identification lists for each attack.
A simplified approach was used in 1999 to label attacks and score alerts and new scoring procedures were added to analyze the optional identification lists. In 1998, every network TCP/IP connection, UDP packet, and ICMP packet was labeled, and participants determined which connections and packets corresponded to attacks.
Although this approach pre-specifies all potential attack packets and thus simplifies scoring and analysis, it can make submitting alerts difficult because aligning alerts with the network connections and packets that generate alerts is often complex. In addition, this approach cannot be used with inside attacks that generate no network traffic. In 1999, a new simplified approach was adopted. Each alert only had to indicate the date, time, victim IP address, and score for each putative attack detection. An alert could also optionally indicate the attack category. This was used to assign false alarms to attack categories. Putative detections returned by participants were counted as true “hits” or true detections if the time of any alert occurred during the time of any attack segment and the alert was for the correct victim IP address.
Alerts that occur outside all attack segments were counted as “misses” or false alarms. Attack segments correspond to the duration of all network packets and connections generated by an attack and to time intervals when attack processes are running on a victim host. To account for small timing inconsistencies across hosts, an extra 60 seconds leeway was typically allowed for alerts before and after the end of each attack segment. The analysis of each system only included attacks which that system was designed to detect, as specified in the system description. Systems weren’ penalized for missing attacks they were not designed to detect and false t alarms that occurred during segments of out-of-spec attacks were ignored.
The score produced by a system was required to be a number that increases as the certainty of an attack at the specified time increases. All participants returned numbers ranging between zero and one, and many participants produced binary outputs (0’ and 1’ only). If alerts occurred in multiple attack segments of one attack, then s s the score assigned to that attack for further analysis was the highest score in all the alerts. Some participants returned optional identification information for attacks.
This included the attack category, the name for old attacks selected from a list of provided names, and the attack source and destination IP addresses, start time, duration, and the ports/services used. This information was analyzed separately from the alert lists used for detection scoring. Results in this paper focus on detection results derived from the required alert lists.
Attack labels were needed to designate attack segments in the training data and also to score lists of alerts returned by participants. Attack labels were provided using list files similar to those used in 1998, except a separate list file was provided for each attack specifying all segments of that attack. Entries in these list files include the date, start time, duration, a unique attack identifier, the attack name, source and destination ports and IP addresses, the protocol, and details concerning the attack.
Details include indications that the attack is clear or stealthy, old or new, inside or outside, the victim machine type, and whether traces of the attack occur in each of the different data types that were collected.
Table 4. Poorly detected attacks where the best system for each attack detects half or fewer of the attack instances.
8 Results An initial analysis was performed to determine how well all systems taken together detect attacks regardless of false alarm rates. The best system was first selected for each attack as the system which detects the most instances of that attack. The detection rate for these best systems provides a rough upper bound on composite system performance. Thirty seven of the 58 attack types were detected well by this composite system, but many stealthy and new attacks were always or frequently missed. Poorly detected attacks for which half or more of the attack instances were not detected by the best system are listed in Table 4. This table lists the attack name, the attack category, details concerning whether the attack is old, new, or stealthy, the total number of instances for this attack, and the number of instances detected by the system which detected this attack best. Table 4 contains 21 attack types and is dominated by new attacks and attacks designed to be stealthy to 1998 network-based intrusion detection systems. All instances of 10 of the attack types in Table 4 were totally missed by all systems. These results suggest that the new systems developed for the 1999 evaluation still are not detecting new attacks well and that stealthy probes and U2R attacks can avoid detection by network-based systems.
Further analyses evaluated system performance at false alarm rates in a specified range. The detection rate of each system at different false alarm rates can be determined by lowering a threshold from 1.0 to 0.0, counting the detections with scores above the threshold as hits, and counting the number of alerts above the threshold that do not detect attacks as false alarms. This results in one or more operating points for each system which trade off false alarm rate against detection rate. It was found that almost all systems, except some anomaly detection systems, achieved their maximum detection accuracy at or below 10 false alarms per day on the 1999 corpus.
These low false alarm rates were presumably due to the low overall traffic volume, the relative stationarity of the traffic, and the ability to tune systems to reduce false alarms on three weeks of training data. In the remaining presentation, the detection rate reported for each system is the highest detection rate achieved at or below 10 false alarms per day on the two weeks of test data.
Table 5. Percent attack instances detected for systems with a detection rate above 40% in each cell and at false alarm rates below 10 false alarms per day.
category and victim type. This table provides overall results and does not separately analyze old, new, and stealthy attacks. The upper number in a cell, surrounded by dashes, is the number of attack instances in that cell and the other entries provide the percent correct detections for all systems with detection rates above 40% in that cell.
A cell contains only the number of instances if no system detected more than 40% of the instances. Only one entry is filled for the bottom row because only probe attacks were against all the victim machines and the SunOS/Data cell is empty because there were no Data attacks against the SunOS victim. High-performance systems listed in Table 5 include rule-based expert systems that use network sniffing data and/or Solaris BSM audit data (Expert-1 through Expert-3 [15,25,21]), a data mining system that uses network sniffing data (Dmine ), a pattern classification approach that uses network sniffing data (Pclassify), an anomaly detection system which uses recurrent neural networks to analyze system call sequences in Solaris BSM audit data (Anomaly ), and a reasoning system which performs a nightly forensic analysis of the Solaris file system (Forensics ).
No one approach or system provides best performance across all categories. Best performance is provided for probe and denial of service attacks for systems that use network sniffer data and for U2R and Data attacks against the Solaris victim for systems that use BSM audit data. Detection rates for U2R and Data attacks are generally poor for SunOS and Linux victims where extensive audit data is not available.
Detection rates for R2L, U2R, and Data attacks are poor for Windows NT which was included in the evaluation for the first time this year.
Attacks were detected best when they produced a consistent “signature” or sequence of events in tcpdump data or in audit data that was different from sequences produced for normal traffic. A detailed analysis by participants demonstrated that attacks were missed for a variety of reasons. Systems which relied on rules or signatures missed new attacks because signatures did not exist for these attacks, and because existing signatures did not generalize to variants of old attacks, or to new and stealthy attacks. For example “ncftp” and “lsdomain” attacks were visible in tcpdump data, but were missed because no rules existed to detect these attacks.
Stealthy probes were missed because hard thresholds in rules were set to issue an alert only for more rapid probes, even though slow probes often provided as much information to attackers. Stealthy U2R attacks were missed by network-based systems because rules generated for clear versions of these attacks did not generalize to stealthy versions and because attacker actions were not easily visible in sniffing data.
Many of the Windows NT attacks were missed due to lack of experience with Windows NT audit data and attacks. A detailed analysis of the Windows NT attacks  indicated that all but two of these attacks (ppmacro, framespoof) can be detected from the 1999 NT audit data using attack-specific signatures which generate far fewer than 10 false alarms per day.
Systems also missed attacks because particular protocols or services were not monitored. For example, some systems missed the “arppoison” attack because the ARP protocol was not monitored. Some missed the “snmpget” attack because the SNMP service was not analyzed and some missed the “lsdomain” attack because the DNS service was not analyzed. Finally, some systems missed attacks because a protocol or TCP service was not analyzed to the required depth. For example, the “lsdomain” attack requires a system to monitor traffic to the DNS server and also detect when an “ls” command is successfully run on that server. The “selfping” command also will not be detected by a network-based intrusion detection system unless telnet sessions are extracted and analyzed to detect when a “ping” command is issued with specific arguments.
Some inside attacks launched from the console of victims and did not generate network traffic. They were detected well only on the Solaris victim by systems that use BSM audit data. Other inside machine-to-machine attacks were detected as well using inside sniffer data as attacks initiated from outside machines. One anomaly detection system  provided good results. It analyzed system-call sequences extracted from BSM audit data and provided a high detection rate similar to that of the best signature-based systems for Solaris U2R attacks, as shown in the upper right of Table 5.