«Workshop held 9-10 January 2008 in Arlington, VA Prepared for US Strategic Command Global Innovation and Strategy Center (USSTRATCOM/GISC) Prepared ...»
Automated data analysis techniques can be broadly classified as either subject-based or patternbased. For subject-based queries, analysis begins with knowledge of the subject, such as a suspicious individual, place, or phone number identified by authoritative intelligence sources.
Subject-based queries seek additional information that will provide a broader understanding of the subject, such as activities an individual has engaged in and links to people, places, and things with which they are familiar (DeRosa, 2004). Link and social network analysis are subject-based query techniques that have been widely used both in the public and private sectors, and they have significant potential to be a useful weapon in the counter-terrorism arsenal (DeRosa, 2004).
Several examples of the application of link/social network analysis to the counter-terrorism domain are described below, including a hyperlink interconnectivity analysis of Jihadi communities on the Internet, and the social network mapping of the Al Qaeda cell responsible for the 9/11 attacks.
Pattern-based queries seek to identify pre-defined patterns of behavior within datasets, and the models can come from data mining techniques or other intelligence sources (DeRosa, 2004).
Perhaps the most well-known example of pattern-based searching in the private-sector is its use by credit-card companies to detect fraud. The banks create a model of fraudulent activity by searching databases that are known to contain a combination of valid and invalid transactions.
An example of such a pattern would be for the thief to make a small purchase with the stolen card to confirm that it works, immediately before making a substantial purchase (Jonas, 2003).
The bank monitors all credit card transactions for instances of fraudulent patterns, and issues an alarm if one is detected. For a case such as this, the bank is looking for a broad pattern in unrelated financial transaction data. This methodology does not generalize to the domain of terrorist activity, however. There is no broad pattern of activity associated with terrorist organizations; they tend to be loosely connected semi-autonomous cells, and it is the “relational” data describing the connections between people, places, and things that are of importance (DeRosa, 2004). Several examples of the application of the pattern-based methodology to the counter-terrorism domain are described below, including content-based monitoring of Web browsing behavior, automatic classification of Web content, and authorship identification of anonymous documents.
Deterring VNSA in CyberspaceData Analysis: Link/Social Network Analysis
The researchers at the University of Arizona, who are responsible for the Dark Web Collection, have been analyzing the data set from several different perspectives, including link/social network analysis. Using the Dark Web collection methodology, they studied Jihadi communities on the Internet in an effort to better understand how these groups interact and communicate in
cyberspace. The data collection process began with three Jihad seed Web sites:
www.qudsway.com of the Palestinian Islamic Jihad, www.hizbollah.com of Hizbollah, and www.ezzedine.net of the Izzedine-Al-Qassam, the military wing of Hamas (Reid et al, 2005).
The Google back-link search service was used to find all the sites linked to the initial three, and after performing manual keyword lexicon searches to expand the set and filtering to remove outliers, a testbed of 39 sites remained. Web spidering collected approximately 300,000 documents from these sites and those linked to them. In order to identify hidden communities, a similarity measure was computed between all website pairs based on the number of hyperlinks shared between the sites. The hyperlinks were weighted proportionally according to how deeply they are embedded within a site, with the most weight given to links available from homepages.
A multidimensional scaling (MDS) algorithm (Duda et al, 2001) was used to generate a twodimensional graph of the link structure, from which clusters representing highly-linked communities were extracted. Six main clusters were identified, and the results conform to what has already been established regarding the relationship among these groups (see Figure 3).
Figure 3: 2D graph of the link structure of hyperlinked Jihadi communities on the Web (Reid et al, 2005).
For example, the figure indicates a strong link between the Hizbollah cluster and Palestinian organizations, which is not surprising since Hizbollah is a known sympathizer with the Palestinian cause. At the top-left portion of the graph is the Hizb-ut-Tahrir political party cluster. While not officially recognized as a terrorist group, the results indicate that they have links to the Hizbollah cluster. The demonstrated link between the Al Qaeda and Hamas clusters Deterring VNSA in Cyberspace was also expected. While this particular study did not produce any results that do not mesh with the current sociological and ideological understanding of the relationships between the Jihadi groups studied, it is possible that this type of link/social network investigation will provide analysts with the ability to identify relationships between organizations and individuals that they might not have otherwise seen, and to allow them to monitor the development of these relationships over time.
Another example of the use of link/social network analysis for terrorism related applications is the work of Valdis Krebs on mapping the structure of the covert network of Al Qaeda members responsible for the Sept. 11th attacks. Krebs used publicly available data from news sources on the Internet to visually represent the social relationships among the 19 individuals identified as hijackers, as well as their relationships with co-conspirators who provided knowledge, money, and skills to the effort, but did not board the planes (Krebs, 2002). Because Krebs relied on publicly released news reports as his data source, his analysis was ill-informed since the amount of relevant relationship information released to the media was limited or intentionally incorrect.
To counteract this, he applied the work of social network theorists such as Malcolm Sparrow who study the structure of covert networks under the conditions of missing information, fuzzy node inclusion criteria, and consistently dynamic datasets. Krebs decided to map the strength of the relationships between the key players in terms of how much time they spent together, with
the strongest ties belonging to individuals who attended the same school or training programs:
the resulting map can be seen in Figure 4.
Figure 4: Link analysis of the 19 hijackers and co-conspirators responsible for 9/11.
Analysis was done by Valdis E. Krebs using open source news data (Krebs, 2002).
Deterring VNSA in Cyberspace Additionally, attributes of network centrality were computed for each individual, including Degree, Closeness, and Betweenness. The Degree attribute indicates the node’s level of activity in the network; Closeness is a measure of the node’s ability to access others and monitor ongoing events, and Betweenness describes the node’s ability to control the flow of communication in the network (Krebs, 2002). Krebs’ analysis of the network structure revealed Mohamed Atta to be the most likely ring leader of the group, since he obtained the highest score of all participants for each of the centrality attributes described above. This result has been confirmed many times over by intelligence experts, and by bin Laden himself, who verified Atta’s leadership role in a video tape (U.S. Department of Defense, 2001).
While not directly illustrative of how link/social network analysis has been used to model the emergent behavior of VNSAs in cyberspace, the Krebs example demonstrates both the potential and limitations of using link/social network analysis for such an application. With the benefit of hindsight in this case, it is natural to ask the question: Given the amount of information that was available prior to the 9/11 attacks, could we have predicted and prevented them had we simply known how and where to look? According to the National Commission on Terrorist Attacks upon the United States, the government may have been able to prevent the tragedies had they pursued leads that were available to them at the time (Jonas and Harper, 2006). This finding does not imply, however, that techniques such as link and social network analysis can be used to reveal the structure of any and all covert VNSA organizations. Krebs asserts that uncovering covert criminal networks is an extremely difficult task, since their behavior is so unlike that of a normal social network. In the case of the 9/11 hijackers, the strong ties between nodes that were “formed years ago in school and training camps…remain[ed] mostly dormant and therefore hidden to outsiders,” unlike normal social networks (Krebs, 2002). The lack of transparent connections among group members, coupled with the self-imposed isolation of network members from the outside world make social network analysis a blunt instrument when it comes to its predictive and preventive capabilities. Krebs cautions that “we must be careful of ‘guilt by association’. Being linked to a terrorist does not prove guilt - but it does invite investigation,” making social network analysis more aptly applied to “the prosecution, not the prevention of criminal activities” (Krebs, 2002).
Data Analysis: Web Monitoring
Law enforcement agencies have been interested in monitoring the Internet activity of suspicious individuals ever since the emergence of the Internet as a standard communication medium in the 1990s. Programs such as Carnivore provided the FBI with the ability to monitor specific types of electronic communication described explicitly by a court order, such as e-mails and browsing records. The architecture of the Carnivore system consisted of a Windows-based computer installed at an ISP with a 1-way tap into the Ethernet segment to which it is attached. The computer filters the packet traffic and stores those packets that conform to filter specifications defined by the court-order. Restrictions on packet collection ranged from permission to access the full contents of communication to only address information, such as To and From e-mail addresses and IP addresses involved in FTP and HTTP sessions (Smith et al, 2000). The data was mined for information at a later time.
The FBI discontinued the Carnivore program (since renamed DCS1000) several years ago;
however, the Carnivore methodology is exemplary of most electronic communication Deterring VNSA in Cyberspace surveillance protocols in use today. That is, a mass of data conforming to certain content parameters is collected and analyzed later using various data mining and analysis techniques.
Researchers at Ben-Gurion University of the Negev, Israel, propose a new methodology and objective for monitoring these electronic communications based on real-time surveillance of users’ browsing behavior. They have developed a content-based model for classifying and identifying browsing activity called the Advanced Terror Detection System (ATDS), which they have applied specifically to the identification of behavior that conforms to a “typical terrorist signature”.
The underlying assumption of the ATDS is that the content of a user’s Web browsing behavior can be used to create a signature of interest that can be compared to a pre-classified set of signatures, such as one that might describe “typical terrorist” or normal Internet usage (Shapira, 2005). The method begins with a learning phase, during which the system is provided with a set of Web pages representing the browsing behavior of a “normal” set of users. Each document browsed is converted to a vector of weighted terms, where the weighting criteria corresponds to the relative frequency with which the term appears on the page and the term’s position, i.e. if the term is found in the page title, it is assigned a higher weight, since its contribution to the document’s content is assumed to be higher. A cluster-generator receives the vectors and performs cluster analysis on the data, identifying discrete areas of interest based on the frequency of the weighted terms across the set of vectors derived from the user’s browsing session. These discrete areas of interest are the centroids of the cluster, and they are the elements that make up the set of normal user’s browsing interests (Elovici et al, 2005).
The monitoring process consists of an on-line packet sniffer, which captures the data sent and accessed by a group of users at a network level, much like the Carnivore system. The packets are sent to a filter which excludes pages without any textual content from further analysis. Each text item is vectorized and compared to the centroids of the normal user signature using the Cosine method of computing the distance between vectors. If the distance between the monitored page vectors and any of the centroids is higher than the dissimilarity threshold, the user has demonstrated an interest that is not reflected in the set of normal user interests, possibly signifying abnormal browsing behavior. Whether or not an alarm is raised depends on parameter choices such as the sensitivity of the dissimilarity threshold, and the number of “normal” pages required to classify the overall browsing behavior as normal (Elovici et al, 2005).
Researchers evaluated the performance of the ATDS by monitoring 38 computer stations in a teaching lab for one month from which they collected 13,300 English pages corresponding to what would be considered “normal” browsing behavior. They also collected 582 terror-related pages for the simulation of an abnormal sequence of accesses, and chose a random 582 pages from the normal set, which they used to simulate the normal browsing behavior. The system was evaluated to determine the optimal alarm thresholds and queue size of pages to monitor. Queue size of 2, 8, 16, and 32 pages were tested for alarm thresholds of 50% and 100% of the queue having dissimilar interests to the normal profile. The system reached almost ideal performance for a 32 page queue and 100% alarm threshold (Elovici et al, 2005).