«Workshop held 9-10 January 2008 in Arlington, VA Prepared for US Strategic Command Global Innovation and Strategy Center (USSTRATCOM/GISC) Prepared ...»
The researchers assert that the intention of the ATDS is to aid law enforcement officials in tracking down suspected terrorists based on the content of their Web browsing. They envision that the system would run in real-time and would be able to monitor a large group of users Deterring VNSA in Cyberspace simultaneously without being detected. Ideally, officials would also have access to individuals’ identification information based on their IP addresses, which would require the cooperation of the ISPs or a court-order (Shapira, 2005). The scalability of this technique is potentially an issue, since each Web page has to be vectorized and compared in near real-time; this may not be feasible for monitoring large groups (Shapira, 2005). Additionally, researchers have yet to evaluate the methodology with a testbed of Web pages that mixes normal and abnormal page accesses, which would more accurately reflect the usage patterns of a real individual (Elovici et al, 2005). Currently, no new data have been published that address these issues.
While the full capabilities of the ATDS have not yet been fully investigated, the underlying concept provides a new idea about what Web monitoring can mean and introduces a framework indicating that it may be possible to achieve real-time monitoring of Web-content. Much future work is needed, however, to establish this type of methodology as a technique that can feasibly be used by law enforcement officials to identify suspicious individuals with minimal numbers of false positives.
Data Analysis: Automatic Content Classification
The investigation of emergent behavior of VNSAs in cyberspace is a problem that is confounded not only by the difficult task of obtaining a clean, relevant dataset, but also by the sheer volume of data that is available. Raw data requires pruning to isolate terror-related content, and manual sorting techniques are simply not efficient enough to handle data sets of the magnitude of the 2TB Dark Web collection. Automatic classification of Web content is currently being investigated as a means of addressing this problem. Classification techniques are typically applied to a “function that has a discrete set of possible values,” and the algorithms have the goal of automatically identifying which of these values describe a previously unclassified piece of data (Last, 2005). The discrete values can be any number of properties, as long as there is a way to quantify the presence or absence of the property within the data in question, making classification algorithms applicable to a broad range of problems. The type of data to be classified determines the appropriate method of classification. For example, the most general classification models are based on a decision-tree methodology to represent the conditional dependence between inputs. The branches of a decision tree “can be interpreted as if-then rules of varying complexity,” and traversing the branches of the tree applies the rules of the model to each new piece of data. Examples of decision-tree based algorithms are C4.5 and ID3. Other pattern recognition approaches utilize Probabilistic Neural Networks (PNN), Bayesian learning methods, Support Vector Machines (SVM), Principle Component Analysis (PCA), Gaussian Mixture Models (GMM), K-nearest neighbor (KNN) algorithms, and linear and quadratic discriminants (Duda et al, 2001), (Bishop, 2006). An in-depth discussion of these techniques is beyond the scope of this report.
In the counter-terrorism domain, the automatic classification of Web content as terror-related or not is perhaps the most straightforward type of classification problem, since it has only two possible outcomes. The application of classification algorithms to Web content is not so straightforward, however, and a significant component of the analysis process resides in manipulating the data into a form that is friendly to the preferred algorithms. Researchers at Ben-Gurion University of the Negev, Israel, have tackled this problem using a graph-based classification technique with the goal of automatically recognizing terror-related websites in Deterring VNSA in Cyberspace English and Arabic (Markov and Last, 2005). Their technique translates the textual HTML content of a Web page into a graph where each node is a unique keyword and the connections between the nodes describe their position relative to each other and location in the document (title, link, text). Figure 5 depicts a CNN news page and its corresponding graphical representation. This representation of the document is an alternative to the vector representation utilized by the content-based Web monitoring methodology described above, and the technique was chosen because it captures the inherent structural information of the original document, such as order, proximity, and location of terms (Markov and Last, 2005).
The classification process begins with a training set of pre-classified Web documents and their corresponding graph models. Sub-graphs representing the key concepts of the document are extracted from the larger graph using the Smart and Naïve extraction algorithms (Markov and Last, 2005). The sub-graphs are analogous to the centroids of Internet browsing interests described in the content-based Web monitoring methodology. Previously unclassified documents are processed similarly and their sub-graphs are compared with the training data. The algorithms that can compute the similarity between classified and unclassified content directly from their graphical representations, such as the K-Nearest Neighbor (KNN) algorithm, are computationally intensive and therefore not suitable for real-time classification of large amounts of Web content. Accordingly, the researchers converted the unseen graphs to vectors of Boolean features where a “1” represents the presence of a sub-graph that matches the training data. Many different classification models can be applied to data in this type of vector format, such as neurofuzzy networks, artificial neural networks, the Naïve Bayes, SVM, decision tree (C4.5, ID3), and PNN Classifiers.
The researchers tested this methodology on 648 manually collected Arabic Web documents, 200 of which were pre-classified as “terrorist-related” and 448 as “non-terrorist.” They used the ID3 decision tree classifier algorithm and tested the technique to determine the optimal number of nodes per graph, classification rate threshold, and sub-graph extraction algorithm. The results Deterring VNSA in Cyberspace can be seen in Figure 6. The most accurate classification results were obtained using the Smart sub-graph extraction technique on 100 node graphs. Nine documents were classified incorrectly;
five non-terrorist sites were classified as “terrorist”, and four “terrorist” sites were missed (Markov and Last, 2005). While the testbed of this study was relatively small, the results indicate the potential for this technology to be successfully applied to much larger datasets.
Figure 6: Results for Naïve and Smart sub-graph extraction techniques of Arabic Web documents. The most accurate classification results were obtained using the Smart sub-graph extraction technique on 100 node graphs (Markov and Last, 2005).
Data Analysis: Authorship Identification Communication channels such as forum postings, chat room dialog, and email offer a fast, inexpensive, and largely anonymous way to reach millions, making them an ideal communication method for VNSAs who wish to namelessly disseminate extremist propaganda.
The application of authorship analysis techniques to this type of data can offer insights into the character and identity of the creator of an anonymous textual document. Characterization techniques “attempt to formulate an author profile by making inferences about gender, education, and cultural backgrounds on the basis of writing style,” while identification is a classification task that has the goal of assigning authorship to an anonymous document based on a stylistic comparison with previously classified documents (Abbasi and Chen, 2005).
Deterring VNSA in Cyberspace The linguistic discipline of stylometry is the basis for most authorship analyses (Abbasi and Chen, 2005). The stylometric methodology applies statistical analysis techniques to a textual document, with the goal of extracting features that are indicative of the author’s unique writing style. This feature set can then be compared to documents with confirmed authorship that have been evaluated in a similar manner. There are four major categories of stylistic features that are the focus of such an analysis: lexical, syntactic, structural, and content-specific. A lexical feature breakdown contains information, such as word frequency, number of words per sentence, total number of characters, and characters per sentence.
Certain generalizations about the author’s writing style can be made from a lexical analysis. For example, the inclusion of a large number of relatively long words can indicate that the author has a large vocabulary and a more complex writing style. Syntax features refer to the order and pattern of words used to construct a sentence, which can be established through punctuation and the use of “function words” such as while and upon (Abbasi and Chen, 2005). An example of a syntactical signature would be an author’s consistent choice to use the word thus instead of hence in the same context. A document’s structural features, such as the layout of the text, structure of greetings, number of paragraphs, and average paragraph length, and the use of content-specific words are also of interest in a stylometric analysis. For example, in a forum where the topic of discussion is computers, an author’s use of the content-specific word RAM as opposed to memory is a distinguishing writing style characteristic.
As part of the Dark Web project, researchers at the University of Arizona have applied authorship identification techniques to English and Arabic Web forum postings collected using the spidering methodology described previously. The testbed for the study consisted of 20 Web forum messages for each of 20 authors, for a total of 400 messages per language. The English forum texts were downloaded from sites associated with the White Knights of the Ku Klux Klan, and the Arabic messages were collected from strongly anti-American forums associated with the Palestinian Al-Aqsa Martyrs Brigade. The researchers had to adapt traditional authorship identification techniques, which were developed for use on literary texts, to the personality of Web forum texts. The latter tend to be shorter and more informal, and contain a substantial amount of misspellings and abbreviations. The large number of potential authors further limited the efficacy of traditional techniques for this application.
Extracting features from the Arabic text posed additional challenges due to the language’s morphological characteristics. In particular, the diacritics that mark phonetic values in Arabic words are rarely used in online communication, which confounds feature extraction algorithms based on a methodology designed for English documents. In addition, Arabic words are shorter, which limits the usefulness of the text’s lexical information for establishing a unique writing style. For example, longer words in English documents indicate a more complex writing style, but this generalization does not translate to Arabic documents.
The researchers resolved these issues by implementing separate feature extraction methodologies developed specifically for Arabic and English text. In addition, the problems posed by the short, noisy nature of forum text were offset by the availability of data that is unique to Web content, such as the presence of hyperlinks and embedded images, font size and color choice, greeting structure, and in some cases contact information. This information expanded the breadth of the Deterring VNSA in Cyberspace structural features category and further informed the classification techniques that were used to identify authorship (Abbasi and Chen, 2005).
After automatic feature extraction, classification algorithms were applied to the data in order to identify authorship based on comparison with pre-classified feature sets. The researchers experimented with two different machine learning classification algorithms: C4.5 and Support Vector Machines (SVM). The C4.5 technique is a decision-tree based algorithm chosen because of the ease with which decision trees can be visualized. SVM was chosen because it is a computational learning method that can handle noisy data. The study produced results that pleasantly surprised the researchers, especially in light of the results obtained by Zheng et al.
(Zheng et al., 2005), Peng et al. (Peng, et al., 2003), and Stamatatos et al. (Stamatatos et al,
2001) in previous authorship attribution studies (Abbasi and Chen, 2005).
The SVM classification technique produced the best results for both languages, achieving 97.00% accuracy for English and 94.83% for Arabic when all four feature categories were incorporated into the analysis. Using this multilingual methodology, the group plans to investigate the scalability of the technique for application to a much larger group of potential authors. In addition, they plan to perform a more comprehensive analysis of the English and Arabic feature sets across texts to see if some of the attributes, such as the use of persuasive or violent language, are indicative of a stylistic signature of the group as a whole.
Data Analysis: Qualitative Content Analysis