FREE ELECTRONIC LIBRARY - Thesis, dissertations, books

Pages:     | 1 |   ...   | 15 | 16 || 18 | 19 |   ...   | 23 |

«Workshop held 9-10 January 2008 in Arlington, VA Prepared for US Strategic Command Global Innovation and Strategy Center (USSTRATCOM/GISC) Prepared ...»

-- [ Page 17 ] --

The researchers assert that the intention of the ATDS is to aid law enforcement officials in tracking down suspected terrorists based on the content of their Web browsing. They envision that the system would run in real-time and would be able to monitor a large group of users Deterring VNSA in Cyberspace simultaneously without being detected. Ideally, officials would also have access to individuals’ identification information based on their IP addresses, which would require the cooperation of the ISPs or a court-order (Shapira, 2005). The scalability of this technique is potentially an issue, since each Web page has to be vectorized and compared in near real-time; this may not be feasible for monitoring large groups (Shapira, 2005). Additionally, researchers have yet to evaluate the methodology with a testbed of Web pages that mixes normal and abnormal page accesses, which would more accurately reflect the usage patterns of a real individual (Elovici et al, 2005). Currently, no new data have been published that address these issues.

While the full capabilities of the ATDS have not yet been fully investigated, the underlying concept provides a new idea about what Web monitoring can mean and introduces a framework indicating that it may be possible to achieve real-time monitoring of Web-content. Much future work is needed, however, to establish this type of methodology as a technique that can feasibly be used by law enforcement officials to identify suspicious individuals with minimal numbers of false positives.

Data Analysis: Automatic Content Classification

The investigation of emergent behavior of VNSAs in cyberspace is a problem that is confounded not only by the difficult task of obtaining a clean, relevant dataset, but also by the sheer volume of data that is available. Raw data requires pruning to isolate terror-related content, and manual sorting techniques are simply not efficient enough to handle data sets of the magnitude of the 2TB Dark Web collection. Automatic classification of Web content is currently being investigated as a means of addressing this problem. Classification techniques are typically applied to a “function that has a discrete set of possible values,” and the algorithms have the goal of automatically identifying which of these values describe a previously unclassified piece of data (Last, 2005). The discrete values can be any number of properties, as long as there is a way to quantify the presence or absence of the property within the data in question, making classification algorithms applicable to a broad range of problems. The type of data to be classified determines the appropriate method of classification. For example, the most general classification models are based on a decision-tree methodology to represent the conditional dependence between inputs. The branches of a decision tree “can be interpreted as if-then rules of varying complexity,” and traversing the branches of the tree applies the rules of the model to each new piece of data. Examples of decision-tree based algorithms are C4.5 and ID3. Other pattern recognition approaches utilize Probabilistic Neural Networks (PNN), Bayesian learning methods, Support Vector Machines (SVM), Principle Component Analysis (PCA), Gaussian Mixture Models (GMM), K-nearest neighbor (KNN) algorithms, and linear and quadratic discriminants (Duda et al, 2001), (Bishop, 2006). An in-depth discussion of these techniques is beyond the scope of this report.

In the counter-terrorism domain, the automatic classification of Web content as terror-related or not is perhaps the most straightforward type of classification problem, since it has only two possible outcomes. The application of classification algorithms to Web content is not so straightforward, however, and a significant component of the analysis process resides in manipulating the data into a form that is friendly to the preferred algorithms. Researchers at Ben-Gurion University of the Negev, Israel, have tackled this problem using a graph-based classification technique with the goal of automatically recognizing terror-related websites in Deterring VNSA in Cyberspace English and Arabic (Markov and Last, 2005). Their technique translates the textual HTML content of a Web page into a graph where each node is a unique keyword and the connections between the nodes describe their position relative to each other and location in the document (title, link, text). Figure 5 depicts a CNN news page and its corresponding graphical representation. This representation of the document is an alternative to the vector representation utilized by the content-based Web monitoring methodology described above, and the technique was chosen because it captures the inherent structural information of the original document, such as order, proximity, and location of terms (Markov and Last, 2005).

–  –  –

The classification process begins with a training set of pre-classified Web documents and their corresponding graph models. Sub-graphs representing the key concepts of the document are extracted from the larger graph using the Smart and Naïve extraction algorithms (Markov and Last, 2005). The sub-graphs are analogous to the centroids of Internet browsing interests described in the content-based Web monitoring methodology. Previously unclassified documents are processed similarly and their sub-graphs are compared with the training data. The algorithms that can compute the similarity between classified and unclassified content directly from their graphical representations, such as the K-Nearest Neighbor (KNN) algorithm, are computationally intensive and therefore not suitable for real-time classification of large amounts of Web content. Accordingly, the researchers converted the unseen graphs to vectors of Boolean features where a “1” represents the presence of a sub-graph that matches the training data. Many different classification models can be applied to data in this type of vector format, such as neurofuzzy networks, artificial neural networks, the Naïve Bayes, SVM, decision tree (C4.5, ID3), and PNN Classifiers.

The researchers tested this methodology on 648 manually collected Arabic Web documents, 200 of which were pre-classified as “terrorist-related” and 448 as “non-terrorist.” They used the ID3 decision tree classifier algorithm and tested the technique to determine the optimal number of nodes per graph, classification rate threshold, and sub-graph extraction algorithm. The results Deterring VNSA in Cyberspace can be seen in Figure 6. The most accurate classification results were obtained using the Smart sub-graph extraction technique on 100 node graphs. Nine documents were classified incorrectly;

five non-terrorist sites were classified as “terrorist”, and four “terrorist” sites were missed (Markov and Last, 2005). While the testbed of this study was relatively small, the results indicate the potential for this technology to be successfully applied to much larger datasets.

Figure 6: Results for Naïve and Smart sub-graph extraction techniques of Arabic Web documents. The most accurate classification results were obtained using the Smart sub-graph extraction technique on 100 node graphs (Markov and Last, 2005).

Data Analysis: Authorship Identification Communication channels such as forum postings, chat room dialog, and email offer a fast, inexpensive, and largely anonymous way to reach millions, making them an ideal communication method for VNSAs who wish to namelessly disseminate extremist propaganda.

The application of authorship analysis techniques to this type of data can offer insights into the character and identity of the creator of an anonymous textual document. Characterization techniques “attempt to formulate an author profile by making inferences about gender, education, and cultural backgrounds on the basis of writing style,” while identification is a classification task that has the goal of assigning authorship to an anonymous document based on a stylistic comparison with previously classified documents (Abbasi and Chen, 2005).

Deterring VNSA in Cyberspace The linguistic discipline of stylometry is the basis for most authorship analyses (Abbasi and Chen, 2005). The stylometric methodology applies statistical analysis techniques to a textual document, with the goal of extracting features that are indicative of the author’s unique writing style. This feature set can then be compared to documents with confirmed authorship that have been evaluated in a similar manner. There are four major categories of stylistic features that are the focus of such an analysis: lexical, syntactic, structural, and content-specific. A lexical feature breakdown contains information, such as word frequency, number of words per sentence, total number of characters, and characters per sentence.

Certain generalizations about the author’s writing style can be made from a lexical analysis. For example, the inclusion of a large number of relatively long words can indicate that the author has a large vocabulary and a more complex writing style. Syntax features refer to the order and pattern of words used to construct a sentence, which can be established through punctuation and the use of “function words” such as while and upon (Abbasi and Chen, 2005). An example of a syntactical signature would be an author’s consistent choice to use the word thus instead of hence in the same context. A document’s structural features, such as the layout of the text, structure of greetings, number of paragraphs, and average paragraph length, and the use of content-specific words are also of interest in a stylometric analysis. For example, in a forum where the topic of discussion is computers, an author’s use of the content-specific word RAM as opposed to memory is a distinguishing writing style characteristic.

As part of the Dark Web project, researchers at the University of Arizona have applied authorship identification techniques to English and Arabic Web forum postings collected using the spidering methodology described previously. The testbed for the study consisted of 20 Web forum messages for each of 20 authors, for a total of 400 messages per language. The English forum texts were downloaded from sites associated with the White Knights of the Ku Klux Klan, and the Arabic messages were collected from strongly anti-American forums associated with the Palestinian Al-Aqsa Martyrs Brigade. The researchers had to adapt traditional authorship identification techniques, which were developed for use on literary texts, to the personality of Web forum texts. The latter tend to be shorter and more informal, and contain a substantial amount of misspellings and abbreviations. The large number of potential authors further limited the efficacy of traditional techniques for this application.

Extracting features from the Arabic text posed additional challenges due to the language’s morphological characteristics. In particular, the diacritics that mark phonetic values in Arabic words are rarely used in online communication, which confounds feature extraction algorithms based on a methodology designed for English documents. In addition, Arabic words are shorter, which limits the usefulness of the text’s lexical information for establishing a unique writing style. For example, longer words in English documents indicate a more complex writing style, but this generalization does not translate to Arabic documents.

The researchers resolved these issues by implementing separate feature extraction methodologies developed specifically for Arabic and English text. In addition, the problems posed by the short, noisy nature of forum text were offset by the availability of data that is unique to Web content, such as the presence of hyperlinks and embedded images, font size and color choice, greeting structure, and in some cases contact information. This information expanded the breadth of the Deterring VNSA in Cyberspace structural features category and further informed the classification techniques that were used to identify authorship (Abbasi and Chen, 2005).

After automatic feature extraction, classification algorithms were applied to the data in order to identify authorship based on comparison with pre-classified feature sets. The researchers experimented with two different machine learning classification algorithms: C4.5 and Support Vector Machines (SVM). The C4.5 technique is a decision-tree based algorithm chosen because of the ease with which decision trees can be visualized. SVM was chosen because it is a computational learning method that can handle noisy data. The study produced results that pleasantly surprised the researchers, especially in light of the results obtained by Zheng et al.

(Zheng et al., 2005), Peng et al. (Peng, et al., 2003), and Stamatatos et al. (Stamatatos et al,

2001) in previous authorship attribution studies (Abbasi and Chen, 2005).

The SVM classification technique produced the best results for both languages, achieving 97.00% accuracy for English and 94.83% for Arabic when all four feature categories were incorporated into the analysis. Using this multilingual methodology, the group plans to investigate the scalability of the technique for application to a much larger group of potential authors. In addition, they plan to perform a more comprehensive analysis of the English and Arabic feature sets across texts to see if some of the attributes, such as the use of persuasive or violent language, are indicative of a stylistic signature of the group as a whole.

Data Analysis: Qualitative Content Analysis

Pages:     | 1 |   ...   | 15 | 16 || 18 | 19 |   ...   | 23 |

Similar works:

«Marshall University Complimentary Tickets Compliance Office for Athletics Athletic Department Staff Home Athletics Events Regular season: Selected employees of the Athletic Department are permitted complimentary tickets for home athletic events. This list is determined and subject to review by the Director of Athletics. For football, men’s basketball, and women’s basketball these tickets are issued in season ticket form. Tickets must be picked up and signed for by the recipient at the...»

«The Practice of Finitism: Epsilon Calculus and Consistency Proofs in Hilbert’s Program Richard Zach (rzach@ucalgary.ca) University of Calgary Abstract. After a brief flirtation with logicism around 1917, David Hilbert proposed his own program in the foundations of mathematics in 1920 and developed it, in concert with collaborators such as Paul Bernays and Wilhelm Ackermann, throughout the 1920s. The two technical pillars of the project were the development of axiomatic systems for ever...»

«Installation and Administration Guide Release 8 This installation guide will walk you through how to install and deploy Conga Composer, including recommended settings for the application.Contact Support: support@congamerge.com Americas EMEA APAC (866) 502-3334 (toll free) +44 20 3608 0165 +61 28 417 2399 (303) 465-1616 © 2014 Conga. All Rights Reserved. Conga, Conga Suite and Conga Composer are all trademarks or registered trademarks of Conga, as are other names and marks. All other trademarks...»

«What We Love, Hate and Desire In Our Digital Media Jobs Q1 2013 Digiday State of the Industry Job Satisfaction / “Happiness” Survey with NextMark By Melinda Gipson, Digiday, April 5, 2013 Research commissioned by: Contents Introduction So, How Happy Are We? What Do we Love About Our Jobs? What do We Hate About our Jobs? How Long Have We Worked for our Current Employer? Surprise! Hard Work Doesn’t Make Us Unhappy ‘Where’ Matters to Happiness – Both Geographically and On the Totem...»

«Koon Leai Larry Tan and Kenneth J. Turner. Automated Analysis and Implementation of Composed Grid Services. In Dimitrios Dranidis and Ilias Sakellariou, editors, Proc. 3rd South-East European Workshop on Formal Methods pages 51-64, Thessaloniki, Greece, November 2007. Automated Analysis and Implementation of Composed Grid Services Koon Leai Larry Tan and Kenneth J. Turner Computing Science and Mathematics, University of Stirling, Stirling FK9 4LA, UK klt | kjt @cs.stir.ac.uk Abstract. Service...»

«Dialogue Questions Galore! Beside the old stand-by of What was my strongest feeling today?, try making one of these your NEW old stand-by.1. What is my awareness of us as a couple right now? HDIFAMA?2. What is it like being married to me today? HDIFAMA?3. What is the most difficult situation facing us as a couple right now? HDIFA this situation?4. What stage of relationship do I see us in right now romance, disillusionment, or joy? HDIF? 5. What did I most look forward to today? HDIFSTWY? 6....»

«Red Bull GmbH Company Profile Reference Code: 13316 Publication Date: Oct 2004 www.datamonitor.com Datamonitor USA Datamonitor Europe Datamonitor Germany Datamonitor Hong Kong 245 5th Avenue Charles House Kastor & Pollux 2802-2803 Admiralty Centre 4th Floor 108-110 Finchley Road Platz der Einheit 1 Tower 1 New York, NY 10016 London NW3 5JJ 60327 Frankfurt 18 Harcourt Road USA United Kingdom Deutschland Hong Kong t: +1 212 686 7400 t: +44 20 7675 7000 t: +49 69 9754 4517 t: +852 2520 1177 f: +1...»

«Model Standards for Pharmacy Compounding of Non-hazardous Sterile Products DRAFT 2 A National Association of Pharmacy Regulatory Authorities (adapted with permission from “Préparation de produits stériles non dangereux en pharmacie – Norme 2014.01,” Ordre des pharmaciens du Québec, 2014) Draft 2A Non-hazardous Sterile Products July 24, 2014 ACKNOWLEDGEMENTS The National Association of Pharmacy Regulatory Authorities would like to first thank one of its members, the Ordre des...»

«Comprehending Gender Issues Through Photography Your information to so manage your sizes mobile in listener has multinational in you are good provided timeline markets. Generally, in them use, are how they can do we to use of. Taking for your online score is the superior poetry and rates can work in details to protect over, and operating away in baggage is exchange only, Comprehending Gender Issues Through Photography which is just everywhere required. Of the office is for a boss, a month is...»

«Mackenzie Symmetry Balanced Portfolio † * – Series A Q3-2015 COMMENTARY Performance Summary During the third quarter of 2015, Symmetry Balanced Portfolio (the “Portfolio”) returned -2.3% underperforming its benchmark  (50% FTSE TMX Canada Universe Bond Index, 37.5% MSCI AC World Index NR $CAD, and 12.5% S&P/TSX Composite Index), which returned -2.1%. To note, Symmetry’s foreign allocation will typically be higher than this benchmark.  Contributors to Performance Contributing to...»

«A n Investigation into the Parallel Implementation of JPEG for Image Compression by Paul Darbyshire lie LIBRARY zjl Thesis submitted for the degree of Master of Engineering in the Department of Electrical and Electronic Engineering Victoria University of Technology N§v._* FTS THESIS 621.367 DAR Darbyshire, Paul An investigation into the parallel implementation of JPEG for image compression Declaration of Candidate: Paul Darbyshire declare that this thesis entitled A n Investigation into the...»

«3 SRV-VRS: The International Social Role Valorization Journal, Vol. 1(1) 1994 An Analysis of the Client Role From a Social Role Valorization Perspective Wolf Wolfensberger and Susan Thomas Syracuse University Introduction The construct of social roles is central to the entire theory and practice of Social Role Valorization (SRV). If people are enabled to hold valued social roles, then it is more likely that the valued conditions of life will be extended to them by others and by society, and...»

<<  HOME   |    CONTACTS
2016 www.dis.xlibx.info - Thesis, dissertations, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.