lab 1 - cs.odu.edu€¦  · web viewlab 1 – lasi product ... the user has the ability to create...

24
Lab 1 – LASI Description 1 LAB 1 – LASI DESCRIPTION Lab 1 – LASI Product Description Brittany Johnson CS411 Janet Brunelle March 18, 2013 Version 2

Upload: duongkiet

Post on 20-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 1

Running head: LAB 1 – LASI DESCRIPTION

Lab 1 – LASI Product Description

Brittany Johnson

CS411

Janet Brunelle

March 18, 2013

Version 2

Page 2: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 2

Table of Contents

1 INTRODUCTION...................................................................................................................4

2 PRODUCT DESCRIPTION....................................................................................................4

2.1 Key Product Features and Capabilities........................................................................5

2.2 Major Components (Hardware/Software)....................................................................9

3 IDENTIFICATION OF CASE STUDY................................................................................11

4 PRODUCT PROTOTYPE DESCRIPTION..........................................................................12

4.1 Prototype Architecture (Hardware/Software)............................................................12

4.2 Prototype Features and Capabilities...........................................................................14

GLOSSARY..................................................................................................................................15

REFERENCES..............................................................................................................................18

List of Figures

Figure 1. Top Results Output...........................................................................................................6

Figure 2. Word Relationships Output..............................................................................................7

Figure 3. Word Count and Weighting Output.................................................................................8

Figure 4 AID Process: Assessment................................................................................................11

Figure 5.Prototype Hardware and Software Component Diagram................................................13

Page 3: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 3

List of Tables

Table 1. Feature comparison between prototype and real world product......................................14

Page 4: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 4

Lab 1 – CertAnon Product Description

1 INTRODUCTION

LASI stands for Linguistic Analysis for Subject Identification. It is a stand-alone theme

finding application conceived by the Old Dominion University CS410 Red Group. It is designed

to be a decision support tool for large, multi-document linguistic analysis and allow for more

accurate and consistent results. Linguistic Analysis, with respect to the current project, is the

contextual study of written works and how the words combine to form and overall meaning.

Themes are subject-object-verb relationships that LASI is attempting to generate from the

input set and are important because they help the reader to comprehend and summarize what has

been read. It is even more difficult to come to a conclusion when the number of documents

increases because the theme across all of the documents may not be the theme of each of the

individual documents. The complexity of a topic and the reader’s familiarity with it plays an

important role in a reader’s comprehension.

This comprehension, along with the ability to summarize the material is important in

being able to communicate the content of a document. Thus, it is often difficult for people to

identify a common theme over a large set of documents in a timely, consistent, and objective

manner. LASI will assist in helping the reader come to an informed conclusion by providing a

weighted list of potential themes.

2 PRODUCT DESCRIPTION

LASI will be an open-source, stand-alone piece of software designed to run on a

consumer grade laptop. LASI will be able to detect themes across many documents and can

provide both individual and cross document analysis to determine a single theme. LASI’s ability

to analyze multiple documents to find a common theme makes it a great decision support tool for

Page 5: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 5

teachers, students, research analysts and those that would need to read through large sets of

documents on a frequent basis.

Teachers for example, would be able to use LASI as an initial analysis on student papers

to check whether or not it is consistent with the topic of that paper. Both students and research

analysts could use LASI to quickly assess the usefulness of scientific and literary publications for

the topic that they are researching.

2.1 Key Product Features and Capabilities

Through the use of Optical Character Recognition (OCR) and a parser that is integrated

into LASI, the user has the ability to create a project with multiple file types including DOC,

DOCX, PPT, PPTX, TXT and PDF. By finding the commonalities between the documents using

their parts of speech and statistics analysis, a common theme can be revealed.

Once a project is created, the files can be viewed in plaintext form in the LASI user

interface. Documents can be added once the project has been created, as well as after they have

already been analyzed. If the project has already been compiled, the documents will be analyzed

and then added to the overall results. While the project is being created, there is also the option

for the user to add their own dictionary of company specific jargon as well as assumptions about

the content. This will help LASI to tailor its analysis to the content and increase the statistical

likelihood of determining a theme.

Page 6: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 6

Figure 1. Top Results Output

Once the documents have been analyzed, the results can be viewed in three different

format types: Top Results, Word Relationships, and Word Count and Weighting. The top results

will be represented graphically based on the user’s preferred chart type. The types of charts

available include tornado charts, bar graphs, and pie charts. In figure 1 there is a tornado chart

showing the top 10 most likely themes throughout all of the documents listed in descending

order based on the word weight. Each of the documents may also be viewed individually, where

the data will be represented similarly.

[This space intentionally left blank.]

Page 7: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 7

Figure 2. Word Relationships Output

The word relationships, as shown in Figure 2, will also be displayed for each document.

Each word is colorized based on its part-of-speech. This will allow the user to see the

relationships between all of the words in a document. The links between the words is an

important visual aid for helping the user to understand the importance of individual words.

[This space intentionally left blank.]

Page 8: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 8

Figure 3. Word Count and Weighting Output

Results will also be displayed based on the individual word count and weight. The weight

that will be displayed is based on the weighting algorithm. In Figure 3, this is shown as a simple

table that can be sorted by word, frequency, and weight. This will show how each document

affected the total results and the importance of individual words. Once the project has finished

being analyzed, the results can either be printed or exported in PDF, JPG, and PNG.

[This space intentionally left blank.]

Page 9: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 9

2.2 Major Components (Hardware/Software)

LASI requires a few hardware specifications for the product to run at an optimal level. It

is preferable that it is run on a high end business grade computer with at least 8GB or greater of

DDR3 SDRAM and a Quad core CPU. It is also requires that the user provide a secondary

storage space for documentation.

The first software component of LASI is the graphical user interface. This application can

be run locally on the user’s machine. This is a Windows Presentation Foundation (WPF) project

using XMAL to define the structure of the views and C# to provide the interactivity.

The second software component is the file system. It manages converting files and

invoking the tagger. After the text file is tagged, it is then passed to a tagged file parser which

converts the text into word and phrase types which represent the elements of the document at run

time. B2XTranslator is a third party open source software that is being used to convert file

types. When documents are added to a project in the GUI, it takes DOCX and converts it to an

XMLfile. Once the document is in XML, it can be converted again into a form useable by the

parts-of-speech tagger.

The parts-of-speech tagger software being used is SharpNLP, open source C# natural

language processing tool. SharpNLP utilizes the Penn Treebank parts-of-speech tags to define

the parts of speech. SharpNLP will assign each word a type and place groups into phrase types

before writing them back to a file. Once the documents have been tagged, it is assigned a word

type which corresponds to its part of speech given by the tagger. Phrase types are groups of

words that have been put together. Each of these phrase types contains a list of words and the

attributes that the syntactic phrase types represent.

Page 10: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 10

The fourth software component is the LASI algorithm. The LASI algorithm is written in

C#. The LASI algorithm ties word and phrase types together based on their syntactic

relationship via a state machine derived logic flow. The document is traversable in multiple

methods: Word-wise, Reference-wise, and Tree-wise. When moving through the document in

word-wise manner, the document is broken up by individual words. When moving through the

document in a Reference-wise manner, this allows the document to be viewed based on the

words and phrases that reference each other. A Tree-wise manner follows a specific word to its

referenced words and so on.

The algorithm focuses on the direct and indirect binding of words and phrases. Direct

binding includes the binding of nouns and verbs, adverbs to verbs, adjectives to nouns,

determiners to nouns. Indirect binding will include the binding of pronouns to nouns.

Once the word and phrase binding is finished, it will begin weighting the words based on their

frequency as word as well as how it is used. The weighting metrics for each word will be based

on a raw frequency as well as a relative frequency. Each word will have a raw frequency that is

based on a simple word count, the number of times that the word was used in a particular

manner, and a frequency count for synonyms of that word. The relative frequency will be based

on subject, verb and object relationship between words as well as where a word is located in a

document. As more bindings get made, the more accurate the results are.

[This space intentionally left blank.]

Page 11: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 11

3 IDENTIFICATION OF CASE STUDY

Dr. Patrick Hester and Dr. Tom Meyers work for the National Center for System of

Systems Engineering (NCSOSE) consulting with organizations and businesses that need an

outside view on issues or future plans of improvement. When consulting with their client, they

use the Assessment Improvement Design Methodology (AID) to help assist the client in both

realizing and achieving their goals. The focus is on evaluating current performance with respect

to client intent, enhancing performance based off of evaluations of current operations, and

procedure versus alternatives. Using this, they create a new method for improvement that is

aligned with their client’s intent.

Figure 4. AID Process: Assessment

In following this process, both Dr. Hester and Dr. Meyers must familiarize themselves

with their client’s domain. Essentially, they must become an expert in the inner workings of their

client’s organization and the field of work. The level of difficulty for this task is dependent on

Page 12: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 12

whether their client provides useable documentation. LASI will assist in the process of defining

what the potential problem is and whether it coincides with what the client believes is the issue.

In Figure 6, LASI would fit into the Document Analysis portion of the Assessment phase. The

results that LASI produces can be used to verify Dr. Hester and Dr. Meyer’s assessment of the

situation and serve as visual proof of their reasoning for the client.

4 PRODUCT PROTOTYPE DESCRIPTION

Due to time constraints the LASI prototype has a much lessened functionality than the

real world product. LASI will still function in the same but in a less complex and process

intensive manner. A prototype needs to be developed in order to narrow the scope but still have a

product that can demonstrate its capabilities.

4.1 Prototype Architecture (Hardware/Software)

The hardware and software components for the prototype will remain largely unchanged

from the real-world solution that was discussed in Section 2 and 2.1. Figure 5 shows the

hardware and software components of the LASI prototype. The hardware required to run the

LASI prototype is a laptop or desktop with at least 8GB of DDR3 RAM and a Quad-Core CPU.

For development purposes we will be using a Virtual Machine for a testing and code writing

environment. The software needed for the prototype includes our part-of-speech tagger,

document converter to convert DOC and DOCX files to TXT files. Other software includes the

LASI algorithms and the LASI GUI.

[This space intentionally left blank.]

Page 13: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 13

Figure 5. Prototype Hardware and Software Component Diagram

The third-party software for the LASI prototype is the SharpNLP Part-of-Speech Tagger

and the B2XTranslator. The SharpNLP POS Tagger tags words and phrases with the respective

parts-of-speech for use by the LASI algorithm. The B2XTranslator converts DOC to DOCX

files. The files then can be converted to a TXT file.

In the LASI prototype, word and phrase binding works the same as it would in the Real-

World solution. Words and word phrases are interrelated based on the tagged part-of-speech and

how they relate to one another within phrases, paragraphs, and the document. The weighting

algorithm will assign each word a weight based on its part-of-speech, frequency count and the

number of times and ways it is referenced.

[This space intentionally left blank.]

4.2 Prototype Features and Capabilities

Page 14: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 14

As shown in Table 1, there are a few key differences to the Real World Product and our

Prototype. The types of documents that the LASI prototype accepts has been limited to just DOC

and DOCX. Scanned text recognition has been removed from the prototype since there is not

enough time to get the OCR software fully functioning. The prototype will limit the number of

documents that can be added to one project to three to five, and there is a size limitation of 10

pages on each of those documents to insure that the algorithm can function in a timely manner.

Rather than focusing on every part of speech, in the LASI prototype we will focus on noun-verb

binding. There were also a few of the more complex features that did not make it into the

prototype like user defined dictionaries, synonym identification, and content assumption. Despite

removing a lot of the unnecessary features, the prototype will still function very similarly to how

the real world product would have functioned.

GLOSSARY

A.I.D. : Assessment Improvement Design

A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems

and determine the feasibility of solutions.

Analysis: Detailed examination of the elements or structure of something, typically as a basis for

interpretation.

Document: A document herein refers to a formally written, expository paper which expounds,

via a declarative approach, on a relatively quantifiable issue, goal, or area of research.

Page 15: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 15

Head word: A locally distinct word within a phrase which, by its syntactic associations,

determines the category of the phrase itself.

LASI: Linguistic Analysis for Subject Identification

Lexer: Part of the parsing tool that isolates each word, its part of speech, and location in a

sentence into machine readable tokens. These are stored as elements in an XML file.

Linguistic Analysis: The scientific analysis of a language.

Optical Character Recognition: A word that has an associated part-of-speech.

Parser: Takes in DOC and DOCS files and converts them to TXT files

Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a

sentence.

Phrase: A group of words standing together as a conceptual unit, typically forming a new

component.

Semantic Analysis: Relating the syntactical structure of words to their language independent

meanings.

Sharp NLP: Written in C#, natural language processing tool used to parse and tag parts-of-

speech.

Strategic Document: Document produced by a client that defines what their Goals, Visions,

and Missions.

Subject Identification: Finds the main actor in a sentence. However, in a broader sense, the

word subject is synonymous with the themes of one or more documents. Subject

identification is the process of determining subjects, or themes of a document or

documents.

Page 16: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 16

Syntactic Analysis: A form of Linguistic analysis that focuses on grammar in sentences and

identifies themes based on structure and formatting. Unlike Semantic Analysis, it

identifies key words based on their location in the sentence, rather than their overall

meaning throughout the document.

Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set

Tag: A label, or the act of attaching a label, that specifies the role (such part of speech or

location) of a selected element in a document

Tagged Set: A group of words, whose part of speech and location in a sentence have been

identified by the parser

Tagged Word Object: The process of binding part-of-speech to a word

Tornado chart: A horizontal bar graph like visualization, representing the relative frequency or

significance of elements, sorted in descending order by magnitude

Word Binding: Conversion of scanned images to text

WordNet: compiler and provider of our thesaurus.

Word Weight: A numeric value, associated with each syntactically and lexically unique word

in a written work, indicating its significance.

Page 17: Lab 1 - cs.odu.edu€¦  · Web viewLab 1 – LASI Product ... the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, ... and Word Count

Lab 1 – LASI Description 17

REFERENCES

SharpNLP.(n.d.). Retrieved from http://sharpnlp.codeplex.com/

Office binary to open xml.(n.d.). Retrieved from http://b2xtranslator.sourceforge.net/