Download - Project group knowAAN Final presentationbücker.name/pubs/pgknowaan.final.presentation.pdf · Final presentation Computer Science Education Group University of Paderborn October 20th

Project group knowAANFinal presentation

Computer Science Education GroupUniversity of Paderborn

October 20th 2011

Overview

Overview

I IntroductionI System components & Work flowI DemonstrationI Development processI Summary & OutlookI Time for further questions of detail

PG knowAAN 2

Overview

Overview: First part

I GoalsI Extraction & Storage (of data)I Exploration (of data)I System components & Work flowI Analysis & Visualization (of data)

PG knowAAN 3

Goals

Goals

I Explore research networksI Based on: Artifacts (scientific publications) and metadataI Combination and analysis of dataI Computation of similarities of full textsI Support for conference management system GinkgoI Data visualizationI Recommendations

(Source: PG knowAAN project description)

PG knowAAN 4

Goals

Imagine you are interested in a conference.You downloaded the papers of 2 or 3 years.

Now you have nearly 100 publications.How do you explore them?

100 publications. Do you know tools?PG knowAAN 5

Extraction & Storage


First step: Extract data and store it.

PG knowAAN 6


PG knowAAN 7

Exploration

Exploration

Second step: Explore data.

PG knowAAN 8

Exploration

Exploring a conference

PG knowAAN 9

Exploration

Exploration

Which extracted data is available for a publication?

→ Database schema

PG knowAAN 10

publication

id GUID

lucuid VARCHAR(512)

title VARCHAR(512)

booktitle VARCHAR(512)

normtitle VARCHAR(512)

date VARCHAR(512)

editor VARCHAR(512)

journal VARCHAR(512)

note VARCHAR(512)

pages VARCHAR(512)

publisher VARCHAR(512)

tech VARCHAR(512)

volume VARCHAR(512)

number VARCHAR(512)

rawstring VARCHAR(4096)

xmlfile VARCHAR(512)

pdffile VARCHAR(512)

topicfile VARCHAR(512)

created BIGINT

modified BIGINT

Indexes

author

id GUID

text VARCHAR(512)

normtext VARCHAR(512)

firstname VARCHAR(512)

lastname VARCHAR(512)

created BIGINT

modified BIGINT

Indexes

pub_aut

publication_id GUID

author_id GUID

Indexes

affiliation

id GUID

text VARCHAR(512)

location_id GUID

Indexes

address

id GUID

text VARCHAR(512)

location_id GUID

Indexes

pub_aff

publication_id GUID

affiliation_id GUID

Indexes

pub_add

publication_id GUID

address_id GUID

Indexes

citation

publication1_id GUID

publication2_id GUID

Indexes

discipline

id GUID

text VARCHAR(512)

parent_id GUID

Indexes

location

id GUID

latitude DOUBLE

longitude DOUBLE

text VARCHAR(512)

Indexes

keyword

id GUID

text VARCHAR(512)

Indexes

pub_key

publication_id GUID

keyword_id GUID

score DOUBLE

source VARCHAR(512)

Indexes

pub_evt

publication_id GUID

event_id GUID

Indexes

pub_dis

publication_id GUID

discipline_id GUID

Indexes

pub_con

publication_id GUID

concept_id GUID

score DOUBLE

source VARCHAR(512)

Indexes

concept

id GUID

text VARCHAR(512)

Indexes

event

id GUID

text VARCHAR(512)

filepath VARCHAR(512)

predecessor_id GUID

successor_id GUID

Indexes

eventseries

id GUID

text VARCHAR(512)

filepath VARCHAR(512)

Indexes

evt_evs

event_id GUID

eventseries_id GUID

Indexes

aut_add

author_id GUID

address_id GUID

Indexes

aut_aff

author_id GUID

affiliation_id GUID

Indexes

pub_cat

publication_id GUID

category_id GUID

score DOUBLE

source VARCHAR(512)

Indexes

category

id GUID

text VARCHAR(512)

Indexes

bib_coupling

co_author

co_citationkeyword_count

discipline_count

category_count

concept_count

evt_pub_aut_count

System components & Work flow


How is our system structured?

→ Some examples.

PG knowAAN 12


Components

<< component >>

FileStorage

<< component >>

Backend

<< component >>

xmlBuilder

<< component >>

TopicExtraction

<< component >>

TF-Component

<< component >>

TrendDetection

<< component >>

Roundtrip

<< component >>

Recommendation

<< component >>

PDFToText

<< component >>

Clustering

<< component >>

DB

<< component >>

Parscit

<< component >>

DataBase

<< component >>

SolrWebServices

<< component >>

DocBrowser

<< component >>

FrontendReferenceExtraction

<< component >>

ParscitTrainer

JDBC

JDBC

Model

WebServices

WebServices

FileSystem

PG knowAAN 13

Languagedetection: DB:Solr:NounExtraction:Lemmatizer:Parscit:PDFToText :RoundTripExecutor :RoundTrip :DocumentBrowser:

a / 1) .addPDF

a / 1)

a / 2) .writeToFS

a / 2) Path

a / 3) .createThread

a / 3)

.submitThread

b / 1) .run

b / 1)

b / 2) .getText

b / 2) Text

b / 3) .ParseFullText

b / 3) ParscitXML

b / 6) .lemmatize

b / 6) LemmatizedText

b / 4) .extractBodyAndAstract

b / 4) BodyAndAbstract

b / 7) .extractNouns

b / 7) NounsList

b / 8) .lemmatizeNounslist

b / 8) LemmatizedNouns

b / 10) .writeToFiles

b / 10) Paths

b / 5) .getLanguage

b / 5) LanguageString

b / 9) .ReduceToTopNouns

b / 9) TopNouns

b / 11) .addTexts

b / 11) Solrid

b / 12) .addPublication

b / 12)


Work flow

PG knowAAN 15

Analysis & Visualization


Third step: Analyze and visualize data.

PG knowAAN 16


Analysis of authors

PG knowAAN 17


Analysis of scientific publications

PG knowAAN 18

Demonstration

Demonstration

Now: Demo.Image: http://www.flickr.com/photos/plaisanter/5525977163/

PG knowAAN 19

Development process

Technologies

Jersey

PG knowAAN 20

Development process

Methods of agile software development

FDD XPScrum

PG knowAAN 21

Development process

Methods of agile software development

I Weekly meetingsI Sit together (as much as possible)I Automated building systemI Continuous integrationI Issue tracking

PG knowAAN 22

Summary and Outlook

Summary and future work

Summary

I Integrated processing of scientific papersI Aggregated visualization of authors, publications and

eventsI Compute various analysis over the dataI Cleaning functionality for automated processed data

Future work

I Parallelized ClusteringI Additional graphical visualizationI Improve extraction of metadata from PDF files

PG knowAAN 23

Summary and Outlook

Thank you for your attention

Questions?

PG knowAAN 24

Download - Project group knowAAN Final presentationbücker.name/pubs/pgknowaan.final.presentation.pdf · Final presentation Computer Science Education Group University of Paderborn October 20th

Top Related