Project group knowAANFinal presentation
Computer Science Education GroupUniversity of Paderborn
October 20th 2011
Overview
Overview
I IntroductionI System components & Work flowI DemonstrationI Development processI Summary & OutlookI Time for further questions of detail
PG knowAAN 2
Overview
Overview: First part
I GoalsI Extraction & Storage (of data)I Exploration (of data)I System components & Work flowI Analysis & Visualization (of data)
PG knowAAN 3
Goals
Goals
I Explore research networksI Based on: Artifacts (scientific publications) and metadataI Combination and analysis of dataI Computation of similarities of full textsI Support for conference management system GinkgoI Data visualizationI Recommendations
(Source: PG knowAAN project description)
PG knowAAN 4
Goals
Imagine you are interested in a conference.You downloaded the papers of 2 or 3 years.
Now you have nearly 100 publications.How do you explore them?
100 publications. Do you know tools?PG knowAAN 5
Extraction & Storage
Extraction & Storage
First step: Extract data and store it.
PG knowAAN 6
Extraction & Storage
PG knowAAN 7
Exploration
Exploration
Second step: Explore data.
PG knowAAN 8
Exploration
Exploring a conference
PG knowAAN 9
Exploration
Exploration
Which extracted data is available for a publication?
→ Database schema
PG knowAAN 10
publication
id GUID
lucuid VARCHAR(512)
title VARCHAR(512)
booktitle VARCHAR(512)
normtitle VARCHAR(512)
date VARCHAR(512)
editor VARCHAR(512)
journal VARCHAR(512)
note VARCHAR(512)
pages VARCHAR(512)
publisher VARCHAR(512)
tech VARCHAR(512)
volume VARCHAR(512)
number VARCHAR(512)
rawstring VARCHAR(4096)
xmlfile VARCHAR(512)
pdffile VARCHAR(512)
topicfile VARCHAR(512)
created BIGINT
modified BIGINT
Indexes
author
id GUID
text VARCHAR(512)
normtext VARCHAR(512)
firstname VARCHAR(512)
lastname VARCHAR(512)
created BIGINT
modified BIGINT
Indexes
pub_aut
publication_id GUID
author_id GUID
Indexes
affiliation
id GUID
text VARCHAR(512)
location_id GUID
Indexes
address
id GUID
text VARCHAR(512)
location_id GUID
Indexes
pub_aff
publication_id GUID
affiliation_id GUID
Indexes
pub_add
publication_id GUID
address_id GUID
Indexes
citation
publication1_id GUID
publication2_id GUID
Indexes
discipline
id GUID
text VARCHAR(512)
parent_id GUID
Indexes
location
id GUID
latitude DOUBLE
longitude DOUBLE
text VARCHAR(512)
Indexes
keyword
id GUID
text VARCHAR(512)
Indexes
pub_key
publication_id GUID
keyword_id GUID
score DOUBLE
source VARCHAR(512)
Indexes
pub_evt
publication_id GUID
event_id GUID
Indexes
pub_dis
publication_id GUID
discipline_id GUID
Indexes
pub_con
publication_id GUID
concept_id GUID
score DOUBLE
source VARCHAR(512)
Indexes
concept
id GUID
text VARCHAR(512)
Indexes
event
id GUID
text VARCHAR(512)
filepath VARCHAR(512)
predecessor_id GUID
successor_id GUID
Indexes
eventseries
id GUID
text VARCHAR(512)
filepath VARCHAR(512)
Indexes
evt_evs
event_id GUID
eventseries_id GUID
Indexes
aut_add
author_id GUID
address_id GUID
Indexes
aut_aff
author_id GUID
affiliation_id GUID
Indexes
pub_cat
publication_id GUID
category_id GUID
score DOUBLE
source VARCHAR(512)
Indexes
category
id GUID
text VARCHAR(512)
Indexes
bib_coupling
co_author
co_citationkeyword_count
discipline_count
category_count
concept_count
evt_pub_aut_count
System components & Work flow
System components & Work flow
How is our system structured?
→ Some examples.
PG knowAAN 12
System components & Work flow
Components
<< component >>
FileStorage
<< component >>
Backend
<< component >>
xmlBuilder
<< component >>
TopicExtraction
<< component >>
TF-Component
<< component >>
TrendDetection
<< component >>
Roundtrip
<< component >>
Recommendation
<< component >>
PDFToText
<< component >>
Clustering
<< component >>
DB
<< component >>
Parscit
<< component >>
DataBase
<< component >>
SolrWebServices
<< component >>
DocBrowser
<< component >>
FrontendReferenceExtraction
<< component >>
ParscitTrainer
JDBC
JDBC
Model
WebServices
WebServices
FileSystem
PG knowAAN 13
Languagedetection: DB:Solr:NounExtraction:Lemmatizer:Parscit:PDFToText :RoundTripExecutor :RoundTrip :DocumentBrowser:
a / 1) .addPDF
a / 1)
a / 2) .writeToFS
a / 2) Path
a / 3) .createThread
a / 3)
.submitThread
b / 1) .run
b / 1)
b / 2) .getText
b / 2) Text
b / 3) .ParseFullText
b / 3) ParscitXML
b / 6) .lemmatize
b / 6) LemmatizedText
b / 4) .extractBodyAndAstract
b / 4) BodyAndAbstract
b / 7) .extractNouns
b / 7) NounsList
b / 8) .lemmatizeNounslist
b / 8) LemmatizedNouns
b / 10) .writeToFiles
b / 10) Paths
b / 5) .getLanguage
b / 5) LanguageString
b / 9) .ReduceToTopNouns
b / 9) TopNouns
b / 11) .addTexts
b / 11) Solrid
b / 12) .addPublication
b / 12)
System components & Work flow
Work flow
PG knowAAN 15
Analysis & Visualization
Analysis & Visualization
Third step: Analyze and visualize data.
PG knowAAN 16
Analysis & Visualization
Analysis of authors
PG knowAAN 17
Analysis & Visualization
Analysis of scientific publications
PG knowAAN 18
Demonstration
Demonstration
Now: Demo.Image: http://www.flickr.com/photos/plaisanter/5525977163/
PG knowAAN 19
Development process
Technologies
Jersey
PG knowAAN 20
Development process
Methods of agile software development
FDD XPScrum
PG knowAAN 21
Development process
Methods of agile software development
I Weekly meetingsI Sit together (as much as possible)I Automated building systemI Continuous integrationI Issue tracking
PG knowAAN 22
Summary and Outlook
Summary and future work
Summary
I Integrated processing of scientific papersI Aggregated visualization of authors, publications and
eventsI Compute various analysis over the dataI Cleaning functionality for automated processed data
Future work
I Parallelized ClusteringI Additional graphical visualizationI Improve extraction of metadata from PDF files
PG knowAAN 23
Summary and Outlook
Thank you for your attention
Questions?
PG knowAAN 24