tellme final poster - hofstra university€¦ · research poster presentation design © 2012 ...

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

Introduction

● Aggregate news articles from various sites and identify the overarching topics for each set of articles

● Extract unbiased statements from articles that correlate with these topics, and identify which statements are most likely to be biased.

● Use these extracted statements to form short yet useful summaries for readers.

● Run as much of the processing in parallel to increase the efficiency of the system.

Goals

Web Crawler● Uses the jsoup Java library to find articles relating to the

user’s search, and return the corresponding URL.● Parses the document body to ensure it is relevant to the

search.

Topic Modeller● Use the University of Massachusettes Machine Learning

library’s (MALLET) ParallelTopicModel● Originally, we implemented the Apache Spark LDA Topic

Modeller, but it was too slow to produce real-time results.

Opinionator

Automatic Summarization– Extraction based summarization instead of abstraction

based; use existing sentences to produce a new summary.

● Abstraction based summarization was too difficult to implement given our time constraints, so we went with extraction.

● Tokenize the paragraphs into sentences, then choose sentences based on the output from the Opinionator.

Distributed System● Originally had sixty four-core executors● Transition to twelve ten-core executors without dynamic

allocation due to algorithm constraints

Methods

Results

● One search query with the six news sources we use, generates around 7-8 verified statements and 5-6 unverified statements.

● All statements were relevant to the original search, with 1-2 anomalies, i.e. statements that are not necessarily biased but are only found in one source.

● Apache Spark was tuned to run in half the time as the default settings

Conclusion● Text analysis algorithms, like those seen in NLP libraries,

are complicated and thorough but become even moreconvoluted with the more complicated sentences as seenin news articles.

● Apache Spark is a distributed Batch Processing systemwhich we were using for real-time processing. ApacheStorm would have been a better choice in terms ofperformance.

● Extraction based summarization and comparison is easier to implement, but doesn’t provide as accurate results as breaking down sentences into abstract components and comparing them

● Proper utilization of a pipeline and breaking down processes into smaller components allows for faster performance when run in parallel.

FutureWork● Personalized notifications for subjects you’re interested in ● Support for Google Home’s and Amazon Echo’s● The ability to “follow” a story for continuous updates as

they’re received● Abstraction based summarization instead of extraction● Use of ontologies for semantic analysis and clustering of

sentences● Use of Elastic Search to store urls, with Web Crawler

running in the background.● Use Apache Storm instead of Apache Spark

ProblemThere are too many news sources talking about similar

topics with their own different biases, and the only way forreaders to know what is actually happening is to crossreference other sources relating to the topic.

Solution

TellMe is a website that analyzes articles on major newssites to determine the bias of each news source on a giventopic.

TellMe uses six major international news organizations tocrawl and gather articles pertaining to the given topic.

FacultyAdvisor:Dr.BoTang Spring2017ChristopherDavie,AmyTopka,ZachVampola

TellMe

SystemDesign

● For our system design, we have an Apache web serverhandling the web application. All requests this applicationreceives are forwarded to our Apache Spark distributedsystem. This system controls all of our actual dataprocessing, and returns the results to the web applicationto be display to the user.

Implementation

1. Web Crawler - The web crawler uses Java’s jsoup libraryto continuously search through each source and all itsrecent articles. The crawler evaluates each webpage, andwill return the url of each article it finds that is relevant tothe original term entered by the user.

2. Topic Modeller - For our topic modeller we chose to usea Latent Dirichlet Allocation (LDA) modeller due to itspopularity and reliability for results. We used the topicmodeller to take the corpus of documents and extract allthe important topics for each set of articles that wegather.

3. Opinionator - The Opinionator uses document clustering techniques to create clusters of sentences. Clusters with more diverse sources are considered to be verified, while less diverse clusters are considered to be unverified.

4. Automatic Summarization - Initially we planned onusing abstraction based summarization, but we ran intoproblems with the creating new sentences. Because ofthis, we switched to an extraction based summarization,where the sentences are extracted directly from thearticles and the most representative sentences are used.

5. Distributed System

After gathering these articles, TellMe uses various machinelearning algorithms to extract words that likely relate to thesame or similar topics in all the articles. These topics arethen used to cluster sentences from different articlestogether, and determine what all the sentences relate to.

Master Server

Workers Executors

References&AcknowledgementsMcKeown, Kathleen R., Regina Barzilay, David Evans, Vasileios Hatzivassiloglou, Judith L. Klavans, Ani

Nenkova, Carl Sable, Barry Schiffman, and Sergey Sigelman. "Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster." Human Language Technology (2002): 280-85. ACM Digital Library. Morgan Kaufmann Publishers Inc, 27 Mar. 2002. Web. 28 Sept. 2016. <http://newsblaster.cs.columbia.edu/papers/hlt-blaster.pdf>.

●Thank you to Dr. Tang, Dr. Krish, Dr. Currie, Dr. Lindo, and Dr. Arabshian for their advice and support on this project.

●Thank you to the School of Engineering and Applied Science for giving access to resources to implement this project.

●Special thanks to Alex Rosenberg for providing us technical support and keying us into the Research & Innovation Lab.

● Thank you to our friends and families that supported us undertaking and working long hours on this project.

tellme final poster - hofstra university€¦ · research poster presentation design © 2012 ...

Documents