tellme final poster - hofstra university€¦ · research poster presentation design © 2012 ...

1
RESEARCH POSTER PRESENTATION DESIGN © 2012 www.PosterPresentations.com Introduction Aggregate news articles from various sites and identify the overarching topics for each set of articles Extract unbiased statements from articles that correlate with these topics, and identify which statements are most likely to be biased. Use these extracted statements to form short yet useful summaries for readers. Run as much of the processing in parallel to increase the efficiency of the system. Goals Web Crawler Uses the jsoup Java library to find articles relating to the user’s search, and return the corresponding URL. Parses the document body to ensure it is relevant to the search. Topic Modeller Use the University of Massachusettes Machine Learning library’s (MALLET) ParallelTopicModel Originally, we implemented the Apache Spark LDA Topic Modeller, but it was too slow to produce real-time results. Opinionator Automatic Summarization Extraction based summarization instead of abstraction based; use existing sentences to produce a new summary. Abstraction based summarization was too difficult to implement given our time constraints, so we went with extraction. Tokenize the paragraphs into sentences, then choose sentences based on the output from the Opinionator. Distributed System ● Originally had sixty four-core executors Transition to twelve ten-core executors without dynamic allocation due to algorithm constraints Methods Results One search query with the six news sources we use, generates around 7-8 verified statements and 5-6 unverified statements. All statements were relevant to the original search, with 1- 2 anomalies, i.e. statements that are not necessarily biased but are only found in one source. Apache Spark was tuned to run in half the time as the default settings Conclusion ● Text analysis algorithms, like those seen in NLP libraries, are complicated and thorough but become even more convoluted with the more complicated sentences as seen in news articles. ● Apache Spark is a distributed Batch Processing system which we were using for real-time processing. Apache Storm would have been a better choice in terms of performance. Extraction based summarization and comparison is easier to implement, but doesn’t provide as accurate results as breaking down sentences into abstract components and comparing them Proper utilization of a pipeline and breaking down processes into smaller components allows for faster performance when run in parallel. Future Work Personalized notifications for subjects you’re interested in Support for Google Home’s and Amazon Echo’s The ability to “follow” a story for continuous updates as they’re received Abstraction based summarization instead of extraction Use of ontologies for semantic analysis and clustering of sentences Use of Elastic Search to store urls, with Web Crawler running in the background. Use Apache Storm instead of Apache Spark Problem There are too many news sources talking about similar topics with their own different biases, and the only way for readers to know what is actually happening is to cross reference other sources relating to the topic. Solution TellMe is a website that analyzes articles on major news sites to determine the bias of each news source on a given topic. TellMe uses six major international news organizations to crawl and gather articles pertaining to the given topic. Faculty Advisor: Dr. Bo Tang Spring 2017 Christopher Davie, Amy Topka, Zach Vampola TellMe System Design ● For our system design, we have an Apache web server handling the web application. All requests this application receives are forwarded to our Apache Spark distributed system. This system controls all of our actual data processing, and returns the results to the web application to be display to the user. Implementation 1. Web Crawler - The web crawler uses Java’s jsoup library to continuously search through each source and all its recent articles. The crawler evaluates each webpage, and will return the url of each article it finds that is relevant to the original term entered by the user. 2. Topic Modeller - For our topic modeller we chose to use a Latent Dirichlet Allocation (LDA) modeller due to its popularity and reliability for results. We used the topic modeller to take the corpus of documents and extract all the important topics for each set of articles that we gather. 3. Opinionator - The Opinionator uses document clustering techniques to create clusters of sentences. Clusters with more diverse sources are considered to be verified, while less diverse clusters are considered to be unverified. 4. Automatic Summarization - Initially we planned on using abstraction based summarization, but we ran into problems with the creating new sentences. Because of this, we switched to an extraction based summarization, where the sentences are extracted directly from the articles and the most representative sentences are used. 5. Distributed System After gathering these articles, TellMe uses various machine learning algorithms to extract words that likely relate to the same or similar topics in all the articles. These topics are then used to cluster sentences from different articles together, and determine what all the sentences relate to. Master Server Workers Executors References & Acknowledgements McKeown, Kathleen R., Regina Barzilay, David Evans, Vasileios Hatzivassiloglou, Judith L. Klavans, Ani Nenkova, Carl Sable, Barry Schiffman, and Sergey Sigelman. "Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster." Human Language Technology (2002): 280-85. ACM Digital Library. Morgan Kaufmann Publishers Inc, 27 Mar. 2002. Web. 28 Sept. 2016. <http://newsblaster.cs.columbia.edu/papers/hlt-blaster.pdf>. Thank you to Dr. Tang, Dr. Krish, Dr. Currie, Dr. Lindo, and Dr. Arabshian for their advice and support on this project. Thank you to the School of Engineering and Applied Science for giving access to resources to implement this project. Special thanks to Alex Rosenberg for providing us technical support and keying us into the Research & Innovation Lab. Thank you to our friends and families that supported us undertaking and working long hours on this project.

Upload: others

Post on 26-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TellMe Final Poster - Hofstra University€¦ · RESEARCH POSTER PRESENTATION DESIGN © 2012  Introduction Aggregate news articles from various sites and identify

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

Introduction

● Aggregate news articles from various sites and identify the overarching topics for each set of articles

● Extract unbiased statements from articles that correlate with these topics, and identify which statements are most likely to be biased.

● Use these extracted statements to form short yet useful summaries for readers.

● Run as much of the processing in parallel to increase the efficiency of the system.

Goals

Web Crawler● Uses the jsoup Java library to find articles relating to the

user’s search, and return the corresponding URL.● Parses the document body to ensure it is relevant to the

search.

Topic Modeller● Use the University of Massachusettes Machine Learning

library’s (MALLET) ParallelTopicModel● Originally, we implemented the Apache Spark LDA Topic

Modeller, but it was too slow to produce real-time results.

Opinionator

Automatic Summarization– Extraction based summarization instead of abstraction

based; use existing sentences to produce a new summary.

● Abstraction based summarization was too difficult to implement given our time constraints, so we went with extraction.

● Tokenize the paragraphs into sentences, then choose sentences based on the output from the Opinionator.

Distributed System● Originally had sixty four-core executors● Transition to twelve ten-core executors without dynamic

allocation due to algorithm constraints

Methods

Results

● One search query with the six news sources we use, generates around 7-8 verified statements and 5-6 unverified statements.

● All statements were relevant to the original search, with 1-2 anomalies, i.e. statements that are not necessarily biased but are only found in one source.

● Apache Spark was tuned to run in half the time as the default settings

Conclusion● Text analysis algorithms, like those seen in NLP libraries,

are complicated and thorough but become even moreconvoluted with the more complicated sentences as seenin news articles.

● Apache Spark is a distributed Batch Processing systemwhich we were using for real-time processing. ApacheStorm would have been a better choice in terms ofperformance.

● Extraction based summarization and comparison is easier to implement, but doesn’t provide as accurate results as breaking down sentences into abstract components and comparing them

● Proper utilization of a pipeline and breaking down processes into smaller components allows for faster performance when run in parallel.

FutureWork● Personalized notifications for subjects you’re interested in ● Support for Google Home’s and Amazon Echo’s● The ability to “follow” a story for continuous updates as

they’re received● Abstraction based summarization instead of extraction● Use of ontologies for semantic analysis and clustering of

sentences● Use of Elastic Search to store urls, with Web Crawler

running in the background.● Use Apache Storm instead of Apache Spark

ProblemThere are too many news sources talking about similar

topics with their own different biases, and the only way forreaders to know what is actually happening is to crossreference other sources relating to the topic.

Solution

TellMe is a website that analyzes articles on major newssites to determine the bias of each news source on a giventopic.

TellMe uses six major international news organizations tocrawl and gather articles pertaining to the given topic.

FacultyAdvisor:Dr.BoTang Spring2017ChristopherDavie,AmyTopka,ZachVampola

TellMe

SystemDesign

● For our system design, we have an Apache web serverhandling the web application. All requests this applicationreceives are forwarded to our Apache Spark distributedsystem. This system controls all of our actual dataprocessing, and returns the results to the web applicationto be display to the user.

Implementation

1. Web Crawler - The web crawler uses Java’s jsoup libraryto continuously search through each source and all itsrecent articles. The crawler evaluates each webpage, andwill return the url of each article it finds that is relevant tothe original term entered by the user.

2. Topic Modeller - For our topic modeller we chose to usea Latent Dirichlet Allocation (LDA) modeller due to itspopularity and reliability for results. We used the topicmodeller to take the corpus of documents and extract allthe important topics for each set of articles that wegather.

3. Opinionator - The Opinionator uses document clustering techniques to create clusters of sentences. Clusters with more diverse sources are considered to be verified, while less diverse clusters are considered to be unverified.

4. Automatic Summarization - Initially we planned onusing abstraction based summarization, but we ran intoproblems with the creating new sentences. Because ofthis, we switched to an extraction based summarization,where the sentences are extracted directly from thearticles and the most representative sentences are used.

5. Distributed System

After gathering these articles, TellMe uses various machinelearning algorithms to extract words that likely relate to thesame or similar topics in all the articles. These topics arethen used to cluster sentences from different articlestogether, and determine what all the sentences relate to.

Master Server

Workers Executors

References&AcknowledgementsMcKeown, Kathleen R., Regina Barzilay, David Evans, Vasileios Hatzivassiloglou, Judith L. Klavans, Ani

Nenkova, Carl Sable, Barry Schiffman, and Sergey Sigelman. "Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster." Human Language Technology (2002): 280-85. ACM Digital Library. Morgan Kaufmann Publishers Inc, 27 Mar. 2002. Web. 28 Sept. 2016. <http://newsblaster.cs.columbia.edu/papers/hlt-blaster.pdf>.

●Thank you to Dr. Tang, Dr. Krish, Dr. Currie, Dr. Lindo, and Dr. Arabshian for their advice and support on this project.

●Thank you to the School of Engineering and Applied Science for giving access to resources to implement this project.

●Special thanks to Alex Rosenberg for providing us technical support and keying us into the Research & Innovation Lab.

● Thank you to our friends and families that supported us undertaking and working long hours on this project.