tellme final poster - hofstra university€¦ · research poster presentation design © 2012 ...
TRANSCRIPT
RESEARCH POSTER PRESENTATION DESIGN © 2012
www.PosterPresentations.com
Introduction
● Aggregate news articles from various sites and identify the overarching topics for each set of articles
● Extract unbiased statements from articles that correlate with these topics, and identify which statements are most likely to be biased.
● Use these extracted statements to form short yet useful summaries for readers.
● Run as much of the processing in parallel to increase the efficiency of the system.
Goals
Web Crawler● Uses the jsoup Java library to find articles relating to the
user’s search, and return the corresponding URL.● Parses the document body to ensure it is relevant to the
search.
Topic Modeller● Use the University of Massachusettes Machine Learning
library’s (MALLET) ParallelTopicModel● Originally, we implemented the Apache Spark LDA Topic
Modeller, but it was too slow to produce real-time results.
Opinionator
Automatic Summarization– Extraction based summarization instead of abstraction
based; use existing sentences to produce a new summary.
● Abstraction based summarization was too difficult to implement given our time constraints, so we went with extraction.
● Tokenize the paragraphs into sentences, then choose sentences based on the output from the Opinionator.
Distributed System● Originally had sixty four-core executors● Transition to twelve ten-core executors without dynamic
allocation due to algorithm constraints
Methods
Results
● One search query with the six news sources we use, generates around 7-8 verified statements and 5-6 unverified statements.
● All statements were relevant to the original search, with 1-2 anomalies, i.e. statements that are not necessarily biased but are only found in one source.
● Apache Spark was tuned to run in half the time as the default settings
Conclusion● Text analysis algorithms, like those seen in NLP libraries,
are complicated and thorough but become even moreconvoluted with the more complicated sentences as seenin news articles.
● Apache Spark is a distributed Batch Processing systemwhich we were using for real-time processing. ApacheStorm would have been a better choice in terms ofperformance.
● Extraction based summarization and comparison is easier to implement, but doesn’t provide as accurate results as breaking down sentences into abstract components and comparing them
● Proper utilization of a pipeline and breaking down processes into smaller components allows for faster performance when run in parallel.
FutureWork● Personalized notifications for subjects you’re interested in ● Support for Google Home’s and Amazon Echo’s● The ability to “follow” a story for continuous updates as
they’re received● Abstraction based summarization instead of extraction● Use of ontologies for semantic analysis and clustering of
sentences● Use of Elastic Search to store urls, with Web Crawler
running in the background.● Use Apache Storm instead of Apache Spark
ProblemThere are too many news sources talking about similar
topics with their own different biases, and the only way forreaders to know what is actually happening is to crossreference other sources relating to the topic.
Solution
TellMe is a website that analyzes articles on major newssites to determine the bias of each news source on a giventopic.
TellMe uses six major international news organizations tocrawl and gather articles pertaining to the given topic.
FacultyAdvisor:Dr.BoTang Spring2017ChristopherDavie,AmyTopka,ZachVampola
TellMe
SystemDesign
● For our system design, we have an Apache web serverhandling the web application. All requests this applicationreceives are forwarded to our Apache Spark distributedsystem. This system controls all of our actual dataprocessing, and returns the results to the web applicationto be display to the user.
Implementation
1. Web Crawler - The web crawler uses Java’s jsoup libraryto continuously search through each source and all itsrecent articles. The crawler evaluates each webpage, andwill return the url of each article it finds that is relevant tothe original term entered by the user.
2. Topic Modeller - For our topic modeller we chose to usea Latent Dirichlet Allocation (LDA) modeller due to itspopularity and reliability for results. We used the topicmodeller to take the corpus of documents and extract allthe important topics for each set of articles that wegather.
3. Opinionator - The Opinionator uses document clustering techniques to create clusters of sentences. Clusters with more diverse sources are considered to be verified, while less diverse clusters are considered to be unverified.
4. Automatic Summarization - Initially we planned onusing abstraction based summarization, but we ran intoproblems with the creating new sentences. Because ofthis, we switched to an extraction based summarization,where the sentences are extracted directly from thearticles and the most representative sentences are used.
5. Distributed System
After gathering these articles, TellMe uses various machinelearning algorithms to extract words that likely relate to thesame or similar topics in all the articles. These topics arethen used to cluster sentences from different articlestogether, and determine what all the sentences relate to.
Master Server
Workers Executors
References&AcknowledgementsMcKeown, Kathleen R., Regina Barzilay, David Evans, Vasileios Hatzivassiloglou, Judith L. Klavans, Ani
Nenkova, Carl Sable, Barry Schiffman, and Sergey Sigelman. "Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster." Human Language Technology (2002): 280-85. ACM Digital Library. Morgan Kaufmann Publishers Inc, 27 Mar. 2002. Web. 28 Sept. 2016. <http://newsblaster.cs.columbia.edu/papers/hlt-blaster.pdf>.
●Thank you to Dr. Tang, Dr. Krish, Dr. Currie, Dr. Lindo, and Dr. Arabshian for their advice and support on this project.
●Thank you to the School of Engineering and Applied Science for giving access to resources to implement this project.
●Special thanks to Alex Rosenberg for providing us technical support and keying us into the Research & Innovation Lab.
● Thank you to our friends and families that supported us undertaking and working long hours on this project.