streamit: dynamic visualization and interactive exploration of text streams jamal alsakrankent state...

Click here to load reader

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams Jamal AlsakranKent State University, Ohio Yang ChenUniversity of North Carolina - Charlotte Ye Zhao (Presenter)Kent State University, Ohio Jing YangUniversity of North Carolina - Charlotte Dongning LuoUniversity of North Carolina - Charlotte
  • Slide 2
  • Text Stream Textual Data Explosion Emails, news, messages, broadcasts, Daily, hourly, minutely Urgent need for efficient processing and analysis Visualization is an effective approach Text stream Text collections constantly evolve with continuously new incoming documents Keywords/topics not known in advance
  • Slide 3
  • Challenges to Visual Exploration Temporal evolution Existing topics Emerging topics Their relations Clusters and Outliers No collection pre-scanning or presumably priori knowledge Live processing required In contrast to traditional text database Flexible user interaction for changing and adjusting information seeking focus/preference Process large volumes of texts in real time
  • Slide 4
  • SREAMIT System Dynamic force-directed simulation Naturally handle continuously inserted documents Continual evolvement Continuous depiction and analysis of growing document collections Automatic grouping and separating No time window used No abrupt change Dynamic processing Keyword vectors dynamically updated No prerecorded scan
  • Slide 5
  • SREAMIT System (continued) Interactive exploration Live adjustment of visualization parameters Dynamic keyword importance Present the significance of a keyword at a certain time Reflect changing user demand and interest Scalable optimization Fast computing GPU acceleration Animation and interaction Easy user control and interaction tools
  • Slide 6
  • Related Work Multidimensional scaling (MDS) & projection : IN-SPIRE 99, InfoSky 02, Hipp 08, Exemplar-based 09 Temporal data trends ThemeRiver 02, LensRiver 07, T-scroll 07, Meme-tracking 09, Themail 06, Topic-based 09 Text streams TextPool 05, Moving time window Wong03, Eventriver 10, Text pipe 05 Force-based placement Graph drawing 91, Chalmers96, Morrison02, etc.
  • Slide 7
  • System Overview
  • Slide 8
  • Potential and Similarity Potential energy between pairs of document particles is a control parameter l i and l j are locations of particle i and j l ij is the ideal distance of them Ideal distance computed from document similarity Cosine similarity Large similarity leads to smaller ideal distance, move documents closer to form clusters
  • Slide 9
  • Force-directed Model Global potential function Forces computed from minimization Attract or repulse document particles
  • Slide 10
  • DYNAMIC KEYWORD IMPORTANCE Cosine similarity can be improved by introducing importance Importance I k freely modified by users at any time According to interest/preference According to discovered knowledge from prior period A powerful tool for users to manipulate layout and analyze data Importance might be changed from automatic scheme E.g. for keyword k, O k : occurance; te k :last time it appears; ts k : first time it appears; n k : the number of documents that contain the keyword
  • Slide 11
  • Visualization Interface
  • Slide 12
  • Visualization Tools Main window Major layout Animation Control Panel Play, pause, stop Drag by mouse Keyword table Dynamic update Change importance Document table Text information
  • Slide 13
  • Labeling Use text document titles Reduce cluttering Recent semantic titles User controlled clutter levels Group title label Use color and opacity to display clear layout
  • Slide 14
  • User Interaction Adjusting Keyword Importance Grouping and Tracking Documents Halo for interested topics Browsing and Tracking Keywords Selection Manual, example-based, keyword-based Integrated shoebox for details
  • Slide 15
  • Case Study: New York Times News Total article number: 230 Time period Jul. 19 and Sep. 18, 2010 About Barack Obama Articles continuously injected, new keywords added to the keyword table, and their frequencies are updated on-the-fly Keyword importance automatically assigned
  • Slide 16
  • Case Study: New York Times News 136 news articles High frequency keywords: Politics and Government, International Relations, Terrorism Increase the importance of International Relations Highlight the group with Afghanistan War in pink halo (2) Terrorism in orange halo (3) All documents are shown Terrorism becomes larger, and one item (outlier) between Afghanistan War and Terrorism
  • Slide 17
  • Case Study: US NSF Award Abstracts 1000 National Science Foundation (NSF) IIS award abstracts Funded between Mar. 2000 and Aug. 2003 Each document characterized by a set of keywords Size of a document circle represents funding amount
  • Slide 18
  • Case Study: US NSF Award Abstracts Aug. 1, 2000 95 projects Sep. 1, 2000,172 projects; many large projects started; Highlight Management in red and Database in green; Increase their importance Mar. 15, 2002,672 projects; many large projects started; Highlight Sensor with halo; (2) is an outlier far away from the other projects with halo It is about just-in-time information retrieval on wearable computers
  • Slide 19
  • Case Study: Video on NSF Dataset
  • Slide 20
  • Slide 21
  • Performance Optimization Initial positions of document particles affect computational steps and cost Similarity Grid New documents roughly inserted within the proximity of similar documents Each grid cell has a special keyword vector consisting of the average keyword weights from the documents inside the cell data set of 7100 documents
  • Slide 22
  • Performance Optimization GPU acceleration CUDA implementation of the N-body problem Good performance achieved NVidia Quadro NVS 295 GPU with 2GB texture memory Intel Core2 1.8GHz CPU with 2GB RAM
  • Slide 23
  • GPU Performance Experiments with 50 by 50 grid Achieve good average speed More importantly, maximum simulation time after document insertion on the GPU was less than a second Fast for human perception and analysis
  • Slide 24
  • Discussion The system has the ability to handle live text streams with document arrival interval around 1 second On consumer PC and graphic card E.g., New York Times news has an averaging 3 documents per hour and a maximum 8 documents per hour at the peak time A very large number of documents inside the system will undoubtedly introduce visual clutters and hinder the ingestion of analyzers Natural perception limit and device limit Clutter reduction and simplification algorithms needed Further increase the power Advanced hardware Hierarchical or multiple-resolution simulation
  • Slide 25
  • Conclusion STREAMIT: An efficient visual exploration system for live text streams Dynamic physical system Keyword manipulation with importance Visual tools Acknowledgment: National Science Foundation IIS-0915528, IIS-0916131 and NSFDACS10P1309.
  • Slide 26