e-health 2009, istanbul 23-25 september 20091 the medical information system - medisysmedisys...
Post on 16-Jan-2016
215 Views
Preview:
TRANSCRIPT
e-health 2009, Istanbul 23-25 September 2009 1
The Medical Information System - MedISys
eHealth 2009Second International ICST Conference
onElectronic Healthcare for the 21st centurySeptember 23-25, 2009 - Istanbul, Turkey
Erik van der Goot & the OPTIMA team (OPensource Text Information Mining and Analysis )
European Commission – Joint Research Centre (JRC)Institute for the Protection and Security of the Citizen (IPSC)
erik.van-der-goot@jrc.ec.europa.eu
JRC Ispra on 14 December 2007 – Unit Meeting 2
Joint Research Centre (JRC)
The European Commission’s
Research-Based Policy Support Organisation
IPSC - Institute for the Protection and Security of the CitizenIspra - Italy
http://ipsc.jrc.ec.europa.eu/
http://www.jrc.ec.europa.eu/
JRC - who
e-health 2009, Istanbul 23-25 September 2009 3
JRC - where
e-health 2009, Istanbul 23-25 September 2009 4
MedISys - Overview
Objective:Provide open source data collection and analysis for surveillance and epidemiology Replace manual scanning of multiple newspapers and web portals Support national and international Public Health (PH) organisations to monitor issues of
Public Health concern (e.g. CBRN)
Functionality:– Gather, filter, classify, extract and aggregate health-related information– Monitor trends, detect breaking news– Visualise analysis results– Alert users– Allows customised views– In combination with RNS tool, allows manual moderation.
e-health 2009, Istanbul 23-25 September 2009 5
Background - History
Based on JRC’s Europe Media Monitor (EMM) technology (EMM live since 2002; http://emm.newsbrief.eu).
On request / initiative of the EC’s Directorate General for Health and Consumer Protection (DG SANCO).
Password-protected service for Public Health bodies since 2005.
Public service since early 2007 (http://medusa.jrc.it/, restricted functionality).
e-health 2009, Istanbul 23-25 September 2009 6
Background - Media Monitoring
• EU Commission Media Monitoring (until 2001/2002) – Traditional cut and paste for printed press only– Monitoring of incoming news wires (e.g. Reuters, AFP)– Simple keyword based filtering of wires– Manual selection of printed press items– Human classification of items
• Potential problems– Not ‘real-time’ for mainstream media: printed press typically once a day– Limited coverage: not all media is printed– Inaccurate and incomplete classification: subjective and limited number of categories– Labour intensive and expensive: limited number of articles per reviewer per day,
requires topical knowledge and requires language knowledge
e-health 2009, Istanbul 23-25 September 2009 7
EMM History
• New Challenges (as seen in 2002)– Enlargement (+10 countries): more media, more languages– More use of electronic publishing (media)– Electronic distribution of results (web+mobile)– Automatic alerting functions
• New approach: EMM - a one stop shop for Media Monitoring– Facilitate (not replace) human Media Monitoring activities– Extend monitoring beyond the traditional news wires (Internet). – Improve coverage, number of languages, analysis. – Apply automatic categorization and analysis to all sources– Provide new services like automatic e-mail, sms, mobile editions etc. – Provide editorial system to manage the information and produce newsletters
etc.
Important: EMM is not Yet Another Internet Search Engine
e-health 2009, Istanbul 23-25 September 2009 8
EMM System Features
• Automatic language recognitionBased on continuously updated language specific frequency tables
• Automated information/entity extraction400.000 persons and organizations based on continuously updated list of entities, many language specific synonyms.
• GeotaggingBased on homegrown harmonised multilingual geo-data set, about 600.000 place name variants in most languages covered by EMM, mostly national capitals, regional capitals and provincial capitals.
• Improved Categorization EngineBoolean combinations, proximity, wildcardsSupport for Arabic and similar (automatic noun-prefix processing) Support for Chinese and similar (no whitespace)
• Tonality/SentimentSimple bag of words approach, range from very negative to very positive, corrected for long term source bias, interesting for following reporting trends per category
e-health 2009, Istanbul 23-25 September 2009 9
… more features
• Duplicate detection
• Metadata categorizationAllows selection of articles based on any previously assigned meta-data.
• Automated information linkingIncremental topic based clustering and storytracking, geolocation. 10 minute interval incremental clustering on last 4 hours worth of news. (Top Stories on front page)
• Automatic detection of breaking newsCluster growth rate Flux of articles per category
• IndexingIndex full text and most metadata.
• Statistics/Trend analysisQuantitative analysis of reporting. Maintain simple count statistics.
e-health 2009, Istanbul 23-25 September 2009 10
…and more features
• Event extractionLanguage independent event grammars used to parse clusters using language dependent resources to fill the grammar slots.Currently for 5 languages (en, fr, it, pt, ru), violent events, humanitarian events
e-health 2009, Istanbul 23-25 September 2009 11
Development time line
2002
2004
2006
EMM/RNS
MediSys
2009
Continuous developmentNew featuresNewsExplorer
Domain specific application
EMM System redesign
RNS redesign
Redesign based on EMM
First version 2005
e-health 2009, Istanbul 23-25 September 2009 12
MediSys System Overview
EMM Open Source Monitoring Engine
MediSys Newsbrief NewsDesk Service (a.k.a. RNS)Editorial Interface
e-health 2009, Istanbul 23-25 September 2009 13
Problems to solve
Find relevant information– Millions of new articles/blogs/items/tweets published on Internet each day
Deliver the information to the right user– Allow for many (possibly overlapping) categories to meet specific needs
Timely– Right now if possible
In short: Deliver targeted information timely to the right user
e-health 2009, Istanbul 23-25 September 2009 14
Approach
Wide coverageMany sources
Local, Regional, National and International coverageMany languages
Multilinguality & cross-lingual information access
Fast coverageHigh frequency monitoring of sites, some sites every 5 minutes
Overcome the information overflow• Categorization, aggregation, duplicate identification, clustering• Customisability of MedISys NewsBrief• Search functions• RNS tool for manual moderation and targeted dissemination
e-health 2009, Istanbul 23-25 September 2009 15
Input data
~ 2200 Sources (world-wide, but primary focus on Europe)• ~ 4,000 HTML web pages+RSS feeds• ~ 100 specialist medical sites• ~ 20 commercial newswires• Specialist pay-for sources (LexisMed)• 24/7, near continuous monitoring
~80,000 new articles/items per day
Converts dirty html with adverts, menus, html tags, ‘related stories’, etc. into clean and standardised Unicode-encoded RSS format
Use RSS when available
Perform full content analysis
e-health 2009, Istanbul 23-25 September 2009 16
MediSys Screenshots
e-health 2009, Istanbul 23-25 September 2009 17
MedISys – Current subscribers and users include …
Supranational organisationsDirectorate General Health and Consumer Protection (SANCO)European Centre for Disease Control, Stockholm (ECDC)European Food Safety Authority (EFSA)World Health Organisation (WHO)
National Public Health organisationsSwiss Federal Office of Public HealthIcelandic Ministry of HealthSpanish Ministry of Sanitation & Ministry of Health and Consumer ProtectionInstitut de Veille Sanitaire (France)Global Public Health Intelligence Network (Canada)Danish Emergency Management AgencyItalian Ministry of Health and Ministry of DefenceDutch Institute of Public Health & Food and Consumer Product Safety Authority
The (general?) public
Currently ~ 1000 visitors, ~ 37000 hits per day on public system
e-health 2009, Istanbul 23-25 September 2009 18
Locations mentioned in MedISys medical articles across languages
Italian - German
Importance of multilingual information gathering
English - French
Spanish - Portuguese
e-health 2009, Istanbul 23-25 September 2009 19
Data processing layer:
• Detect ‘known entities’ across languages using large multilingual set of name variants (updated daily)
• Geo-locate the articles using large multilingual geo-database
• Apply content based categorization using multilingual category definitions
Multilingual and cross lingual analysis (1)
Barack Obama (Eu,yo)Barak Obama (az,wo)Барак Обама (ba,uk)
أوباما (ar) باراكاوباما (ar,fa) باراك
Барак Хуссейн Обама (ru)Baraque Obama (pt)
バラク・オバマ (ja) บารั�ค โอบามา (th)
(hy)ԲարաքՕբամաާމ� ޮއ�ާބ� ްކ� ަރ� (dv) ާބ�
(yi) באראק אבאמא(he) ברק אובאמה贝拉克 · 奥巴马 (zh)
ާމ� ޮއ�ާބ� ްކ� ަރ� (dv) ާބ�اوبام Influenza-A-Virus(ur) بارک
influenzavirus tipo Aswine-origin influenzasjevernoameričk gripe
pandemia influenzalemexicaanse griepмексиканск гриппсевероамериканск гриппpandemija svinjskesjevernoameričke gripe
grippe nouvellegripă porcinăsvinjski gripsikainfluenssasvininfluensaSchweineinfluenzaPorzine InfluenzaSchweinegrippeinfluenza porcinaprasečí chřipka
e-health 2009, Istanbul 23-25 September 2009 20
Multilingual and cross lingual analysis (2)
Data presentation layer:
• ‘Convenience’ links to external Machine Translation programs, where available.• Display of other MedISys categories, of persons and organisations found in text.
• Display on-line English translation of Chinese and Arabic
e-health 2009, Istanbul 23-25 September 2009 21
Aggregation of multilingual information
Documents from all languages get classified according to the same countries and categories.
An increase of the number of media reports on any country-category combination is detected,
independently of the reporting language.
Graphs and alerts may show events not yet reported in your own language.
e-health 2009, Istanbul 23-25 September 2009 22
Detection using statistics
Detect abnormal flux of reporting for a particular country/category combination
e-health 2009, Istanbul 23-25 September 2009 23
Recent case
e-health 2009, Istanbul 23-25 September 2009 24
News Clusters mostly about CategorySat. 02-05-2009, Influenza A
e-health 2009, Istanbul 23-25 September 2009 25
Categorized and Clustered NewsSat. 02-05-2009, Influenza A
e-health 2009, Istanbul 23-25 September 2009 26
PULS Event detection
Results from Helsinki University
e-health 2009, Istanbul 23-25 September 2009 27
Category definitions – Example: haemorrhagic fever
• Terms (single or multi-word)• Cumulative weights with threshold
• Case forcing• Upper case characters in pattern only match
uppercase in text (useful for acronyms etc.)
• Wild cards• Single letters (_)• Zero, one or more letters (%)• Adjacent words (+)
• Boolean combinations of term lists• And, or, not• Using proximity operator (within X words)
e-health 2009, Istanbul 23-25 September 2009 28
Customisability of MedISys
Add more news sources or new categories, e.g. Events: Cricket World Cup, Rugby World Cup, UEFA Euro 2008New diseasesOther classes, e.g. deliberate release of chemicals
(on request of recognised users/partners)
Output formats: web pages, email alerts, or RSS feed to integrate into your environment.
Email alerts: daily vs. breaking news onlyfor daily notification: specify hourfor breaking news: level-dependentUser-selected languages only
e-health 2009, Istanbul 23-25 September 2009 29
Customisability: Filter by language/news source/category
e-health 2009, Istanbul 23-25 September 2009 30
Rapid News Service - RNS (restricted to subscribed users)
Allows MedISys users to further customise their view of the news
Selection of specific languages and feedsAllows human moderation
Manual selection of news itemsDrag and drop compilation of newsletters
Allows moderators to forward news items to user groupsAllows user management Via SMS alerts, emails or newsletters
Shows overview of relative activity of each category over time
e-health 2009, Istanbul 23-25 September 2009 31
RNS moderation: Editing interface for newsletter
Manual selection of news items, drag and drop compilation of newsletters.
e-health 2009, Istanbul 23-25 September 2009 32
RNS moderation: Alert overview page
Time line shows overview of relative activity of each category over time.
e-health 2009, Istanbul 23-25 September 2009 33
MedISys - Summary
High coverage: helps monitor a large number of multilingual media reports.Includes tools to help beat the information overflow:
via clustering, duplicate detection; categorization; information aggregation; visualisation; mapping further means are being implemented: e.g. multiligual medical event extraction
Special features of MedISys:Fully automatic (moderation possible)Real time (10-minute updates), 24/7High multilinguality (43 languages)Multilingual information aggregation
Part of EMM family of applications, active team: much new functionality to come.
top related