recognizing the electronic medical record data from unstructured medical data using visual text...

11
 Recognizing The Electronic Medical Record Data From Unstructured Medical Data Using Visual Text Mining Techniques  Abstract: Computer systems and communication technologies made a strong and influential presence in the different fields of medicine. The cornerstone of a functional medical information system is the Electronic Health Records (EHR) management system. EHR implementation and adoption face different barriers that slow down its deployment in different organizations. This research focuses on resolving the most public barriers, which are data entry, unstructured clinical data modifying the physician work flow. This research proposed a solution, which use Text mining and Natural language processing techniques.This solution tested and verified in four real-world clinical organizations. The suggested solution proved correcteness and perciseness with 91.88%..  Keywords: Electronic Health Reacord, Textmining, Unstructured Medical Data , medical Data entry, Health  Information Technology. I.INTRODUCTION The paper-based medical record is woefully inadequate for meeting the needs of modern medicine. It arose in the 19th century as a highly personalized "lab notebook" that clinicians could use to record their observations and plans so that they could be reminded of pertinent details when they next saw that same patient. There were no bureaucratic requirements, no assumptions that the record would be used to support communication among varied providers of care, and remarkably few data or test results to fill up the record’s pages. The record that met the needs of clinicians a century ago has struggled mightily to adjust over the decades so as to accommodate to new requirements as health care and medicine have changed which leads to the existence of Health Information Technology (HIT) [1]. HIT allows comprehensive management of medical knowledge and its secure exchange among health care consumers and p roviders. Broad uses of HIT will: 1. Help to eliminate the manual tasks of extracting data from charts or filling out specialized datasheets. 2. Help to derive data directly from the electronic record, making research-data collection by product of routine clinical record keeping. . 3. Help to Move from paper-based health care system to secure electronic medical records which will save lives and reduce health care costs. 4. Help in Early detection of infectious disease by advanced data collection, fusion and processing techniques which would be at the forefront in spotting the emergence of new diseases, and crucial to tracking the spread of known diseases[2]. II.ELECTRONIC HEALTH RECORD ,DEFINITION AND MODELS EHR defined as longitudinal electronic record of patients' health information generated by one or more encounters in any care delivery setting. This information includes, but not limited to, patient demographics, progress notes, examinations details like symptoms and findings, medications, vital signs, past medical history, immunizations, laboratory data, and radiology reports. The EHR automates and streamlines the clinician's workflow. The EHR has the ability to generate a complete record of a clinical patient encounter as well as supporting other care directly or indirectly related activities via interface including evidence-based decision support, quality management, and outcomes reporting. The EHR means a repository of patient data in a digital form stored and exchanged securely and accessible by multiple authorized users. [2][3][4] There are many EHR architectural models that can be used all over the world. The most two popular EHR models are: 1. Central Repository Model The center of EHR model will be the repository, which will be fed by the existing applications in different care locations such as hospitals, clinics, and family physician practices. The feed from these applications will be messaging based on the pre-agreed standards. The messaging needs to be based well-defined standards, for Prof. Hussain Bushinak Faculty of Medicine Ain Shams University Cairo, Egypt Dr. Sayed AbdelGaber Faculty of Computers and Information Helwan University Cairo, Egypt Mr. Fahad Kamal AlSharif Collage of Computer Science Modern Academy Cairo, Egypt (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 6, June 2011 25 http://sites.google.com/site/ijcsis/ ISSN 1947-5500

Upload: ijcsis

Post on 07-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 1/11

 

Recognizing The Electronic Medical Record Data

From Unstructured Medical Data Using Visual

Text Mining Techniques

 Abstract: Computer systems and communication technologiesmade a strong and influential presence in the different fields

of medicine. The cornerstone of a functional medical

information system is the Electronic Health Records (EHR)management system. EHR implementation and adoption face

different barriers that slow down its deployment in different

organizations. This research focuses on resolving the most

public barriers, which are data entry, unstructured clinicaldata modifying the physician work flow. This research

proposed a solution, which use Text mining and Natural

language processing techniques.This solution tested andverified in four real-world clinical organizations. The

suggested solution proved correcteness and perciseness with

91.88%..

  Keywords: Electronic Health Reacord, Textmining,

Unstructured Medical Data , medical Data entry, Health

  Information Technology.

I.INTRODUCTION 

The paper-based medical record is woefully inadequate

for meeting the needs of modern medicine. It arose in the19th century as a highly personalized "lab notebook" thatclinicians could use to record their observations and plansso that they could be reminded of pertinent details whenthey next saw that same patient. There were no bureaucraticrequirements, no assumptions that the record would be usedto support communication among varied providers of care,and remarkably few data or test results to fill up therecord’s pages. The record that met the needs of clinicians acentury ago has struggled mightily to adjust over thedecades so as to accommodate to new requirements ashealth care and medicine have changed which leads to theexistence of Health Information Technology (HIT) [1].

HIT allows comprehensive management of medicalknowledge and its secure exchange among health careconsumers and providers. Broad uses of HIT will:

1.  Help to eliminate the manual tasks of extracting datafrom charts or filling out specialized datasheets.

2.  Help to derive data directly from the electronic record,making research-data collection by product of routineclinical record keeping. .

3.  Help to Move from paper-based health care system tosecure electronic medical records which will save livesand reduce health care costs.

4.  Help in Early detection of infectious disease byadvanced data collection, fusion and processingtechniques which would be at the forefront in spottingthe emergence of new diseases, and crucial to trackingthe spread of known diseases[2].

II.ELECTRONIC HEALTH RECORD ,DEFINITION AND MODELS 

EHR defined as longitudinal electronic record of patients' health information generated by one or moreencounters in any care delivery setting. This informationincludes, but not limited to, patient demographics, progressnotes, examinations details like symptoms and findings,medications, vital signs, past medical history,immunizations, laboratory data, and radiology reports. The

EHR automates and streamlines the clinician's workflow.The EHR has the ability to generate a complete record of aclinical patient encounter as well as supporting other caredirectly or indirectly related activities via interfaceincluding evidence-based decision support, qualitymanagement, and outcomes reporting. The EHR means arepository of patient data in a digital form stored andexchanged securely and accessible by multiple authorizedusers. [2][3][4]

There are many EHR architectural models that can beused all over the world. The most two popular EHR modelsare:

1.  Central Repository Model

The center of EHR model will be the repository, whichwill be fed by the existing applications in different carelocations such as hospitals, clinics, and family physicianpractices. The feed from these applications will bemessaging based on the pre-agreed standards. Themessaging needs to be based well-defined standards, for

Prof. Hussain Bushinak 

Faculty of Medicine

Ain Shams University

Cairo, Egypt

Dr. Sayed AbdelGaber

Faculty of Computers and Information

Helwan University

Cairo, Egypt

Mr. Fahad Kamal AlSharif 

Collage of Computer Science

Modern Academy

Cairo, Egypt

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

25 http://sites.google.com/site/ijcsis/ISSN 1947-5500

Page 2: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 2/11

 

example the HL7. Reference Information Model (RIM) forwhich XML could be used as the recommendedImplementation Technology Specification (ITS). [5]

Figure 1. EHR Central Repository Model

The event-driven messages that need to be sent andstored in the repository will essentially be event-basedsummaries as shown in figure (2). The event-based

summaries stored in the repository can be queried andretrieved by different clinicians who are treating thepatients in different scenarios and by different clinicalsettings. The retrieval and access of data from therepository is subject to establishing that the clinicianslegitimately access the data for treating only those patientswho are in their care. The retrieval is done throughmessaging which can be done either through synchronousor asynchronous messages depending on the urgency,complexity, and importance of the data that is beingretrieved. [5]

Figure 2. EHR Message Events 

2.  Managed Services Model 

The managed services model is based on hostingapplications for different care providers and care settings ina data center by a consortium, which may consist of groupof infrastructure providers, system integrators, andapplication providers. The hosted applications can be usedto provide an effective EHR by building a common

repository using a shared database or by providing acommon user interface to all hosted applications andextracting data from these systems using a portal whoseauthentication and authorization mechanism can also becontrolled at the data center level as shown in figure 3. [5]

Figure 3. Shared Services Model

III.BARRIERS OF THE ELECTRONIC HEALTH RECORD

IMPLEMENTATION 

Implementation of EHR faces different barriers, butthese barriers vary from one environment to another.Hereafter, the main focus will be on the general barriersthat exist in most of EHR implementation attempts, thesebarriers are:

1.  Financial Barriers

Financial barriers are divided into the following points:

  High Costs: These costs are divided into twomain parts, initial cost and ongoing cost. [6]

  Under-developed business case: This barrierraised because of the following: Uncertaintyof EHR returns on investment, Financialbenefits are only achieved on the long run andThe main objective and benefits of EHR is toprovide a high quality medical service for thecitizens. [6]

2.  Technological Barriers

Technological barriers are divided into four points: [7]

  Inadequate technical support

  Inadequate data exchange

  Security and privacy

  Lack of standards

3.  Physicians Attitudinal and Behavioral Barriers in dataentry:

Many health information system projects fail due toattitudes, behaviors, barriers in data entry and lack of systematic consideration of human-centered computingissues such as usability, workflow, organizational change,and process reengineering. There are two major factors that

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

26 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 3: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 3/11

 

lead to sluggish performance of this EHR system, thesefactors are: complexity of the Graphical User Interface(GUI) and system response time. This forces clinician tosee fewer patients and have longer workdays, largelybecause of the extra time needed to use the system. [8]

In 2004,Lisa Pizziferri and others concluded that the

benefits of using EHR system can be achieved and acceptedby physicians if only the physicians do not need to sacrificetheir time with patients or other activities during clinicsessions. Physicians recognize the quality improvementsachieved by EHRs, but their time should be saved bydecreasing the time required for data entry in EHR systems.[9]

4.  Organizational Change Barriers

This category contains many points, these points are:

  Design of and alignment with workflow andoffice integration:

  54.2 percent out of the 5000 respondentsreported that they are worried about slowerworkflow and low productivity according tothe American Academy of Family Physicianssurvey results (American Academy of FamilyPhysicians 2004). [10]

  Migration from paper-based systems:

  Staff training:

5.  The format of Clinical Data store in EHR systems

Generally speaking, there are two main types of 

data store shapes: structured data and

unstructured data.

  Structured data: Structured data is a data thathas a relational data model and enforcecomposition to the atomic data types.Structured data is managed by technology thatallows for querying and reporting againstpredetermined data types and understoodrelationships, like patient demographics,laboratory tests, etc. [11]

  Unstructured data: Unstructured data consistsof any data stored in an unstructured format atan atomic level. That is, in the unstructuredcontent, there is no conceptual definition and

no data type definition - in textual documents,a word is simply a word. [11]

Unstructured data consists of two basic categories:

  Bitmap Objects: Inherently non-languagebased, such as X-rays, radiology, video oraudio files.

  Textual Objects: Based on a written or printedlanguage, such as clinical reports, nurserynotes and examination sheets. [11]

Using unstructured data for storing clinical data has thefollowing limitations:

  The data is not consumable from a semanticlevel without a compatible interface orapplication.

  Any technology cannot be necessarily gainedinsight into the context of the informationunless it can actually be read.

6.  Barriers of using unstructured data in Electronic HealthRecord:

Aggregation of information across all the records in

a large repository could bring benefits for clinical

research. When physicians work with structured data,

they could receive alerts of the drugs that have badinteraction together which enables them to enhance

the treatment process and avoid the medication errors;

but this cannot be done with unstructured data [12].

IV.SURVEYING THE SOLUTIONS OF EHR DATA ENTRY

BARRIERS:

In October 2010, Ergin Soysal, Ilyas Cicekli, and

Nazife Baykal designed and developed an ontology

based information extraction system for radiological

reports. [15]

The main goal of this technique is to extract and

convert the available information in free text Turkish

radiology reports into a structured information modelusing manually created extraction rules and domain

ontology. This technique extracts data from the

radiological reports, which is a free text written by

physicians and insert it as a structured data into the

EHR. [13]

However, this technique has the following

drawbacks:

  It concentrates mainly on abdominal

radiology reports.

  It does not use a huge and trusted medical

expressions repository, which may reduce

the quality of information extractionprocess. Consequently, wrong clinical

information will be recorded.

In September 2010, Adam Wright, Elizabeth S.

Chen, and Francine L. Maloney developed a technique

for identifying associations between medications,

laboratory results and problems. They developed a

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

27 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 4: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 4/11

 

knowledge base of medication and laboratory result

problems associations in an automated fashion. It was

based on two data mining techniques; frequent item

set mining and association rule mining. This technique

was successfully able to identify a large number of 

clinically accurate associations. A high proportion of 

high-scoring associations were adjudged clinicallyaccurate when evaluated against the gold standard

(89.2% for medications with the best-performing

statistic, chi square, and 55.6% for laboratory results

using interest) [14]. However, this technique has the

following drawbacks:

  The researchers assumed that patients’ data

was structured.

  Building the knowledge base concentrated

only on patient’s problems, medications

and laboratory results, which mean the

other data, such as the patient’s history,

diagnosis, and procedures are not in

account.  Data entry is done through traditional GUI.

So, this solution did not enhance the

physician workflow.

In September 2010, a system for misspellings in

drug information system queries was developed by

Christian Senger, Jens Kaltschmidt, Simon P.W.

Schmitt, Markus G. Pruszydlo and Walter E. Haefeli.

This system attempted to solve the problem of drug’s

data entry in Drug Information System (DIS). The

researchers evaluated correctly spelled and misspelled

drug names from all queries of the University Hospital

of Heidelberg. The results identified that search

engines of DIS should be equipped with error-tolerant

search capabilities. Auto-completion lists might

expedite searches but might fail regularly due to the

high frequency of typographic errors already in initials.

It improved the DIS data entry by using spelling

corrected tools to make the drug information

understandable and available, but it concentrated only

on DIS without examination, history, and procedure

data [16].

In august 2010, a technique was developed by

Yong-gang Cao, James J. Cimino, John Ely and Hong

Yu. It was an automated identification of diseases and

diagnosis in clinical records. This technique presents

an approach for a prototyping of a diagnosis classifier

based on a popular computational linguistics platform

[18]. This technique has the following limitations:

  It focuses only on the diseases key words

to be extracted and ignores other important

parts like operations, symptoms,

finding…etc.

  It does not use spelling correction.

  There is no clear structure data model to

store the extracted data from the clinical

report.

  It does not use a huge and trusted data

source for medical expressions like Unified

Medical Language Systems (UMLS).

In July 2010, another technique for automatically

extracting information needed from complex clinical

questions was developed by Yong-gang Cao, James J.

Cimino, John Ely and Hong Yu. They built a fully

automated system Ask EHRMES Help clinicians

extract and articulate multimedia information from

literature to answer their ad hoc clinical questions.

This system automatically retrieves, extracts, and

integrates information from the literature and other

information resources and attempts to formulate this

information as answers in response to ad hoc medical

questions posted by clinicians, all of which can be

achieved within a time-frame that meets their demands[17]. This technique succeeds in clinical question

answering and in identifying the category of the

question but in the EHR system adoption process

faced the following limitations:

  This technique extracted the clinical

information to identify the question

category but not to store this information in

the EHR repository.

  It works only on question answering but

not in the data entry process.

  It does not enhance the physician workflow

during the examination process.

Although the previous techniques attempted to solve

the EHR data entry barrier but it has the following

limitations:

  These techniques concentrate on specific

parts of data, such as diseases and leaves.

  The used medical expression repository

does not contain all the expressions or the

semantic relations between them.

  Some of these techniques store the EHR

data as free text (unstructured data form).

  The physician workflow has some

modifications which, in turn, leads to more

physical and mental efforts and reduces the

physician’s productivity.

V.  BRIDGING THE UNSTRUCTURED DATA TO STRUCTUREDEHR

The suggested idea is to convert the unstructured

free text clinical data to structured EHR data without

modifying the workflow of physicians or adding any

additional physical or mental effort to them. Figure (4)

shows the algorithm of the suggested technique.

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

28 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 5: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 5/11

 

Figure 4 Objective Technique Steps

Step1: Optical Character Recognition OCR

The physician writes his/her diagnoses as regular on

pen-pad, paper or tablet PC. If the clinical report wrote

on paper, it will need to scan it. The clinical report

data will be stored as image of a free hand text which

can be process. This free hand text image scans with

OCR tool to convert to machine encoded text. The

Details of this step represented in figure (5).

Step 2: Spelling Corrector

Machine encoded text may include spelling errors

which may yield wrong information during the

extraction process. So, all the incorrect spelling words

will be correct to move to the next step. This step

requires a medical dictionary that contains most of the

medical expressions in different forms such as verbs,

adjectives, nouns… etc. Figure (6) represent the

details of this step.

Figure 6 Spell Check input and output

Step 3: Text mining with Natural Language Processing

Techniques

In this step, the resulted data will be cleaned andpartitioned into statements. to be classified and coded;

Using text mining and NLP all medical data will be

classified and coded in the form of multiple statements

and remove the unwanted words. This step consists of:

[19]

  Text preprocessing,

  Part of speech tagging,

  Statements Segmentation,

  Noun phrase extraction.

The declaration of each pervious component is

showing in the following.

1.  Text preprocessing: Is called tokenization or text

normalization and it does include the following

steps: [19]

  Throw away unwanted stuff (e.g.,

unwanted brackets and tags).

  Word boundaries: white space and

punctuations.

  Stemming (Lemmatization): This is

optional. English words like ‘look’ can be

inflected with morphological suffixes to

produce ‘looks, looking, looked’. They

share the same stem ‘look’. Often (but not

always) it is beneficial to map all inflected

forms into the stem. This is a complex

process since there can be many

exceptional cases (e.g., department vs.

depart, be vs. were). The most commonly

used stemmer is the Porter Stemmer.

However, there are many others.

  Stop word removal: the most frequent

words often do not carry much

meaning.

  Capitalization, case folding: often it is convenient to lower case every

character.

2.  Part of speech tagging: A Part-Of-Speech Tagger

(POS Tagger) is a piece of software that reads text

in some language and assigns parts of speech to

each word (and other token), such as nouns, verbs,

adjectives, etc. [19]

3.  Statements segmentation: The output of this part

divides the clinical text into several statements.

[19] 

Figure 5 OCR and Handwriting input and output

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

29 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 6: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 6/11

v

4.  Noun phrase extraction: In this part, all noun

phrases are extracted and the complex noun

phrase is decomposed into smaller noun phrases.

Figure 7 Text mining and NLP tasks

Step 4: Unified Medical Language System (UMLS)

Coding

To identify the clinical information, there is a need for

a huge repository for all clinical expressions to extract

the matched clinical expressions. UMLS used to

achieve this purpose. The UMLS is a compendium of 

many controlled vocabularies in the biomedical

sciences and created in 1986. It provides a mapping

structure among these vocabularies and allows

translating among the various terminology systems. It

may be viewed as a comprehensive thesaurus and

ontology of biomedical concepts. [20]

UMLS consists of the following components: [20]  Metathesaurus, the core database of the

UMLS, a collection of concepts and terms

from the various controlled vocabularies

and their relationships.

  Semantic Network, a set of categories and

relationships that are being used to classify and relate the entries in the Metathesaurus.

  Specialist Lexicon, a database of 

lexicographic information to be used in

natural language processing.  A number of supporting software tools.

Morphologically analyzed words are compared to the

UMLS entries to find the best matched expression

according to its Morphological position. Each noun

phrase which matches a clinical expression entry in

the UMLS, put as a pair that contains the noun phrase

with its UMLS’s clinical codes.

Figure 8 UMLS expressions coding

The pseudo code of UMLS coding algorithm can be:

For each Statement S in Statements // in physician

sheet 

Begin

For each noun-phrase N in S

Begin

If N exists in UMLS then,

Extract N and C // where c is the

UMLS code

Put N with C as pair <N, C>

End if 

End

End 

Step 5: Classify EHR Components

The suggested technique applied on physician’s

examination sheet. The examination sheet contains the

following classes:

  History

  Examination

  Diagnosis

  Procedure

Each part treated as a class and all coded clinical data

that were produced from the previous steps classified

into one of the previous classes.

The first step in the classification process is building a

collective set of features that is typically called a

dictionary. The UMLS clinical expressions in the

dictionary form represent the base to create a

spreadsheet of numeric data corresponding to the

previous defined classes.

TABLE (1): CLASSES DICTIONARY 

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

30 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 7: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 7/11

 

Each row defines a class and each column represents a

UMLS code. The cell in the spreadsheet represents a

measurement of the feature corresponding to the

column and the class corresponding to the row. The

dictionary of words covers all the possibilities and thenumber corresponds to the columns. All cells values

ranged between zero and one depending on whether

the words were encountered in the Class or not. The

form of classes’ dictionary is shown in table (1).

The second step is measuring the similarity between

extracted expressions and the defined classes then

classify each expression to the most similar class. The

Cosine algorithm selected to calculate the Similarity

between the extracted clinical phrases and predefined

classes. Steps of Cosine Similarity algorithm are:

  Compute the similarity of new clinical

phrase to all Classes in Dictionary.

  Select the Class that is most similar to thenew clinical phrase.

  The class which occurs most frequently is

the similar one.

For cosine similarity, only positive words shared by

the compared phrases are considered. Frequency of 

word occurrence is also valued. The clinical phrase is

compared with each class by the following equation:

[21]

Norm (P) = W (j): is the weight of the word phrase in

class

Cosine (P1, P2) = wp1 (j) * wp2 (j))/ (Norm (P1) *Norm (P2))

Wpi: is the weight of the word phrase in class i

The cosine similarity of two Classes will range from 0

to 1. The angle between two term frequency vectors

cannot be greater than 90°, consequently, when the

cosine value is close to 1 this means that the clinical

phrase is more similar to the compared class.

Step 6: Storing data in EHR Repository

The classified clinical phrase stored in its class inside

the EHR database with its matched UMLS code. For

example, a physician wrote the following:

There is enlarged prostate with tender base of the bladder .

This statement contains two findings, and then this

statement compared with each class. The cosine vector

scores for this statement against each defined class

according to the previous equations are calculated.

The winning class will be the high score one. The data

will store in the winning class with its UMLS codes as

pairs inside EHR repository:

< enlarged prostate, Finding>

< tender base of the bladder , Finding>

The EHR put in a structured form for analysis and data

mining operation, or as a perfect resource for decisionsupport system.

VI.  THE EXPERIMENTAL STUDY 

The aim of the experiment is to prove the success of 

the suggested technique in a real world cases. For any

experiment, there are some hypotheses; the hypotheses

of this experiment are:

  Physician has little experience of computer

using.

  Physician’s handwriting is readable.

  The used medical abbreviations should be

standard.  The experiment applied during the

examination session.

The required equipments to implement the

experiment are:

  An electronic pen pad.

  A Laptop or personal computer.

  Windows vista or later

  SQL server 2008

  Microsoft office 2007 or later (For

applying OCR in Pin pad)

  .Net framework 4

  UMLS database system

  Medical dictionary (for spelling correction)The implementation of the experimental study is

going through the following steps:

Step 1: At the nurse office the patient

demographics data recorded using the following

screen.

Figure 9: Computing similarity scores for New Clinical Phrase

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

31 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 8: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 8/11

 

Step 2: The physician uses the pen pad to write

the diagnosis.

The physician has the freedom to erase, add or

modify any partition of his/her diagnosis. This

step helps him/her to work as regular without any

additional effort. The data is directly recorded on

the computer which will help the physician to

retrieve it easy with its form or as structured data.

Step 3: After the physician finished his/her hand

writing, he/she press OCR button to convert the

diagnosis from image form to machine coded text

as shown in the following figure:

Step 4: After the OCR done, the system starts to

checks and corrects the spelling errors of the

examination data according to the installed

medical dictionary through an interaction session

with the physician.

Step 5: After the spelling correction done, the

physician presses “insert into EHR” button to

convert the diagnosis data from unstructured to

the structured form. Conversion is done through

the following steps:

  Text preprocessing: All brackets, unwanted

stuff, and word boundaries are removed.

Figure 10: EHR demographics form

Figure 11: Pen pad to Computer Form

Figure 12: Applying OCR on the diagnosis sheet 

Figure 13: Applying spell check on the examination text

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

32 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 9: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 9/11

 

(TOP (S (NP (DT A) (ADJP (NP (CD 15) (NNS years)) (JJ 

old)) (JJ female) (NN patient)) (VP (VBZ complains) (PP (IN 

  from) (NP (JJ nocturnal) (NN enuresis))) (PP (IN since) (NP

(NN birth)))) (. . .)))

(TOP (S (NP (NP (JJ Plain) (NN X-ray)) (PP (IN of) (NP (DT 

the) (NN abdomen)))) (VP (VBD was) (ADJP (JJ free))) (. .)))

(TOP (S (NP (JJ Abdominal) (NN ultra) (NN sonography))

(VP (VBD was) (ADJP (JJ free))) (. .)))

(TOP (S (NP (PRP he)) (VP (VBZ has) (NP (NP (NNP

 Enuresis)) (SBAR (S (NP (DT The) (NN patient)) (VP (MD

should) (VP (VB receive))))) (: :) (NP (NP (NNP R1) (NNP

Uipam) (NN tablet)) (NP (NP (CD one) (NN tablet)) (NP (RB

twice) (RB daily)) (PP (IN for) (NP (CD three) (NNSmonths))))))) (. .)))

(TOP (S (PP (IN R2) (NP (NNP Dipripam) (CD 20) (NN mg)

(NN capsule))) (NP (NP (CD one) (NN tablet)) (NP (RB

twice) (RB daily)) (PP (IN for) (NP (CD three) (NNS

months)))) (. .))) (TOP (S (NP (DT R3) (NNP Depavit) (NNP

 B12) (NN ampule)) (. .)))

  Parts of speech tagging: Assigning parts of 

speech to each word.

  Statements segmentation: Examination text

is split into multiple statements.

  Phrase tagging: Each phrase is tagged with

the suitable code to identify all phrases

contained in the diagnosis sheet.The output of this step is the examination of 

words with their parts of speech; this output exists

in the following format:

  Noun Phrase Extraction:

All noun phrases are extracted and

compounded. Noun phrases are divided

into a smaller noun phrases, such as the

following:

o  A 15 years old female patient

o  15 years

o  Nocturnal enuresis since birth

o  Birth

o  Plain X-ray of the abdomen

o  Plain X-ray

o  The abdomen

o  Abdominal ultra sonography

o  Enuresis

o  The patient

o  R1 Uipam tablet

o  One tablet twice daily for threemonths

o  One tablet

o  Twice daily

o  Three months

o  Dipripam 20 mg capsule

o  One tablet twice daily for three

months

o  One tablet

o  Twice daily

o  Three months

o  R3 Depavit B12 ampule

Step 7: All noun phrases are coded with UMLS

codes. The output of this step represented in table

(2).

TABLE (2): NOUN PHRASES WITH THEIR UMLS CODES.

Each statement got score according to UMLS

codes and the class’s dictionary which declared in

table (1). Table (3) shows the statements and theirscores.

TABLE (3):  STATEMENTS’ SCORE.

Step 8: According to the scores showed in table

(3), the statements classified into their classes.

The predefined classes are:

  History

  Examination

  Diagnosis

  Procedure

Figure 14: Output of Text mining technique

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

33 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 10: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 10/11

 

The classifier uses the COS similarity algorithm

to classify each statement according to the class

dictionary. Table (4) shows the score of each

statement relative to nearst class.

TABLE (4):  COS SIMILARITY SCORES FOR EACH CLASS.

Step 9: After determining the winning class for

each statement, each noun phrase with its UMLS

code saved inside the EHR in the winning class as

a paired tag. Table (5) shows this format.

Step 10: This extracted information compared

with the physician manual results to identify the

suggested technique precision.

VII.  RESULTS DISCUSSION 

The experimental study conducted on four

Medical departments. In each department 10

diagnosis sheets tested. The tested departmentsare:

  Surgical Oncology

  Surgery Urology

  Cardiology

  General Surgery

Table (6) shows the overall precession

percentage in each of tested department.

TABLE (6):  RESULTS OF THE EXPERIMENTAL STUDY.

Department Overall Precise

Surgical Oncology 92.96%

Surgery Urology 91.55%

Cardiology 92.33 %

General Surgery 88.61%

Overall precession 91.36

Some factors affect the results, such as quality of 

physician hand writing. The effect of this factor clears

in the result of experiment four, since it is the lowest

precision percentage (91.36 %). High precision OCR

tool can minimize the effect of this factor; but it may

be expensive. The results indicated that the suggested

technique success with high percentage in a real world

experiment, which means that this technique can be

applied in the real live in future.

VIII.  CONCLUSION 

The suggested technique succeeded in working as a

bridge between unstructured and structured medical

data. The medical data stored inside the EHR system

in its right position without any additional physical or

mental effort by physician, which in turn satisfy the

main objective of this research.

REFERENCES 

[1]  Institute of Medicine. “Review of the Adoption and

Implementation of Health IT Standards by the DHHS

Office of the National Coordinator for Health

Information

Technology”http://www.iom.edu/Activities/Workforc

e/HealthITStandards.aspx

[2]  Richard Dick, Elaine B. Steen, and Don Detmer, “The

Computer Based Patient Record: An Essential

Technology for Health Care”, National Academy

Press, 1997.

[3]  See HIMSS web page for the consensus definition of 

an electronic health record.

http://www.himss.org/ASP/topics_ehr.asp.

[4]  J.H. van Bemmel and M.A. Musen, “Handbook of 

Medical Informatics”, Springer, 1997.

TABLE (5): DATA THAT INSERTED INSIDE THE EHR

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

34 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

Page 11: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

8/6/2019 Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

http://slidepdf.com/reader/full/recognizing-the-electronic-medical-record-data-from-unstructured-medical-data 11/11

 

[5]  K. Ananda Mohan,” National Electronic Health

Record Models”, Tata Consultancy Services

(TCS),2004.

[6]  Miller, R. H. and Sim, Ida. “Physicians’ Use Of 

Electronic Medical Records: Barriers And Solutions”.

Health Affairs, 2004.

[7]  Waegemann, “EHR vs. CPR vs. EMR. Healthcare

Informatics”, 2003.

[8]  Himali Saitwala, Xuan Fengb, Muhammad Walji,

Vimla Patel, Jiajie Zhanga, ”Assessing performance of 

an Electronic Health Record (EHR) using Cognitive

Task Analysis” , Elsevierhealth, 2010.

[9]  Lisa Pizziferri, Anne F. Kittler, Lynn A. Volk, Melissa

M. Honourb, Sameer Gupta, Samuel Wang, Tiffany

Wang, Margaret Lippincott, Qi Li and David W.

Bates,” Primary care physician time utilization before

and after implementation of an electronic health

record: A time-motion study”, Elsevierhealth,2004.

[10] American Academy of Family Physicians. “Family

Practice Management Monitor”, AAFP pushes for

affordable EMR system, 2004.

[11] Oleh Hrycko,” Electronic Discovery in Canada: Best

Practices and Guidelines”,CCH,2007.

[12] Angus Roberts , Robert Gaizauskas, Mark Hepple,

George Demetriou, Yikun Guo, Ian Roberts, Andrea

Setzer,” Building a semantically annotated corpus of 

clinical texts”, Elsevierhealth,2009.

[13] Hanna M. Seidlingab, Marilyn D. Paternoac, Walter E.

Haefelib, David W. Bates,” Coded entry versus free-

text and alert overrides: What you get depends on how

you ask”, Elsevierhealth,2010.

[14] Adam Wright, Elizabeth S. Chenc, d and Francine L.

Maloney,” An automated technique for identifying

associations between medications, Laboratory results

and problems”, Elsevierhealth, 2010.

[15] Ergin Soysal, IlyasCicekli, NazifeBaykal,” An

ontology based information extraction system for

radiological reports”, Elsevierhealth, 2010.

[16] Christian Senger, Jens Kaltschmidt, Simon P.W.

Schmitt,Markus G. Pruszydlo, Walter E.

Haefeli ,“Misspellings in drug information system

queries: Characteristics of drug name spelling errorsand strategies for their prevention”, Elsevierhealth,

2010.

[17] Yong-gang Cao, James J. Cimino, John Ely, Hong Yu,

“Automatically extracting information needs from

complex clinical questions”, Elsevierhealth, 2010.

[18] Dina Demner-Fushman, James G. Mork, Sonya E.

Shooshan, Alan R. Aronson ,“UMLS content views

appropriate for NLP processing of the biomedical

literature vs. clinical text”, Elsevierhealth, 2009.

[19] Malgorzata Marciniak,Agnieszka Mykowiecka,”

Aspects of Natural LanguageProcessing”,Springer,2009.

[20] Catherine R. Selden,Betsy L. Humphreys,” Unified

Medical Language System: Current Bibliographies in

Medicine”, National institute of health,1990.

[21] Jiawei Han,Micheline Kamber,” Data mining:

concepts and techniques”,Diana Cerra,2006.

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 6, June 2011

35 http://sites.google.com/site/ijcsis/

ISSN 1947 5500