ijcsmc, vol. 5, issue. 5, may 2016, pg.771 medical records ... · 3) we fetch the data from search...

Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782

© 2016, IJCSMC All Rights Reserved 771

Available Online at www.ijcsmc.com

International Journal of Computer Science and Mobile Computing

A Monthly Journal of Computer Science and Information Technology

ISSN 2320–088X IMPACT FACTOR: 5.258

IJCSMC, Vol. 5, Issue. 5, May 2016, pg.771 – 782

Medical Records Clustering Based on

the Text Fetched from Records

Ms. Ramya.S.Bhat1, Mrs. A.Rafega Beham

2

¹M.Tech Scholor, Department of Information Science and Engineering New Horizon College Of Engineering, Bangalore, India

²Asst.Professor, Department of Information Science and Engineering

New Horizon College Of Engineering, Bangalore, India 1 [email protected], 2 [email protected]

Abstract- This paper describes how the rich available data from patient’s medical records can be clustered

and hidden information can be retrieved out of it. We first collect the 49 patient’s medical records, use

annotators to extract the text based on symptom occurred and medical drug name. The fetched text are

clustered and stored in a file. When a combination of medical terms taken from medical documents is given

as a query through the search engine shows the clustered documents. We use MetaMap and Medex as

annotators for extracting the symptom names and the pharmaceutical names. For clustering the fetched data

we are using the multi view NMF, which is a clustering technique.

Keywords:- multi view NMF, Metamap, Medex, document clustering.

I. INTRODUCTION

Data mining is a process of extracting the hidden data from the large set of data to extract the feasible

pattern. There are many types of data mining. One of the techniques among them is the clustering technique.

Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar

to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common

technique for statistical data analysis, used in many fields, including machine learning, pattern recognition,

image analysis, information retrieval, bioinformatics, data compression, document clustering and computer

graphics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be

achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to

efficiently find them. Popular notions of clusters include groups with small distances among the cluster

members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore



be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter

settings (including values such as the distance function to use, a density threshold or the number of expected

clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an

automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that

involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the

result achieves the desired properties.

One application of clustering technique, ie the document clustering is explained in this paper.

Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that

describe the contents within the cluster. Document clustering is generally considered to be a centralized process.

Examples of document clustering include web document clustering for search users.

The application of document clustering can be categorized to two types, online and offline. Online

applications are usually constrained by efficiency problems when compared to offline applications.

II. RELATED WORK

First, we pre-process clinical notes to identify words and sentences from clinical notes. During the pre-

processing, we use section annotator to identify different sections for each clinical note. The section annotator

depends on the section header information from clinical notes. We also use negation annotator to remove

negation symptom and medication names. Pre-negation is negation words like avoid, deny, cannot, without, and

so on. Post-negation is negation words like free, was ruled out, and so on. After pre-process, we use symptom

annotator based on the Meta Map to extract symptom names from clinical notes.

Meanwhile, we use medication annotator based on MedEx System to extract medication names from

clinical notes. We use MetaMap to extract symptom names from clinical notes. MetaMap is a program that

maps biomedical texts to concepts in the UMLS Meta-thesaurus. Since Meta map returns all types of concepts,

we only keep these concepts related to symptom names, such as concept labeled as “sosy,” which represents

“sign and symptom.” We use MedEx system to extract medication names from clinical notes.

Advantages

1. By using symptom names and medication names, the clustering performance can be improved.

2. It also indicates that multi-view NMF can achieve better results than NMF.

Unsupervised learning algorithms such as principal components analysis and vector quantization can

be understood as factorizing a data matrix subject to different constraints. Depending upon the constraints

utilized, the resulting factors can be shown to have very different representational properties. Principal

components analysis enforces only a weak orthogonality constraint, resulting in a very distributed representation

that uses cancellations to generate variability. On the other hand, vector quantization uses a hard winner- take-

all constraint that results in clustering the data into mutually exclusive prototypes.

It has been shown that non negativity is a useful constraint for matrix factorization that can learn a

parts representation of the data .The nonnegative basis vectors that are learned are used in distributed, yet still



sparse combinations to generate expressiveness in the reconstructions .In this submission, we analyze in detail

two numerical algorithms for learning the optimal nonnegative factors from data.

We formally consider algorithms for solving the following problem:

Non-negative matrix factorization (NMF) Given a non-negative matrix V, find non-negative matrix

factors Wand H such that: V~WH (1)

NMF can be applied to the statistical analysis of multivariate data in the following manner. Given a set

of of multivariate n-dimensional data vectors, the vectors are placed in the columns of an n x m matrix V where

m is the number of examples in the data set. This matrix is then approximately factorized into an n x r matrix

Wand an r x m matrix H. Usually r is chosen to be smaller than nor m, so that Wand H are smaller than the

original matrix V. This results in a compressed version of the original data matrix.

What is the significance of the approximation in Eq. (1)? It can be rewritten column by column as

v ~ Wh, where v and h are the corresponding columns of V and H. In other words, each data vector v is

approximated by a linear combination of the columns of W, weighted by the components of h. Therefore W can

be regarded as containing a basis that is optimized for the linear approximation of the data in V. Since relatively

few basis vectors are used to represent many data vectors, good approximation can only be achieved if the basis

vectors discover structure that is latent in the data.

Some of the applications of Non matrix factorization are:-

Text Mining for Email Surveillance

Electronic Mail Sub collections

Term Weighting

Document clustering.

Our contributions in this paper are:

1) We present a system for extracting symptom/medication names from clinical notes.

2) We apply multi-view NMF to evaluate the effects of using medication/symptom names to improve the

clinical documents clustering results.

3) We fetch the data from search engine by sending the query.

Fig 1.Overview of Designed System



The execution of the methodology is shown below:

In this paper 49 set of patient’s records were taken to extract the symptoms occurred and medical drug

name. The extraction of the data is shown below:

Fig 2. Extraction of medical drug name and symptom(record 0-5)

Fig 3.Extraction of medical drug name and symptom(record 6-11)



Clustering of the documents

Fig 10.Clustering of documents(cluster 0-31)

It shows that the first record is in the first cluster, second record in the third cluster and so on.

Fig 11. Clustering of documents(cluster 32-48)

After the documents are clustered the query is sent through a search engine so as to retrieve the

clustered data on the basis of symptom occurred and medical drug name from the patient’s medical record.



Fig 12. Home page to enter credentials

Login credentials are entered.

Fig 13. Credentials Entered



Fig 14. Search Terms sent as query

A search terms are the medical terms observed from the medical records which are sent as an query.In

the above the search terms given are diabetes mellitus and glaucoma.

Fig 15.Clusterd data on the basis of medication and symptom for a given query.



Fig 16. Clustered data on the basis of medication and symptom for a given query obesity and insomnia.

The above screen shot depicts the clustered data for the search terms obesity and insomnia.

III. CONCLUSION

This paper explains about the document clustering done by the technique Multi-view NMF based on

the symptom occurred and medical drug name taken from the patient’s medical records. The medical records

which are of rich text are taken into account in order to find the hidden information from them. In this approach

we first extract the nouns by removing the negation annotators and fetch the nouns. The text fetched is further

for clustering using the NMF clustering technique. The clustered data is stored which can be used further to find

the symptoms occurred and the medical drug name with similarity when a query is passed through the search

engine which is useful in the healthcare sector especially at medical transcription sector.

REFERENCES

[1] K. Roberts and S. M. Harabagiu, “A flexible framework for deriving assertions from electronic medical records,” J. Amer. Med.

Informat. Assoc., vol. 18, no. 5, pp. 568–573, 2011.

[2] M.-Y. Kim et al., “Patient information extraction in noisy tele health texts,” in Proc. IEEE Int. Conf. Bioinformat. Biomed. (BIBM’13).

[3] F.S.Roqueetal.,“Using electronic patient records to discover disease correlations and stratify patient cohorts,” PLoS Comput. Biol., vol.

7, no. 8, p. E1002141, 2011.

[4] G. Hripcsak et al., “Mining complex clinical data for patient safety research: a framework for event discovery,” J. Biomed. Informat.,

vol. 36, no. 1, pp. 120–130, 2003.

[5] S. V. Pakhomov, A. Ruggieri, and C. G. Chute, “Maximum entropy modelling for mining patient medication status from free text,”

inProc. AMIA Symp. Amer. Med. Informat. Assoc., 2002.

[6] A. Henriksson, “Semantic spaces of clinical text: Leveraging distributional semantics for natural language processing of electronic health

records,” Lic. degree, Dept. Comput. Syst. Sci., Stockholm Univ., Stockholm, Sweden, 2013.

[7] S. Kushinka, “Clinical documentation: EHR deployment techniques,” in California HealthCare Found., 2010 [Online]. Available:

http://www.chcf.org/~/media/MEDIA%20LIBRARY%20Files/PDF/ C/PDF%20ClinicalDocumentationEHRDeploymentTechniques.pdf



[8] K. Chaudhuri, S. Kakade, K. Livescu, and K. Srid- haran. Multi-view clustering via canonical correlation analysis. In ICML, pages 129–

136, 2009.

[9] C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic

indexing. Computational Statistics Data Analysis, 52:3913 – 3927, 2008.

[10] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. In KDD, pages 126–135, 2006.

[11] E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications. In SIGIR, pages 601–602, 2005.

[12] G. Greco, A. Guzzo, and L. Pontieri. Coclustering multiple heterogeneous domains: Linear combinations and agreements. TKDE,

22:1649–1663, 2010.

[13] D. Greene and P. Cunningham. A matrix factorization approach for integrating multiple data views. In ECML PKDD, pages 423–438,

2009.

[14] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50–57, 1999.

[15] A. Kumar and H. Daum´e III. A co-training approach for multi-view spectral clustering. In ICML, pages 393– 400, 2011.

ijcsmc, vol. 5, issue. 5, may 2016, pg.771 medical records ... · 3) we fetch the data from search...

Documents