ijcsmc, vol. 5, issue. 5, may 2016, pg.771 medical records ... · 3) we fetch the data from search...
TRANSCRIPT
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 771
Available Online at www.ijcsmc.com
International Journal of Computer Science and Mobile Computing
A Monthly Journal of Computer Science and Information Technology
ISSN 2320–088X IMPACT FACTOR: 5.258
IJCSMC, Vol. 5, Issue. 5, May 2016, pg.771 – 782
Medical Records Clustering Based on
the Text Fetched from Records
Ms. Ramya.S.Bhat1, Mrs. A.Rafega Beham
2
¹M.Tech Scholor, Department of Information Science and Engineering New Horizon College Of Engineering, Bangalore, India
²Asst.Professor, Department of Information Science and Engineering
New Horizon College Of Engineering, Bangalore, India 1 [email protected], 2 [email protected]
Abstract- This paper describes how the rich available data from patient’s medical records can be clustered
and hidden information can be retrieved out of it. We first collect the 49 patient’s medical records, use
annotators to extract the text based on symptom occurred and medical drug name. The fetched text are
clustered and stored in a file. When a combination of medical terms taken from medical documents is given
as a query through the search engine shows the clustered documents. We use MetaMap and Medex as
annotators for extracting the symptom names and the pharmaceutical names. For clustering the fetched data
we are using the multi view NMF, which is a clustering technique.
Keywords:- multi view NMF, Metamap, Medex, document clustering.
I. INTRODUCTION
Data mining is a process of extracting the hidden data from the large set of data to extract the feasible
pattern. There are many types of data mining. One of the techniques among them is the clustering technique.
Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar
to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including machine learning, pattern recognition,
image analysis, information retrieval, bioinformatics, data compression, document clustering and computer
graphics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to
efficiently find them. Popular notions of clusters include groups with small distances among the cluster
members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 772
be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter
settings (including values such as the distance function to use, a density threshold or the number of expected
clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an
automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that
involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the
result achieves the desired properties.
One application of clustering technique, ie the document clustering is explained in this paper.
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that
describe the contents within the cluster. Document clustering is generally considered to be a centralized process.
Examples of document clustering include web document clustering for search users.
The application of document clustering can be categorized to two types, online and offline. Online
applications are usually constrained by efficiency problems when compared to offline applications.
II. RELATED WORK
First, we pre-process clinical notes to identify words and sentences from clinical notes. During the pre-
processing, we use section annotator to identify different sections for each clinical note. The section annotator
depends on the section header information from clinical notes. We also use negation annotator to remove
negation symptom and medication names. Pre-negation is negation words like avoid, deny, cannot, without, and
so on. Post-negation is negation words like free, was ruled out, and so on. After pre-process, we use symptom
annotator based on the Meta Map to extract symptom names from clinical notes.
Meanwhile, we use medication annotator based on MedEx System to extract medication names from
clinical notes. We use MetaMap to extract symptom names from clinical notes. MetaMap is a program that
maps biomedical texts to concepts in the UMLS Meta-thesaurus. Since Meta map returns all types of concepts,
we only keep these concepts related to symptom names, such as concept labeled as “sosy,” which represents
“sign and symptom.” We use MedEx system to extract medication names from clinical notes.
Advantages
1. By using symptom names and medication names, the clustering performance can be improved.
2. It also indicates that multi-view NMF can achieve better results than NMF.
Unsupervised learning algorithms such as principal components analysis and vector quantization can
be understood as factorizing a data matrix subject to different constraints. Depending upon the constraints
utilized, the resulting factors can be shown to have very different representational properties. Principal
components analysis enforces only a weak orthogonality constraint, resulting in a very distributed representation
that uses cancellations to generate variability. On the other hand, vector quantization uses a hard winner- take-
all constraint that results in clustering the data into mutually exclusive prototypes.
It has been shown that non negativity is a useful constraint for matrix factorization that can learn a
parts representation of the data .The nonnegative basis vectors that are learned are used in distributed, yet still
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 773
sparse combinations to generate expressiveness in the reconstructions .In this submission, we analyze in detail
two numerical algorithms for learning the optimal nonnegative factors from data.
We formally consider algorithms for solving the following problem:
Non-negative matrix factorization (NMF) Given a non-negative matrix V, find non-negative matrix
factors Wand H such that: V~WH (1)
NMF can be applied to the statistical analysis of multivariate data in the following manner. Given a set
of of multivariate n-dimensional data vectors, the vectors are placed in the columns of an n x m matrix V where
m is the number of examples in the data set. This matrix is then approximately factorized into an n x r matrix
Wand an r x m matrix H. Usually r is chosen to be smaller than nor m, so that Wand H are smaller than the
original matrix V. This results in a compressed version of the original data matrix.
What is the significance of the approximation in Eq. (1)? It can be rewritten column by column as
v ~ Wh, where v and h are the corresponding columns of V and H. In other words, each data vector v is
approximated by a linear combination of the columns of W, weighted by the components of h. Therefore W can
be regarded as containing a basis that is optimized for the linear approximation of the data in V. Since relatively
few basis vectors are used to represent many data vectors, good approximation can only be achieved if the basis
vectors discover structure that is latent in the data.
Some of the applications of Non matrix factorization are:-
Text Mining for Email Surveillance
Electronic Mail Sub collections
Term Weighting
Document clustering.
Our contributions in this paper are:
1) We present a system for extracting symptom/medication names from clinical notes.
2) We apply multi-view NMF to evaluate the effects of using medication/symptom names to improve the
clinical documents clustering results.
3) We fetch the data from search engine by sending the query.
Fig 1.Overview of Designed System
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 774
The execution of the methodology is shown below:
In this paper 49 set of patient’s records were taken to extract the symptoms occurred and medical drug
name. The extraction of the data is shown below:
Fig 2. Extraction of medical drug name and symptom(record 0-5)
Fig 3.Extraction of medical drug name and symptom(record 6-11)
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 775
Fig 4. Extraction of medical drug name and symptom(record 12-17)
Fig 5. Extraction of medical drug name and symptom(record 18-23)
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 776
Fig 6. Extraction of medical drug name and symptom(record 24-29)
Fig 7. Extraction of medical drug name and symptom(record 30-35)
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 777
Fig 8. Extraction of medical drug name and symptom(record 36-41)
Fig 9. Extraction of medical drug name and symptom(record 42-48)
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 778
Clustering of the documents
Fig 10.Clustering of documents(cluster 0-31)
It shows that the first record is in the first cluster, second record in the third cluster and so on.
Fig 11. Clustering of documents(cluster 32-48)
After the documents are clustered the query is sent through a search engine so as to retrieve the
clustered data on the basis of symptom occurred and medical drug name from the patient’s medical record.
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 779
Fig 12. Home page to enter credentials
Login credentials are entered.
Fig 13. Credentials Entered
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 780
Fig 14. Search Terms sent as query
A search terms are the medical terms observed from the medical records which are sent as an query.In
the above the search terms given are diabetes mellitus and glaucoma.
Fig 15.Clusterd data on the basis of medication and symptom for a given query.
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 781
Fig 16. Clustered data on the basis of medication and symptom for a given query obesity and insomnia.
The above screen shot depicts the clustered data for the search terms obesity and insomnia.
III. CONCLUSION
This paper explains about the document clustering done by the technique Multi-view NMF based on
the symptom occurred and medical drug name taken from the patient’s medical records. The medical records
which are of rich text are taken into account in order to find the hidden information from them. In this approach
we first extract the nouns by removing the negation annotators and fetch the nouns. The text fetched is further
for clustering using the NMF clustering technique. The clustered data is stored which can be used further to find
the symptoms occurred and the medical drug name with similarity when a query is passed through the search
engine which is useful in the healthcare sector especially at medical transcription sector.
REFERENCES
[1] K. Roberts and S. M. Harabagiu, “A flexible framework for deriving assertions from electronic medical records,” J. Amer. Med.
Informat. Assoc., vol. 18, no. 5, pp. 568–573, 2011.
[2] M.-Y. Kim et al., “Patient information extraction in noisy tele health texts,” in Proc. IEEE Int. Conf. Bioinformat. Biomed. (BIBM’13).
[3] F.S.Roqueetal.,“Using electronic patient records to discover disease correlations and stratify patient cohorts,” PLoS Comput. Biol., vol.
7, no. 8, p. E1002141, 2011.
[4] G. Hripcsak et al., “Mining complex clinical data for patient safety research: a framework for event discovery,” J. Biomed. Informat.,
vol. 36, no. 1, pp. 120–130, 2003.
[5] S. V. Pakhomov, A. Ruggieri, and C. G. Chute, “Maximum entropy modelling for mining patient medication status from free text,”
inProc. AMIA Symp. Amer. Med. Informat. Assoc., 2002.
[6] A. Henriksson, “Semantic spaces of clinical text: Leveraging distributional semantics for natural language processing of electronic health
records,” Lic. degree, Dept. Comput. Syst. Sci., Stockholm Univ., Stockholm, Sweden, 2013.
[7] S. Kushinka, “Clinical documentation: EHR deployment techniques,” in California HealthCare Found., 2010 [Online]. Available:
http://www.chcf.org/~/media/MEDIA%20LIBRARY%20Files/PDF/ C/PDF%20ClinicalDocumentationEHRDeploymentTechniques.pdf
Ramya.S.Bhat et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.5, May- 2016, pg. 771-782
© 2016, IJCSMC All Rights Reserved 782
[8] K. Chaudhuri, S. Kakade, K. Livescu, and K. Srid- haran. Multi-view clustering via canonical correlation analysis. In ICML, pages 129–
136, 2009.
[9] C. Ding, T. Li, and W. Peng. On the equivalence be- tween non-negative matrix factorization and probabilis- tic latent semantic
indexing. Computational Statistics Data Analysis, 52:3913 – 3927, 2008.
[10] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. In KDD, pages 126–135, 2006.
[11] E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications. In SIGIR, pages 601–602, 2005.
[12] G. Greco, A. Guzzo, and L. Pontieri. Coclustering multiple heterogeneous domains: Linear combinations and agreements. TKDE,
22:1649–1663, 2010.
[13] D. Greene and P. Cunningham. A matrix factorization approach for integrating multiple data views. In ECML PKDD, pages 423–438,
2009.
[14] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50–57, 1999.
[15] A. Kumar and H. Daum´e III. A co-training approach for multi-view spectral clustering. In ICML, pages 393– 400, 2011.