use of sas based natural language processing to identify incident and recurrent malignancies strauss
DESCRIPTION
Clinical InformaticsTRANSCRIPT
![Page 1: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/1.jpg)
Use of SAS-Based Natural Language Processing to Identify Incident and Recurrent Malignancies
Justin A. Strauss, MAResearch Associate III
Kaiser Permanente Southern California
May 1, 2012 • 2012 HMORN Conference • Seattle, Washington
![Page 2: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/2.jpg)
Co-Authors & Funding
• Chun R. Chao, PhD
• Marilyn L. Kwan, PhD
• Syed A. Ahmed, MD
• Joanne E. Schottinger, MD
• Virginia P. Quinn, PhD
![Page 3: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/3.jpg)
Acknowledgements & Funding• Mayra Martinez, Michelle McGuire, Melissa
Preciado, Nirupa Ghai, and Jeff Slezak (KPSC); Lawrence Kushi (KPNC); Debra Ritzwoller (KPCO); Joan Warren (NCI); Jianyu Rao and Jiaoti Huang (UCLA)
• Funding was provided by KPSC Community Benefit and the Cancer Research Network
![Page 4: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/4.jpg)
Malignancy Identification• Malignancy identification is important for clinical
and epidemiologic cancer research.
• Limited quality and availability of incident and recurrent malignancy data within health plans.
• Delayed availability of incident malignancy data from cancer registries.
• Few registries track cancer recurrences.
• Manual chart abstraction slow and expensive.
• Previous research has shown electronic diagnosis codes (e.g., ICD-9) to be unreliable.
![Page 5: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/5.jpg)
Natural Language Processing• Natural language processing (NLP) can be used to identify
and extract information from electronic clinical text, including incident and recurrent malignancy data.
• Increasing opportunity for NLP with adoption of electronic clinical systems in patient care delivery.
• Despite its potential value in clinical and research settings, NLP usage has been relatively sparse. Contributing factors may include:
• Technical complexity
• Systems integration requirements
• Habitual use of existing methods
![Page 6: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/6.jpg)
SCENT Overview• A SAS-based coding, extraction, and nomenclature tool
(SCENT) was developed to identify incident and recurrent malignancies using text from pathology reports.
• SCENT is currently being implemented in two research studies at Kaiser Permanente Southern California (KPSC):
• Intervention to improve medication adherence among breast cancer patients.
• Differences in the prognosis of prostate cancer patients according to their genetic factors
• Use of SAS programming minimizes implementation barriers and increases availability for multisite research.
![Page 7: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/7.jpg)
Description of Methods• SCENT identifies non-negated clinical concepts within
pathology report text.
• Built using SAS Base (does not require Text Miner add-on).
• Makes extensive use of SAS hash objects and regular expressions.
• Includes components for preprocessing, matching, negation and uncertainty detection, extracting diagnostic information (e.g., staging and Gleason score), and classifying report malignancy status.
• Flexibility to assign codes using variety of coding systems.
• Validation used subset of SNOMED 3.x (~1000 concepts).
![Page 8: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/8.jpg)
SCENT Process Diagram
Concept Dictionary (SAS)
Pathology Text (Research Database)Text : Raw text segment from reportLine : Sequential text segment identifier
Regular Expressions
LoopConcepts
Examine Segments
Tokenize Words
[adenocarcinoma[ls]?][papillar(y|ies)]
Extract Data
Code Matches
Tokenize Words
Clean
Enhance
Disease Extent
Diagnostic Certainty
Tumor Staging
Gleason Score
Check Negation
Clinical Concepts (Excel)Type : Morphology, topology, or proceduralCode : SNOMED 3.XClass : Malignant, basaloid, benign, or N/ADescription : Concept description
[intraductal][papillary][adenocarcinoma][with][invasion]
[intraductal][papillary][adenocarcinoma][with][invasion]
[((intra)?duct(al)?)][papillar(y|ies)][adenocarcinoma[ls]?]
[moderately-differentiated ductal adenocarcinoma with papillary][features.][the tumor involves 0.6 cm of one core.]
[moderately-differentiated ductal adenocarcinoma with papillary features.][the tumor involves 0.6 cm of one core.]
Preprocessed TextCode : M-85033
Description : intraductal papillary adenocarcinoma with invasion
[moderately] [differentiated] [ductal] [adenocarcinoma] with [papillary] [features]
moderately differentiated <nlp snm=m85033 type=m class=3>ductal adenocarcinoma with papillary</nlp snm=m85033> features
free (of|from)not? (support[a-z]*|identified)non(?!small|hodgkins)
[((intra)?duct(al)?)]
Match Tokens
![Page 9: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/9.jpg)
Sample Report Coding
LEFT BREAST CORE BIOPSY TWO O CLOCK.<BR>
INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2.<BR>
NO CALCIFICATION IS IDENTIFIED.<BR>
NO VASCULAR INVASION IS IDENTIFIED.<BR>
HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
<NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP SNM=P1140> TWO O CLOCK.<BR>
INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM GRADE 2.<BR>
NO CALCIFICATION IS IDENTIFIED.<BR>
NO VASCULAR INVASION IS IDENTIFIED.<BR>
HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
Preprocessed Text
Coded Text
![Page 10: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/10.jpg)
Validation Study• To validate SCENT, trained chart abstractors reviewed
electronic pathology reports.
• Random samples of breast (n=400) and prostate (n=400) cancer patients.
• Patients diagnosed at KPSC between 2000-2007.
• Reports included from six months post-diagnosis through end of 2008.
• In total, 206 breast and 186 prostate cancer patients contributed 490 and 425 eligible reports, respectively.
• SCENT classifications were compared with those of abstractors.
![Page 11: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/11.jpg)
Classification ConcordanceAbstractor Classifications
Benign CancerRecurrence
OtherPrimary Cancer Suspicious
SCENT Classifications % N % N % N % N Kappa
Breast Cancer (Total) (436) (32) (18) (4)
Benign 99.8 435 - - - - 25.0 1 0.96
Cancer Recurrence - - 100.0 32 - - - -
Other Primary Cancer 0.2 1 - - 100.0 18 50.0 2
Suspicious - - - - - - 25.0 1
Prostate Cancer (Total) (356) (29) (36) (4)
Benign 99.4 354 - - 5.6 2 - - 0.95
Cancer Recurrence - - 96.6 28 2.8 1 - -
Other Primary Cancer 0.6 2 3.4 1 91.7 33 - -
Suspicious - - - - - - 100.0 4
Note: incident contralateral breast malignancies were considered to be recurrences.
![Page 12: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/12.jpg)
SCENT Performance Metrics
Sensitivity* Specificity* PPV* NPV*
Breast Cancer 1.00 (0.93-1.00) 0.99 (0.98-1.00) 0.94 (0.85-0.98) 1.00 (0.99-1.00)
Prostate Cancer 0.97 (0.89-0.99) 0.99 (0.98-1.00) 0.97 (0.89-0.99) 0.99 (0.98-1.00)
* Shown with Wilson's 95% confidence interval.
![Page 13: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/13.jpg)
Conclusions• Favorable results suggest SCENT can identify and extract
information about primary and recurrent malignancies from pathology reports.• Rapid cancer case identification.
• Improved measurement accuracy of common study endpoint.
• SCENT has the potential to expedite chart reviews by narrowing the search and highlighting relevant concepts.
• Generalized utility for extracting standardized disease scores and other clinical information.
• SCENT is proof of concept for SAS-based NLP that can be easily shared between institutions to support research.
![Page 14: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/14.jpg)
Limitations & Next Steps• SCENT has a number of limitations, including:
• Unable to disambiguate and contextualize identified clinical concepts without part-of-speech (POS) tagging.
• More susceptible to changes in text structure and increased linguistic variability than statistical NLP approaches.
• General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology.
• Next steps include:• Release SCENT source code and requisite support files.
• Optimize current functionality and assess feasibility of adding methods (e.g., POS tagging, n-grams, statistical classifiers).
• Attempt to identify non-pathologically diagnosed malignancies using radiology reports and clinical progress notes.
• Quantify cost savings associated with SCENT-assisted chart reviews.
![Page 15: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS](https://reader033.vdocument.in/reader033/viewer/2022052904/557dd15ad8b42ae4688b4e33/html5/thumbnails/15.jpg)
Questions?