machine learning - breast cancer diagnosis

Machine Learning for Breast Cancer DiagnosisA Proof of ConceptP. K. SHARMAEmail: from_pramod @yahoo.com

Introduction

Machine learning is branch of Data Science which incorporates a large set of statistical techniques.

These techniques enable data scientists to create a model which can learn from past data and detect patterns from massive, noisy and complex data sets.

Researchers use machine learning for cancer prediction and prognosis. Machine learning allows inferences or decisions that otherwise cannot be made using

conventional statistical methodologies. With a robustly validated machine learning model, chances of right diagnosis improve. It specially helps in interpretation of results for borderline cases.

Breast Cancer: An overview

The most common cancer in women worldwide. The principle cause of death from cancer among women globally. Early detection is the most effective way to reduce breast cancer deaths. Early diagnosis requires an accurate and reliable procedure to distinguish between benign

breast tumors from malignant ones Breast Cancer Types - three types of breast tumors: Benign breast tumors, In-situ cancers, and

Invasive cancers. The majority of breast tumors detected by mammography are benign.

They are non-cancerous growths and cannot spread outside of the breast to other organs. In some cases, it is difficult to distinguish certain benign masses from malignant lesions with

mammography. If the malignant cells have not gone through the basal membrane but is completely contained in the

lobule or the ducts, the cancer is called in-situ or noninvasive. If the cancer has broken through the basal membrane and spread into the surrounding tissue, it is

called invasive. This analysis assists in differentiating between benign and malignant tumors.

Data Sourcehttps://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) The data used for this POC is

from University of Wisconsin. Citation: This breast cancer databases was obtained

from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

Reference :o O. L. Mangasarian and W. H. Wolberg: "Cancer

diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

o William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.

o O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.

o K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Data Files

Data File Name Description File Name # of records

# of attributes

breast-cancer-wisconsin.data breast-cancer-wisconsin.names 699 11

unformatted-data Data file with comments based on breast-cancer-wisconsin.data 699 11

wdbc.data wdbc.names 569 32

wpbc.data wpbc.names 198 34

In this case study, lets analyze breast-cancer-wisconsin.data and wdbc.data.

Data Sets

The data is in CSV format without any column headers. Columns are interpreted from the associated “names” files.

Flow of Data

Biopsy Procedure Measurements Reports Evaluation Diagnosis

Analysis of measurements

Preparation of ML Models

Predictions and validation

Analysis

Classifier Params.

min_samples_leaf n_estimators min_samples_split max_features

Data Preparation

Address missing data

Training - Testing – Validation data

Lab Setup

Components

Libraries

EnvironmentPython

scikit-learn

RandomForestClassifier

Linux

PandasSciPy

NumPy IPython Matplotlib seaborn

StratifiedKFold

train_test_split

GridSearchCV

learning_curve

pyplot

interp

Input Files

wdbc.data breast-cancer-

wisconsin.data

Outputs

Trained Classifier

Predictions

Data Visualization

Data Description : wdbc.data

1. ID number 2. Diagnosis (M = malignant, B = benign)3-32. Ten real-valued features are

computed for each cell nucleus:a) radius (mean of distances from center

to points on the perimeter)b) texture (standard deviation of gray-

scale values)c) perimeter d) areae) smoothness (local variation in radius

lengths)f) compactness (perimeter^2 / area - 1.0)g) concavity (severity of concave portions

of the contour)h) concave points (number of concave

portions of the contour)i) symmetry j) fractal dimension ("coastline

approximation" - 1)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.

They describe characteristics of the cell nuclei present in the image. The mean, standard error, and "worst" or largest (mean of the

three largest values) of these features were computed for each image, resulting in 30 features.

For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. All feature values are recoded with four significant digits.

wdbc.data

Mean Radius, Mean Perimeter and Mean appear to be helpful in classification.

Higher the values of each parameter more are the chances of it being malignant.

wdbc.data

Mean Concavity, Mean Concave Points, and Mean Compactness appear to be helpful in classification.

Higher the values of each parameter more are the chances of it being malignant.

wdbc.data

Mean Smoothness, Mean Texture, Mean Fractal Dimension, Mean Symmetry and Mean Compactness do not appears to have influence on classification.

Both type of cases are spread across.

Data Description : breast-cancer-wisconsin.data

Missing attribute values: 16 There are 16 instances in Groups 1 to 6

that contain a single missing (i.e., unavailable) attribute value, now denoted by "?".

# Attribute Domain

1. Sample code number id number

2. Clump Thickness 1 - 10

3. Uniformity of Cell Size 1 - 10

4. Uniformity of Cell Shape 1 - 10

5. Marginal Adhesion 1 - 10

6. Single Epithelial Cell Size 1 - 10

7. Bare Nuclei 1 - 108. Bland Chromatin 1 - 109. Normal Nucleoli 1 - 10

10. Mitoses 1 - 10

11. Class (2 for benign, 4 for malignant)

breast-cancer-wisconsin.data

The features distinguish between benign and Malignant fairly well.

breast-cancer-wisconsin.data

The feature seems to distinguish between benign and Malignant fairly well.

ResultsWDBC.DATA

Analysis: wdbc.data

Training data is divided in 5 folds. Test data has 114 records

Accuracy Score: 0.9561

Confusion Matrix: Predicted Benign Predicted Malignant

True Benign 69 2

True Malignant 3 40

Classification Report:

Precision Recall f1-

score Support

0 0.96 0.97 0.97 711 0.95 0.93 0.94 43

avg / total 0.96 0.96 0.96 114

Three cases, although malignant, are predicted as benign

• High accuracy.• Supports the diagnosis.

Model performs equally well on both test and training sets

Two dimensional plot shows excellent separation of Benign and Malignant cases

Plotting three cases… Factors influencing predictions.

Plotting three cases:Factors having no influence on predictions…

Plotting two features at a time

Also analyzed cases if only two of the features were available.

Classifier was trained on two features at a time and decision boundary is plotted. Model could predict the cases with

reasonable accuracy

ResultsBREAST-CANCER-WISCONSIN.DATA

Analysis: breast-cancer-wisconsin.data Training data is divided in 5 folds. Test data has 140 records

Accuracy Score: 0.9643

Confusion Matrix: Predicted Benign Predicted Malignant

True Benign 92 3

True Malignant 2 43

Classification Report:

Precision Recall f1-

score Support

0 0.98 0.97 0.97 951 0.93 0.96 0.95 45

avg / total 0.96 0.96 0.96 140

Two cases, although malignant, are

predicted as benign

Model performs equally well on both training as well as

test data

• High accuracy.• Supports the diagnosis.

Two dimensional plot shows excellent separation of Benign and Malignant cases

Plotting two cases…

Plotting three cases… Factors influencing predictions.

Plotting two features at a time

Classifier was trained on two features at a time and decision boundary is plotted.

As expected, classifier needs more than just two parameters to give accurate predictions.

machine learning - breast cancer diagnosis

Data & Analytics