-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBreast_Cancer_Code.Rmd
790 lines (607 loc) · 36.3 KB
/
Breast_Cancer_Code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
---
title: "Breast Cancer Analysis"
author: "Utsav Patel"
date: "28 April 2019"
output:
html_document:
df_print: paged
pdf_document: default
word_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Abstract
About 1 in 8 U.S. women (about 12%) will develop invasive breast cancer over the course of her lifetime. There are two types of tumors: Benign and Malignant. Malignant tumor is cancerous and is dangerous if it is not treated in the early stages. It is also very expensive in the United States for the cancer diagnosis and treatment. Hence, in the search to find a cheaper diagnosis technique, we propose a few models like logistic regression, linear classification on PCA(Principal Component Analysis) and Random Forest, which can help diagnose breast cancer as benign or malignant.
## Introduction
Breast cancer is a type of cancer that develops from breast
tissue and is often associated by a lump in the breast, change in
breast shape, development of red and patchy skin, or fluid
emanating from the nipple. The causes for breast cancer have
not been fully understood till date. There are some genetic
factors, and some environmental factors associated with its
development. Breast cancer is preliminarily detected by a
mammogram exam and confirmed by a biopsy.
There is no single measurement that can be used to
determine whether a given sample is benign or malignant.
In 2019, an estimated 268,600 new cases of invasive breast cancer are expected to be diagnosed in women in the U.S., along with 62,930 new cases of non-invasive (in situ) breast cancer. About 2,670 new cases of invasive breast cancer are expected to be diagnosed in men in 2019. A lifetime risk of breast cancer for man is about 1 in 883. Breast cancer incidence rates in the U.S. began decreasing in the year 2000, after increasing for the previous two decades. They dropped by 7% from 2002 to 2003 alone. One theory is that this decrease was partially due to the reduced use of hormone replacement therapy (HRT) by women after the results of a large study called the Women's Health Initiative were published in 2002. These results suggested a connection between HRT and increased breast cancer risk.
There can be cancer because of 2 types of tumor: Benign and Malignant.
- Benign tumor are non-malignant/non-cancerous tumor. A benign tumor is usually localized, and does not spread to other parts of the body. Most benign tumor respond well to treatment. However, if left untreated, some benign tumor can grow large and lead to serious disease because of their size. Benign tumor can also mimic malignant tumor, and so for this reason are sometimes treated.
- Malignant tumor are cancerous growths. They are much dangerous than the benign tumor. They usually grow very rapidly. They are often resistant to treatment, may spread to other parts of the body and they sometimes recur after removal.
## Dataset Characteristics
We have 9 attributes which can help us detect whether the tumor is benign or malignant. Let's see what are those.
- **Clump Thickness**:
This is used to assess if cells are
mono-layered or multi-layered. Benign cells tend to be
grouped in mono-layers, while cancerous cells are often
grouped in multi-layer.
- **Uniformity of Cell Size**:
It is used to evaluate the
consistency in the size of cells in the sample. Cancer
cells tend to vary in size. That is why this parameter is
very valuable in determining whether the cells are
cancerous or not.
- **Uniformity of Cell Shape**:
It is used to estimate the equality of cell shapes and identifies marginal variances
because cancer cells tend to vary in shape.
- **Marginal Adhesion**:
Normal cells tend to stick together.
Cancer cells tend to loose this ability. So loss of
adhesion is a sign of malignancy.
- **Single Epithelial Cell Size**:
It is related to the uniformity. Epithelial cells that are significantly enlarged
may be a malignant cell.
- **Bare Nuclei**:
This is a term used for nuclei that is not
surrounded by cytoplasm. Those are typically seen in
benign tumor.
- **Bland Chromatin**:
Describes a uniform texture of the
nucleus seen in benign cell. In cancer cell, the
chromatin tends to be coarser.
- **Normal Nucleoli**:
Nucleoli are small structures seen in
the nucleus. In normal cell the nucleolus is usually very
small if visible at all. In cancer cell the nucleoli become
much more prominent, and sometimes there are more of
them.
- **Mitoses**:
It is an estimate of the number of mitosis that
has taken place. Larger the value, greater is the chance of
malignancy
```{r results='hide', message=FALSE, include=FALSE}
###Loading Libraries
library(ggplot2)
library(corrplot)
library(broom)
library(ElemStatLearn)
library(cowplot)
library(leaps)
library(tidyverse)
library(tidyr)
library(dplyr)
library(randomForest)
library(devtools)
library(factoextra)
library(randomForest)
library(mlbench)
library(caret)
#library(ggbiplot)
```
```{r results='hide', message=FALSE, include=FALSE}
###Loading the data
column_names = c("Sample_code_number", "Clump_thickness","Uniformity_of_cell_size", "Uniformity_of_cell_shape", "Marginal_adhesion", "Single_Epithelial_Cell_size", "Bare_Nuclei", "Bland_Chromatin","Normal_Nucleoli", "Mitoses", "Is_Malignant")
bc = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", sep = ",",col.names = column_names)
```
```{r results='hide', message=FALSE, include=FALSE}
###Preprocessing the data
unique(bc$Bare_Nuclei)
bc$Is_Malignant = ((bc$Is_Malignant)/2) -1
bc = bc[,2:11]
#head(df)
bc = bc[bc$Bare_Nuclei != "?",]
bc$Bare_Nuclei = as.numeric(bc$Bare_Nuclei)
bc$Bare_Nuclei = bc$Bare_Nuclei -1
bc$Is_Malignant = as.numeric(bc$Is_Malignant)
```
###Correlation heatmap of our data
```{r message=FALSE}
cor_df = cor(bc)
corrplot(cor_df, type = "upper")
```
The above heatmap shows the correlation between variables. Darker the blue color higher positive correlation, darker the red color, higher negative correaltion. As we can see that all the variables are postively corelated with each other and with target variable. Size of the dot is proportional to absolute value of correlation.
```{r message=FALSE, results='hide', include=FALSE}
cor(bc$Clump_thickness, bc$Is_Malignant, method = c("pearson"))
cor(bc$Uniformity_of_cell_size, bc$Is_Malignant, method = c("pearson"))
cor(bc$Uniformity_of_cell_shape, bc$Is_Malignant, method = c("pearson"))
cor(bc$Marginal_adhesion, bc$Is_Malignant, method = c("pearson"))
cor(bc$Single_Epithelial_Cell_size, bc$Is_Malignant, method = c("pearson"))
cor(bc$Bare_Nuclei, bc$Is_Malignant, method = c("pearson"))
cor(bc$Bland_Chromatin, bc$Is_Malignant, method = c("pearson"))
cor(bc$Normal_Nucleoli, bc$Is_Malignant, method = c("pearson"))
cor(bc$Mitoses, bc$Is_Malignant, method = c("pearson"))
```
###Correlation of each variable with respect to target variable:
To see how each variable is correlated with target variable, we have provided a clearer picture of the correlation values in a sorted manner.
```{r message=FALSE}
Var <- c("Uniformity of Cell Shape", "Uniformity of Cell Size", "Bland Chromatin", "Normal Nucleoli","Clump Thickness","Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Mitoses")
correlation <- c(0.8218909, 0.8208014, 0.7582276, 0.7186772, 0.7147899, 0.7062941, 0.6909582, 0.5087019, 0.4234479)
x <- data.frame(name = as.factor(Var), val = correlation)
x$name <- factor(x$name, levels = x$name[order(x$val)])
ggplot(x, aes(x = name, y = val)) +
theme_bw() +
geom_bar(stat = "identity") +
coord_flip() +
xlab("") + ylab("Correlation values")
```
###Boxplot of our data
```{r message=FALSE}
#predictors.scaled = scale(df[, 1:9])
#cancer = df$Is_Malignant
#db.lm = lm(cancer ~ predictors.scaled)
#db.lm.tidy = tidy(db.lm, conf.int = TRUE)
#db.lm.tidy[2:10, 1] =names(df)[1:9]
#data.frame(db.lm.tidy[, 1],round(db.lm.tidy[,-1], 2))
#head(predictors.scaled)
#ggplot(db.lm.tidy[-1, ],aes(x = estimate, y = term, xmin = conf.low,xmax = conf.high, color = term)) +
# geom_point() +
# geom_errorbarh() +
# geom_vline(xintercept = 0)
ggplot(stack(bc[,1:9]), aes(x = ind, y = values)) +
geom_boxplot() +
#theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust=1)) +
labs(title = "Boxplots of columns") +
labs(x = "", y = "Values") +
scale_y_continuous(breaks = seq(1, 10, by = 1)) +
coord_flip()
```
Just to get a range of the variable values, boxplot has been plotted. For example, clump thickness ranges from 2 to 6 and has a median close to 4. In case of uniformity of cell size, uniformity of cell shape value starts from 1 and that itself is median that means at least 50% of the data has value for these variable 1. In case of mitoses, it is clear that data is congested at value 1 and there is not much variation except few outliers. So, how mitoses affect breast cancer is difficult to analyze from this dataset.
###Logistic Regression with Backward elimination technique
We have used Logistic Regression with backward elimination technique to reduce the number of variables. The process we followed was to check what impacts the residual change and eliminated the variables accordingly. We removed one variable at a time. Each time we removed the variable which had highest p value till we found all the variable with p value less than 0.01 as we had set significance level as 0.01. For example, when we did not eliminate any variable the model summary showed coefficient of p value for Uniformity_of_cell_size was highest with a value of 0.773024. So we eliminated Uniformity_of_cell_size. Similarly, one by one we removed 3 more variables. Finally, we were down to 4 predictors from 9. Clump thickness, Uniformity of cell shape, Bland Chromatin and Marginal Adhesion are our final 4 variables for glm model. Let us analyze their individual impact on target variable.
```{r, include=FALSE}
cancer.logistic =glm(Is_Malignant~., family = binomial, data = bc)
summary(cancer.logistic)
```
```{r, include=FALSE}
cancer.logistic =glm(Is_Malignant ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion +
Single_Epithelial_Cell_size + Bare_Nuclei + Bland_Chromatin + Normal_Nucleoli +
Mitoses, family = binomial, data = bc)
summary(cancer.logistic)
```
```{r, include=FALSE}
cancer.logistic =glm(Is_Malignant ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion+
Bare_Nuclei + Bland_Chromatin + Normal_Nucleoli +
Mitoses, family = binomial, data = bc)
summary(cancer.logistic)
```
```{r, include=FALSE}
cancer.logistic =glm(Is_Malignant ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion +
Bare_Nuclei + Bland_Chromatin + Normal_Nucleoli,
family = binomial, data = bc)
summary(cancer.logistic)
```
```{r, include=FALSE}
cancer.logistic =glm(Is_Malignant ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion+
Bland_Chromatin + Normal_Nucleoli,
family = binomial, data = bc)
summary(cancer.logistic)
```
```{r}
# cancer.logistic =glm(Is_Malignant ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion+
# Bland_Chromatin,
# family = binomial, data = bc)
# summary(cancer.logistic)
```
```{r, include=FALSE}
cancer.logistic =glm(Is_Malignant ~ Clump_thickness + Uniformity_of_cell_shape + Bland_Chromatin ,
family = binomial, data = bc)
summary(cancer.logistic)
```
###Density Plot
```{r message=FALSE}
color.palette = c("Blue", "Red", "#999999", "#E69F00", "#56B4E9", "#0072B2")
gg9 = ggplot(bc, aes(x = Clump_thickness, fill = factor(bc$Is_Malignant))) +
geom_density(alpha = 0.25)
gg9 = gg9 +
scale_fill_manual(values = color.palette) +
labs(fill = "Malignant")
gg10 = ggplot(bc, aes(x = Uniformity_of_cell_shape, fill = factor(bc$Is_Malignant))) +
geom_density(alpha = 0.25)
gg10 = gg10 +
scale_fill_manual(values = color.palette) +
labs(fill = "Malignant")
gg11 = ggplot(bc, aes(x = Bland_Chromatin, fill = factor(bc$Is_Malignant))) +
geom_density(alpha = 0.25)
gg11 = gg11 +
scale_fill_manual(values = color.palette) +
labs(fill = "Malignant")
gg12 = ggplot(bc, aes(x = Marginal_adhesion, fill = factor(bc$Is_Malignant))) +
geom_density(alpha = 0.25)
gg12 = gg12 +
scale_fill_manual(values = color.palette) +
labs(fill = "Malignant")
```
```{r message=FALSE}
theme_set(theme_cowplot(font_size=12))
plot_grid(gg9, gg10, gg11, gg12, ncol = 2)
```
**Clump thickness:** Values for benign cells tends to be on lower end and for malignant cells values tends to higher in general
**Bland chromatin, Uniformity of cell shape, Marginal adhesion:** Values are highly dense at the lower end of the range and does not vary much for benign cells where in case of malignant cells, values have higher variance and are roughly spread across the entire range.
###How probability of cancer vary with each variable:
```{r message=FALSE}
clump = aggregate(Is_Malignant ~ Clump_thickness, mean, data = bc)
adhesion = aggregate(Is_Malignant ~ Marginal_adhesion, mean, data = bc)
chromatin = aggregate(Is_Malignant ~ Bland_Chromatin, mean, data = bc)
cellshape = aggregate(Is_Malignant ~ Uniformity_of_cell_shape, mean, data = bc)
clump$Type = rep('Clump_thickness', nrow(clump))
colnames(clump) = c('Severity', 'Is_Malignant', 'Type')
adhesion$Type = rep('Marginal_adhesion', nrow(adhesion))
colnames(adhesion) = c('Severity', 'Is_Malignant', 'Type')
chromatin$Type = rep('Bland_Chromatin', nrow(chromatin))
colnames(chromatin) = c('Severity', 'Is_Malignant', 'Type')
cellshape$Type = rep('Uniformity_of_cell_shape', nrow(cellshape))
colnames(cellshape) = c('Severity', 'Is_Malignant', 'Type')
binded = data.frame(rbind(clump, adhesion, chromatin, cellshape))
ggplot(binded, aes(y = Is_Malignant, x = Severity)) +
geom_point() +
geom_line() +
ggtitle('Aggregate Malignant cases for each attribute') +
facet_wrap( ~ factor(Type), ncol = 2)
```
The graph represents how the probability of cell being malignant varies as the value for each of the variable increases. In all the 4 variables, as the values increases probability of cell being malignant increases. This corroborate the insight provided by density plot above where we saw that benignant cells value tend to be on the lower end.
###Grid based classification using 2 variables
```{r message=FALSE}
ggplot(bc, aes(x= Marginal_adhesion, y = Clump_thickness)) +
geom_point(aes(color = factor(Is_Malignant), size = 1.5)) +
#geom_vline(xintercept = 7) +
#geom_hline(yintercept = 8) +
ggtitle("Clump_Thickness vs Marginal Adhesion") +
#geom_smooth(span = 2) +
labs(fill = "Malignant") +
guides(size = FALSE)
ggplot(bc, aes(x= Uniformity_of_cell_shape, y = Bland_Chromatin)) +
geom_point(aes(color = factor(Is_Malignant), size = 1.5)) +
#geom_vline(xintercept = 7) +
#geom_hline(yintercept = 8) +
ggtitle("Uniformity of Cell Shape vs Bland Chromatin") +
#geom_smooth() +
labs(fill = "Malignant") +
guides(size = FALSE)
#plot_grid(plot1,plot2, ncol = 2)
```
To see how the class distribution varies with 2 variables we have plotted these graphs. We see that as the value for any variable increases from somewhere around 6.5 to 7.5, it is certain that cells are malignant. For values lower than 3 it is safe as they are benignant cells but for values between 3 and 6.5 it is uncertain.
```{r message=FALSE, include=FALSE}
bc1 = bc
cancer.leaps = regsubsets(Is_Malignant ~ ., data = bc1)
#summary(cancer.leaps)$which
#summary(cancer.leaps)$cp
tidy(lm(Is_Malignant ~ Clump_thickness + Marginal_adhesion + Uniformity_of_cell_shape + Bland_Chromatin, data = bc1))
clump_cellshape.glm = glm(Is_Malignant ~ Clump_thickness + Uniformity_of_cell_shape,
family = "binomial", data = bc1)
#summary(partyIm.glm)
clump_cellshape.int.glm = glm(Is_Malignant ~ Clump_thickness * Uniformity_of_cell_shape,
family = "binomial", data = bc1)
#summary(partyIm.int.glm)
grid = expand.grid(Clump_thickness = 1:10, Uniformity_of_cell_shape = 1:10)
cancer.no = predict(clump_cellshape.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
cancer.int = predict(clump_cellshape.int.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
grid2 = data.frame(grid, cancer.no, cancer.int)
grid2 = gather(grid2, key = model, value = cancer, c("cancer.no", "cancer.int"))
grid2$model = recode_factor(grid2$model, "cancer.no" = "No interaction", "cancer.int" = "With interaction")
gg1 = ggplot(grid2, aes(x = Clump_thickness, y = cancer, group = model, color = model)) +
geom_line() +
facet_grid(~Uniformity_of_cell_shape) +
xlab("Clump Thickness") +
ylab("Chance of cancer") +
ggtitle("Clump Thickness\n and Uniformity of cell shape") +
theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
```
```{r message=FALSE}
cellshape_adhesion.glm = glm(Is_Malignant ~ Uniformity_of_cell_shape + Marginal_adhesion,
family = "binomial", data = bc1)
# summary(partyIm.glm)
cellshape_adhesion.int.glm = glm(Is_Malignant ~ Uniformity_of_cell_shape * Marginal_adhesion,
family = "binomial", data = bc1)
# summary(partyIm.int.glm)
grid = expand.grid(Uniformity_of_cell_shape = 1:10, Marginal_adhesion = 1:10)
cancer.no = predict(cellshape_adhesion.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
cancer.int = predict(cellshape_adhesion.int.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
grid2 = data.frame(grid, cancer.no, cancer.int)
grid2 = gather(grid2, key = model, value = cancer, c("cancer.no", "cancer.int"))
grid2$model = recode_factor(grid2$model, "cancer.no" = "No interaction",
"cancer.int" = "With interaction")
gg2 = ggplot(grid2, aes(x = Uniformity_of_cell_shape, y = cancer, group = model, color = model)) +
geom_line() +
facet_grid(~Marginal_adhesion) +
xlab("Uniformity of cell shape") +
ylab("Chance of cancer") +
ggtitle("Uniformity of cell shape\n and Marginal Adhesion")+
theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
```
```{r message=FALSE}
cellshape_chromatin.glm = glm(Is_Malignant ~ Uniformity_of_cell_shape + Bland_Chromatin,
family = "binomial", data = bc1)
# summary(partyIm.glm)
cellshape_chromatin.int.glm = glm(Is_Malignant ~ Uniformity_of_cell_shape * Bland_Chromatin,
family = "binomial", data = bc1)
# summary(partyIm.int.glm)
grid = expand.grid(Uniformity_of_cell_shape = 1:10, Bland_Chromatin = 1:10)
cancer.no = predict(cellshape_chromatin.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
cancer.int = predict(cellshape_chromatin.int.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
grid2 = data.frame(grid, cancer.no, cancer.int)
grid2 = gather(grid2, key = model, value = cancer, c("cancer.no", "cancer.int"))
grid2$model = recode_factor(grid2$model, "cancer.no" = "No interaction",
"cancer.int" = "With interaction")
gg3 = ggplot(grid2, aes(x = Uniformity_of_cell_shape, y = cancer, group = model, color = model)) +
geom_line() + facet_grid(~Bland_Chromatin) + xlab("Uniformity of cell shape") +
ylab("Chance of cancer") +
ggtitle("Uniformity of cell shape\n and Bland Chromatin")+
theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
```
```{r message=FALSE}
clump_adhesion.glm = glm(Is_Malignant ~ Clump_thickness + Marginal_adhesion, family = "binomial", data = bc1)
# summary(partyIm.glm)
clump_adhesion.int.glm = glm(Is_Malignant ~ Clump_thickness * Marginal_adhesion,
family = "binomial", data = bc1)
# summary(partyIm.int.glm)
grid = expand.grid(Clump_thickness = 1:10, Marginal_adhesion = 1:10)
cancer.no = predict(clump_adhesion.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
cancer.int = predict(clump_adhesion.int.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
grid2 = data.frame(grid, cancer.no, cancer.int)
grid2 = gather(grid2, key = model, value = cancer, c("cancer.no", "cancer.int"))
grid2$model = recode_factor(grid2$model, "cancer.no" = "No interaction",
"cancer.int" = "With interaction")
gg4 = ggplot(grid2, aes(x = Clump_thickness, y = cancer, group = model, color = model)) +
geom_line() + facet_grid(~Marginal_adhesion) + xlab("Clump Thickness") + ylab("Chance of cancer") +
ggtitle("Clump Thickness\n and Marginal Adhesion")+
theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
```
```{r message=FALSE, }
clump_chromatin.glm = glm(Is_Malignant ~ Clump_thickness + Bland_Chromatin, family = "binomial", data = bc1)
# summary(partyIm.glm)
clump_chromatin.int.glm = glm(Is_Malignant ~ Clump_thickness * Bland_Chromatin, family = "binomial", data = bc1)
# summary(partyIm.int.glm)
grid = expand.grid(Clump_thickness = 1:10, Bland_Chromatin = 1:10)
cancer.no = predict(clump_chromatin.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
cancer.int = predict(clump_chromatin.int.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
grid2 = data.frame(grid, cancer.no, cancer.int)
grid2 = gather(grid2, key = model, value = cancer, c("cancer.no", "cancer.int"))
grid2$model = recode_factor(grid2$model, "cancer.no" = "No interaction",
"cancer.int" = "With interaction")
gg5 = ggplot(grid2, aes(x = Clump_thickness, y = cancer, group = model, color = model)) +
geom_line() + facet_grid(~Bland_Chromatin) + xlab("Clump Thickness") + ylab("Chance of cancer") +
ggtitle("Clump Thickness\n and Bland_Chromatin")+
theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
```
```{r message=FALSE}
chromatin_adhesion.glm = glm(Is_Malignant ~ Marginal_adhesion + Bland_Chromatin, family = "binomial", data = bc1)
# summary(partyIm.glm)
chromatin_adhesion.int.glm = glm(Is_Malignant ~ Marginal_adhesion * Bland_Chromatin, family = "binomial", data = bc1)
# summary(partyIm.int.glm)
grid = expand.grid(Marginal_adhesion = 1:10, Bland_Chromatin = 1:10)
cancer.no = predict(chromatin_adhesion.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
cancer.int = predict(chromatin_adhesion.int.glm, newdata = grid, type = "response", se.fit = TRUE)$fit
grid2 = data.frame(grid, cancer.no, cancer.int)
grid2 = gather(grid2, key = model, value = cancer, c("cancer.no", "cancer.int"))
grid2$model = recode_factor(grid2$model, "cancer.no" = "No interaction",
"cancer.int" = "With interaction")
gg6 = ggplot(grid2, aes(x = Bland_Chromatin, y = cancer, group = model, color = model)) +
geom_line() + facet_grid(~Marginal_adhesion) + xlab("Bland Chromatin") +
ylab("Chance of cancer") + ggtitle("Bland_Chromatin\n and Marginal adhesion")+
theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
```
```{r message=FALSE}
#require(cowplot)
#theme_set(theme_cowplot(font_size=12))
#plot_grid(gg1, gg2, gg3, gg4, gg5, gg6, ncol = 2)
cancer.model = glm(Is_Malignant ~ Clump_thickness + Marginal_adhesion +
Uniformity_of_cell_shape + Bland_Chromatin +
Clump_thickness:Marginal_adhesion + Clump_thickness:Uniformity_of_cell_shape +
Clump_thickness:Bland_Chromatin , family = "binomial", data = bc1)
```
```{r message=FALSE, include=FALSE}
cancer.model.df = bc
cancer.model.df$.fitted = fitted.values(cancer.model)
cancer.model.df$.resid = residuals(cancer.model, type = "response")
# ggplot(cancer.model.df, aes(x = .fitted, y = .resid)) +
# geom_point() +
# geom_smooth(method = "lm", method.args = list(degree = 1)) +
# xlab("Fitted values") +
# ylab("Residuals")
summary(cancer.model)
cancer.grid = expand.grid(Clump_thickness = 1:6, Marginal_adhesion = 1:6,
Uniformity_of_cell_shape = 1:6, Bland_Chromatin = 1:6)
cellshape_names = c("1" = "UCS: 1", "2" = "UCS: 2", "3" = "UCS: 3", "4" = "UCS: 4",
"5" = "UCS: 5", "6" = "UCS: 6" )
adhesion_names = c("1" = "MA: 1", "2" = "MA: 2", "3" = "MA: 3", "4" = "MA: 4", "5" = "MA: 5", "6" = "MA: 6" )
clump_names = c("1" = "CT: 1", "2" = "CT: 2", "3" = "CT: 3", "4" = "CT: 4", "5" = "CT: 5", "6" = "CT: 6" )
chromatin_names = c("1" = "BC: 1", "2" = "BC: 2", "3" = "BC: 3", "4" = "BC: 4", "5" = "BC: 5", "6" = "BC: 6" )
cancer.prediction = predict(cancer.model, newdata = cancer.grid, type = "response")
cancer.grid1 = data.frame(cancer.grid, cancer.prob = as.vector(cancer.prediction))
ggplot(cancer.grid1, aes(x = Clump_thickness, y = cancer.prob*100,
group = Bland_Chromatin, color = Bland_Chromatin)) +
geom_line() + facet_grid(Marginal_adhesion ~ Uniformity_of_cell_shape,
labeller = labeller(Uniformity_of_cell_shape = as_labeller(cellshape_names),
Marginal_adhesion = as_labeller(adhesion_names)))+
xlab("Clump Thickness") + ylab("Probability of Cancer") + labs(color = "Bland Chromatin")
ggplot(cancer.grid1, aes(x = Clump_thickness, y = cancer.prob*100,
group = Marginal_adhesion, color = Marginal_adhesion)) +
geom_line() + facet_grid(Uniformity_of_cell_shape ~ Bland_Chromatin,
labeller = labeller(Uniformity_of_cell_shape = as_labeller(cellshape_names),
Bland_Chromatin = as_labeller(chromatin_names)))+
xlab("Clump Thickness") + ylab("Probability of Cancer") + labs(color = "Marginal Adhesion")
```
### Fitting the glm model:
To increase the complexity of the model and eventually accuracy of the prediction, we have decided to include the combinations of two variable as well. We have 4 variables so to choose 2 from 4 we had total 6 options and out of these 6 we have picked 3 combination based on their interaction plots and later through hit and trial, it was verified to perform the best.
```{r message=FALSE}
cancer.model = glm(Is_Malignant ~ Clump_thickness + Marginal_adhesion +
Uniformity_of_cell_shape + Bland_Chromatin +
Clump_thickness:Marginal_adhesion +
Clump_thickness:Uniformity_of_cell_shape +
Clump_thickness:Bland_Chromatin , family = "binomial", data = bc)
cancer.model.df = bc
cancer.model.df$.fitted = fitted.values(cancer.model)
cancer.model.df$.resid = residuals(cancer.model, type = "response")
# ggplot(cancer.model.df, aes(x = .fitted, y = .resid)) + geom_point() +
# geom_smooth(method = "lm", method.args = list(degree = 1)) +
# xlab("Fitted values") + ylab("Residuals")
summary(cancer.model)
```
###Accuracy on test set:
```{r message=FALSE}
train <- bc1[1:550,]
test <- bc1[551:683,]
fitted.results <- predict(cancer.model,newdata=subset(test,select=c(1,3,4,7)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$Is_Malignant)
#print(paste('Accuracy',1-misClasificError))
paste('Accuracy =', (1-misClasificError)*100,'%')
```
# Dimensionality Reduction: PCA
- If we have a large amount of data, we would like to avoid the curse of dimensionality and would like to reduce the time and space required.
- We tried to transform the dimensions of the data which captures the maximum variance using PCA and observed how much variance we can capture using PCA.
- This would let us easily visualize the data.
- It would also help us remove the multi-collinearity, which would help us ignore redundant features.
```{r message=FALSE}
pca_res <- prcomp(bc[,1:9], center = TRUE, scale = TRUE)
plot(pca_res, type="l", main='Variance Captured using different variables after PCA')
```
- We can see that, after applying PCA, the maximum variance is captured by the first variable. The second and the third variable also contributes to the variance a little.
So now we can check how the data looks using the most important 2 and 3 dimensions.
```{r message=FALSE, warning=FALSE}
### PCA using 2 PCA dimensions
pca_df <- as.data.frame(pca_res$x)
ggplot(pca_df, aes(x=PC1, y=PC2, col=bc$Is_Malignant)) +
geom_point(alpha=0.5) +
#xlab(names(df)) +
labs(fill = "Malignant")+
ggtitle("Data using 2 dimensions after PCA")+
labs(color='Class: Malignant')
```
- We can see that the data seems quite seperable using just the 2 dimensions.
We will just check how it looks using 3 dimensions.
```{r message=FALSE, warning=FALSE}
### PCA using 3 PCA dimensions
library(plotly)
legendtitle <- list(y=1.05,x=1.18,text="Class: Malignant",showarrow=F)
plot_ly(x=pca_df$PC1, y=pca_df$PC2, z=pca_df$PC3, type="scatter3d", mode="markers",marker = list(size = 5), color=bc$Is_Malignant) %>%
layout(
title = "Data using 3 dimensions after PCA",
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
),
annotations=legendtitle
)
```
- Again, we see that the data looks pretty seperable using 3 dimensions as well.
- We can observe that using just the 2 dimensions, we get very seperated results. Hence, we proceed with applying a linear model to classify the malign and benign tumors on the 2 PCA reduced dimensions.
#### Decision boundary
Let us draw a decision boundary on the data from the 2 dimensional PCA.
```{r , warning=FALSE}
mdl <- glm( bc$Is_Malignant ~ . , data = pca_df , family=binomial)
slope <- coef(mdl)[2]/(-coef(mdl)[3])
intercept <- coef(mdl)[1]/(-coef(mdl)[3])
library(lattice)
xyplot( PC2 ~ PC1 , data = pca_df, groups = bc$Is_Malignant,
panel=function(...){
panel.xyplot(...)
panel.abline(intercept , slope)
panel.grid(...)
}, main="Data using 2 dimensions after PCA with the decison boundary")
```
- The decision boundary which is drawn separated the data pretty well enough.
Now, we calculate the accuracy of the data separated by the decision boundary.
```{r}
predictions <- predict(mdl,newdata=pca_df,type='response')
predictions <- ifelse(predictions > 0.5,1,0)
misClasificError <- mean(predictions != bc$Is_Malignant)
print(paste('Accuracy',1-misClasificError))
```
####Results
- We observe that the accuracy of around 97.07% is a good prediction using data with such low dimensions.
- But, here the problem is we cannot interpret the dimensions which we get from PCA. Hence, we need something that can give us some high accuracy with interpretable features.
- Therefore, we next tried random forest, whose result features would be interpretable and more practical.
# Random Forest
- As discussed above PCA gives us good results but is less interpretable. Hence, we will use a random forest model which is developed by aggregating trees.
- The advantage of random forest is that its results are very interpretable and it also avoids overfitting.
- We can perform feature selection based on their importance.
- As an extension, there are many other health test which can be done which can be useful for breast cancer diagnosis, due to which we can have more number of features that can be handled using random forest.
**We experiment the model with different depths and check which one performs the best.**
```{r message=FALSE, include=FALSE}
#install.packages("randomForest")
model1 <- randomForest(x=bc[,1:9],y=as.factor(bc$Is_Malignant), importance = TRUE,depth=1)
model1
```
```{r message=FALSE, include=FALSE}
model2 <- randomForest(x=bc[,1:9],y=as.factor(bc$Is_Malignant), importance = TRUE,depth=2)
model2
```
```{r message=FALSE, include=FALSE}
model3 <- randomForest(x=bc[,1:9],y=as.factor(bc$Is_Malignant), importance = TRUE,depth=3)
model3
```
```{r message=FALSE, include=FALSE}
model4 <- randomForest(x=bc[,1:9],y=as.factor(bc$Is_Malignant), importance = TRUE,depth=4)
model4
```
```{r message=FALSE, include=FALSE}
model5 <- randomForest(x=bc[,1:9],y=as.factor(bc$Is_Malignant), importance = TRUE,depth=5)
model5
```
```{r message=FALSE, include=FALSE}
model6 <- randomForest(x=bc[,1:9],y=as.factor(bc$Is_Malignant), importance = TRUE,depth=6)
model6
```
Accuracy for models with depths: 1,2,3,4,5,6 =
```{r}
error<-c(2.93,3.22,2.64,3.07,3.07,2.64)
acclist<-100-error
acclist
```
- Now lets visualise the accuracy values.
```{r}
accuracydf = data.frame(Depth = c(1,2,3,4,5,6), acclist = acclist)
ggplot(accuracydf, aes(x = Depth, y = acclist, width = 0.75)) +
geom_line(stat = "identity",size=1.4) +
xlab("Depth of Random Forest") +
ylab("Accuracy of Model") +
geom_point(color='blue', size = 6) +
ggtitle("Change in accuracy with change in depth")
```
For depth = 1, it gives the accuracy of 97.07 and after increasing the depth it does not help much with the accuracy. At max we reached to 97.37% of accuracy at depth =3 and depth =6 and this change is not sufficient enough to go for higher depth and unnecessarily overfit our model.
```{r}
varImportance <- varImp(model1, scale = FALSE)
#varImportance
varImportanceScores <- data.frame(varImportance$`0`)
#varImportanceScores
varImportanceScores <- data.frame(Features = row.names(varImportance), Importance = varImportanceScores$varImportance..0.)
#varImportanceScores
```
# Gini importance of features in Random Forest:
Gini Importance or Mean Decrease in Impurity (MDI) calculates each feature importance as the sum over the number of splits (across all trees) that include the feature, proportionally to the number of samples it splits. We have plotted relative feature importance of the model that we have created.
```{r}
library(randomForest)
set.seed(7)
varImpPlot(model1, type = 2, main = "Feature importance of attributes")
```
- We can observe that the uniformity of cell size and cell shape has the higher importance compared to remaining features. Further, lets create a tree from which we can interpret our results very easily by just looking at the symptoms of a patient.
```{r fig.width=10, fig.height=7}
library(rpart)
fit=rpart(factor(Is_Malignant)~., bc)
plot(fit, main = "Dendogram showing the random forest model")
text(fit,pretty = TRUE, fancy = FALSE, fwidth = 0.8, fheight = 0.8)
```
- As we can see, now the results are very interpretable as we can just measure each of the attributes and we can decide with good confidence, whether a person has a malignant tumor or a benign one.
- For example, if a patient has the uniformity of cell size less than 2.5, uniformity of cell shape greater than 2.5 and Bland chromatin greater than 3.5, then the person is most likely to have a malignant tumor.
###Accuracy:
We tried 3 different models: GLM, GLM with PCA and Random Forest. Accuracy of those models are 99.4% , 97% and 97.07% respectively. Random Forest was chosen to be the final model because it gave us pretty high accuracy just using DEPTH=1 and estimators=100.
###Conclusion:
We have done detailed analysis on which features are useful in predicting malignancy of breast cancer. We also gave a simple dendogram chart prepared from Random Forest that can help anyone to predict whether a tumor is benign or malignant from 9 given features. From our derived conclusions we can conclude that there are mainly 1 or 2 features of tumor cell which are most important in order to predict malignancy of tumor. Also, it is much important to identify cancerous nature of tumor cells in early stages as nearly 86% of patients could be cured if tumor is treated early. As a future work, we can test on more practical data to get conclusions on robustness of our model and improve our findings.
### Acknowledgement:
We thank Prof. Brad Luen and Seiji Sloan for guiding us throughout the project. The inspiration to try Random Forest Model and make simple dendogram like chart came from suggestions by Prof. Brad Luen during the presentations and we really thank him to guide us throughout the semester.
### References
- https://www.researchgate.net/profile/Akash_Nag2/publication/325868350_Identifying_Patients_at_Risk_of_Breast_Cancer_through_Decision_Trees/links/5b2a29da4585150c63400a5f/Identifying-Patients-at-Risk-of-Breast-Cancer-through-Decision-Trees.pdf?origin=publication_detail
- https://www.breastcancer.org/symptoms/understand_bc/statistics
- http://pathology.jhu.edu/pc/BasicTypes1.php