Breast cancer is common. It is estimated that 1 out of 8 women will develop a breast cancer at some point. A mammogram is an x-ray picture of the breast. Mammograms are used to check for breast cancer in women who have no signs or symptoms of the disease (screening mammograms). If a breast abnormality is found, further mammograms may be necessary to determine if the abnormality is benign or malignant (diagnostic mammograms).

After mammography, the mammograms are examined by radiologists to look for abnormalities in breast tissue. Sometimes a mass, something more substantial and clear than a lesion, may be found. Most breast masses are benign, but certain characteristics of mass may suggest a breast cancer. A radiologist will write a report to describe and assess the nature of any abnormality found. BI-RADS (Breast Imaging Reporting and Data System) is a standard system to describe mammogram findings and results. It sorts the results into categories numbered 0 through 6 (see this page and this page for an explanation of the assessment codes). In addition to the BI-RADS assessment codes, several other BI-RADS attributes are also reported: shape (round/oval/lobular/irregular), density (the amount of fat cells present and density of suspicious cells) and margin (characteristics of its edge, circumscribed/microlobulated/obscured/ill-defined/spicularted). A more detailed explanation can be found on this page.

Although mammograms are currently the best tool for detecting a breast cancer, mammogram results are not accurate. About 70% of the positive results turned out to be false positives when biopsies were performed, causing major mental and physical discomfort for the patients. Several computer-aided diagnosis (CAD) systems were proposed based on the BI-RADS attributes, hoping to improve the accuracy.

In this exercise, you are going to construct a logistic regression model to predict if a mammographic mass is benign or malignant based on the BI-RADS attributes (i.e. shape, density, margin) and the patient’s age. That is, you are going to ignore the BI-RADS assessment code and rely only on the BI-RADS attributes and the patient’s age to predict if a mass is benign or malignant. You will use a version of the mammographic mass data set from the UCI machine learning repository. The csv data file can be downloaded here and the data description is on this page. The data were collected at the Institute of Radiology of the University Erlangen-Nuremberg between 2003 and 2006. The data set has been cleaned. They contain the patients’ ages, BI-RADS assessment codes, the three BI-RADS attributes (shape, margin, density). The last column, named severity, is a binary 0/1 variable indicating whether the mass is benign or malignant. Take some time to explore the data.

a. (1 point)

Use read.table() with an appropriate parameter to load the data to R.

Note: The data file is not in csv form. In Week 9, some students did not know how to read a non-csv data file properly with R. Many real-world data are not in csv form, so it is very important to know how to read a non-csv file properly. As emphasized in Section 6.2 of Peng’s textbook, it is worth reading the help file in ?read.table in its entirety.

mammo <- read.table("mammographic_masses_cleaned.txt", header=TRUE)


Before building a model, it is useful to make some plots to visualize how the different variables are correlated with the probability of a breast cancer.

There are 72 different values of age in the data set, as can be seen from the output of the table() applied on the age column. We can calculate the fraction of people with malignant mass (severity=1) versus the different values of age using the command f_age <- with(data_frame_name, tapply(severity, age, mean)) (replace data_frame_name by the name of the data frame you use when you load the data set). The 72 different values of age are in the names of f_age, which are characters. We can extract the ages by converting the characters to numbers using the command age <- as.numeric(names(f_age)).

b. (2 points)

Make a scatter plot showing the fraction of patients with malignant mass versus age in the data set. (1 pt)

Calculate f_age and extract the unique ages in the data set:

f_age <- with(mammo, tapply(severity, age, mean))
age <- as.numeric(names(f_age))

Plot f_age vs age:

plot(f_age ~ age, pch=16, las=1, xlab="Age", ylab="Breast Cancer Fraction")


Aside: You should know from this week’s notes and from this week’s exercises that the values of f_age is the same as the predicted probability of getting a breast cancer from a logistic regression model predicting severity from age treating age as a factor variable, as demonstrated below:

logit.age <- glm(severity ~ as.factor(age), data=mammo, family=binomial)
plot(logit.age$fitted.values ~ mammo$age, pch=16, las=1, xlab="Age", 
     ylab="P(Breast Cancer)")

How does the chance of getting a breast cancer change with a patient’s age? (1 pt)

It is clear from the graph that the chance of getting a breast cancer generally increases with age.

c. (4 points)

Calculate the percentage of malignant masses for each of the 4 categories in shape. (2 pts)

(p_shape <- with(mammo, tapply(severity, shape, mean)*100))
irregular   lobular      oval     round 
 78.57143  51.85185  17.22222  16.84211 

Then make a barplot showing the percentages for the 4 categories. (2 pts)

barplot(p_shape, las=1, xlab="Shape", ylab="% of Malignant Masses")


Aside: If you want, you can add the percentage information on the top of each bar. The following is a customized barplot function I wrote a while ago. I won’t explain how it works here. Try to figure it out yourself if you are interested.

my_barplot = function(x,...) {
  # calculate percentage
  # round percents to 2 sig. fig. and change it to a character
  perc_char = paste0(as.character(signif(x,2)),"%")
  # set ymax on the plot
  yup = signif(min(100,max(x)+10),1)
  mp = barplot(x,ylim=c(0,yup),las=1,...)
  # Add percent value above each bar
  text(mp,x+3,labels=perc_char )
}

It was based on a trick I saw on a stack overflow forum (a good place to ‘steal’ tricks). Let’s see how it looks when applied on this data set:

my_barplot(p_shape, xlab="Shape", ylab="% of Malignant Masses")

Again, the values of p_shape are exactly the same as the probability estimate from a logistic regression model predicting severity from shape.

d. (2 points)

Do the same calculations for the margin variable. That is, calculate the percentages of malignant masses for each category in margin and then make a barplot of percentages for the categories.

(p_margin <- with(mammo, tapply(severity, margin, mean)*100))
 circumscribed    ill-defined microlobulated       obscured     spiculated 
      11.87500       69.68504       65.21739       62.85714       83.46457 
barplot(p_margin, las=1, xlab="Margin", ylab="% of Malignant Masses")

OR use my_barplot() to add % on top of each bar:

my_barplot(p_margin, xlab="Margin", ylab="% of Malignant Masses")

e. (2 points)

Do the same calculations for the density variable. That is, calculate the percentages of malignant masses for each category in density and then make a barplot of percentages for the categories.

(p_density <- with(mammo, tapply(severity, density, mean)*100))
fat-containing           high            iso            low 
      50.00000       45.45455       32.14286       49.73475 
barplot(p_density, las=1, xlab="Density", ylab="% of Malignant Masses")

OR use my_barplot() to add % on top of each bar:

my_barplot(p_density, xlab="Density", ylab="% of Malignant Masses")

f. (4 points)

Build a logistic regression model predicting severity from age, shape, margin and density. Treat age as a continuous variable and the other 3 BI-RADS attributes as factor variables. (2 pts)

logit.model <- glm(severity ~ age+shape+margin+density, data=mammo, family=binomial)
summary(logit.model)

Call:
glm(formula = severity ~ age + shape + margin + density, family = binomial, 
    data = mammo)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5300  -0.5633  -0.2185   0.6631   2.5564  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -4.565858   0.950479  -4.804 1.56e-06 ***
age                   0.054876   0.007827   7.011 2.36e-12 ***
shapelobular         -0.715062   0.298668  -2.394  0.01666 *  
shapeoval            -1.636775   0.290107  -5.642 1.68e-08 ***
shaperound           -1.376055   0.333461  -4.127 3.68e-05 ***
marginill-defined     1.495475   0.302762   4.939 7.83e-07 ***
marginmicrolobulated  1.637733   0.559682   2.926  0.00343 ** 
marginobscured        1.155811   0.352738   3.277  0.00105 ** 
marginspiculated      2.007016   0.374927   5.353 8.65e-08 ***
densityhigh           1.759788   1.062651   1.656  0.09771 .  
densityiso            0.798370   0.865664   0.922  0.35639    
densitylow            1.108105   0.792189   1.399  0.16188    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1148.48  on 828  degrees of freedom
Residual deviance:  723.34  on 817  degrees of freedom
AIC: 747.34

Number of Fisher Scoring iterations: 5

The model can also be constructed using the command logit.model <- glm(severity ~ .-BIRADS, data=mammo, family=binomial). The syntax severity ~ .-BIRADS means to fit severity using all other variables in the data frame except BIRADS.

How many parameters (intercept and slopes) are returned by the model? Which slopes are not significant at the 5% level? (2 pts)

From the summary output, we see that there are 12 parameters (1 intercept and 11 slopes). The p-values for densityhigh, densityiso and densitylow are greater than 5% and so they are not significant at the 5% level. This means that the probability of a mass being malignant is insensitive to whether the mass density being high, iso or low, keeping all the other variables fixed.

g. (7 points)

Suppose we decide that a mass is malignant if the predicted probability from the logistic regression model above is greater than 0.5.

Calculate the percentage of wrong predictions of this model (3 pts)

The predicted probability of each case is stored in the vector logit.model$fitted.values. We create a variable pred to store the prediction of this model for each case by setting pred=1 (malignant) if the predicted probability is greater than 0.5 and pred=0 (benign) if the probability is equal to or less than 0.5. This can be done by the following one-line command:

pred <- as.integer(logit.model$fitted.values > 0.5)

The percentage of wrong prediction, the misclassification rate, can be calculated by

(p_misclassify <- mean(pred != mammo$severity))
[1] 0.1845597

So the percentage of wrong predictions is 18.5%.

Calculate the probability of Type I and Type II error. (4 pts)

Probability of Type I error is the percentage of benign masses misclassified as malignant, which can be calculated by the following command.

(p1 <- mean(pred[mammo$severity==0]))
[1] 0.2177986

You can also get the same answer by a somewhat obscure command mean(pred[!mammo$severity]).

Probability of Type II error is the percentage of malignant masses misclassified as benign, which can be calculated by the following command

(p2 <- mean(pred[mammo$severity==1]==0))
[1] 0.1492537

or a somewhat obscure command mean(!pred[mammo$severity==1]).

So the probabilities of Type I and Type II errors are 0.218 and 0.149, respectively.

You can also construct a confusion matrix to calculate the errors:

(cm <- with(mammo, table(pred,severity)))
    severity
pred   0   1
   0 334  60
   1  93 342
n <- nrow(mammo)

The prediction error is the percentage of wrong preditions = (60 + 93)/total number = (60 + 93)/829 = 0.1845597.

P(Type I error) = 93/(334 + 93) = 0.2177986.

P(Type II error) = 60/(60 + 342) = 0.1492537.

h. (3 points)

Suppose doctors decide that it is more important to make sure that a breast cancer be detected, so they would want to reduce the Type II error. This can be achieved by setting a lower threshold value for the classification of malignant masses. Instead of 0.5, suppose now we classify a mass to be malignant if the predicted probability from the logistic regression model above is greater than 0.3. Calculate the probability of Type I and Type II error in this case.

The new prediction and probabilities can be calculated using these commands:

pred_new <- as.integer(logit.model$fitted.values > 0.3)
(p1_new <- mean(pred_new[mammo$severity==0]))
[1] 0.3161593
(p2_new <- mean(pred_new[mammo$severity==1]==0))
[1] 0.09701493

The probability of Type I error increases to 0.316 but the probability of Type II error decreases to 0.097.




You should see from your calculation that we can decrease the Type II error rate in the expense of increasing the Type I error rate for a fixed detection system. This is the tradeoff between the Type I/Type II error described in Chapter 1 of your Stat 200 notes. Thus, a better approach is to construct a better detection system to reduce both type of errors. In our example here, we need to build a model better than the logistic regression model considered in this exercise. However, a sophisticated model usually involves many parameters and the model is prone to overfitting — it gives accurate prediction on the data the model is constructed but doesn’t perform very well on new data. A commonly used strategy to prevent overfitting is to split the data randomly into two sets, called the traning set and the test set (or the held-out data). Models are constructed using the training set and then the accuracy is evaluated by applying the models to predict responses for the data on the test set.