MGT 3: Quantitative Methods in Business
Extra Credit 1—Naïve Bayes (20 points)
The Breast Cancer Dataset
This dataset contains information about 277 women who were treated for breast cancer; specifically, for the
removal of a malignant tumor from one of their breasts. This dataset is well-known and has been regularly
cited in machine learning literature to explore or demonstrate various approaches to classification modeling.
Analysts use the information to predict which of the women experienced a recurrence of her tumor within
five years of the initial tumor’s removal.
For this exercise, the data has been split into two sets: a training set containing ~80% of the original data (221
rows), and a testing set containing the remaining ~20% (56 rows). You will build your model on the training set
and generate predictions on the testing set.
This dataset was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. 9
instances which contained missing data were removed from the original dataset. Thanks go to M. Zwitter and
M. Soklic for providing the data.
The names of the variables contained in the dataset are listed in the table below, along with a brief
description of each:
Variable
Description
recur
yes/no: whether or not the patient suffered a recurrence of the tumor within five years of the tumor’s removal
(yes = recurred)
age
the age of the patient, bucketed by decade (e.g., 20-29, 30-39)
post_meno
yes/no: whether or not the patient is post-menopausal (yes = post-menopausal)
tumor_size
the maximum diameter of the tumor (in millimeters)
inv_nodes
number of axillary (armpit) lymph nodes with visible metastic breast cancer at
the time of diagnosis
node_caps
yes/no: whether or not the cancer metastasized to a lymph node (yes = metastasized)
deg_malig
“degree malignant”: the severity of the malignancy of the tumor, ranked on a 3-point scale
(1 = least severe, 3 = most severe)
breast
indicates whether the tumor occurred on the left or right breast
quadrant
the location of the tumor within the breast, categorized by quadrant
(e.g., upper-left, lower-right)
radiation
yes/no: whether or not the patient received radiation therapy (yes = received radiation)
© Ryan Wagner, 2020. Do not copy or distribute without permission.
Installing R Packages
R comes with built-in functionality (called base R), and the ability to easily load additional commands that
greatly enhance the software’s capabilities. These extra commands are found in packages. The commands
needed to run and analyze a Naïve Bayesian classifier are found in two packages: e1071 and caret.
Fortunately, installing and loading these packages is typically very fast, and requires only simple commands.
To install these packages, run the following commands:
install.packages(“e1071”)
install.packages(“caret”)
Note the use of quotes around each package name in the above commands. You only ever need to run these
commands one time. Once the package has been successfully installed, it lives permanently on your
computer unless you manually uninstall it.
Installing a package typically takes between 30-60 seconds, though the process make take longer if your
internet connection is weak. You will know the package has finished installing when the > symbol re-appears
in the console.
Loading R Packages
Every time you start R, only the commands contained within base R are automatically loaded. If you wish to
use a command contained in a package, you must first load that package. This process must be repeated
every time you start R (assuming you wish to use that package during your session), but the command only
needs to be run once at the start of your session. After installing a package for the first time (see above), you
will still need to load each time you begin a new R session, using the commands below.
To load the e1071 and caret packages, run the following commands:
library(e1071)
library(caret)
Note that unlike install.packages(), the library() command does not use quotes around the package
name.
A best practice is to save all relevant library() calls (those required to execute the work contained in a
given script) at the beginning of your script, so that if you revisit the script at a later date, you already know
which packages are needed, and are reminded to load them.
© Ryan Wagner, 2020. Do not copy or distribute without permission.
Training the model
When predictions are generated via Naive Bayes, an expanded version of Bayes’ Theorem is used to calculate
a posterior probability distribution of the target variable for each row in the dataset. The terms in the theorem
(prior probability, evidence, and likelihood) reflect patterns that are believed to govern the system of
variables being analyzed. Model training refers to the step in which these patterns are analyzed and stored, so
that they can be used to generate predictions on future cases.
To train a Naive Bayes, use the naiveBayes() command. This command takes the following parameters:
formula: the target variable (that which we would like to predict the outcome of), and the predictors (those
variables we believe contain useful information that help us predict the outcome of the target variable.)
Formulas take the form y~a+b+c, where y is the target variable, and a,b,c are the predictors. The list of
predictors may be extended infinitely, each separated by the + symbol.
For example: in the iris dataset, if you wished to predict the species of an iris using information about its
petal length and width, you would write the formula as:
Species ~ Petal.Length + Petal.Width
A few notes on the formula:
• The use of spacing between variable names is optional; you may prefer to insert spaces to make your
code easier to read.
• Variable names are written without $ notation, and no quotes are used around them.
Shortcut: If you wish to use all remaining variables (besides the target variable) predictors, you can simply
write a period after the ~ symbol in your formula. For example, the following formula assigns Species as the
target variable, and uses all four remaining variables in the iris dataset as predictors:
Species ~ .
data: the dataset containing the target variable and predictors, to be used to train your model. For example,
to train your model on the iris dataset, you would write:
data = iris
Assign the output of the naiveBayes() command to an object with a name of your choice.
Continuing the above example, the complete naiveBayes() call would appear as:
nb
Purchase answer to see full
attachment