MGT 3 UCSD Mathematics Worksheet

User Generated

Wrssyrr

Mathematics

MGT 3

University of California San Diego

MGT

Description

Unformatted Attachment Preview

MGT 3: Quantitative Methods in Business Extra Credit 1—Naïve Bayes (20 points) The Breast Cancer Dataset This dataset contains information about 277 women who were treated for breast cancer; specifically, for the removal of a malignant tumor from one of their breasts. This dataset is well-known and has been regularly cited in machine learning literature to explore or demonstrate various approaches to classification modeling. Analysts use the information to predict which of the women experienced a recurrence of her tumor within five years of the initial tumor’s removal. For this exercise, the data has been split into two sets: a training set containing ~80% of the original data (221 rows), and a testing set containing the remaining ~20% (56 rows). You will build your model on the training set and generate predictions on the testing set. This dataset was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. 9 instances which contained missing data were removed from the original dataset. Thanks go to M. Zwitter and M. Soklic for providing the data. The names of the variables contained in the dataset are listed in the table below, along with a brief description of each: Variable Description recur yes/no: whether or not the patient suffered a recurrence of the tumor within five years of the tumor’s removal (yes = recurred) age the age of the patient, bucketed by decade (e.g., 20-29, 30-39) post_meno yes/no: whether or not the patient is post-menopausal (yes = post-menopausal) tumor_size the maximum diameter of the tumor (in millimeters) inv_nodes number of axillary (armpit) lymph nodes with visible metastic breast cancer at the time of diagnosis node_caps yes/no: whether or not the cancer metastasized to a lymph node (yes = metastasized) deg_malig “degree malignant”: the severity of the malignancy of the tumor, ranked on a 3-point scale (1 = least severe, 3 = most severe) breast indicates whether the tumor occurred on the left or right breast quadrant the location of the tumor within the breast, categorized by quadrant (e.g., upper-left, lower-right) radiation yes/no: whether or not the patient received radiation therapy (yes = received radiation) © Ryan Wagner, 2020. Do not copy or distribute without permission. Installing R Packages R comes with built-in functionality (called base R), and the ability to easily load additional commands that greatly enhance the software’s capabilities. These extra commands are found in packages. The commands needed to run and analyze a Naïve Bayesian classifier are found in two packages: e1071 and caret. Fortunately, installing and loading these packages is typically very fast, and requires only simple commands. To install these packages, run the following commands: install.packages(“e1071”) install.packages(“caret”) Note the use of quotes around each package name in the above commands. You only ever need to run these commands one time. Once the package has been successfully installed, it lives permanently on your computer unless you manually uninstall it. Installing a package typically takes between 30-60 seconds, though the process make take longer if your internet connection is weak. You will know the package has finished installing when the > symbol re-appears in the console. Loading R Packages Every time you start R, only the commands contained within base R are automatically loaded. If you wish to use a command contained in a package, you must first load that package. This process must be repeated every time you start R (assuming you wish to use that package during your session), but the command only needs to be run once at the start of your session. After installing a package for the first time (see above), you will still need to load each time you begin a new R session, using the commands below. To load the e1071 and caret packages, run the following commands: library(e1071) library(caret) Note that unlike install.packages(), the library() command does not use quotes around the package name. A best practice is to save all relevant library() calls (those required to execute the work contained in a given script) at the beginning of your script, so that if you revisit the script at a later date, you already know which packages are needed, and are reminded to load them. © Ryan Wagner, 2020. Do not copy or distribute without permission. Training the model When predictions are generated via Naive Bayes, an expanded version of Bayes’ Theorem is used to calculate a posterior probability distribution of the target variable for each row in the dataset. The terms in the theorem (prior probability, evidence, and likelihood) reflect patterns that are believed to govern the system of variables being analyzed. Model training refers to the step in which these patterns are analyzed and stored, so that they can be used to generate predictions on future cases. To train a Naive Bayes, use the naiveBayes() command. This command takes the following parameters: formula: the target variable (that which we would like to predict the outcome of), and the predictors (those variables we believe contain useful information that help us predict the outcome of the target variable.) Formulas take the form y~a+b+c, where y is the target variable, and a,b,c are the predictors. The list of predictors may be extended infinitely, each separated by the + symbol. For example: in the iris dataset, if you wished to predict the species of an iris using information about its petal length and width, you would write the formula as: Species ~ Petal.Length + Petal.Width A few notes on the formula: • The use of spacing between variable names is optional; you may prefer to insert spaces to make your code easier to read. • Variable names are written without $ notation, and no quotes are used around them. Shortcut: If you wish to use all remaining variables (besides the target variable) predictors, you can simply write a period after the ~ symbol in your formula. For example, the following formula assigns Species as the target variable, and uses all four remaining variables in the iris dataset as predictors: Species ~ . data: the dataset containing the target variable and predictors, to be used to train your model. For example, to train your model on the iris dataset, you would write: data = iris Assign the output of the naiveBayes() command to an object with a name of your choice. Continuing the above example, the complete naiveBayes() call would appear as: nb
Purchase answer to see full attachment
Explanation & Answer:
3 pages
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached.

Running Head: MGT 3: Quantitative Methods in Business

MGT 3: Quantitative Methods in Business
Name
Institutional Affiliation
Date

1

MGT 3: Quantitative Methods in Business

2

Outline: Extra Credit 1—Naïve Bayes
The Breast Cancer Dataset
For this exercise, the data has been split into two sets:




A training set containing ~80% of the original data (221rows)
A testing set containing the remaining ~20% (56 rows).
You will build your model on the training set and generate predictions on the testing set.


Running Head: MGT 3: Quantitative Methods in Business

MGT 3: Quantitative Methods in Business
Name
Institutional Affiliation
Date

1

MGT 3: Quantitative Methods in Business

2

Extra Credit 1—Naïve Bayes
Q1. Write the command to train a model on the training dataset that predicts the
recurrence of breast cancer using only information about the patient’s age. Save the output
as an object called nb_1. (2 points)
Solution:
#required packages
𝑙𝑖𝑏𝑟𝑎𝑟𝑦(𝑒1071)
𝑙𝑖𝑏𝑟𝑎𝑟𝑦(𝑐𝑎𝑟𝑒𝑡)
#Load the training and test data
𝑏𝑐_𝑡𝑟𝑎𝑖𝑛 < −𝑟𝑒𝑎𝑑. 𝑐𝑠𝑣("𝐸:\\𝐸𝑥𝑝𝑒𝑟𝑡 𝑃𝑎𝑢𝑙\\2020\\18𝑡ℎ 𝐹𝑒𝑏\\𝑏𝑐_𝑑𝑎𝑡𝑎𝑠𝑒𝑡\
\𝑏𝑐_𝑡𝑟𝑎𝑖𝑛. 𝑐𝑠𝑣")
𝑏𝑐_𝑡𝑒𝑠𝑡 < −𝑟𝑒𝑎𝑑. 𝑐𝑠𝑣("𝐸:\\𝐸𝑥𝑝𝑒𝑟𝑡 𝑃𝑎𝑢𝑙\\2020\\18𝑡ℎ 𝐹𝑒𝑏\\𝑏𝑐_𝑑𝑎𝑡𝑎𝑠𝑒𝑡\\𝑏𝑐_𝑡𝑒𝑠𝑡. 𝑐𝑠𝑣")
#training a model on the training dataset to predict recurrence of breast cancer with age
...


Anonymous
I was having a hard time with this subject, and this was a great help.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4