Logistic Regression
A lot of times datasets will be found as links on a
website. Click on them and save them
into a text reader program like Notepad.
Make sure you know to which directory (i.e. folder) you save it because
you will need to locate it in order to read it into R. Use the getwd()
command with empty parentheses for R to tell you what your current Working
Directory is. In other words, this
command tells you which folder is in your hand, have opened and are currently
looking at. If you are not sure if you
saved the file of interest in that particular directory then use the dir() command with empty
parentheses. This will have R print out
a list of all the files in the current working directory. If it is not there then you will need to
change the directory to the one where the file in mind is located – this requires
the setwd() command where you insert
the path to the new directory inside the parentheses.
I manually added column names in the text file (via Notepad),
saved the data as a .txt file, named it lobster.txt and stored it in my
Documents folder. Read it into R with the read.csv()
command.
getwd() [1] "C:/Users/Joshua/Desktop" setwd(“C:/Users/Joshua/Documents”) #This code should have no output because it is merely changing directories so nothing visible should happen
I want to point out that my code in the setwd() command should look different than yours. I am using my personal computer to retrieve the
.txt file so the pathway inserted in the parentheses in unique to my
computer. Therefore, if you try running
my code verbatim then you will get an error stating something along the lines
that R cannot change working directory. What this really means is that no such
directory exists on your computer. If
you need to, manually locate the file on your computer, right click on it,
click on Properties and then once in the properties window look for something
that says Location or Folder Path and then copy and paste that into the setwd() command. But make sure you put quotations around the
file path as well as change all the forward slashes ( \ ) with backslashes ( /
). R uses backslashes. Okay, let’s finally read this text file into
R and store it into an object named lobster.
Honestly, you can name it whatever you like – heck, name it bob or steve
if it tickles your fancy. I simply chose
to name the object lobster because that is what the data is about.
lobster <- read.csv(“lobster.txt”, header=T, sep=” “)
lobster = pd.read_csv("C:/Users/Joshua/Documents/lobster.txt", sep=" ")
length <- lobster$length survived <- lobster$survived lobster <- cbind(length, survived) lobster <- as.data.frame(lobster)
All I did here was select the desired columns and shove them
into an object which I decided to name after the columns. Next I wanted to put them together into a
data frame like structure with only the length and the survived columns so I
used the cbind() command (short for
column bind) which glues the two columns side by side. Unfortunately, the two columns weren’t
vectors or single column data frames.
They were lists of integers, so I used the as.data.frame() command in order to coerce the data type to change
from list to data frame. We need to work
with data frames in order to do our analyses.
Look again at the View() or head() commands of the newly created
data frame and you will see that it is exactly the way we found it online. Now that we’ve played maid or butler and
cleaned the data, let’s plot the data and see the fruits of our labor.
plot(x=lobster$length, y=lobster$survived, sub=”WHAT?! NOT HELPFUL AT ALL!!!!!!!!)
This scatterplot depicts the main problem when working with Boolean
target variables. Scatterplots of this
nature are very unhelpful. They don’t help
flush out the hidden story of the data.
We need to go back to the drawing board and hash out a new strategy for
modeling this data. We need to find a
way to transform the data. Let’s begin
by thinking about what we are attempting to generate for a graph. Logistic models show proportions, so let’s
try finding a way to create the proportion of the number of lobsters that
survived while being tethered for each length size.
library(data.table)
lobster <- as.data.table(lobster)
lobster.die <- lobster[, sum(survived==0), by=length]
lobster.survive <- lobster[, sum(survived==1), by=length]
I just introduced a new package into the fray via the library() command. I already had it installed on my computer,
but if you have not yet done so then use the install.packages() command that we learned in previous
modules. The data.table package allows us to subset our data frames with the
square brackets [ ]. If you are in
any way familiar with the apply()
family of commands then you will notice that this method of applying a function
across all the rows is a little bit easier and the computer can crunch the
numbers a lot faster. If you haven’t
yet, then take a look at the newly created data tables (lobster.die and
lobster.survive). They are missing their
second column names – time to fix that.
colnames(lobster.die) <- c("length", "total.die")
colnames(lobster.survive) <- c("length", "total.live")
This code is rather self-explanatory. It changes the names of the columns to whatever floats your boat. To give fair warning, I want to let you know that R will scream at you and give you a bunch of warnings, but you can ignore that for now. It is just letting you know that if you were working with an extraordinarily large dataset then it might not work. This is due to the fact that it has to read in and store all the data when changing the column names. But for such a small dataset like the one we are working with, your computer can cope with holding the data in its memory for a little while. Alright, let’s utilize the cbind() command and merge together the useful parts of each data table.
lobster2 <- cbind(lobster.survive, lobster.die$total.die)
colnames(lobster2) = c("length", "total.live", "total.die")
We had to rename the columns because the total.die column name got lost in the
carry over, but that gave us another chance to become more acquainted with the colnames() command. Now we can see how many of the lobsters
survived while being tethered and how many died while being tethered and more
importantly at each length. However, we
need the total number of lobsters at each length in order to find the
proportion of them that survived at each length. Well, that is an easy fix. We merely add the number that lived with the
number that died at each length and create a new column with that information
in it. It kind of sounds like a daunting
task, but with the right tools these jobs become rather simple. Let me introduce yet another package to help
us out.
library(dplyr)
lobster3 <- mutate(lobster2, total = total.live+total.die)
This mutate() command is quite a useful tool. It allows us to take the data from whichever column or columns are desired and do some sort of mathematical operation on it and then takes the output and puts in a brand new column that you named within the mutate() command. Now that we have the lengths, the number that survived for each length AND the total number for each length we can plot the proportions as follows.
|
plot(x=lobster3$length, y=lobster3$total.live/lobster3$total,
main="Lobster Survival Data Logistic Curve",
sub="Clearly a linear model won't fit the data")
|
Theoretical example of what a logistic function should look
like.
See how it looks like a sigmoidal S-shape curve
|
Now that we know what we are looking for and have set
ourselves a goal, let’s go and create that logistic model for our lobster data.
glm.out <- glm((total.live/total) ~ length, family=binomial(logit), data=lobster3)
plot((total.live/total) ~ length, data=lobster3)
lines(lobster2$length, glm.out$fitted, type="l", col='red')
title(main="Lobster Survival Data with Red Logistic Regression")
It seems to look pretty much like that sigmoidal S-shaped
curve, ergo it appears to be a good fit!
|
No comments:
Post a Comment