Monday, July 20, 2015

R & Python Module 4





Logistic Regression





In this module we are going to learn about a type of linear regression: logistic regression.  In statistics, standard linear regression is a tactic of determining a model that dictates the relationship between the scalar dependent variable Y and one or more explanatory variables (or independent variables) typically denoted as X.  In this regular version of regression both the X and the Y variables are continuous types of data.  An example of simple linear regression is trying to determine the length of some specific breed of fish based off the weight of the fish alone.  Notice how both of these attributes of the fish are continuous data types.  Logistic regression varies from this nature of regression in one simple way: the target variable (i.e. the Y variable) is categorical in nature but more importantly Binomial or more specifically Boolean.   A rather complicated version of logistic regression can be found in sports.  In American football, the weather can play a huge role in determining whether the placekicker will make the attempted field goal or not.  What makes this problem complex is that there are many factors of the weather that may or may not be interacting with each other causing different results.  Nonetheless, the desired target variable is intrinsically Boolean – 1 for if the field goal is complete or 0 if the field goal is incomplete.  The problem with this kind of regression is that if you merely plot the points of success on the graph then there will be a dichotomization of points – some mapped to 1 and the rest mapped to 0.  This does not remotely look anything like a curve whatsoever.  Allow me to illustrate my point with the utilization of an example from the Lobster Survival dataset.  Go to http://www.stat.ufl.edu/~winner/datasets.html and scroll down to the section marked 1D and labeled Binary Response Regression; you will find this to be the last link in the section.  Download and save the data into either Notepad or Text Wrangler (Windows or Mac users respectively) and follow along with the following tutorial.  Wait!  Before you leave the website, make sure to click the link that says Description.  We need to read what the dataset is about.  Knowing is half the battle! (any GI Joe fans out there?)
A lot of times datasets will be found as links on a website.  Click on them and save them into a text reader program like Notepad.  Make sure you know to which directory (i.e. folder) you save it because you will need to locate it in order to read it into R.  Use the getwd() command with empty parentheses for R to tell you what your current Working Directory is.  In other words, this command tells you which folder is in your hand, have opened and are currently looking at.  If you are not sure if you saved the file of interest in that particular directory then use the dir() command with empty parentheses.  This will have R print out a list of all the files in the current working directory.  If it is not there then you will need to change the directory to the one where the file in mind is located – this requires the setwd() command where you insert the path to the new directory inside the parentheses. 
I manually added column names in the text file (via Notepad), saved the data as a .txt file, named it lobster.txt and stored it in my Documents folder. Read it into R with the read.csv() command.


getwd()
[1] "C:/Users/Joshua/Desktop"

setwd(“C:/Users/Joshua/Documents”) #This code should have no output because it is merely changing directories so nothing visible should happen


I want to point out that my code in the setwd() command should look different than yours.  I am using my personal computer to retrieve the .txt file so the pathway inserted in the parentheses in unique to my computer.  Therefore, if you try running my code verbatim then you will get an error stating something along the lines that R cannot change working directory.  What this really means is that no such directory exists on your computer.  If you need to, manually locate the file on your computer, right click on it, click on Properties and then once in the properties window look for something that says Location or Folder Path and then copy and paste that into the setwd() command.  But make sure you put quotations around the file path as well as change all the forward slashes ( \ ) with backslashes ( / ).  R uses backslashes.  Okay, let’s finally read this text file into R and store it into an object named lobster.  Honestly, you can name it whatever you like – heck, name it bob or steve if it tickles your fancy.  I simply chose to name the object lobster because that is what the data is about.


lobster <- read.csv(“lobster.txt”, header=T, sep=” “)


lobster = pd.read_csv("C:/Users/Joshua/Documents/lobster.txt", sep=" ")

If you are using R Studio you should now see an object named lobster in your global environment.  I want to take the time to point out here the use of the code header=T.  If you do not set it to true, then R will automatically default the selection to false.  This code will read any available column names and assign them to their designated column in the newly created data frame.  The sep=” “ code tells R how the data are separated in the text file.  In general, a CSV file (Comma Separated Value file) is laid out in such a way that all the data points are separated by commas.  The default for the sep command is set for commas, but can be changed to fit whatever punctuation separates the data.  It is not uncommon, however, to find that the data is separated by white space, tabs, or semi colons.  In our case, the only things separating the data were white spaces (i.e. the character created when pushing the spacebar), hence I left a blank white space between the quotation marks.  Now it is time for us to don our yellow rubber gloves and do some cleaning.  The data is messy in the sense that there are extra unneeded columns of NAs that need to be cleaned up and taken out.  Go ahead and use the View() or head() commands to see what the data looks like.  You will notice that there are multiple columns with NAs (data that is Not Available aka missing data).  In this case it is not missing, the format of the data was a little wonky so the computer read in extra columns that were not a part of the dataset.


length <- lobster$length

survived <- lobster$survived

lobster <- cbind(length, survived)

lobster <- as.data.frame(lobster)

All I did here was select the desired columns and shove them into an object which I decided to name after the columns.  Next I wanted to put them together into a data frame like structure with only the length and the survived columns so I used the cbind() command (short for column bind) which glues the two columns side by side.  Unfortunately, the two columns weren’t vectors or single column data frames.  They were lists of integers, so I used the as.data.frame() command in order to coerce the data type to change from list to data frame.  We need to work with data frames in order to do our analyses.  Look again at the View() or head() commands of the newly created data frame and you will see that it is exactly the way we found it online.  Now that we’ve played maid or butler and cleaned the data, let’s plot the data and see the fruits of our labor.


plot(x=lobster$length, y=lobster$survived, sub=”WHAT?! NOT HELPFUL AT ALL!!!!!!!!)
 
 



This scatterplot depicts the main problem when working with Boolean target variables.  Scatterplots of this nature are very unhelpful.  They don’t help flush out the hidden story of the data.  We need to go back to the drawing board and hash out a new strategy for modeling this data.  We need to find a way to transform the data.  Let’s begin by thinking about what we are attempting to generate for a graph.  Logistic models show proportions, so let’s try finding a way to create the proportion of the number of lobsters that survived while being tethered for each length size.


library(data.table)

lobster <- as.data.table(lobster)

lobster.die <- lobster[, sum(survived==0), by=length]

lobster.survive <- lobster[, sum(survived==1), by=length]
 


I just introduced a new package into the fray via the library() command.  I already had it installed on my computer, but if you have not yet done so then use the install.packages() command that we learned in previous modules.  The data.table package allows us to subset our data frames with the square brackets [ ]. If you are in any way familiar with the apply() family of commands then you will notice that this method of applying a function across all the rows is a little bit easier and the computer can crunch the numbers a lot faster.  If you haven’t yet, then take a look at the newly created data tables (lobster.die and lobster.survive).  They are missing their second column names – time to fix that.


colnames(lobster.die) <- c("length", "total.die")

colnames(lobster.survive) <- c("length", "total.live")


This code is rather self-explanatory.  It changes the names of the columns to whatever floats your boat.  To give fair warning, I want to let you know that R will scream at you and give you a bunch of warnings, but you can ignore that for now.  It is just letting you know that if you were working with an extraordinarily large dataset then it might not work.  This is due to the fact that it has to read in and store all the data when changing the column names.  But for such a small dataset like the one we are working with, your computer can cope with holding the data in its memory for a little while.  Alright, let’s utilize the cbind() command and merge together the useful parts of each data table.


lobster2 <- cbind(lobster.survive, lobster.die$total.die)

colnames(lobster2) = c("length", "total.live", "total.die")


We had to rename the columns because the total.die column name got lost in the carry over, but that gave us another chance to become more acquainted with the colnames() command.  Now we can see how many of the lobsters survived while being tethered and how many died while being tethered and more importantly at each length.  However, we need the total number of lobsters at each length in order to find the proportion of them that survived at each length.  Well, that is an easy fix.  We merely add the number that lived with the number that died at each length and create a new column with that information in it.  It kind of sounds like a daunting task, but with the right tools these jobs become rather simple.  Let me introduce yet another package to help us out.


library(dplyr)

lobster3 <- mutate(lobster2, total = total.live+total.die)


This mutate() command is quite a useful tool.  It allows us to take the data from whichever column or columns are desired and do some sort of mathematical operation on it and then takes the output and puts in a brand new column that you named within the mutate() command.  Now that we have the lengths, the number that survived for each length AND the total number for each length we can plot the proportions as follows.





 
plot(x=lobster3$length, y=lobster3$total.live/lobster3$total,
     main="Lobster Survival Data Logistic Curve",
     sub="Clearly a linear model won't fit the data")

 
 
 







It is somewhat difficult to see, but the data points create the stereotypical sigmoidal S-shaped curve that is associated with logistic regression.  This would suggest that creating a logistic model is a wise choice and should fit the data fairly well.  So let’s make that our new goal: create a logistic model, fit it to our data, and plot the line that from the model that best fits the points on our graph above.  Before we move forward, I would like to point out that the Lobster Survival dataset is a real world dataset and not all datasets create the ideal, theoretical shapes that we are accustomed to seeing in textbooks or on educational websites.  So I would like to emphasize that it is okay to see your data generate distorted or not so neat looking visualizations.



Theoretical example of what a logistic function should look like. 
See how it looks like a sigmoidal S-shape curve







Now that we know what we are looking for and have set ourselves a goal, let’s go and create that logistic model for our lobster data.


glm.out <- glm((total.live/total) ~ length, family=binomial(logit), data=lobster3)



The glm() command is how we create a model for our data.  It is short for Generalized Linear Model and it allows us to choose the family type of linear models.  There are many kinds of linear models so we needed to set the family code equal to binomial(logit) so R knows that we are fitting a logistic regression.  Also, whenever a person is trying to create a linear model of some sort in R the general layout of the code is: Y ~ X and is read Y regressed on X.  And it is customary to call X the predictor variable and Y the criterion variable or target variable.  This is just some statistical jargon and I thought it might be helpful to learn about it here in case you run across it elsewhere in your research and studies.  Now let’s see how well our model works.


plot((total.live/total) ~ length, data=lobster3)

lines(lobster2$length, glm.out$fitted, type="l", col='red')

title(main="Lobster Survival Data with Red Logistic Regression")
 
 




It seems to look pretty much like that sigmoidal S-shaped curve, ergo it appears to be a good fit!
 


 







 


No comments:

Post a Comment