Tuesday, July 28, 2015

R & Python Module 2

Often times when you have a coding question and research it online, people will provide a sample data frame to illustrate the difficulty they are having. This module will familiarize you with the method of creating these data frames and allow you to easily read the language of the code and get right to the meat of answering your questions. In other words your code reading comprehension and data frame building skills will improve by participating in this module.



So now that we've explored a dataset in module 1, we're going to look at some techniques that will help you find ways to evaluate their effectiveness before applying them to a real dataset that you might be working with.  Basically, I'm thinking of the good old "guess and check" method that we often need to use when solving math problems. It's also similar to giving yourself a simpler problem to extrapolate methods to more complex problems.

After we create a data frame, we will then manipulate it in various ways that may be applicable to the needs in a real world data problem. In module 3 you will work with a business problem and have to manipulate some datasets so you will use some of these skills right away.



So, let's get right to creating a dataset from scratch! Let's say that I want to create a data frame that has four columns. I want the first column to be the numbers 1 through 10.

In R you can just get right to creating this data frame but in Python you will need to load some packages. Both methods produce a data frame with labeled columns. It's very efficient.

In R, let's call this data frame: dfR.

To get started we use this code:


dfR <- data.frame (a=1:10)


This tells R that it is creating a data frame and that the first column will be labeled a.

You can either click on the name dfR in the environment or type


 View(dfR)


to see the data in the data frame. Or you can click on the arrow next to dfR and it will display the
vector. This also shows the type of data you have. In this case a vector of integers.

In Python. let's call the data frame: dfP.

To get started we use these codes which call the packages needed in the creation of the data frame.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


Next we will create an array of data and store it in a variable.

a = np.arange(1,11)

Then we put the array in a container called data. 'a' is the name of the column and a is the array of data itself. This step gets it ready to put into the data frame.

data={'a':a}

Now we can place the container of data in the data frame.

dfP = pd.DataFrame(data)


Ok so now we have one column of data in our data frame in each program.


Now, let's say that we want to create another column that starts at 10 and ends at 1. Before you look ahead, think to yourself: how would I do this in each language? You know there is a very different approach in each language, but what is it for each? Try some things in your consoles and see if you can figure it out on your own first. I'd like to encourage you to do this from here on out with all of the column creations in this module.

Did you figure something like this out? If you found a little different way, but it works, that's great too! Please share it in the comments below.

For the second column in R you just add another column with the data going from 10 to 1.



dfR <- data.frame (a = 1:10, b = 10:1)

For the Python code we need to create that array to get us started. You might think that the Python is similar to the R code thinking and try b = np.arange(10,0). Try that in your console. It doesn't give us the expected array that we are looking for, so that's not going to work.

Upon researching you would find that there is code in the numpy package that helps us to accomplish the column creation that we are desiring. Basically we tell python that we want an array starting at 10 and going to zero, but to make it give us the array we are looking for, we need to add a third number: the by parameter, which tells us how many we want the next number to skip by. So this is the code we now need.



b = np.arange(10,0,-1)

We now put this array into the data container we were working with up above.


data={'a':a, 'b':b}


And now the data has changed, so we can run the exact same data frame code and it will include our new column of data as it has been added to the data itself.


dfP = pd.DataFrame(data)

And now we have two data frames that look like this.


R Studio output


Spyder output

ipython notebook

Now I'd like to create a column of the same value repeated. Let's put a column of 5's in there.
Again, try to take some time to think about how you might create this or research it online to see what you think.

So, here's what I found.  For R you want to use the rep function and add this to your data frame.


dfR <- data.frame (a = 1:10, b = 10:1, c =rep(5,10))

By looking at the rep function what do you think each of the numbers mean? If you don't know right away, try running the code and then see what you think. Posing these types of questions to yourself as you're learning will help you to develop a more comprehensive understanding of the code as well as flexibility in your problem solving.

So you can see that the number 5 is repeated 10 times. So obviously the 5 tells the program to put all 5's in the first 10 (which is all of them here) rows to make a column of all 5's. This is exactly what we want! Great!

Next we will look at creating code in the Python data frame to make the column of 5's. Notice that we are using the code array this time rather than arange (a range).



c = np.array([5.] * 10)


This code creates an array of 5's (the "5." is a type of data called a float) 10 times.

See if you can figure out what is missing in the following codes and get a three column data frame by completing the codes.


data={'a':a, 'b':b,    }
dfP = pd.DataFrame(    )

Ok, so let's create that last column.  Let's say that we want to create a column of numbers from 3 to 30 counting every 3.

In R, see if you can decide where to place this piece of the code d=1:10 *3 to complete our data frame. This code tells R to give you 10 numbers, multiplying each number by three.


dfR <- data.frame (a = 1:10, b = 10:1,c =rep(5,10))


And in Python, the code is similar to the code we used for array b, but you'll want to check your work carefully. Building off the pattern that we just used in R and saw earlier here in Python, you might think of d = np.arange(1,10,3).  Let's find out what that code does.  Go ahead and try it out in Python.  Once you see what's happening there, you'll quickly find a way to resolve it and get the same column here in Python as we did in the R data frame.

Finally you'll want to put it all together into one data frame.


dfR <- data.frame(a = 1:10, b = 10:1, c = rep(5,10), d =1:10 *3)

dfP = pd.DataFrame(data)

Ok so now I want to get a basic idea of the summary of this data. Looking at the descriptive statistics will give me a nice overview.



summary(dfR)

dfP.describe()


What do you notice between the two outputs here?

Everything is the same for both of them except the R output does not have standard deviations and counts. Count is straight forward in the example data frames we have created (all of them will be 10) but we do need to look at standard deviation in R.

We could calculate the standard deviations of each column one at a time, which would be fine for our small data frame, using this code.



sd(dfR$a)


This tells us to find the standard deviation (sd) from the data frame (dfR) of ($: points us to a certain location in the data frame) column a (a).

But if we had a data frame with hundreds of columns, finding the standard deviations one at a time would be a tedious waste of time, so we need to use something from the family of apply functions to calculate all of the standard deviations at once.



apply(dfR,FUN=sd,2)



This code tells us to apply this function: standard deviation (FUN=sd) to all the columns (,2) of the dfR data frame.  How do you think you would apply something to all of the rows?

Try it... Create a code that will give you the standard deviations of the data across each row.
.
.
.

Push the flexibility in your thinking!

Remember the rationale in creating one of these datasets: you are pretty much thinking of it as a stand-in or "dummy" dataset to practice your data science techniques. This is a great way to find out if you are performing the correct techniques and to make sure that the technique is working the way you want it to before applying it to a big set of data.  This could be important for time efficiency as it can take a while for the computer to apply certain things to more complex datasets. If you find out that what you were doing isn't really what you meant to be doing, you just wasted time running it on a large dataset. Figuring this out on a small sample is a much better use of your time and definitely cuts down on the frustration.

So now let's practice a few techniques.

Let's find the mean of two of these columns and create a new column with that average as its data. We're going to call the new column: mean.
dfR2$mean = rowMeans(dfR2[,c("a", "b")], na.rm=TRUE)

 
e = df[["a", "b"]].mean(axis=1)


data={'a':a, 'b':b, 'c':c,'d':d,'mean':e}

dfP2 = pd.DataFrame(data)
Take a look at the values in the average column. Do they make sense?



I have just found out that I want to add a column representing the year that the data was created to every row of my data frame. Well in order to do that, I will add a repeated values column to my defined data frame.

In R, I will create one column and bind it with the existing dfR data set. You could also just simply insert the column into the data frame, but this is a new way: one in which we don't change the original data frame, dfR.

New column:

year <- c(rep(2015,10))


cbind is for column bind:

dfR2 <- cbind(dfR,year)


What do you think rbind means?
.
.
.


Now, in Python we are going to use the .shape command. Notice that the name of our original data frame is there, so we are working on that particular data set.


dfP['year'] = dfP.shape[0]*[2015]


Running that command created a new data frame, with the same name.  We just shaped the data frame differently by adding a new column. Now call the data frame.

dfP

You will see that you have a new column in the original data frame entitled year.

You can rename the data frame, if you'd like. We will do that here, to keep our naming conventions consistent between R and Python.

dfP2 = dfP

Call it.

dfP2

dfP2 should look exactly the same as the most recent dfP did.

In the next module we are going to ask you to combine many of these skills to help in analyzing a business problem. If you need to test out a technique prior to using it on the larger data set, remember this option here and create your own data frame to practice on.




No comments:

Post a Comment