Monday, July 27, 2015

R & Python Module 3

Business Intelligence (BI) is an important area in data science. In this module, we are going to show you how to begin making sense of a typical business data set. As you prepare to import (R) or load (Python) your data set, a few questions should be percolating in your mind. Who are the stakeholders I will be reporting information to (my audience) and what are they wanting to find out and understand from this data?


In this module your stakeholders are the owners of a ski lift in Arapahoe Basin, Colorado. They would like to find out who their customers are. Where do they come from "pretty much"? How much are they spending? How long are they staying? Ultimately they want you to help them figure out how to get more of their "best customers", how to find new "best customers".



So that's where you come in. The first step is to get the data. As we discussed in the intro, data can come to you in various ways. This particular data set is going to be found on a github repository. <more info about github>

Step 1: Github steps

Step 2: Import data into R or load data into Python

We are naming the data Y2014 for the year 2014's data at the ski-lift. This is a common convention in naming data from a particular year. Use the Y...it helps to avoid problems with structure of the data later. More on structure of the data soon!


Y2014 <- read.csv("~/uno/ski-lift/.txt")

Y2014 =  pd.read_csv('~~/uno/ski-lift/.txt')
(github path/address)


Step 3:  

R: Copy this code into a script window and save the script as Ski-lift module.
Python: Copy this code into a new file in the Spyder editor window and save as Ski-lift module.

Step 4:  Take a look at your data. 

R:  You can either click on the name of the data (not the arrow) in the environment tab or you can you use this code in the console. 

Note: Clicking on the arrow will give you the structure of the data which will be helpful a little later in this module.

Python: Simply type the name of your data and it will show you all of your data in the iPython console.


View(Y2014)

Y2014

Notice that we have 11 columns. As Josh mentioned in the first module, you will always want to make sure that you understand the information that is represented in each column. 

All of the columns are pretty self explanatory here except for what the charges cover and whether or not the customer number has any significance or not. 

I asked the business owner for clarification and they provided me with the following information:

1. Charges represent the per day charge for that group.

2. The customer number does have a special meaning:
Going from left to right the first number represents the month of the reservation, the second and third the day of the month, and the fourth the number of days the party was at the ski lift. 

Step 5: Okay, I'm thinking about my data set and knowing that it's from the year 2014, I start thinking that I may want to compare that with a previous year's data for a meaningful analysis and future predictions.



So if I want each row of my data to be identified as being from 2014 how do I do that? I need to add a column of all 2014's.

c stands for concatenate a column repeating 2014, 173 times (the number of rows we have)


year <- c(rep(2014,173))



Step 6: And now we will change our Y2014 data by combining the original DF (data frame) with the new column named, year


Y2014 <- cbind(Y2014,year)


Now you're ready for any multi-year comparisons.


Step 7: Often in data science problems viewing your data on a map representing the geospatial relationships will be a very informative piece of information for your stakeholders.




You will need to import (R) or load (Python) a public data set of latitude and longitude coordinates into your work.

As we have previously discussed (please see the beginning of module 1) there are many ways to acquire a dataset to work with in your analyses. We are now going to need to get a public dataset and will then later apply some of its information to the private dataset that we have been given by the ski-lift company (step 8).

First we have to find this public dataset. We did a google search for <    > and found the dataset we wanted at <link>. I downloaded it onto my computer and stored it in a folder that I wanted to access for this analysis. Important note: This of course will be different on each person's computer, so get to know your machine and decide where you want to keep your information. I decided to store mine in a folder labeled ski within another folder named uno. This txt file is perfect for importing into R and turning into a csv. 

In the R Studio environment window, you will see an Import Dataset button. Push this button and it will ask you to choose either "From Text File" or "From Web URL". Well for this one we downloaded a text file, so that's what we'll pick. You will then pick the file from your computer and voila, R creates a line of code for the file path. I highly recommend copying this code into your R script for simple loading henceforth. The following is the path that my computer needed. I am storing it in coords for the coordinates of latitude and longitude.


coords <- read.csv("~/uno/ski/coords.txt", stringsAsFactors=FALSE)
coords =  pd.read_csv("~/uno/ski/coords.txt", stringsAsFactors=FALSE)

Step 8: Now we want to match the particular location in the ski-lift data frame with the latitude and longitude coordinates in the whole data frame of all the coordinates that we just imported in step 7. 


We really don't want to do that by hand, so how do we do this?

Well upon inspection we notice that both datasets do have zip codes and also city, state combinations. Let's go back to our question from the business. "Where are my customers coming from?" Well, either match will give us a pretty good answer but zip code will break it down a little more precisely in cities with multiple zip codes and we won't have to think about how the computer would choose which latitude/longitude combo it should pick if there are multiple zip codes for one city, state combination... so going with the match between the company dataset's zip codes and the coords dataset's zip codes seems like the most efficient way to approach this.

Now, the other interesting part of this is thinking about how to match these rows up when we will need some lat/long combos to be repeated for customers from the same zip codes and of course we won't need all of the zip codes that are represented in the coords data base. 


So how do we repeat some rows and completely delete others? 

Well, R has an amazing way to do both at one time!


First of all we need to get the column names in both datasets to be the same. In the coords data frame the name of the zip code column is zip_code, but our company just labeled it zip. Let's go with the company name. We won't have any new definitions / names when we go to present our analysis for them later if we stick with their labels throughout.

Renaming the columns of the coords dataset.


colnames(coords) <- c("zip","latitude","longitude","city","state","county")
coords.columns = ['zip','latitude','longitude','city','state',county']

And now we simply use the merge command from the dplyr package to assign the correct information to each row in the ski dataset to the corresponding row in the coords dataset.



sample1 <- merge(sample,coords,by="zip")

No comments:

Post a Comment