In this first module we want to learn how to explore data and all the steps it takes to do so. The first step is getting the data and importing it into R. This can potentially be a frustrating task because there are so many different locations datasets are being stored. Sometimes the data is publicly stored online in GitHub or needs to be downloaded off of various websites. Often times it comes with a package that we installed in R and is hidden until we evoke it with a simple command. Or when working with businesses in the real world, it is sent via email or flash drive and we need to know where it is saved and stored on our personal computers so we can upload it into R.
The first step in our mission is locating the data and getting our hands on it. Once we have the data we need to import it into R. In this particular module we are going to focus on a dataset that comes with R. More specifically, we are going to concentrate our attention on the dataset, diamonds. Some datasets like AirPassengers, BOD, or ChickWeight inherently come with R when we first download the program, while other datasets like diamonds, nasa and world.cities can only be ascertained after downloading new packages into R with the install.packages command.
Use the data() command with empty parentheses to generate the list, with brief explanations, of all of these datasets currently loaded onto your version of R. This list will grow every time you install a new package that comes with datasets. Let our foray into data science begin by installing our first package. Type the following command into the console of R:
install.packages(“ggplot2”)
Upon executing the command you should see a bunch of words fly across the screen. Don’t be alarmed. This is normal. The words are merely saying where all the files are being stored in your computer amongst other things. After everything is installed onto your computer this does not automatically mean we have access to the package. No, all we have done is gotten our hands on it. It is like purchasing a new book and then setting it on your shelf with the rest of your books. In other words, we have added it to our library of packages. This brings us to our next command – the library() command.
We use the library() command when we want to use packages we have previously installed. It’s like grabbing that book off the shelf once we have a need to use what’s in it. Upon executing the line of code above we now have access to all of the commands inside the package as well as any datasets hiding inside. The package ggplot2 is a really nice package. It contains codes for generating really spiffy graphs that could be rather difficult to create otherwise, as well as datasets like diamonds.
Before we import the diamonds dataset I want to give you a word of caution:
the library() and install.packages() commands are very similar and it is easy to mix them up. Try to be very conscience about the fact that we do NOT use quotation marks inside the parentheses for library() but we do use them for install.packages(). It happens a lot that we forget to surround the package name with quotes or vice versa. R will scream at you, giving you an error when this happens, so just try switching it and eventually you will start to remember.
Okay, now that the package has been installed, let’s import that diamonds dataset. The command is rather simple:
This will create a new object named diamonds and it will appear in your global environment. Now mind you, this command will only work if you have both installed the ggplot2 package AND uploaded the package with the library() command. I can’t say how many times I’ve been stymied (not to mention frustrated) because I forgot to reload the package with the library() command. Think of it this way: a person need only purchase the book once in order to place it on his or her shelf but if they want to use it again they need to physically pull it off the shelf again.
The library() command is our way of reloading or pulling the package out of the library. Every time we exit out of R, it will automatically default back to its basic set of packages. So every time we restart R we will need to reload any packages in order to use them. Awesome, we have installed a new package, loaded it into R and extracted the dataset diamonds, so like divers in a cave let our spelunking and exploration of the cavernous data begin. We delve into the data by learning a few new codes: head(), tail(), and View()…be wary my young padawans the “v” in the View() command is capitalized. These codes will help us see our data frames and visually get a feel for the structure and how it is all laid out.
head(diamonds)
X carat cut color clarity depth table price x y z
1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
X carat cut color clarity depth table price x y z
53935 53935 0.72 Premium D SI1 62.7 59 2757 5.69 5.73 3.58
53936 53936 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
53937 53937 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
53938 53938 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
53939 53939 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
53940 53940 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
These two commands show us all the columns and the first 6 and last 6 rows of observations. The View() command does the same thing except it opens a new window and can potentially show the first 1000 columns and the first 1000 rows, so in conservation of space I intentionally did not show the output because it would have taken up pages of text.
Perhaps you start thinking to yourself, “Huh, I have a dataset that has hundreds of columns and I just want to look at the column names without printing rows of observations with them…what do I do?” Well, stress out no more, because the colnames() command will make your life so much easier. Merely run the name of the dataset through the command and it will spit out a list of all the column names with no rows of data attached.
colnames(diamonds)
[1] "X" "carat" "cut" "color" "clarity" "depth" "table"
[8] "price" "x" "y" "z"
Okay, now that we have perused the data I am sure that questions are starting to percolate in your mind. When I first saw the diamonds dataset, I, Josh, as a guy, had no idea what some of the column names even meant. I had no idea that there were so many ways a person could categorize a tiny little rock. Anyway, if you ever come across this sort of problem you should really investigate and figure out what you are even looking at. I mean, if you can’t even understand what the categories are, then how can you even make any sort of meaningful statistical analysis about the data?
For example, a scatterplot of row number versus price would be a meaningless visualization, however, a scatterplot of price versus carat would paint a picture that explains a lot about the story of data collected. A good way to solve this problem is to do a little research on the dataset. A lot of times datasets come with a read.me file or some sort of pdf is written somewhere online that goes into detail explaining all the column names, or attributes, that comprise the data. Comprehension of the data is a huge aspect of data science so make sure you have a solid foundation of understanding before trekking onwards in the exploration. Let us make our first plot using the plot()command.
plot(x=diamonds$carat, y=diamonds$price, main=”Carat VS Price”,
xlab=”Carat”, ylab=”Price”,
sub=”OUR FIRST SCATTERPLOT…MESSY AND HARD TO DECIPHER MUCH”)
Notice that in the code all the labels and titles are in quotations. This means that the computer is reading the words as characters and not as objects with values placed inside them. To learn more about and differentiate between the structure-types, I found the following link to be super insightful: R Tutorial . Go through the tutorial and run the code on your console and hopefully you will begin to see the difference.
I also want you to take note of the fact that I used the dollar symbol ($) in the code as well. In this particular instance $ acts as a pointer and tells the computer where to look. One could liken it to an Irish Setter, a breed of hunting dog. This particular breed is known as a “pointer”. They sniff out their prey and then once they have cornered it they stand stiff and erect and point for their master with their noses at the hunted prey so the hunter can take aim and shoot. Well, the dollar sign works the same way. It directs the computer as to where to look and grab the data for inspection, graphing etc. Without the dollar sign the computer will be looking for an object named price when in reality we want it to look at the column name price. Enough about coding specifics; let’s discuss the actual scatterplot that we generated.
Let me begin by saying “YIKES!” What have we done?! It looks like we created a blob monster! This scatter plot has so many points that there is too much overlap causing us difficulty in determining what is actually going on. Fortunately our work was not entirely in vain because we can glean from the graph the overall outline of the shape, thus inferring that as the carat size increases so does the price. Our general intuition from everyday life could have told us this, but it is nice to visually see this same story being told from the data. For our first attempt of visualizing the data we did an okay job, but we are only depicting two columns of the data and there are so many more columns that we have left untouched. As a data scientist that should be a common thought running through your head: “How can I get the graph to show as much data as I can without it being overly messy?” So how can we introduce another variable into the visualization? One solution is to connect a variable to color. Let’s try it.
plot(diamonds$carat, y=diamonds$price, xlab="Carat", ylab="Price",
main="Carat VS Price \n Color = Cut", col=diamonds$cut,
sub="OUR SECOND SCATTERPLOT...STILL MESSY AND COLORS DON'T HELP MUCH")
It’s worth taking note that in the code where I was giving a title to the graph I used the phrase “\n”. This is merely the same thing as hitting the Return or Enter key on the keyboard. It literally translates to “new line” in the computer’s eyes. The backslash symbol ( \ ) is one of group of many special characters called metacharacters. Metacharacters are special characters or symbols that were designated special powers by the software engineers who wrote the program R. In fact, we have already encountered another one of these metacharacters in this module. The dollar symbol ($) is another example where it has the power to do more than just look pretty or express the units of dollars. If you are curious and want to see the entire list of metacharaters and learn more about them then it take the time and peruse through the following link: