Data Science in Minutes

Try your first data science project in minutes, not days.

This post is for everyone: you can be an expert engineer looking for a reminder or you can be completely oblivious to coding. It doesn’t matter.

It doesn’t matter if you are going to become a data scientist or you just want to understand what’s all this hype. Here you can understand a bit how things work and at the end of the post you will have used machine learning to make predictions.

The War Plan

We will follow the steps below thoroughly, and at the end you won’t be a data scientist, but you will be on the right path.

  • What’s R
  • Why R
  • Install R
  • Install RStudio
  • Basic R
  • Packages
  • Analysis
  • Plotting
  • Modeling
  • Further Reading

You can follow the steps in just one sitting, or you can save the post and follow them whenever you want. What you can’t change is their order (unless you already have R installed, in that case you can skip that part).

What’s R?

R is a programming language built by statisticians for statisticians. If you look for R on Google you’ll have a hard time: in the end R is just a letter. However you can use R language, R programming or use Rseek.

With R you can write scripts that download, load, analyze, visualize, model and share data in many different ways. It is an open source language which means that you can use it for free and you can even modify it if you want to.

Since R was built with statistics in mind the main things you can do with data are extremely easy to perform. Let’s say we have a csv with data about countries’ GDP and public debt, since we heard about the paper by Reinhart and Rogoff (which was wrong by the way) we want to check by ourself if there is some sort of relationship between the two variables.

With the three lines of code below we load the dataset, make a scatter plot of the variables of interest and then we fit a linear regression. That’s it.

Why R?

It is likely that during the previous search you found many results with comparisons between R and Python and about 1000 different opinions about which is the better one.

Let them discuss while you start actually doing data science and go with R. The reason is pretty simple, especially if you never coded before: R is easier to install, manage and to learn.

Besides, you can do basically everything with R thanks to the community that releases open source packages at a fast pace. If you really want to become a data scientist, eventually you’ll have to learn at least a bit of this and that, like SQL and Python, maybe also Scala, so there’s no point in losing time to pick one.

Another good reason is that Python was created as a fully functional language: you can build proper apps with it, while in R it would be a real pain. This can look like a plus for Python, but if you want to perform data analysis it is actually a downside: you’ll need many different packages just to manage data in an easy and functional way, for loops everywhere and it’s easier to break things.

Install R

To install R you have to go to CRAN and you will find three links at the beginning of the page, just click on the download link for your OS.

CRAN page

Don’t worry, there aren’t major differences between R for different OS and you can use the same code if you have more than one.

After downloading you can run the install and when it finishes you’ll get an R console (on Windows, on the other OS you can run R from the terminal by simply running: R) and a GUI.

R in terminal

The GUI shipped with R it’s not that great, so just open the console (or the terminal) and let’s run a bit of code to check that everything’s working fine.

Running the code above you should see the result as below. Be aware that you won’t see the same numbers after printing x and y as mine because they are generated randomly.

Checking R

If everything worked fine we can move on to the next step! Otherwise you can comment here and I will try to solve the issue.

Install RStudio

As I said before, the R GUI it’s not great, but doing data science in the console is a real pain. You’ll have reproducibility issues, loss of interactivity if you script in a text file and run the code in console, and so on.

One of the best IDEs around is RStudio, you can get the free desktop version at this link. As previously, choose the right version for your OS, download it and run the installer.

Once you have installed it you can just open it and RStudio will figure out by itself where is R and which version you’re using, so you don’t have to do any setup.

After opening RStudio, you should see something like this:

RStudio IDE

Don’t pay attention to colors and panel order, you can change them via Tools > Global Options whenever you want. What matters here is that you have a Console that is going to work as R in the terminal (see previous point) and an area for Scripting, you will recognize it by the tab “Untitled 1”, here you can write code without running it directly.

The other panels:

  • Environment: list of all variables, functions, datasets and so on loaded in the current environment
  • History: history of the last commands run
  • Files: you can browse your computer folders and files from here
  • Packages: a list of installed packages, you can activate and deactivate them from here (more on packages later)
  • Plots: you will visualize plots in this panel
  • Help: a good thing about R, if you run in console ?runif you’ll see the documentation about runif in this panel. You can do the same for every command
  • Viewer: you can visualize Html in this panel

Now try to run the same code we run in the terminal in RStudio console as below.

RStudio console

We get the result instantly as in the terminal, but now we see that in the Environment panel there are two listed variables: x and y. This is good when we will work with dozens of different functions and variables.

To test the script panel, just write the same exact code as before in the panel, then to run it you can either put the cursor on the first line and then click on Run at the top-right of the panel, or press ctrl-enter.

To run all at once you can either highlight all the code and press ctrl-enter, or you can press ctrl-alt-r. As you can see in this way it’s much easier to look at your code and in case there were some mistakes you can debug it with ease.

Basic R

The first thing when learning to code or a new programming language is to build a habit by actually reading code. For this reason this section is all written in code.

Packages

One of the greatest things about R is that there is a package almost for everything. A package is just a collection of already defined functions that add to the regular ones in R.

The previous intro to R was limited because of packages: it doesn’t make sense anymore to learn R without learning some of the most used packages.

CRAN (the website where you got R) collects packages, but there are also other mirrors such as bioconductor and then many others are in the wild on Github (usually the stable version is on CRAN, while the newer version is on Github).

It’s really easy to install packages in R, just do install.packages("package_name") and let R do its thing. Another feature of R is that packages are lazy which means that you have to explicitly load them in order to access their functionality.

To load a package just run library("package_name") and you’ll be able to access it. This means that you won’t have to manage different environments to avoid conflicts within packages because they aren’t loaded until you call them.

Some essential packages:

  • dplyr: data cleaning and wrangling
  • ggplot2: publication ready charts
  • lubridate: work easily with dates
  • data.table: manage “big” data
  • devtools: a collection of tools, for the moment you care about the possibility to install packages directly from Github

Analysis

Data Loading

Now the fun part!

We will do some analysis on a very famous dataset in the machine learning community: iris. This dataset collects data about sepal and petal measures for different iris species. You can take a look at this data and many other on the UCI Machine Learning Repository.

First you will load the data into R directly from the web:

By running the previous command you told R to reach that url and to read its content which is in csv format. Note that you can do the same thing with data sitting on your hard disk by substituting the url with the file path on our system.

The second argument header = FALSE is needed in this case because data doesn’t have column names. So what happens is the following:

R has loaded the data by assigning columns some default names: V1, V2, etc.

In the meantime you also used a new function head(iris_data) which returns the first 6 records from the object fed as an argument. If you want to see more than 6 records, just add an integer as a second argument to head():

If you want to check the bottom of your data use tail(), it works the same as head() but starts from the bottom of your object instead than the top.

Now you want to give meaningful names to columns, so you can go checking the UCI ML Repository documentation for iris and see that the columns are:

  • V1 = Sepal Length
  • V2 = Sepal Width
  • V3 = Petal Length
  • V4 = Petal Width
  • V5 = Species

Let’s rename them!

Now that you have better column names you can start looking for some descriptive statistics.

There are many functions that can help you in this regard, let’s look at some of them.

With summary() you get some statistics organized by column: quartiles, min and max values, mean and median.

In case you have non numeric data as in Species you get a count of records by group, these are called categorical variables and R manages them as factors.

Factors are dummy encoded variables, meaning that R shows you Iris-setosa but stores and considers it as an integer. You can check that is really a factor with the class() function.

You can also check levels of factors with…drumroll… the levels() function. Levels are the labels assigned to the dummy integers to make them human readable. As an example if we had male and female R would encode them as female = 1 and male = 2.

In this case you have:

  • Iris-setosa = 1
  • Iris-versicolor = 2
  • Iris-virginica = 3

Exploratory Analysis and Plotting

The next step is to dig a bit deeper and try to make a sense about this dataset. You already tried the summary() function, now you’ll use the str() function which gives you a sense of the data and the types you have.

You can see the dimension of the dataset (150 rows and 5 columns, use dim(iris_data) if you want to see just this info) and then you see every column name and the first 10 observations with their class. As you can see Species is a factor and str() shows you the dummy values as you saw earlier.

You may want to check some summary statistics beside the ones given to you by summary(), for instance it’s interesting to check means by species.

The first argument is the vector or column you want to calculate.
The second argument is the vector with the values R will use to split.
The last one is the function to apply, in this case mean().

You just got the Sepal Length mean by species. But it would be better to see every column mean split by Species.

As usual in R there are several ways to do it. The first approach uses aggregate() that groups data by factor and then applies a function to collapse data.

The second approach uses dplyr one of the most useful R packages around. This package integrated piping with a series of data wrangling basic functions, when you hear about data pipelines think about dplyr that made R a fully fledged piping language.

If you want you can translate dplyr code in a much more logical way than aggregate(). You are telling R to take your dataset iris_data, to group it by Species and then to summarize each column with the mean() function.

As you can see for some species there is a large variation in some columns. To translate this visually you can draw a scatter plot.

plot_iris

You can do also a general plot of all the variables in your dataset with plot().

plot_iris_all.png

As you can see points tend to cluster in two large groups, this is really interesting, but you might want to check by introducing the Species variable in your plots.

I’m going to stop using basic plot functionality since it is a bit outdated and there is another great package that makes everything more intuitive and better looking.

 

ggplot_iris_all.png

The species are almost perfectly split except for some versicolor and virginica. You can try the predict the species by using the variables in the dataset, this is a classification problem.

Modeling

As a test you can try to predict the species by using Petal_width with a logistic regression, but in order to make things simpler and more interesting you’ll have to eliminate the setosa observations.

Logistic regression works very well with binary classification problems, but before modeling you have to split data into a training set and a test set. This is because you want to prevent overfitting of your model, you will run it through the training set and then evaluate it on the prediction of the test set.

To run most of the models in R you have to feed functions with the formula: Species ~ Petal_Width in this case tells R to look for a function that takes Species as a response and Petal_Width as the predictor variable. Then you insert the dataset R has to draw data from.

In this case you also put family = binomial because you want a logistic regression and not a linear regression. If you want to do some test with linear regression you can use lm()and you could try to put petal width as the response, like this: lm(Petal_Width ~ ., data = iris_data[,-'Species']). The ‘dot’ is a placeholder indicating that you want to use all columns as predictors.

What you care about are the coefficients Estimate and the z value through which you define significance. You can also look at the Null deviance or the deviance with the null model and the Residual Deviance which is the deviance of your model, a decrease is good.

Now you will predict the species of the test set with the predict() function.

You just made your first prediction! The result is very good since there is only one wrong prediction (top-left and bottom-right are right predictions).

Since the result is so good you want to try to predict all three species. You could try with a multinomial logistic regression, but for multivariate categorical variables there are much better classification methods.

Linear Discriminant Analysis (LDA) is the first method you will try on the full dataset. Let’s split it into training and test as previously.

LDA will try to draw separating lines between classes, so you’ll end up with two lines separating versicolor, virginica and setosa.

By reintroducing all species and using LDA you just got 4 errors on 50 observations, or a 8% error rate. You can try to improve your model by feeding it with all columns and not just petal width.

Using all columns brings our error from 4 to 3, this means we are close to the limit reachable. You could try KNN to see if you can catch some of that error, but from now on we are probably just overfitting.

This is the best you can do with so little data and without turning to even more complicated algorithms.

Further Reading

There is a huge variety of resources you can use to improve your data science knowledge. My advice if you are a true beginner through an intermediate level is to get ‘An Introduction to Statistical Learning’ by Hastie and Tibshirani.

The pdf version is free and you’ll be able to improve both R and statistics knowledge at the same time. Another great book is ‘Discovering Statistics Using R’ by Field and Miles.

Other great resources below:

If you enjoyed this you can let me know in the comments below, or by spreading this post. You can also follow the blog and/or subscribe to the newsletter.

  • sammy

    Thanks for the well detailed introduction for beginners.

  • perfectlyGoodInk

    Wow, I’ve seen quite a few “crash courses” in R, but I think this has got to be the best. I already knew a lot of that, but I also didn’t know a lot of that, too.

    I’ve also just taken Coursera’s Practical Machine Learning course (alas, not free if you want to check your work via graded quizzes), and I discovered _Introduction to Statistical Learning_ there. I totally agree that it’s a real gem.

  • Rana Muhammad Kashif

    Appreciate your effort!!! Really helpful