Try your first data science project in minutes, not days.
This post is for everyone: you can be an expert engineer looking for a reminder or you can be completely oblivious to coding. It doesn’t matter.
It doesn’t matter if you are going to become a data scientist or you just want to understand what’s all this hype. Here you can understand a bit how things work and at the end of the post you will have used machine learning to make predictions.
The War Plan
We will follow the steps below thoroughly, and at the end you won’t be a data scientist, but you will be on the right path.
- What’s R
- Why R
- Install R
- Install RStudio
- Basic R
- Further Reading
You can follow the steps in just one sitting, or you can save the post and follow them whenever you want. What you can’t change is their order (unless you already have R installed, in that case you can skip that part).
R is a programming language built by statisticians for statisticians. If you look for R on Google you’ll have a hard time: in the end R is just a letter. However you can use R language, R programming or use Rseek.
With R you can write scripts that download, load, analyze, visualize, model and share data in many different ways. It is an open source language which means that you can use it for free and you can even modify it if you want to.
Since R was built with statistics in mind the main things you can do with data are extremely easy to perform. Let’s say we have a csv with data about countries’ GDP and public debt, since we heard about the paper by Reinhart and Rogoff (which was wrong by the way) we want to check by ourself if there is some sort of relationship between the two variables.
With the three lines of code below we load the dataset, make a scatter plot of the variables of interest and then we fit a linear regression. That’s it.
# Read data from csv df = read.csv('some_data.csv') # Make a plot of the data plot(df$debt, df$gdp) # Fit a linear regression lm(gdp ~ debt, data = df)
It is likely that during the previous search you found many results with comparisons between R and Python and about 1000 different opinions about which is the better one.
Let them discuss while you start actually doing data science and go with R. The reason is pretty simple, especially if you never coded before: R is easier to install, manage and to learn.
Besides, you can do basically everything with R thanks to the community that releases open source packages at a fast pace. If you really want to become a data scientist, eventually you’ll have to learn at least a bit of this and that, like SQL and Python, maybe also Scala, so there’s no point in losing time to pick one.
Another good reason is that Python was created as a fully functional language: you can build proper apps with it, while in R it would be a real pain. This can look like a plus for Python, but if you want to perform data analysis it is actually a downside: you’ll need many different packages just to manage data in an easy and functional way, for loops everywhere and it’s easier to break things.
To install R you have to go to CRAN and you will find three links at the beginning of the page, just click on the download link for your OS.
Don’t worry, there aren’t major differences between R for different OS and you can use the same code if you have more than one.
After downloading you can run the install and when it finishes you’ll get an R console (on Windows, on the other OS you can run R from the terminal by simply running: R) and a GUI.
The GUI shipped with R it’s not that great, so just open the console (or the terminal) and let’s run a bit of code to check that everything’s working fine.
# Generate some random number x <- runif(10) y <- runif(10) # Print generated numbers x y # Sum the two vectors x + y
Running the code above you should see the result as below. Be aware that you won’t see the same numbers after printing x and y as mine because they are generated randomly.
If everything worked fine we can move on to the next step! Otherwise you can comment here and I will try to solve the issue.
As I said before, the R GUI it’s not great, but doing data science in the console is a real pain. You’ll have reproducibility issues, loss of interactivity if you script in a text file and run the code in console, and so on.
Once you have installed it you can just open it and RStudio will figure out by itself where is R and which version you’re using, so you don’t have to do any setup.
After opening RStudio, you should see something like this:
Don’t pay attention to colors and panel order, you can change them via Tools > Global Options whenever you want. What matters here is that you have a Console that is going to work as R in the terminal (see previous point) and an area for Scripting, you will recognize it by the tab “Untitled 1”, here you can write code without running it directly.
The other panels:
- Environment: list of all variables, functions, datasets and so on loaded in the current environment
- History: history of the last commands run
- Files: you can browse your computer folders and files from here
- Packages: a list of installed packages, you can activate and deactivate them from here (more on packages later)
- Plots: you will visualize plots in this panel
- Help: a good thing about R, if you run in console ?runif you’ll see the documentation about runif in this panel. You can do the same for every command
- Viewer: you can visualize Html in this panel
Now try to run the same code we run in the terminal in RStudio console as below.
We get the result instantly as in the terminal, but now we see that in the Environment panel there are two listed variables: x and y. This is good when we will work with dozens of different functions and variables.
To test the script panel, just write the same exact code as before in the panel, then to run it you can either put the cursor on the first line and then click on Run at the top-right of the panel, or press ctrl-enter.
To run all at once you can either highlight all the code and press ctrl-enter, or you can press ctrl-alt-r. As you can see in this way it’s much easier to look at your code and in case there were some mistakes you can debug it with ease.
The first thing when learning to code or a new programming language is to build a habit by actually reading code. For this reason this section is all written in
# As a primer you saw earlier that I wrote # some text in the code, but it wasn't evaluated. # These are comments, you can use them by simply # using '#' before text. ###### I can also add more '#' to draw the attention # on that comment. # Another thing about R: # '<-' and '=' are the same when assigning values # Assign names # We can do math with R as we would do in a simple calculator 5 + 6 # Returns 11 2 * 5 # Returns 10 2 ^ 2 # Returns 2 to the power of 2 = 4 10 / 2 # Returns 5 # But this is an inefficient way to deal with data: # think about a 1000 rows dataset, # and you want to find the sum. # It's going to take a while to sum by hand every value in it # So we can assign names to values x = 5 x # This prints the value of x, so we will se 5 x = 7 x # As you can see the value of x has changed to 7 # We can store whatever we want in a name x = 'hello world' x x = 5 * 5 # R will evaluate the expression 5 * 5 and then store its result in x # We can do operations with variables x = 5 y = 2 x * y # Returns 10 x + y # Returns 7 x # The values of x and y are unchanged y z = x + y * x / y z x y # Getting back to our previous example, #to manage many elements we need vectors x = c(1, 2, 3, 4, 5) # We create vectors with the c() function x # Vectors can contain integers (as above), floats or strings x = c(1.5, 2.3, 7.4) # Floats y = c('hello', 'world', 'bye', 'moon') # Strings of text # We can combine vectors, # but remember that one vector can contain # only one type among floats or strings z = c(x, 3, 5) z z = c(x, y) # Here R converts the floats in x to strings z # Operations with vectors x = c(1, 2, 3) y = c(4, 5, 6) x + y # R sums every value of x with the corresponding value of y x * y # What if we want to sum only the first value # of x with the second value of y? # We use indexing x # This prints the first element of x y # This prints the second element of y x + y # Returns 6 x[1:2] # Returns the first two elements of x x[1:2] + y[2:3] # Returns 6, 8 # Above we used ':' between two indexes # we can use it also to generate sequences of values x = 1:10 x # Returns 1 2 3 4 5 6 7 8 9 10 x = 10:1 x # Returns the same sequence as before, but inverted # Returning to the same example, 1000 values to sum. # We will use a couple of functions to demonstrate # the speed and the capability of vectors and R set.seed(123) x = runif(1000) # runif() is a function, a function takes various # arguments and will do some operation with them. # When in doubt go with help '?functionName'. # In this case we generate 1000 random numbers. # set.seed() is another function that blocks the seed # for random generation, we need it otherwise you wouldn't get # the same results as me # (see previous sections Install R and Install RStudio) sum(x) # sum() is a function that sums every element in a vector # We can do the same for larger vectors x = runif(100000) sum(x) # Returns 49931.65 x = runif(1000000) sum(x) # Returns 499638.4 # Data Frame # Usually when we deal with data we want them in tabular format: # organized by columns and rows. # R has a native method to deal with tabular data: data frames x = 1:10 y = letters[1:10] # We take the first 10 letters of the alphabet z = rep(c('male', 'female'), 5) # rep() replicates the first argument x times df = data.frame(id = x, status = y, sex = z) df # We created a dataframe from 3 vectors, # as you can see it's organized as a table, and every column is a vector. # We can call single or groups of rows and columns and do operations on them df[1,1] # Returns the first row in the first column df[1, ] # Returns the whole first row df[ ,1] # Returns the first column df[ ,'sex'] # Returns the sex column # Another way to select columns is with the '$' operator df$sex # Returns sex column sum(df$id)
One of the greatest things about R is that there is a package almost for everything. A package is just a collection of already defined functions that add to the regular ones in R.
The previous intro to R was limited because of packages: it doesn’t make sense anymore to learn R without learning some of the most used packages.
CRAN (the website where you got R) collects packages, but there are also other mirrors such as bioconductor and then many others are in the wild on Github (usually the stable version is on CRAN, while the newer version is on Github).
It’s really easy to install packages in R, just do
install.packages("package_name") and let R do its thing. Another feature of R is that packages are lazy which means that you have to explicitly load them in order to access their functionality.
To load a package just run
library("package_name") and you’ll be able to access it. This means that you won’t have to manage different environments to avoid conflicts within packages because they aren’t loaded until you call them.
Some essential packages:
- dplyr: data cleaning and wrangling
- ggplot2: publication ready charts
- lubridate: work easily with dates
- data.table: manage “big” data
- devtools: a collection of tools, for the moment you care about the possibility to install packages directly from Github
Now the fun part!
We will do some analysis on a very famous dataset in the machine learning community: iris. This dataset collects data about sepal and petal measures for different iris species. You can take a look at this data and many other on the UCI Machine Learning Repository.
First you will load the data into R directly from the web:
iris_data <- read.csv( 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = FALSE )
By running the previous command you told R to reach that url and to read its content which is in csv format. Note that you can do the same thing with data sitting on your hard disk by substituting the url with the file path on our system.
The second argument
header = FALSE is needed in this case because data doesn’t have column names. So what happens is the following:
## V1 V2 V3 V4 V5 ## 1 5.1 3.5 1.4 0.2 Iris-setosa ## 2 4.9 3.0 1.4 0.2 Iris-setosa ## 3 4.7 3.2 1.3 0.2 Iris-setosa ## 4 4.6 3.1 1.5 0.2 Iris-setosa ## 5 5.0 3.6 1.4 0.2 Iris-setosa ## 6 5.4 3.9 1.7 0.4 Iris-setosa
R has loaded the data by assigning columns some default names: V1, V2, etc.
In the meantime you also used a new function
head(iris_data) which returns the first 6 records from the object fed as an argument. If you want to see more than 6 records, just add an integer as a second argument to
## V1 V2 V3 V4 V5 ## 1 5.1 3.5 1.4 0.2 Iris-setosa ## 2 4.9 3.0 1.4 0.2 Iris-setosa ## 3 4.7 3.2 1.3 0.2 Iris-setosa ## 4 4.6 3.1 1.5 0.2 Iris-setosa ## 5 5.0 3.6 1.4 0.2 Iris-setosa ## 6 5.4 3.9 1.7 0.4 Iris-setosa ## 7 4.6 3.4 1.4 0.3 Iris-setosa ## 8 5.0 3.4 1.5 0.2 Iris-setosa ## 9 4.4 2.9 1.4 0.2 Iris-setosa ## 10 4.9 3.1 1.5 0.1 Iris-setosa
If you want to check the bottom of your data use
tail(), it works the same as
head() but starts from the bottom of your object instead than the top.
Now you want to give meaningful names to columns, so you can go checking the UCI ML Repository documentation for iris and see that the columns are:
- V1 = Sepal Length
- V2 = Sepal Width
- V3 = Petal Length
- V4 = Petal Width
- V5 = Species
Let’s rename them!
# Check dataframe columns names names(iris_data)
##  "V1" "V2" "V3" "V4" "V5"
# Create a vector with column names, use "_" as a best practice instead of whitespace iris_names = c( 'Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species' ) # Rename columns with the created vector names(iris_data) = iris_names # Check that everything went fine head(iris_data)
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species ## 1 5.1 3.5 1.4 0.2 Iris-setosa ## 2 4.9 3.0 1.4 0.2 Iris-setosa ## 3 4.7 3.2 1.3 0.2 Iris-setosa ## 4 4.6 3.1 1.5 0.2 Iris-setosa ## 5 5.0 3.6 1.4 0.2 Iris-setosa ## 6 5.4 3.9 1.7 0.4 Iris-setosa
Now that you have better column names you can start looking for some descriptive statistics.
There are many functions that can help you in this regard, let’s look at some of them.
## Sepal_Length Sepal_Width Petal_Length Petal_Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## Iris-setosa :50 ## Iris-versicolor:50 ## Iris-virginica :50 ## ## ##
summary() you get some statistics organized by column: quartiles, min and max values, mean and median.
In case you have non numeric data as in Species you get a count of records by group, these are called categorical variables and R manages them as factors.
Factors are dummy encoded variables, meaning that R shows you Iris-setosa but stores and considers it as an integer. You can check that is really a factor with the
##  "factor"
You can also check levels of factors with…drumroll… the
levels() function. Levels are the labels assigned to the dummy integers to make them human readable. As an example if we had male and female R would encode them as female = 1 and male = 2.
##  "Iris-setosa" "Iris-versicolor" "Iris-virginica"
In this case you have:
- Iris-setosa = 1
- Iris-versicolor = 2
- Iris-virginica = 3
Exploratory Analysis and Plotting
The next step is to dig a bit deeper and try to make a sense about this dataset. You already tried the
summary() function, now you’ll use the
str() function which gives you a sense of the data and the types you have.
## 'data.frame': 150 obs. of 5 variables: ## $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...
You can see the dimension of the dataset (150 rows and 5 columns, use
dim(iris_data) if you want to see just this info) and then you see every column name and the first 10 observations with their class. As you can see Species is a factor and str() shows you the dummy values as you saw earlier.
You may want to check some summary statistics beside the ones given to you by summary(), for instance it’s interesting to check means by species.
# You can use tapply() to apply a function # to a vector split by another vector tapply( iris_data$Sepal_Length, iris_data$Species, mean )
## Iris-setosa Iris-versicolor Iris-virginica ## 5.006 5.936 6.588
The first argument is the vector or column you want to calculate.
The second argument is the vector with the values R will use to split.
The last one is the function to apply, in this case mean().
You just got the Sepal Length mean by species. But it would be better to see every column mean split by Species.
As usual in R there are several ways to do it. The first approach uses
aggregate() that groups data by factor and then applies a function to collapse data.
The second approach uses dplyr one of the most useful R packages around. This package integrated piping with a series of data wrangling basic functions, when you hear about data pipelines think about dplyr that made R a fully fledged piping language.
# Use aggregate() aggregate( iris_data[,1:4], # Columns you want to calculate and collapse list(iris_data$Species), # List of columns with splitting values mean # Function to calculate and collapse data )
## Group.1 Sepal_Length Sepal_Width Petal_Length Petal_Width ## 1 Iris-setosa 5.006 3.418 1.464 0.244 ## 2 Iris-versicolor 5.936 2.770 4.260 1.326 ## 3 Iris-virginica 6.588 2.974 5.552 2.026
# To use dplyr you have to install it install.packages('dplyr')
# Then just call it before using it library(dplyr) # The first argument is always the dataset # This '%>%' is the pipe operator and says to R to take # what comes before it and pass it through the following function iris_data %>% group_by(Species) %>% summarize_each(funs(mean))
## # A tibble: 3 x 5 ## Species Sepal_Length Sepal_Width Petal_Length Petal_Width ## ## 1 Iris-setosa 5.006 3.418 1.464 0.244 ## 2 Iris-versicolor 5.936 2.770 4.260 1.326 ## 3 Iris-virginica 6.588 2.974 5.552 2.026
If you want you can translate dplyr code in a much more logical way than
aggregate(). You are telling R to take your dataset iris_data, to group it by Species and then to summarize each column with the mean() function.
As you can see for some species there is a large variation in some columns. To translate this visually you can draw a scatter plot.
You can do also a general plot of all the variables in your dataset with
As you can see points tend to cluster in two large groups, this is really interesting, but you might want to check by introducing the Species variable in your plots.
I’m going to stop using basic plot functionality since it is a bit outdated and there is another great package that makes everything more intuitive and better looking.
install.packages('ggplot2') library(ggplot2) # Call ggplot() ggplot(iris_data, # The dataset is the first argument aes( # aes() = aesthetics Petal_Width, # x axis Sepal_Width, # y axis color = Species # color values by Species ) ) + # ggplot() creates an empty plot geom_point() # add points to the empty plot
The species are almost perfectly split except for some versicolor and virginica. You can try the predict the species by using the variables in the dataset, this is a classification problem.
As a test you can try to predict the species by using Petal_width with a logistic regression, but in order to make things simpler and more interesting you’ll have to eliminate the setosa observations.
# With filter() in dplyr you can filter rows by a condition, # in this case we keep all rows where Species is not equal (!=) # to Iris - setosa iris_subset <- iris_data %>% filter(Species != 'Iris-setosa') # R is smart, but not so smart, it will keep all levels for Species # unless you tell him to drop unused levels iris_subset$Species <- droplevels(iris_subset$Species) dim(iris_subset)
##  100 5
Logistic regression works very well with binary classification problems, but before modeling you have to split data into a training set and a test set. This is because you want to prevent overfitting of your model, you will run it through the training set and then evaluate it on the prediction of the test set.
# Since you will use a random function is always # a best practice setting a seed set.seed(123) # With the sample() function we draw a random sequence of ids train <- sample(1:nrow(iris_subset), 85) # Train is just an index of ids we will use to subset data head(train)
##  29 79 41 86 91 5
# Then we subset iris_data using the train index iris_train <- iris_subset[train,] iris_test <- iris_subset[-train,] # Create a logistic regression model iris_lr <- glm( Species ~ Petal_Width, data = iris_train, family = binomial )
## (Intercept) -19.344 4.414 -4.383 1.17e-05 *** ## Petal_Width 12.024 2.773 4.337 1.45e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 117.541 on 84 degrees of freedom ## Residual deviance: 26.283 on 83 degrees of freedom ## AIC: 30.283 ## ## Number of Fisher Scoring iterations: 7
To run most of the models in R you have to feed functions with the formula:
Species ~ Petal_Width in this case tells R to look for a function that takes Species as a response and Petal_Width as the predictor variable. Then you insert the dataset R has to draw data from.
In this case you also put
family = binomial because you want a logistic regression and not a linear regression. If you want to do some test with linear regression you can use
lm()and you could try to put petal width as the response, like this:
lm(Petal_Width ~ ., data = iris_data[,-'Species']). The ‘dot’ is a placeholder indicating that you want to use all columns as predictors.
What you care about are the coefficients Estimate and the z value through which you define significance. You can also look at the Null deviance or the deviance with the null model and the Residual Deviance which is the deviance of your model, a decrease is good.
Now you will predict the species of the test set with the
# Predict the probability of the observation being virginica probs <- predict(iris_lr, iris_test, type = 'response') # Create a vector for the prediction with the base category pred <- rep('Iris - versicolor', 15) # Every probability > 0.5 means predicting virginica pred[probs > 0.5] <- 'Iris - virginica' # Build a confusion matrix table(iris_test$Species, pred)
## pred ## Iris - versicolor Iris - virginica ## Iris-versicolor 10 0 ## Iris-virginica 1 4
You just made your first prediction! The result is very good since there is only one wrong prediction (top-left and bottom-right are right predictions).
Since the result is so good you want to try to predict all three species. You could try with a multinomial logistic regression, but for multivariate categorical variables there are much better classification methods.
Linear Discriminant Analysis (LDA) is the first method you will try on the full dataset. Let’s split it into training and test as previously.
set.seed(999) train <- sample(1:nrow(iris_data), 100) iris_train <- iris_data[train,] iris_test <- iris_data[-train,]
LDA will try to draw separating lines between classes, so you’ll end up with two lines separating versicolor, virginica and setosa.
# The lda() function is in the MASS package library(MASS) # As I told you before algos in R # are fed in a similar manner iris_lda <- lda(Species ~ Petal_Width, data = iris_train) # Add '$class' at the end of predict to get # only the predicted classes preds <- predict(iris_lda, iris_test[,1:4])$class # Confusion matrix as before table(iris_test$Species, preds)
## preds ## Iris-setosa Iris-versicolor Iris-virginica ## Iris-setosa 16 0 0 ## Iris-versicolor 0 18 1 ## Iris-virginica 0 3 12
By reintroducing all species and using LDA you just got 4 errors on 50 observations, or a 8% error rate. You can try to improve your model by feeding it with all columns and not just petal width.
iris_lda <- lda(Species ~ ., data = iris_train) preds <- predict(iris_lda, iris_test[,1:4])$class table(iris_test$Species, preds)
## preds ## Iris-setosa Iris-versicolor Iris-virginica ## Iris-setosa 16 0 0 ## Iris-versicolor 0 19 0 ## Iris-virginica 0 3 12
Using all columns brings our error from 4 to 3, this means we are close to the limit reachable. You could try KNN to see if you can catch some of that error, but from now on we are probably just overfitting.
set.seed(789) # KNN is in class package library(class) # knn works a bit differently than # other methods we tried # Take all the columns you want to # use for prediction knn_train <- iris_train[,-5] knn_test <- iris_test[,-5] # Take only the vector of correct # classes for the train set train_class <- iris_train[,5] # Feed knn() with them, 'k' indicates the number # neighbors that can vote for classifying other neighbours iris_knn <- knn(knn_train, knn_test, train_class, k = 6) # Confusion matrix table(iris_knn, iris_test$Species)
## ## iris_knn Iris-setosa Iris-versicolor Iris-virginica ## Iris-setosa 16 0 0 ## Iris-versicolor 0 19 2 ## Iris-virginica 0 0 13
This is the best you can do with so little data and without turning to even more complicated algorithms.
There is a huge variety of resources you can use to improve your data science knowledge. My advice if you are a true beginner through an intermediate level is to get ‘An Introduction to Statistical Learning’ by Hastie and Tibshirani.
The pdf version is free and you’ll be able to improve both R and statistics knowledge at the same time. Another great book is ‘Discovering Statistics Using R’ by Field and Miles.
Other great resources below:
- R for Data Science – free online – beginner to intermediate
- Advanced R – free online – intermediate to advanced
- Datacamp – online classes free/paid – beginner to intermediate
- Data Science Specialization – mooc – beginner to intermediate
- Awesome R – list of tutorials – beginner to advanced
- Swirl – free R package – you can learn R whithin R
If you enjoyed this you can let me know in the comments below, or by spreading this post. You can also follow the blog and/or subscribe to the newsletter.