Learn R before Python

This is not the usual R vs Python post you can find online, in fact I won’t discuss whether one is better than the other. I will show to you why a learner who wants to learn data science will have an advantage by starting with R.

Vectors

What are vectors? If you know matrices, you know vectors. They can be seen as rows or columns of matrices, so what we have is a one-dimensional “list” of numbers. Usually vectors are used as columns for data frames, that is because we are sure that in a column we have data of the same type.

Float, integer, string, categorical, etc a vector has always only one type. This is important because we can make our code faster and clearer: the interpreter will have to check the type of the first record and that’s it. As you may know in R vectors are native, actually even a scalar is a vector.

Vectorization

When performing data analysis or machine learning, I will often work with data in a tabular format, or at a lower level, with a series of vectors. If I want to multiply every record in a vector by 2 it’s pretty natural to do:

In Python you can use lists to store your vectors, so let’s try the same with Python 3 (the fact you have to worry about 2 vs 3 is all another issue)

WAT…

It turns out that the only way to get the same result in native Python is to perform a for loop:

You may want to store the result in a list as the input, so you have to initialize an empty list out of the loop and append results to it:

The same code in R would be:

I would stress that it isn’t much about less typing, but more about the formation of the “right” mental model. Many people complain because their R code is slow, 99% of the time this is because they didn’t vectorize their code by coding “Python style” with loops, either hidden or explicit.

Random Walk Example

We will perform a random walk in R and Python, for the latter the examples are taken from “From Python to NumPy” book.

Let’s start from the most basic approach by looping:

This code can get slow for very large objects, we can improve it by using the itertools module:

Anyway, this isn’t vectorized yet. It’s just a more efficient way to loop. To reach full vectorization we need NumPy:

Take a close look at the methods derived from NumPy.

The same R code:

No imports, no real need to define a function or a method, code packed in one line.

Conclusion

If you want to be a data “something”, or if you want to teach someone start with R. After reaching confidence with R, start with Python.

If you enjoyed this you can let me know commenting below, or by spreading this post. You can also follow the blog and/or subscribe to the newsletter.

  • Honza Beníšek

    Hi, I usually dont like this R vs Python wars a well, but your arguments against Python are not valid. Firstly, if you want to multiply vector by a number in python, you can simply do this: lst_mult = 2*np.array([5,3,4]). Which is same as in R in your example. Secondly, the same goes for the Random walk example. You can simply write this in Python: b = np.cumsum(np.random.randint(-1, 1, size=10)). Again, same as in R.

    • Alan

      It’s not in Python, it’s in NumPy, which is a vast library and there is a full book (http://www.labri.fr/perso/nrougier/from-python-to-numpy/) about how to use it. For a beginner who’s learning a language starting to learn many libraries on top of the language can be overwhelming.

      • Matthew Dornfeld

        R also has a vast amount of array functionality that is difficult for beginners to learn. Including it in the main namespace doesn’t make it any easier to learn than organizing it into a separate library like Numpy.

        • Alan

          I agree on the fact that R is not the easieset language to learn, but let’s not forget the fact I’m talking about learning for data science, so having c(1, 2) * 2 working out of the box is one of the most important things. Furthermore in R there’s masking, meaning that if I write my own cumsum function I will mask the core one, which Python won’t let me do it. Packages work similarly in Python and R, but package management is another area where R shines with its simplicity, while in every Python resource for beginners you have to start with: get either conda or venv or something similar, get pip, go to the command line, etc. For a total beginner this can be overwhelming and will take away the focus from the core language.

  • Enjoyed the article. Although If one is diving immediately into machine learning or language processing, I might recommend starting with python.
    Just wanted to let you know your home link is directing users to a HTTPS url, prompting errors as it appears SSL is not configured yet. (show -> hompage)

    • Alan

      Yes, true! Thanks for reporting it.

      I may agree with you if one has to do only machine learning for production, but usually one would have to collect, clean and analyze data before. In this R in my opinion is much better.

      • I agree with you there – Definitely a space where R shines.

  • Wintermute

    Ignoring the misuse of numpy — you don’t have to initialise an empty list and append to it in python you can use list comprehensions.
    res = [i*2 for i in [5, 3, 4]]

  • sumendar

    It’s True!

  • Matthew Dornfeld

    In the first part it’s kind of disingenuous to compare a Python list to an R vector. They’re different data types and serve different purposes. The Python equivalent to the R vector is a 1d numpy array. Sure Numpy isn’t a default library, but that’s really a superficial difference since that can be fixed with one command

    pip install numpy

    Using Numpy you’re example would look like

    import numpy as np

    vec = np.array([5, 3, 4]) * 2

    In you’re second example you don’t need to define a function either. You can do

    import numpy as np

    rw = np.cumsum( np.random.randint(-1, 2, 100) )

    You also mention Python’s namespaces as a hindrance, but they’re actually an asset. They help you organize functions that are logically connected and perform interrelated operations. They also don’t pollute your main namespace with thousands of functions you’re never going to use in your program, which makes it easier to name your own custom functions. Lets say I want to create my own custom function called cumsum that does something similar to the default cumsum function but slightly different. In Python doing that is perfectly clear and easy. I don’t even think R supports function overloading, so you would have to call it something like cumsum_custom. That’s an extra layer of difficulty. Also if you don’t want to deal with namespaces and you want all of your scientific functions in your main namepsace for fast prototyping you can just import pylab. The two examples become

    from pylab import *

    vec = array([5, 3, 4]) * 2
    rw = cumsum( randint(-1, 2, 100) )