Machine learning in Clojure with XGBoost

Clojure it’s a LISP. So being a LISP it has a lot of parentheses. Now that we got that off our chests we can move on talking about more serious stuff.

Obligatory xkcd comic

Why Clojure?

You probably never heard about Clojure at all, let alone for data science and machine learning. So why would you be interested in using it for doing these things? I’ll tell you why: to make what matters (data) first class!

In Clojure we don’t deal with classes, objects and so on, everything is just data. And that data is immutable. This means that if you mess up with your transformations, the data will be ok and you won’t have to start all over again.

The previous is just one of the reasons popping out of my head, another one might be the JVM. Yeah, we all hate it to some extent, but make no mistake: it is developed since 1991 and has always been production grade. This is why businesses all around the world still run and develop software on the JVM. Considering this, Clojure can be accepted even by the most conservative corporations out there because down below it is fundamentally something they already know.

The second point I would like to make is the REPL. This isn’t like the usual language shell (for instance the Python REPL) which are usually very basic and annoying, but it has superpowers! Would it be nice to go seeing live what’s happening with that service in production? Done! Would you like to have a plotting service that can be used for both production and experimenting? Done! Do you happen to have nested data structures and it would be nice to explore them visually to better understand them? Done!

This means you don’t really need something like a Jupyter notebook, though if you feel more comfortable in such environment there are options out there that work seamlessly: clojupyter is a Clojure kernel for Jupyter notebooks, and Gorilla REPL is a native Clojure notebook solution.

Data Reading

If you never saw Clojure code or if you are just starting out my advice is to take a look at Learn Clojure in Y minutes to just be able to follow along with this, then one of my favourite sources when I started learning is Clojure for the Brave and True. Anyway I’ll explain carefully every step in the code, so everyone will be able to follow along without many issues.

This is just an introduction so we will work with the infamous Iris dataset. If you don’t already have Leiningen get it and install it, it is really easy to use and is the de facto build tool for Clojure. Now you can create a new project skeleton from your command line by launching:

You should have a directory structure as the following:

The files we care the most about are project.clj which is the basis for Leiningen to correctly prepare and build the project, src where we put the code for our application or library and resources where we put our Iris csv file that you can get from here. Now we can define the project.clj file which is another ‘sanity’ touch about Clojure: you have to declare explicitly libraries and versions which is always a good thing.

If we save project.clj and launch lein run in the shell, Leiningen will get all needed dependencies and start a REPL. We can move to the data loading and trasforming by opening and modifying the src/clj-boost-demo/core.clj file. We will find some placeholder code in there, we can get rid of it and start writing our code.

In Clojure we work with namespaces, usually a file contains one namespace where we define our imports:

ns defines a new namespace and it is a good practice to import libraries from the namespace definition as we’re doing here. :require it’s a keyword which is a type per se in Clojure and we will see later why they’re important, [clj-boost.core :as boost] means that we want to use the core namespace from the clj-boost library, but we want to refer all the names under it with the name boost. If you know Python this is the same as doing import library as lbr.

With def we create new vars globally in the current namespace. In this case we’re pointing at a string representing a path where our dataset is living. Usually in Clojure we don’t define many global names, except for function names. In fact iris-path will be the only global name we will use in this demo!

To read the Iris csv we use this code:

This code defines (defn) a function named generate-iris that takes as argument the path to the Iris csv. Then we open a connection to the file at the given path that’s going to be closed when we’re done (with-open). When you see after a function call a vector with a symbol followed by some code – [reader (io/reader iris-path)] – that’s a local binding.

Local bindings are useful to avoid cluttering the global namespace and to separate code execution in tiny bits. In this case we use a Java reader, to be precise a BufferedReader, to open and read the file. As you can see we are using the imported clojure.java.io namespace by doing io/reader, and name/ is the syntax to access names residing in an imported namespace.

Now the following code might look a bit esoteric, but it’s just a matter of habit. Let’s start decomposing at the REPL all the steps.

The code above throws an error, that is because csv/read-csv is lazy. Default laziness is another Clojure feature: most of Clojure functions don’t return anything until you need those values. This is nice if you think about it: if we had a very large file we wouldn’t have to load it all in memory to process it, but we might read and process it line by line while writing the result to another file.

To make the function eager we can use doall:

The result is a sequence of vectors containing strings. Sequences are among the most important data structures in Clojure: they are lazy, immutable and can be produced by any other Clojure data structure. For more info about them check official Clojure docsVectors are very similar to Python lists, with the difference that they are persistent.

To better understand immutability and persistence let’s try a little experiment:

As you can see we created a vector a then conj (which appends elements to vectors) 4 to a and the result was a whole new data structure, in fact a still has the initial value.

Data Processing

In this case we don’t care about the header anymore since the "species" column is the class we want to predict and is the last one, so we can start processing the raw data right away. To make the process easier we define a demo-reader function to use in the REPL and start working on the resulting data:

Then we will use the threading macro to experiment and apply transformations step by step:

->> lets us thread a value as the last argument of the subsequent functions: you surely remember that at school they taught you that to solve \(f(g(x))\) you should start by solving \(g(x)=x’\) and then \(f(x’)=x”\). The threading macro it’s just syntactic sugar to make code more readable.

Here’s a simple example:

We increase 1 by 1, then (dec (inc 1)) means increasing 1 and then decreasing the result – 2 – by 1, and that gives 1. Basically we read from right to left to understand the order of application of the functions. With (-> 1 inc dec) we can go back at reading from left to right the operations. For further info on threading macros check official Clojure docs.

Getting back to (->> (demo-reader) (take 3)) as you see as a result we get only the first 3 vectors from the csv. take does exactly what you think: it lazily takes n values from the given collection. This is very useful when experimenting, otherwise we would have to work with the whole sequence.

To remove the header from the sequence we can drop the first row from results:

Now since we want to separate \(X\) values from our \(Y\) (the classes we want to predict) it would be nice to do it in one go. If you come from Python or other C-like languages you might be tempted to use a loop to do it, but in Clojure we do things differently.

With map we can apply a function to all the values in a collection:

In this case we want to split values, and guess what, there’s a split-at function waiting for us!

This is exactly what we need, so we will define an anonymous function and map it over our vectors:

To define named functions we use defn which is a macro to let us avoid typing every time (def my-func (fn [arg] "do something")), so by doing simply (fn [arg] "I have no name") we get an anonymous function.

#(split-at 4 %) is another short-hand that resolves the same as (fn [x] (split-at 4 x)), so it is just a way to save typing.

Now let’s take everything together by using transducers. As with the threading macro with transducers we compose functions together, but transducers return a single function that passes over data only once. Don’t worry if they look a bit obscure at first, it took some time for me as well to grasp the concept, and I suggest you read carefully this very well written series on transducers.

With comp we compose functions together getting back only one function, while with into we cycle over a collection and drop results into the collection given as first argument. I like to think about this process in this way: it is like we’re pulling values from a collection to another, but while we do that we apply a function – xf – to all of them.

The result is the generate-iris function we started from:

Now we want to go from this ([("5.1" "3.5" "1.4" "0.2") ("setosa")] [("4.9" "3" "1.4" "0.2") ("setosa")]) to something we can process in an easier way: ([5.1 3.5 1.4 0.2 0] [4.9 3.0 1.4 0.2 0]). Basically we parse strings into numbers and convert classes (setosavirginica and versicolor) to integers.

Let’s start by abstracting the needed transformations:

With these transformations we can build the \(X\) for our model. Let’s build a named transformer:

For our \(Y\) instead:

Stop just a second, with let we can create local bindings, as to say give names to data or functions that live only in the local space, not globally. case it’s a way to avoid nested (if condition "this" "else this"). What we’re saying is: if l is = to "setosa" then return 0, if is = to "versicolor" return 1, while if is = to "virginica" return 2.

In this way if the given value doesn’t match none of the three cases we get an error. This is nice to check for data quality as well.

Train – Test Split

One of the most important things to do when doing machine learning is to split data over a train and a test set. A good split requires random sampling, so we will implement a very simple sampler from scratch.

train-test-split takes a collection and a number of instances you want in the training set. The shuffle function simply shuffles the collection, so we get a random result every time and we can easily avoid repetition.

This solution is not optimal if you have a fairly large dataset, in that case you might want to take a look at sampling, a very nice library that takes care of everything about sampling.

With the above functions we generate the train and test sets as 2 maps with 2 keys: :x and :y. Maps in Clojure are first-class citizens and have very nice properties:

Maps are much more than this, if you don’t know about them you should take a look here.

Training & Prediction

XGBoost is an ensemble model that uses gradient boosting to minimize the loss function. If you can’t make sense of these words together my suggestion is to check this very nice explanation  (with pics and formulas) of the algorithm.

clj-boost gives you a Clojure interface to the underlying Java implementation of the library, in this way we can avoid Java interop and get the same results. To train a model we have to create a DMatrix from the data we want to feed the algo for learning. This isn’t a choice of mine, and I could have hidden the data trasformation behind the API implementation, but there’s an issue: once you get your data into a DMatrix you can’t touch nor look at them anymore.

With dmatrix we serialize our training set, there are various way to generate a DMatrix, so I advise you to take a look at the docs or at the README. The :params map acts like a config for XGBoost and this is a very minimal example of what we can do with it, to know all the possible options always refer to official docs.

Here we say XGBoost should do training with a learning rate:eta – of 0.00001, since we’re doing classification on 3 classes – setosa, versicolor and virginica – we set the :objective to multi:softmax and tell XGBoost how many classes we have with :num_class.

XGBoost will do 2 :rounds of boosting, will evaluate accuracy on the training set itself (not a good practice, but this is just an example) by passing it to the :watches map and in case accuracy will start to increase for 10 consecutive iterations it will stop training because of the :early-stopping parameter.

Calling fit over the defined data and params trains a XGBoost model from scratch and returns a Booster instance. We can use the Booster for prediction, we can persist it to disk or feed it as a baseline to another XGBoost model.

Though we passed the training set itself to check training performance, we will check accuracy on the test set we prepared previously. predict needs a model and the data as a DMatrix we want to predict and returns a vector of predictions.

Let’s wrap everything up into a -main function so that we can run the whole analysis from both the REPL and the command line.

We generate the split-set than from it we use a nice little map trick: we map multiple functions over one collection and not the other way around. Then we train the XGBoost model and we get predictions.

This is the result I get, but yours might be slightly different since we didn’t fix a seed for the random number generator.

Flexibility vs Production

You probably didn’t notice it, but though the code we wrote was simple and flexible enough to be used for analysis and experimentation, this is also production ready code. If you do lein run from the root of the project in your command line you’ll get the same results as doing (-main) from the REPL. So it would be trivial to add functionality to the program, for instance you might want to feed new data to it when it changes and you want to retrain your model.

If we did something like this with Python in a Jupyter Notebook now we would probably have assignments all over the place, imperative code that must be rewritten from scratch to make it somewhat production ready and I won’t talk about the fact that if for data munging performance might be an issue by composing transducers we can get parallelization almost for free.

Now you can go playing a little bit with clj-boost, don’t forget that there are docs available and that this is still a work in progress, so please let me know if there are issues, ideas, ways to make it better or even just that you’re using it and you’re happy with it.

Subscribe

Get next posts directly in your inbox!