Data science with Clojure

Photo by Every Angle on Unsplash

Data science with Clojure

My journey learning Clojure as a lisp and ML/DL/NLP techniques.

I was originally intending to use this blog to write about my common lisp journey. Along the way, though, I found Clojure to be a much easier lisp to get started with. It probably has more to do with a larger community and more readily available documents/projects to get started with. Besides, my intention for learning lisp was to learn functional techniques. And Clojure seems to enforce that better on me than common lisp does. I love common lisp more than Clojure: I feel SBCL repl is definitely superior to the Clojure repl, but Clojure repl is already quite fabulous. And I can experience the same repl-driven techniques I learnt in common lisp. Hopefully, once I get more comfortable with lisp in general, I'll be able to add more value to the common lisp community as well.

I am pretty new to Clojure. I've been lurking around their slack, their zulip and their forums for about 4-6 months now. I've not written much in the way of clojure code except this one Clojurescript project when I first began trying Clojure out. Between clojurescript, babashka and clojure, the entire spectrum of application development is captured without having to learn a lot of different stuff. And as a web dev, I found it best to start with clojurescript because it felt like known territory. I still miss Django when I have to write reasonably sized code, but I guess I must learn to think differently here.

Since I am spending a portion of my time learning data science, ML and DL techniques, I thought it might be best to learn clojure through these. I can manage some level of ML and DL with Python. And am quite comfortable with data analysis using Pandas. So last weekend, I started with data analysis using Clojure. And they have a fantastic pandas replacement in tablecloth. Well, I am using the word replacement quite loosely here. It is entirely possible that professional data scientists can say what tablecloth lacks as opposed to Pandas. From where I stand, it seems it can do everything I need.

To test this out, I took the titanic dataset and did a simple set of operations: drop some columns, replace some missing values with a mean value, encode some columns as numbers etc. While it took me about half a day to write this (I might have done this in a couple of minutes in Pandas), I like how concise and simple the resulting code turned out to be. Here is a snippet of what I did.

(ns titanic
  (:require [tablecloth.api :as tc]
                 [tech.v3.dataset.functional :as fns]
                 [tech.v3.dataset :as tds]))
(defn prepare [ds]
  (as-> ds $
    (tds/categorical->number $ [:Sex] [["male" 1] ["female" 2]])
    (tds/categorical->number $ [:Embarked] [["S" 1] ["C" 2] ["Q" 3]])
    (tc/replace-missing $ :Age :value fns/mean)
    (tc/drop-columns $ [:Cabin :Name :Ticket])
    (tc/drop-missing $ :Embarked)))

The code is mostly my attempt to understand how to do these things with tablecloth. So I just chose a quick way out. Once I implement a simple regression algorithm, I'll return to fix the preparation step so I can get a better prediction based on more advanced feature extraction methods.

While we can argue that pandas does this too, most examples I've seen aren't as concise as this. And I did write it independently, then thread it into a single composition and finally convert into a function that I can call both train and test datasets with. I always had trouble coming up with composed functions in python. It is entirely possible that I've spent too much time in python writing imperative or OOP code. And there is nothing in python that stops me from staying functional either. It is just that old habits do die hard :-)

The code is not yet on any repo. I'll update the post once I add this to my git. And hopefully, I can help more people follow suite and write better pandas code!