edX Scalable Machine Learning course review

For me one of the great things about modern society is that high quality education is more accessible than ever before. While the UK now has significant university fees (though still a lot cheaper than many US universities), since Stanford’s AI-Class and ML-Class offerings sites like Coursera and edX have started offering a wide range of courses online in association with a range of major universities (including Stanford, Harvard, MIT, Berkeley and many more). Generally a ‘verified’ certificate is offered for a fee, but you can get a basic certificate following the same video lectures and doing the same weekly assignments, and this doesn’t cost anything.

For someone who is always keen to learn more or learn something new, this is an excellent opportunity. You can learn about subjects outside your career path without worrying whether there’s any financial benefit to doing so, and you can expand your knowledge in your chosen subject in a well-planned and demonstrable way. What better way to demonstrate your wish to excel and to keep your skills updated than by taking university undergraduate and postgraduate level courses in subjects that keep expanding your knowledge?

The latest course I’ve done has been Scalable Machine Learning. It’s a 5-week course that (kind of) follows on from the Introduction to Big Data with Apache Spark. Both were run with Berkely on the edX.org site. Both used Python and Apache Spark via a virtual machine that provided an IPython notebook server for working on the assignments. Introduction was also a 5 week course. The format on edX is that you have a set of video lectures, broken up to chunks of 10 minutes or less. Accompanying this are transcripts, slides and sometimes links to external sites for more information. Each week there are assignments to work on, which are auto-graded, and there are a range of questions covering the information in the lectures and the results from the assignments. 

The course had a Week 0 set of tasks to set up the necessary environment which was identical to the Introduction to Big Data set up. These were easy to follow, and I quickly got an environment together. It assumes little system configuration knowledge and uses a preconfigured virtual machine. You won’t know how to set up a Spark cluster after this, but you’ll have a VM you can explore a smaller volume of data with.This is probably about right for the course, although it would have been nice to at least reference some materials for creating a full-scale cluster – I’ve been getting my own server set up to run Hadoop, HDFS and Spark, and have had to look elsewhere for instructions on creating a physical server installation.

The proper weeks assume no knowledge of Machine Learning, but do assume a good standard of Python programming knowledge. If, like me, your primary programming language has been something else, so your knowledge of things like the Numpy library are more patchy, then additional time is required investigating the relevant libraries. I should say I’ve written a CMS in Python, albeit a while ago, so my Python skills were rusty more than non-existent, but I’d had little exposure to the Numpy library and the matrix manipulation functions required here, and while occasionally I had to dig around for the correct way to do something. Most of the time it didn’t really get in the way, and between the Introduction course and this I’ve gained a lot of confidence with Python’s take on lambda functions and for comprehensions along with the NumPy library knowledge and PySpark-specific details.

Week 1 explores a range of core details, such as complexity issues (Big O notation), a bit of an introduction to NumPy and matrix manipulation in Python, an overview of Apache Spark and a run-through of some core Machine Learning concepts like the difference between supervised and unsupervised learning. This was the foundation week, bringing everyone up to speed on the array of concepts that were going to be used in the rest of the course. Much of this was already familiar from other courses I’ve taken on Machine Learning, but for those who haven’t taken other courses it seemed like a clear and reasonably scoped explanation of the key concepts. I mainly learned some more about the NumPy library here.

Week 2 introduces Apache Spark, giving more detail on why it’s useful, how it aims to outperform the likes of Hadoop and what makes it different to processing the data at small scale in something like R. It introduces the core data structure – RDDs – and the concepts of transformations (which aren’t run immediately) and actions (which trigger the preceding steps), culminating in implementing some logic to process text data. This was a duplicate of one of the weeks from Introduction to Big Data, with the same assignment, so those doing the 2 courses together as part of a logical sequence of courses had an easy week here. After this you know how to build up a reasonably complex piece of logic through a pipeline of transformations, how to cache intermediate results, and how Spark splits, collects and processes the data.

Week 3 moved on to the Machine Learning side, looking primarily at Linear Regression. Here I was impressed by the quality of the course, as it not only explained how to use the tools, but started with the principles of Linear Regression, closed form solutions, solving some simpler problems without the Spark libraries. It moved to how the solution can be performed in a way that can be performed at scale to give an understanding of how Spark can solve the problem in a distributed manner, then solved some more problems using plain Spark code. Having demonstrated the principles, then it moved to using the libraries, ensuring an insight into What Spark is doing and How it is doing it, not just How to use its libraries. This also introduced the concept of grid search and hyperparameter tuning. While there are other parameter tuning strategies which I would have liked to see discussed further, it seemed a reasonable scope for this course. Similarly, while it didn’t really explain in detail how quadratic features may help in certain circumstances, it did explore how to generate these and demonstrated that they did have an impact on the prediction accurate for the data being explored.

Week 4 covered Logistic Regression, specifically looking at things like Click Through Rate prediction. Here the course moved onto something quite relevant for many interested in data analysis. It covered how to handle things like categorical (non-numeric) input variables, training a classifier to indicate whether something fits a criteria, and how to improve your model. Again, it started with the mathematical underpinnings of Logistic Regression, and walked through the various data processing steps involved in more real-world data processing, like trying hashing, One Hot Encoding, hyperparameter tuning and interpretation of the results using an ROC plot. Prediction accuracy was covered, not just by getting a value, but explaining how you might need to consider False Positives over False Negatives or vice versa and how to measure whether your model is any good by comparing to simple models (like predicting the most likely outcome). This week again went into good depth with the theory while also covering key aspects that are sometimes glossed over but which are necessary for getting a good working model.

Week 5 wrapped up the course with an explanation of some basic concepts about Neuroscience, why its problem falls into the Big Data category, and why dimensionality reduction is so necessary for interpreting data. Having outlined the domain knowledge required (with some nice video illustration of the relevant images that will be processed), it then moves into dimensionality reduction. This walks you through the mathematical functions to build up Principal Component Analysis from the basic matrix functions, and applies it to a dataset to process the images in a number of ways. In this it takes what could be some matrix calculations whose function is relatively unclear and helps you to understand how the various different matrix calculations can perform an efficient aggregation and how PCA allows massively complex data to be reduced to something that can train models or be visualised. In this is probably demonstrated PCA as more useful and understandable than any other explanation of it I’ve seen.

Throughout the lab work was at times challenging, but generally when I got stuck it was only because I was tired and hadn’t read something properly – coming back with fresher eyes was generally the solution. There’s a lot to take in on the course, and the knowledge from it would be useful in Data Mining and Machine Learning. I was very much impressed with the standard of the course, the scope, the pace, and how it demonstrated how the concepts can be applied to real data. My main complaint would be that it could have covered more on the ETL kind of aspects – data extraction, transformation, loading (and cleaning) – it covers a little early on but could be expanded on. On the whole, though, it’s a very high standard of course, and certainly the match of many or most paid university modules.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>