machine learning – ScraperWiki

Adventures in Kaggle: Forest Cover Type Prediction

Ian Hopkinson — Sun, 26 Apr 2015 06:36:25 +0000

Regular readers of this blog will know I’ve read quite few machine learning books, now to put this learning into action. We’ve done some machine learning for clients but I thought it would be good to do something I could share. The Forest Cover Type Prediction challenge on Kaggle seemed to fit the bill. Kaggle is the self-styled home of data science, they host a variety of machine learning oriented competitions ranging from introductory, knowledge building (such as this one) to commercial ones with cash prizes for the winners.

In the Forest Cover Type Prediction challenge we are asked to predict the type of tree found on 30x30m squares of the Roosevelt National Forest in northern Colorado. The features we are given include the altitude at which the land is found, its aspect (direction it faces), various distances to features like roads, rivers and fire ignition points, soil types and so forth. We are provided with a training set of around 15,000 entries where the tree types are given (Aspen, Cottonwood, Douglas Fir and so forth) for each 30x30m square, and a test set for which we are to predict the tree type given the “features”. This test set runs to around 500,000 entries. This is a straightforward supervised machine learning “classification” problem.

The first step must be to poke about at the data, I did a lot of this in Tableau. The feature most obviously providing predictive power is the elevation, or altitude of the area of interest. This is shown in the figure below for the training set, we see Ponderosa Pine and Cottonwood predominating at lower altitudes transitioning to Aspen, Spruce/Fir and finally Krummholz at the highest altitudes. Reading in wikipedia we discover that Krummholz is not actually a species of tree, rather something that happens to trees of several species in the cold, windswept conditions found at high altitude.

Data inspection over I used the scikit-learn library in Python to predict tree type from features. scikit-learn makes it ridiculously easy to jump between classifier types, the interface for each classifier is the same so once you have one running swapping in another classifier is a matter of a couple of lines of code. I tried out a couple of variants of Support Vector Machines, decision trees, k-nearest neighbour, AdaBoost and the extremely randomised trees ensemble classifier (ExtraTrees). This last was best at classifying the training set.

The challenge is in mangling the data into the right shape and selecting the features to use, this is the sort of pragmatic knowledge learnt by experience rather than book-learning. As a long time data analyst I took the opportunity to try something: essentially my analysis programs would only run when the code had been committed to git source control and the SHA of the commit, its unique identifier, was stored with the analysis. This means that I can return to any analysis output and recreate it from scratch. Perhaps unexceptional for those with a strong software development background but a small novelty for a scientist.

Using a portion of the training set to do an evaluation it looked like I was going to do really well on the Kaggle leaderboard but on first uploading my competition solution things looked terrible! It turns out this was a common experience and is a result of the relative composition of the training and test sets. Put crudely the test set is biased to higher altitudes than the training set so using a classifier which has been trained on the unmodified training set leads to poorer results then expected based on measurements on a held back part of the training set. You can see the distribution of elevation in the test set below, and compare it with the training set above.

We can fix this problem by biasing the training set to more closely resemble the test set, I did this on the basis of the elevation. This eventually got me to 430 rank on the leaderboard, shown in the figure below. We can see here that I’m somewhere up the long shallow plateau of performance. There is a breakaway group of about 30 participants doing much better and at the bottom there are people who perhaps made large errors in analysis but got rescued by the robustness of machine learning algorithms (I speak from experience here!).

There is no doubt some mileage in tuning the parameters of the different classifiers and no doubt winning entries use more sophisticated approaches. scikit-learn does pretty well out of the box, and tuning it provides marginal improvement. We observed this in our earlier machine learning work too.

I have mixed feelings about the Kaggle competitions. The data is nicely laid out, the problems are interesting and it’s always fun to compete. They are a great way to dip your toes in semi-practical machine learning applications. The size of the awards mean it doesn’t make much sense to take part on a commercial basis.

However, the data are presented such as to exclude the use of domain knowledge, they are set up very much as machine learning challenges – look down the competitions and see how many of them feature obfuscated data likely for reasons of commercial confidence or to make a problem more “machine learning” and less subjectable to domain knowledge. To a physicist this is just a bit offensive.

If you are interested in a slightly untidy blow by blow account of my coding then it is available here in a Bitbucket Repo.

Book review: Mining the Social Web by Matthew A. Russell

Ian Hopkinson — Mon, 20 Jan 2014 08:41:35 +0000

The twitter search and follower tools are amongst the most popular on the ScraperWiki platform so we are looking to provide more value in this area. To this end I’ve been reading “Mining the Social Web” by Matthew A. Russell.

In the first instance the book looks like a run through the APIs for various social media services (Twitter, Facebook, LinkedIn, Google+, GitHub etc) but after the first couple of chapters on Twitter and Facebook it becomes obvious that it is more subtle than that. Each chapter also includes material on a data mining technique; for Twitter it is simply counting things. The Facebook chapter introduces graph analysis, a theme extended in the chapter on GitHub. Google+ is used as a framework to introduce term frequency-inverse document frequency (TF-IDF), an information retrieval technique and a basic, but effective, way to process natural language. Web pages scraping is used as a means to introduce some more ideas about natural language processing and summarisation. Mining mailboxes uses a subset of the Enron mail corpus to introduces MongoDB as a document storage system. The final chapter is a twitter cookbook which includes lots of short recipes for simple twitter related activities but no further analysis. The coverage of each topic isn’t deep but it is practical – introducing the key libraries to do tasks. And it’s alive with suggests for further work, and references to help with that.

The examples in the book are provided as IPython Notebooks which are supplied, along with a Notebook server on a virtual machine, from a GitHub repository. IPython notebooks are interactive Python sessions run through a browser interface. Content is divided into cells which can either be code or simple descriptive text. A code cell can be executed and the output from the code appears in an output cell. These notebooks are a really nice way to present example code since the code has some context. The virtual machine approach is also a great innovation since configuring Python libraries and the IPython server itself, in a platform agnostic manner, is really difficult and this solution bypasses most of those problems. The system makes it incredibly easy to run the example code for yourself, almost too easy in fact, I found myself clicking blindly through some of the example code. Potentially the book could have been presented simply as an IPython notebook, this is likely not economically practical but it would be nice to collect the links to further reading there where they would be more usable. The GitHub repository also provides a great place for interaction with the author: I filed a couple of issues regarding setting the system up and he responded unerringly quickly – as he did for many other readers. Also I discovered incidentally, through being subscribed to the repository, that one of the people I follow on Twitter (and a guest blogger here) was also reading the book. An interesting example of the social web in action!

Mining the social web covers some material I had not come across in my earlier machine learning/ data mining reading. There are a couple of chapters containing material on graph theory using data from Facebook and GitHub data. In the way of benefitting from reading about the same material in different places, Russell highlights that cluster and de-duplication are of course facets of the same subject.

I read with interest the section on using a MongoDB database as a store for tweets and other data in the form of JSON objects. Currently I am bemused by MongoDB. The ScraperWiki platform uses it to store user profile information. I have occasional recourse to try to look things up there. I’ve struggled to see the benefit of MongoDB over a SQL database. Particularly having watched two of my colleagues spend a morning working out how to do a what would be a simple SQL join in MongoDB. Mining the social web has made me wonder about giving MongoDB another chance.

The penultimate chapter is a discussion of the semantic web, introducing both microformats as well as RDF technology, although the discussion is much less concrete than earlier chapters. Microformats are HTML elements which hold semantic information about a page using an agreed schema, to give an example: the geo microformat encodes geographic information. In the absence of such a microformat, geographic information such as latitude and longitude could be encoded in pretty much any way, making it necessary to either use custom scrapers on a page by page basis or complex heuristics to infer the presence of such information. RDF is one of the underpinning technologies for the semantic web: a shorthand for a worldwide web marked up such that machines can understand the meaning of webpages. This touches on the EU Newsreader project on which we are collaborators, and which seeks to generate this type of semantic mark up for news articles using natural language processing.

Overall, definitely worth reading. We’re interested in extending our tools for social media and with this book in hand I’m confident we can do it and be aware of more possibilities.

Book review: Data Mining – Practical Machine Learning Tools and Techniques by Witten, Frank and Hall

Ian Hopkinson — Wed, 01 Jan 2014 08:10:16 +0000

I’ve been doing more reading on machine learning, this time in the form of Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank and Mark A. Hall. This comes by recommendation of my academic colleagues on the Newsreader project, who rely heavily on machine learning techniques to do natural language processing.

Data mining is about finding structure in data, the algorithms for doing this are found in the field of machine learning. The classic example is Iris flower dataset. This dataset contains measurements of parts of a flower for three different species of Iris, the challenge is to build a system which classifies a flower to its species by its measurements. More practical examples are in the diagnosis of machine faults, credit assessment, detection of oil slicks, customer support analysis, marketing and sales.

Previously I’ve reviewed Machine Learning in Action by Peter Harrington. Data Mining is a somewhat different book. The core contents are quite similar: background to machine learning, evaluating your results and a run through the core algorithms. Machine Learning in Action is a pretty quick run through the field touching on many subjects, with toy demonstrations built from scratch in Python. Data Mining, running to almost 600 pages, is a much more thorough reference. There is a place for both types of book, even on the same bookshelf.

Data Mining is written by three members of the University of Waikato’s Computer Science Department and is based around the Weka machine learning system developed there. Weka is a complete framework, written in Java, which implements the algorithms described in the book as well as some others. Weka can be accessed via the command-line or using a GUI. As well as the machine learning algorithms there are systems for preparing data, evaluating and visualising results. A collection of well-known demonstration data sets are included. I’ve no reason to doubt the quality of the implementations in Weka, the GUI interface is functional, occasionally puzzling and not particularly slick. The book stands alone from the Weka framework but the framework provides a good playground to try out the techniques discussed in the book. Weka seems to be entirely suitable for conducting serious analysis. This approach is in contrast to the approach of Harrington in Machine Learning in Action who provides toy implementations of algorithms in Python.

The first two parts of the book provide an overview of machine learning, followed by a more detailed look at how the key algorithms are implemented. The third section is dedicated to Weka, whilst the first two sections refer to it but do not rely on it. The third section is divided into a discussion of Weka, covering all its key features and then a tutorial. I found this a bit confusing since the first part has the air of a tutorial, but isn’t, and the tutorial part keeps referring back to the overview section for its screenshots.

With some knowledge already in machine learning, the things I learned from this book:

better methods, and subtleties in measuring the performance of machine learning algorithms;
the success of the one-rule algorithm, essentially a decision tree which gets the maximum benefit from a single rule. It turns such an approach is surprisingly effective and only bettered a little, if at all, by more sophisticated algorithms;
getting enough, clean data to do machine learning is often a problem;
where to learn more!

The first edition of this book was published in 1999; my review is of the third edition. The book does show some signs of age, Machine Learning in Action was written as a response to a poll published at the International Conference on Data Mining 2006 on the 10 most important machine learning algorithms (see the paper here). Whilst Data Mining mentions this survey, it is as something of an afterthought and the authors seem bemused by the inclusion of the PageRank algorithm used by Google to rank web pages in search results. They mention the Moa framework for data stream mining although do not discuss it in any detail. Moa focuses on techniques for large datasets.

In summary: a well-written, well-structured and readable book on machine learning algorithms with demonstrations based on an extensive machine learning framework. Definitely one to read and come back to for reference.

Book Review: Machine Learning in Action by Peter Harrington

Ian Hopkinson — Wed, 17 Apr 2013 12:49:46 +0000

Machine learning is about prediction, and prediction is a valuable commodity. This sounds pretty cool and definitely the sort of thing a data scientist should be into, so I picked up Machine Learning in Action by Peter Harrington to get an overview of the area.

Amongst the examples covered in this book are:

Given that a customer bought these items, what other items are they likely to want?
Is my horse likely to die from colic given these symptoms?
Is this email spam?
Given that these representatives have voted this way in the past, how will they vote in future?

In order to make a prediction, machine learning algorithms take a set of features and a target for a training set of examples. Once the algorithm has been trained, it can take new feature sets and make predictions based on them. Let’s take a concrete example: if we were classifying birds, the birds’ features would include the weight, size, colour and so forth and the target would be the species. We would train the algorithm on an initial set of birds where we knew the species, then we would measure the features of unknown birds and submit these to the algorithm for classification.

In this case, because we know the target – a species of bird – the algorithms we use would be referred to as “supervised learning.” This contrasts “unsupervised learning,” where the target is unknown and the algorithm is seeking to make its own classification. This would be equivalent to the algorithm creating species of birds by clustering those with similar features. Classification is the prediction of categories (i.e. eye colour, like/dislike), alternatively regression is used to predict the value of continuous variables (i.e. height, weight).

Machine learning in Action is divided into four sections that cover key elements and “additional tools” which includes algorithms for dimension reduction and MapReduce – a framework for parallelisation. Dimension reduction is the process of identifying which features (or combination of features) are essential to a problem.

Each section includes Python code that implements the algorithms under discussion and these are applied to some toy problems. This gives the book the air of Numerical Recipes in FORTRAN, which is where I cut my teeth on numerical analysis. The mixture of code and prose is excellent for understanding exactly how an algorithm works, but its better to use a library implementation in real life.

The algorithms covered are:

Classification – k-Nearest Neighbours, decision trees, naive Bayes, logistic regression, support vector machines, and AdaBoost;
Regression – linear regression, locally weighted linear regression, ridge regression, tree-based regression;
Unsupervised learning – k-means clustering, apriori algorithm, FP-growth;
Additional tools – principle component analysis and singular value decomposition.

Prerequisites for this book are relatively high: it assumes fair Python knowledge, some calculus, probability theory and matrix algebra.

I’ve seen a lot of mention of MapReduce without being clear what it was. Now I am more clear: it is a simple framework for carrying out parallel computation. Parallel computing has been around quite some time, the problem has always been designing algorithms that accommodate parallelisation (i.e. allow problems to be broken up into pieces which can be solved separately and then recombined). MapReduce doesn’t solve this problem but gives a recipe for what is required to run on commodity compute cluster.

As Harrington says: do you need to run MapReduce on a cluster to solve your data problem? Unless you are an operation on the scale of Google or Facebook then probably not. Current, commodity desktop hardware is surprisingly powerful particularly when coupled with subtle algorithms.

This book works better as an eBook than paper partly because the paper version is black and white and some figures require colour but the programming listings are often images and so the text remains small.