API – ScraperWiki

Digging Olympic Data at Londinium MMXII

Aine McGuire — Tue, 24 Jul 2012 09:50:23 +0000

This is a guest post by Makoto Inoue, one of the organisers of this weekend’s Londinium MMXII hackathon.

The Olympics! Only a few days to go until seemingly every news camera on the planet is pointed at the East End of London, for a month of sporting coverage. But for data diggers everywhere, this is also a gigantic opportunity to analyse and visualise whole swathes of sporting data, as well as create new devices and apps to amplify, manage and make sense of the data in interesting ways.

Remapping past Olympic results into London 2012 schedule to predict the medal ranking leader board

I’m organising the Londinium MMXII Hackathon which happens the day after the opening of the Olympics so that the participants can do cool hacks using real time data. But while you can use Twitter and Facebook to gather social buzz, or TfL, Google Maps and Foursquare to do geo mashups, it turns out the one dataset we’re missing is real time game results. I spent a long time trying to find out if there are publicly available data APIs but in the end it looked like we were out of luck!

Out of luck, that was, until we found out about ScraperWiki. Rather than waiting for the data to come to us, ScraperWiki lets us go grab the freshest data ourselves – after all, there will be tons of news sites publishing the Olympic schedule, and many (like as the BBC) are well structured enough to reliably scrape. Since the BBC publishes the schedule (and, from the look of it, the result) of each event, including most importantly, the exact time of each sport, we can easily set periodic scheduler jobs to scrape the latest data as it is announced. Perfect!

I’ve already written one scraper while writing an Olympic medal rivalry article, so feel free to copy the scraper as your own starting point. Setting an hourly cronjob on ScraperWiki is normally a premium service, but the guys at ScraperWiki are so keen to see what data the Londinium MMXII hackers can come up with, they’re allowing all participants free access to set an hourly cron, for the duration of the hackathon (thanks ScraperWiki!). So let’s join the hackathon and hack together!!

The state of Twitter: Mitt Romney and Indonesian Politics

Zarino Zappia — Mon, 23 Jul 2012 09:16:53 +0000

It’s no secret that a lot of people use ScraperWiki to search the Twitter API or download their own timelines. Our “basic_twitter_scraper” is a great starting point for anyone interested in writing code that makes data do stuff across the web. Change a single line, and you instantly get hundreds of tweets that you can then map, graph or analyse further.

So, anyway, Tom and I decided it was about time to take a closer look at how you guys are using ScraperWiki to draw data from Twitter, and whether there’s anything we could do to make your lives easier in the process!

Getting under the hood at scraperwiki.com

As anybody who’s checkout out our source code will will know, we store a truck-load of information about each scraper and each run it’s ever made, in a MySQL database. Of 9727 scrapers that had run since the beginning of June, 601 accessed a twitter.com URL. (Our database only stores the first URL that each scraper accesses on any particular run, so it’s possible that there are scripts that accessed twitter but not as the first URL.)

Twitter API endpoints

Getting more specific, these 601 scrapers accessed one of a number of Twitter’s endpoints, normally through the nominal API. We removed the querystring from each of the URLs and then looked for commonly accessed endpoints.

It turns out that search.json is by far the most popular entry point for ScraperWiki users to get Twitter data – probably because it’s the method used by the basic_twitter_scraper that has proved so popular on scraperwiki.com. It takes a search term (like a username or a hashtag) and returns a list of tweets containing that term. Simple!

The next most popular endpoint – followers/ids.json – is a common way to find interesting user accounts to then scrape more details about. And, much to Tom’s amusement, the third endpoint, with 8 occurrences, was http://twitter.com/mittromney. We can’t quite tell whether that’s a good or bad sign for his 2012 candidacy, but if it makes any difference, only one solitary scraper searched for Barack Obama.

Searches

We also looked at what people were searching for. We found 398 search terms in the scrapers that accessed the twitter search endpoint, but only 45 of these terms were called in more than one scraper. Some of the more popular ones were “#ddj” (7 scrapers), “occupy” (3 scrapers), “eurovision” (3 scrapers) and, weirdly, an empty string (5 scrapers).

Even though each particular search term was only accessed a few times, we were able to classify the search terms into broad groups. We sampled from the scrapers who accessed the twitter search endpoint and manually categorized them into categories that seemed reasonable. We took one sample to come up with mutually exclusive categories and another to estimate the number of scrapers in each category.

A bunch of scripts made searches for people or for occupy shenanigans. We estimate that these people- and occupy-focussed queries together account for between two- and four-fifths of the searches in total.

We also invented a some smaller categories that seemed to account for few scrapers each – like global warming, developer and journalism events, towns and cities, and Indonesian politics (!?) – But really it doesn’t seem like there’s any major pattern beyond the people and occupy scripts.

Family Tree

Speaking of the basic_twitter_scraper, we thought it would also be cool to dig into the family history of a few of these scrapers. When you see a scraper you like on ScraperWiki, you can copy it, and that relationship is recorded in our database.

Lots of people copy the basic_twitter_scraper in this way, and then just change one line to make it search for a different term. With that in mind, we’ve been thinking that we could probably make some better tweet-downloading tool to replace this script, but we don’t really know what it would look like. Maybe the users who’ve already copied basic_twitter_scraper_2 would have some ideas…

After getting the scraper details and relationship data into the right format, we imported the whole lot into the open source network visualisation tool Gephi, to see how each scraper was connected to its peers.

By the way, we don’t really know what we did to make this network diagram because we did it a couple weeks ago, forgot what we did, didn’t write a script for it (Gephi is all point-and-click..) and haven’t managed to replicate our results. (Oops.) We noticed this because we repeated all of the analyses for this post with new data right before posting it and didn’t manage to come up with the sort of network diagram we had made a couple weeks ago. But the old one was prettier so we used that :-)

It doesn’t take long to notice basic_twitter_scraper_2’s cult following in the graph. In total, 264 scrapers are part of its extended family, with 190 of those being descendents, and 74 being various sorts of cousins – such as scrape10_twitter_scraper, which was a copy of basic_twitter_scraper_2’s grandparent, twitter_earthquake_history_scraper (the whole family tree, in case you’re wondering, started with twitterhistory-scraper, written by Pedro Markun in March 2011).

With the owners of all these basic_twitter_scraper(_2)’s identified, we dropped a few of them an email to find out what they’re using the data for and how we could make it easier for them to gather in the future.

It turns out that Anna Powell-Smith wrote the basic_twitter_scraper at a journalism conference and Nicola Hughes reused it for loads of ScraperWiki workshops and demonstrations as basic_twitter_scraper_2. But even that doesn’t fully explain the cult following because people still keep copying it. If you’re one of those very users, make sure to send us a reply – we’d love to hear from you!!

Explore

We’ve posted our code for this analysis on Github, along with a table of information about the 594 Twitter scrapers that aren’t in vaults (out of 601 total Twitter scrapers) in case you’re as puzzled as we are by our users’ interest in Twitter data

Now here’s video of a cat playing a keyboard.

Scraping the protests with Goldsmiths

Zarino Zappia — Fri, 09 Dec 2011 12:25:51 +0000

Zarino here, writing from carriage A of the 10:07 London-to-Liverpool (the wonders of the Internet!). While our new First Engineer, drj, has been getting to grips with lots of the under-the-hood changes which’ll make ScraperWiki a lot faster and more stable in the very near future, I’ve been deploying ScraperWiki out on the frontline, with some brilliant Masters students at CAST, Goldsmiths.

I say brilliant because these guys and girls, with pretty much no scraping experience but bags of enthusiasm, managed (in just three hours) to pull together a seriously impressive map of Occupy protests around the world. Using data from no less than three individual wikipedia articles, they parsed, cleaned, collated and geolocated almost 600 protests worldwide, and then visualised them over time using a ScraperWiki view. Click here to take a look.

Okay, I helped a bit. But still, it really pushed home how perfect ScraperWiki is for diving into a sea of data, quickly pulling out what you need, and then using it to formulate bigger hypotheses, flag up possible stories, or gather constantly-fresh intelligence about an untapped field. There was this great moment when the penny suddenly dropped and these journalists, activists and sociologists realised what they’d been missing all this time.

Students like these, with tools like ScraperWiki behind them, are going to take the world by storm.

But the penny also dropped for me, when I saw how suited ScraperWiki is to a role in the classroom. The path to becoming a data science ninja is a long and steep one, and despite the amazing possibilites fresh, clean and accountable data holds for everybody from anthropologists to zoologists, getting that first foot on the ladder is a tricky task. ScraperWiki was never really built as a learning environment, but with so little else out there to guide learners, it fulfils the task surprisingly well. Students can watch their tutor editing and running a scraper in realtime, right alongside their own work, right inside their web browser. They can take their own copy, hack it, and then merge the data back into a classroom pool. They can use it for assignments, and when the built-in documentation doesn’t answer their questions, there’s a whole community of other developers on there, and a whole library of living, working examples of everything from cabinet office tweeting to global shark activity. They can try out a new language (maybe even their first language) without worrying about local installations, plugins or permissions. And then they can share what they make with their classmates, tutors, and the rest of the world.

Guys like these, with tools like ScraperWiki behind them, are going to take the world by storm. I can’t wait to see what they cook up.

How to scrape and parse Wikipedia

Julian Todd — Wed, 07 Dec 2011 14:50:04 +0000

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page).

The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if you’ve got a spare hour you can check out the discussion considering its deletion between the immovable object (American cavers who believe cave locations are secret) and the immovable force (Wikipedian editors who believe that you can’t have two templates for the same thing, except when they are in different languages).

But let’s get on with some Wikipedia parsing. Here’s what doesn’t work:

import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Aquamole_Pot").read()

because it returns a rather ugly error, which at the moment is: “Our servers are currently experiencing a technical problem.”

What they would much rather you do is go through the wikipedia api and get the raw source code in XML form without overloading their servers.

To get the text from a single page requires the following code:

import lxml.etree
import urllib

title = "Aquamole Pot"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v)  for k, v in params.items())
url = "http://en.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print "The Wikipedia text for", title, "is"
print revs[-1].text

Note how I am not using urllib.urlencode to convert params into a query string. This is because the standard function converts all the ‘|’ symbols into ‘%7C’, which the Wikipedia api site doesn’t accept.

The result is:

{{Infobox ukcave
| name = Aquamole Pot
| photo =
| caption =
| location = [[West Kingsdale]], [[North Yorkshire]], England
| depth_metres = 113
| length_metres = 142
| coordinates =
| discovery = 1974
| geology = [[Limestone]]
| bcra_grade = 4b
| gridref = SD 698 784
| location_area = United Kingdom Yorkshire Dales
| location_lat = 54.19082
| location_lon = -2.50149
| number of entrances = 1
| access = Free
| survey = [http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]
}}
'''Aquamole Pot''' is a cave on [[West Kingsdale]], [[North Yorkshire]],
England wih which was first discovered from the
bottom by cave diving through 550 feet of
sump from [[Rowten Pot]] in 1974....

This looks pretty structured. All ready for parsing. I’ve written a nice complicated recursive template parser that I use in wikipedia_utils, which makes it easy to extract all the templates from the page in the following way:

import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")

title = "Aquamole Pot"

val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res               # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave    # prints just the ukcave infobox

This now produces the following Python data structure that is almost ready to push into our database — after we have converted the length and depths from strings into numbers:

{0: 'Infobox ukcave', 'number of entrances': '1',
 'location_lon': '-2.50149',
 'name': 'Aquamole Pot', 'location_area': 'United Kingdom Yorkshire Dales',
 'geology': '[[Limestone]]', 'gridref': 'SD 698 784', 'photo': '',
 'coordinates': '', 'location_lat': '54.19082', 'access': 'Free',
 'caption': '', 'survey': '[http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]',
 'location': '[[West Kingsdale]], [[North Yorkshire]], England',
 'depth_metres': '113', 'length_metres': '142', 'bcra_grade': '4b', 'discovery': '1974'}

Right. Now to deal with the other end of the problem. Where do we get the list of pages with the data?

Wikipedia is, unfortunately, radically categorized, so Aquamole_Pot is inside Category:Caves_of_North_Yorkshire, which is in turn inside Category:Caves_of_Yorkshire which is then inside
Category:Caves_of_England which is finally inside
Category:Caves_of_the_United_Kingdom.

So, in order to get all of the caves in the UK, I have to iterate through all the subcategories and all the pages in each category and save them to my database.

Luckily, this can be done with:

lcavepages = wikipedia_utils.GetWikipediaCategoryRecurse("Caves_of_the_United_Kingdom")
scraperwiki.sqlite.save(["title"], lcavepages, "cavepages")

All of this adds up to my current scraper wikipedia_longest_caves that extracts those infobox tables from caves in the UK and puts them into a form where I can sort them by length to create this table based on the query SELECT name, location_area, length_metres, depth_metres, link FROM caveinfo ORDER BY length_metres desc:

name	location_area	length_metres	depth_metres
Ease Gill Cave System	United Kingdom Yorkshire Dales	66000.0	137.0
Dan-yr-Ogof	Wales	15500.0
Gaping Gill	United Kingdom Yorkshire Dales	11600.0	105.0
Swildon’s Hole	Somerset	9144.0	167.0
Charterhouse Cave	Somerset	4868.0	228.0

If I was being smart I could make the scraping adaptive, that is only updating the pages that have changed since the last scraped by using all the data returned by GetWikipediaCategoryRecurse(), but it’s small enough at the moment.

So, why not use DBpedia?

I know what you’re saying: Surely the whole of DBpedia does exactly this, with their parser?

And that’s fine if you don’t want your updates to come less than 6 months, which prevents you from getting any feedback when adding new caves into Wikipedia, like Aquamole_Pot.

And it’s also fine if you don’t want to be stuck with the naïve semantic web notion that the boundaries between entities is a simple, straightforward and general concept, rather than what it really is: probably the one deep and fundamental question within any specific domain of knowledge.

I mean, what is the definition of a singular cave, really? Is it one hole in the ground, or is it the vast network of passages which link up into one connected system? How good do those connections have to be? Are they defined hydrologically by dye tracing, or is a connection defined as the passage of one human body getting itself from one set of passages to the next? In the extreme cases this can be done by cave diving through an atrocious sump which no one else is ever going to do again, or by digging and blasting through a loose boulder choke that collapses in days after one nutcase has crawled through. There can be no tangible physical definition. So we invent the rules for the definition. And break them.

So while theoretically all the caves on Leck Fell and Easgill have been connected into the Three Counties System, we’re probably going to agree to continue to list them as separate historic caves, as well as some sort of combined listing. And that’s why you’ll get further treating knowledge domains as special cases.

Job advert: Lead programmer

Francis Irving — Thu, 27 Oct 2011 11:04:47 +0000

Oil wells, marathon results, planning applications…

ScraperWiki is a Silicon Valley style startup, in the North West of England, in Liverpool. We’re changing the world of open data, and how data science is done together on the Internet.

We’re looking for a programmer who’d like to:

Revolutionise the tools for sharing data, and code that works with data, on the Internet.
Take a lead in a lean startup, having good hunches on how to improve things, but not minding when A/B testing means axing weeks of code.

In terms of skills:

Be polyglot enough to be able to learn Python, and do other languages (Ruby, Javascript…) where necessary.
Be able to find one end of a clean web API and a test suite from another.
We’re a small team, so need to be able to do some DevOps on Linux servers.
Desirable – able to make igloos.

About ScraperWiki:

We’ve got funding (Knight News Challenge winners) and are in the brand new field of “data hubs”.
We’re before product/market fit, so it’ll be exciting, and you can end up a key, senior person in a growing company.

Some practical things:

We’d like this to end up a permanent position, but if you prefer we’re happy to do individual contracts to start with.

Must be willing to either relocate to Liverpool, or able to work from home and travel to our office here regularly (once a week). So somewhere nearby preferred.

To apply – send the following:

A link to a previous project that you’ve worked on that you’re proud of, or a description of it if it isn’t publicly visible.
A link to a scraper or view you’ve made on ScraperWiki, involving a dataset that you find interesting for some reason.
Any questions you have about the job.

Along to francis@scraperwiki.com with the word swjob2 in the subject (and yes, that means no agencies, unless the candidates do that themselves)

Pool temperatures, company registrations, dairy prices…

Make RSS with an SQL query

Francis Irving — Wed, 21 Sep 2011 11:22:16 +0000

Lots of people have asked for it to be easier to get data out of ScraperWiki as RSS feeds.

The Julian has made it so.

The Web API now has an option to make RSS feeds as a format (i.e. instead of JSON, CSV or HTML tables).

For example, Anna made a scraper that gets alocohol licensing applications for Islington in London. She wanted an RSS feed to keep track of new applications using Google Reader.

To make it, she went to the Web API explorer page for the scraper, chose “rss2” for the format, and entered this SQL into the query box.

select licence_for as description,
       applicant as title,
       url as link,
       date_scraped as date
from swdata order by date_scraped desc limit 10

The clever part is the SQL “as” clauses. They let you select exactly what appears in the title and description and so on of the feed. The help that appears next to the “format” drop down when you choose “rss2” explains which fields need setting.

Since SQL is a general purpose language, you can do complicated things like concatenate strings if you need to. For most simple cases though, it is just a remapping of fields.

This is Anna’s final RSS feed of Islington alcohol license applications. There’s a thread on the ScraperWiki Google Group with more details, including how Anna made the date_scraped column.

Meanwhile, pezholio decided to use the new RSS feeds to track food safety inspections in Walsall. He used the new ScraperWiki RSS SQL API to make that scraper into an RSS feed.

Of course, these days RSS isn’t enough, he used the wonderful ifttt to map that RSS feed to Twitter. Now anyone can keep track of how safe restaurants in Walsall are by simply following @EatSafeWalsall.

Let us know if you ScraperWiki anything with RSS feeds!

P.S. Islington’s licensing system is run by Northgate, as are lots of others. It is likely that Anna’s scraper can easily be made to run for some other councils…

It’s SQL. In a URL.

Francis Irving — Mon, 16 May 2011 11:09:46 +0000

Squirrelled away amongst the other changes to ScraperWiki’s site redesign, we made substantial improvements to the external API explorer.

We’re going to concentrate on the SQLite function here as it is most import, but as you can see on the right there are other functions for getting out scraper metadata.

Zarino and Julian have made it a little bit slicker to find out the URLs you can use to get your data out of ScraperWiki.

1. As you type into the name field, ScraperWiki now does an incremental search to help you find your scraper, like this.

2. After you select a scraper, it shows you its schema. This makes it much easier to know the names of the tables and columns while writing your query.

3. When you’ve edited your SQL query, you can run it as before. There’s also now a button to quickly and easily copy the URL that you’ve made for use in your application.

You can get to the explorer with the “Explore with ScraperWiki API” button at the top of every scraper’s page. This makes it quite useful for quick and dirty queries on your data, as well as for finding the URLs for getting data into your own applications.

Let us know when you do something interesting with data you’ve sucked out of ScraperWiki!

Scrape it – Save it – Get it

Nicola Hughes — Mon, 11 Apr 2011 18:22:00 +0000

I imagine I’m talking to a load of developers. Which is odd seeing as I’m not a developer. In fact, I decided to lose my coding virginity by riding the ScraperWiki digger! I’m a journalist interested in data as a beat so all I need to do is scrape. All my programming will be done on ScraperWiki, as such this is the only coding home I know. So if you’re new to ScraperWiki and want to make the site a scraping home-away-from-home, here are the basics for scraping, saving and downloading your data:

With these three simple steps you can take advantage of what ScraperWiki has to offer – writing, running and debugging code in an easy to use editor; collaborative coding with chat and user viewing functions; a dashboard with all your scrapers in one place; examples, cheat sheets and documentation; a huge range of libraries at your disposal; a datastore with API callback; and email alerts to let you know when your scrapers break.

So give it a go and let us know what you think!

ScraperWiki Datastore – The SQL.

Ross Jones — Wed, 06 Apr 2011 16:54:58 +0000

Recently at ScraperWiki we replaced the old datastore, which was creaking under the load, with a new, lighter and faster solution – all your data is now stored in Sqlite tables as part of the move towards pluggable datastores. In addition to the new increase in performance, using Sqlite also provides some other benefits such as allowing us to transparently modify the schema and accessing the data using SQL via the ScraperWiki API or via the Sqlite View. If you don’t know SQL, or just need to try and remember the syntax there is a great SQL tutorial available at w3schools.com which might get you started.

For getting your data out of ScraperWiki you can try using the Sqlite View, which makes it easier to add the fields you want to query as well as performing powerful queries on the data. To explain how you do this we’ll use the Scraper created by Nicola in her recent post Special Treatment for Special Advisers in No. 10 which you can access on ScraperWiki, and from there create a new view. If you choose General Sqlite View, you’ll get a nice easy interface to query and study the data. This dataset shows data from the Cabinet Office (UK central Government) and logs gifts given to advisers for the top ministers – all retrieved by Nicola after only having known how to program for three weeks.

If you’re more confident with your SQL, you can access a more direct interface after clicking the ‘Explore with ScraperWiki API’ link on the overview page for any scraper. This will also give you a link that you can use elsewhere to get direct access to your data in JSON or CSV format. For those that are still learning SQL, or not quite as confident as they’d like to be, using the Sqlite View is a good place to start. When you first get to the Sqlite View you’ll see something similar to the following, but without the data already shown.

As you can see, the view gives you a description of the fields in the Sqlite table (highlighted in yellow) and a set of fields where you can enter the information you require. If you are feeling particularly lazy you can simply click on the highlighted column names and they will be added to the SELECT field for you! Accessing data across scrapers is done slightly differently, and is hopefully the subject of a future post. By default this view will display the output data as a table but you can change it to do what you wish by editing the HTML and Javascript underneath – it is pretty straight forward. Once you have added the fields you wish to find (making sure to use ` to surround any field names with spaces in) clicking the query button will make a request to the ScraperWiki API and display the results on your page. It also shows you the full query so that you can copy your query and save it away for future use.

Now that you have an interface where you can modify your SQL, you can now access your data almost any way you want it! You can do simple queries by just leaving the SELECT field set to * which will return all of the columns, or you can specify the individual columns and the order they will be retrieved. You can even set their title by using the AS keyword. Setting the SELECT field to “`Name of Organisation` AS Organisation” allows will show that field with the new shorter column name.

Aside from ordering your results (putting a field name in ORDER BY, followed by desc if you want descending order), limiting your results (adding the number of records into LIMIT) and the aforementioned renaming of columns, one thing the Sqlite will let you do is group your results to show information that isn’t immediately visible in the full result set. Using the Special Advisers scraper again, the following view shows how by grouping the data on `Name of Organisation` and using the count function in the SELECT field we can show the total number of gifts given by each organisation – surely a lot faster than counting how many times each Organisation appears in the full output!

In addition to using the count function in SELECT you could also use sum, or even avg to obtain an average of some numerical values. Not only can you add these individual functions into your SELECT field, you can get a lot more complicated to get a better overall view of the data, as in the Arts Council Cuts scraper. Here you can see the output for the total revenue per year and average percent change by artform and draw your own conclusions on where the cuts are, or are not happening.

SELECT `Artform `,
    sum(`Total Revenue 10-11`) as `Total Revenue for this year`,
    sum(`11-12`) as `Total Revenue for 2011-2012`,
    sum(`12-13`) as `Total Revenue for 2012-2013`,
    sum(`13-14`) as `Total Revenue for 2013-2014`,
    (avg(`Real percent change -Oct inflation estimates-`)*100) 
    as `Average % change over 4 years (Oct inflation estimates)`
FROM swdata
GROUP BY `Artform `
ORDER BY `Total Revenue for this year` desc"

If there is anything you’d like to see added to any of these features, let us know either in the comments or via the website.