documentation – ScraperWiki

Scraping guides: Dates and times

Francis Irving — Wed, 12 Oct 2011 12:58:32 +0000

Working with dates and times in scrapers can get really tricky. So we’ve added a brand new scraping guide to the ScraperWiki documentation page, giving you copy-and-paste code to parse dates and times, and save them in the datastore.

To get to it, follow the “Dates and times guide” link on the documentation page.

The guide is available in Ruby, Python and PHP. Enjoy!

Scraping guides: Parsing HTML using CSS selectors

Francis Irving — Mon, 03 Oct 2011 22:20:58 +0000

We’ve added a new scraping copy-and-paste guide, so you can quickly get the lines of code you need to parse an HTML file using CSS selectors. Get to it from the documentation page:

The HTML parsing guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

While the library used varies (lxml in Python, Nokogiri in Ruby, Simple HTML DOM in PHP), the principle is the same. You pull the text out of the page the way as you use CSS to style a page.

It’s a popular technique – for example, around 30% of Python scrapers on ScraperWiki use lxml.

Scraping guides: Excel spreadsheets

Francis Irving — Wed, 14 Sep 2011 15:55:29 +0000

Following on from the CSV scraping guide, we’ve now added one about scraping Excel spreadsheets. You can get to them from the documentation page.

The Excel scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

As with CSV files, at first it seems odd to be scraping Excel spreadsheets, when they’re already at least semi-structured data. Why would you do it?

The format of Excel files can varies a lot – how columns are arranged, where tables appear, what worksheets there are. There can be errors and inconsistencies that are easiest to fix in code. Sometimes you’ll find the data is there but not formatted in cells – entire rows in one cell, or data stored in notes.

We used an Excel scraper that pulls together 9 spreadsheets into one dataset for the brownfield sites map used by Channel 4 News.

Dave Hughes has one that converts a spreadsheet from an FOI request, making a nice dataset of temperatures in Cambridge’s botanical garden.

This merchant oil shipping scraper does a few regular expressions to parse some text in one of the columns.

Next time – parsing HTML with CSS selectors.

Scraping guides: Values, separated by commas

Francis Irving — Thu, 25 Aug 2011 12:50:30 +0000

When we revamped our documentation a while ago, we promised guides to specific scraper libraries, such as lxml, Nokogiri and so on.

We’re now staring to roll those out. The first one is simple, but a good one. Go to the documentation page and you’ll find a new section called “scraping guides”.

The CSV scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

“CSV” stands for “comma separated value”. It’s a basic but quite common format for publishing spreadsheet files. Take a look at the scrapers tagged CSV on ScraperWiki. Lots of ministerial meetings and government spending records.

The CSV scraping guide shows you how to download and parse a CSV file, and how to easily save it into the datastore.

Why write a scraper just to load a CSV file in? First there are quirks you’ll inevitably find – inconsistencies in fields, extra rows and columns, dates that you want formatting and so on. Secondly, you might want to load in multiple files and merge them together. Finally, you get the data in ScraperWiki, giving you a JSON API and letting you make views.

You can do quite funky things with even just CSV scraping. For example, see Nicola’s @scraper_no10 Twitter bot.

Next time, Excel files.

‘Documentation is like sex: when it is good, it is very, very good; and when it is bad, it is better than nothing’

Francis Irving — Wed, 25 May 2011 01:02:42 +0000

You may have noticed that the design of the ScraperWiki site has changed substantially.

As part of that, we made a few improvements to the documentation. Lots of you told us we had to make our documentation easier to find, more reliable and complete.

We’ve reorganised it all under one contents page, called Documentation throughout the site, including within the code editor. All the documentation is listed there. (The layout shamelessly inspired by Django).

Of course, everyone likes different kinds of documentation – talk to a teacher and they’ll tell you all about different learning styles. Here’s what we have on offer, all available in Ruby, Python and PHP (thanks Tom and Ross!).

New style tutorials – very directed recipes, that show you exactly how to make something specific in under 30 minutes. More on these in a future blog post.
Live tutorials – these are what we now call the ScraperWiki special sauce tutorials. Self contained chunks of code with commentary that you fork and edit and run entirely in your browser. (thanks Anna and Mark!)
Copy and paste guides – a new type of reference to a library, which gives you code snippets you can quickly copy into your scraper. With one click. (thanks Julian!)
Interactive API documentation – for how to get data out of ScraperWiki. More on that in a later blog post. (thanks Zarino!)
Reference documentation – we’ve gone through it to make sure it covers exactly what we support.
Links for further help – an FAQ and our Google Group. But also for more gnarly questions asks on the Stack Overflow scraperwiki tag.

We’ve got more stuff in the works – screencasts and copy & paste guides to specific view/scraper libraries (lxml, Nokogiri, Google Maps…). Let us know what you want.

Finally, none of the above is what really matters about this change.

The most important thing is our new Documentation Policy (thanks Ross). Our promise to keep documentation up to date, and available alike for all the languages that we support.

Normally in websites it is much more important to have a user interface that doesn’t need documentation. Of course, you need it for when people get stuck, and it has to be good quality. But you really do want to get rid of it.

But programming is fundamentally about language. Coders need some documentation, even if it is just the quickest answer they can get Googling for an error message.

We try hard to make it so as little as possible is needed, but what’s left isn’t an add on. It is a core part of ScraperWiki.

(The quote in the title of this blog post is attributed to Dick Brandon on lots of quotation sites on the Internet, but none very reliably)

All recipes 30 minutes to cook

Francis Irving — Wed, 18 May 2011 10:13:52 +0000

The other week we quietly added two tutorials of a new kind to the site, snuck in behind a radical site redesign.

They’re instructive recipes, which anyone with a modicum of programming knowledge should be able to easily follow.

1. Introductory tutorial

For programmers new to ScraperWiki, to a get an idea of what it does.

It runs through the whole cycle of scraping a page, parsing it then outputting the data in a new form. For a simplest possible example.

Available in Ruby, Python and PHP.

2. Views tutorial

Find out how to output data from ScraperWiki in exactly the format you want – i.e. write your own API functions on our servers.

This could be a KML file, an iCal file or a small web application. This tutorial covers the basics of what a ScraperWiki View is.

Available in Ruby, Python and PHP.

Hopefully these tutorials won’t take as long as Jamie Oliver’s recipes to make. Get in touch with feedback and suggestions!

It’s SQL. In a URL.

Francis Irving — Mon, 16 May 2011 11:09:46 +0000

Squirrelled away amongst the other changes to ScraperWiki’s site redesign, we made substantial improvements to the external API explorer.

We’re going to concentrate on the SQLite function here as it is most import, but as you can see on the right there are other functions for getting out scraper metadata.

Zarino and Julian have made it a little bit slicker to find out the URLs you can use to get your data out of ScraperWiki.

1. As you type into the name field, ScraperWiki now does an incremental search to help you find your scraper, like this.

2. After you select a scraper, it shows you its schema. This makes it much easier to know the names of the tables and columns while writing your query.

3. When you’ve edited your SQL query, you can run it as before. There’s also now a button to quickly and easily copy the URL that you’ve made for use in your application.

You can get to the explorer with the “Explore with ScraperWiki API” button at the top of every scraper’s page. This makes it quite useful for quick and dirty queries on your data, as well as for finding the URLs for getting data into your own applications.

Let us know when you do something interesting with data you’ve sucked out of ScraperWiki!