coding – ScraperWiki

Lots of new libraries

Francis Irving — Wed, 26 Oct 2011 11:53:50 +0000

We’ve had lots of requests recently for new 3rd party libraries to be accessible from within ScraperWiki. For those of you who don’t know, yes, we take requests for installing libraries! Just send us word on the feedback form and we’ll be happy to install.

Also, let us know why you want them as it’s great to know what you guys are up to. Ross Jones has been busily adding them (he is powered by beer if ever you see him and want to return the favour).

Find them listed in the “3rd party libraries” section of the documentation.

In Python, we’ve added:

csvkit, a bunch of tools for handling CSV files made by Christopher Groskopf at the Chicago Tribune. Christopher is now lead developer on PANDA, a Knight News Challenge winner who are making a newsroom data hub
requests, a lovely new layer over Python’s HTTP libraries made by Kenneth Reitz. Makes it easier to get and post.
Scrapemark is a way of extracting data from HTML using reverse templates. You give it the HTML and the template, and it pulls out the values.
pipe2py was requested by Tony Hirst, and can be used to migrate from Yahoo Pipes to ScraperWiki.
PyTidyLib, to access the old classic C library that cleans up HTML files.
SciPy is at the analysis end, and builds on NumPy giving code for statistics, Fourier transforms, image processing and lots more.
matplotlib, can almost magically make PNG charts. See this example that The Julian knocked up, with the boilerplate to run it from a ScraperWiki view.
Google Data (gdata) for calling various Google APIs to get data.
Twill is its own browser automation scripting language and a layer on top of Mechanize.

In Ruby, we’ve added:

tmail for parsing emails.
typhoeus, a wrapper round curl with an easier syntax, and that lets you do parallel HTTP requests.
Google Data (gdata) for calling various Google APIs.

In PHP, we’ve added:

GeoIP for turning IP addresses into countries and cities.

Let us know if there are any libraries you need for scraping or data analysis!

OpenCorporates partners with ScraperWiki & offers bounties for open data scrapers

countculture — Fri, 25 Mar 2011 11:25:12 +0000

This is a guest post by Chris Taggart, co-founder of OpenCorporates

When we started OpenCorporates it was to solve a real need that we and a number of other people in the open data community had: whether it’s Government spending, subsidy info or court cases, we needed a database of corporate entities to match against, and not just for one country either.

But we knew from the first that we didn’t want this to be some heavily funded monolithic project that threw money at the project in order to create a walled garden of new URIs unrelated to existing identifiers. It’s also why we wanted to work with existing projects like OpenKvK, rather than trying to replace them.

So the question was, how do we make this scale, and at the same time do the right thing – that is work with a variety of different people using different solutions and different programming languages. The answer to both, it turns out, was to use open data, and the excellent ScraperWiki.

How does it work? Well, the basics we need in order to create a company record at OpenCorporates is the company number, the jurisdiction and the company’s name. (If there’s a status field — e.g. dissolved/active — company type or url for more data, that’s a bonus). So, all you need to do is write a scraper for a country we haven’t got data for, name the fields in a standard way (CompanyName, CompanyNumber, Status, EntityType, RegistryUrl, if the url of the company page can’t be worked out from the company number), and bingo, we can pull it into OpenCorporates, with just a couple of lines of code.

Let’s have a look at one we did earlier: the Isle of Man (there’s also one for Gibraltar, Ireland, and in the US, the District of Columbia). It’s written in Ruby, because that’s what we at OpenCorporates code in, but ScraperWiki allows you to write scrapers in Python or php too, and the important thing here is the data, not the language used to produce it.

The Isle of Man company registry website is a .Net system which uses all sorts of hidden fields and other nonsense in the forms and navigation. This is a normally bit of a pain, but because you can use the Ruby Mechanize library to submit forms found on the pages (there’s even a tutorial scraper which shows how to do it), it becomes fairly straightforward.

The code itself should be fairly readable to anyone familiar with Ruby or Python, but essentially it tackles the problem by doing multiple searches for companies beginning with two letters, starting with ‘aa’ then ‘ab’ and so on, and for each letter pair iterating through each page of results in turn, which in turn is scraped to extract the data, using the standardised headings to save them in. That’s it.

In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.

However, that’s not all. In order to kickstart the effort OpenCorporates (technically Chrinon Ltd, the micro start-up that’s behind OpenCorporates) is offering a bounty for new jurisdictions opened up.

It’s not huge (we’re a micro-startup remember): £100 for any jurisdiction that hasn’t been done yet, £250 for those territories we want to import sooner rather than later (Australia, France, Spain), and £500 for Delaware (there’s a captcha there, so not sure it’s even possible), and there’s an initial cap of £2500 on the bounty pot (details at the bottom of this post).

However, often the scrapers can be written in a couple of hours, and it’s worth stressing again that neither the code nor the data will belong to OpenCorporates, but to the open data community, and if people build other things on it, so much the better. Of course we think it would make sense for them to use the OpenCorporates URIs to make it easy to exchange data in a consistent and predictable way, but, hey, it’s open data 😉

Small, simple pieces, loosely connected, to build something rather cool. So now you can do a search for, oh say Barclays, and get this:

The bounty details: how it works

Find a country/company registry that you fancy opening up the data for (here are a couple of lists of registries). Make sure it’s from the official registry, and not a commercial reseller. Check too that no-one has already written one, or is in the middle of writing one, by checking the scrapers tagged with opencorporates (be nice, and respect other people’s attempts, but feel free to start one if it looks as if someone’s given up on a scraper).

All clear? Go ahead and start a new scraper (useful tutorials here). Call it something like trial_fr_company_numbers (until it’s done and been OK’d) and get coding, using the headings detailed above for the CompanyNumber, CompanyName etc. When it’s done, and it’s churning away pulling in data, email us info@opencorporates.com, and assuming it’s OK, we’ll pay you by Paypal, or by bank transfer (you’ll need to give us an invoice in that case). If it’s not we’ll add comments to the scraper. Any questions, email us at info@opencorporates.com, and happy scraping.

Job advert: Product / UX lover

Francis Irving — Mon, 14 Feb 2011 15:35:38 +0000

ScraperWiki is a Silicon Valley style startup, but based in the UK. We’re changing the world of open data, and how programming is done together on the Internet.

We’re looking for a web product designer who is…

Able to make design decisions to launch features by themselves.
Capable of writing CSS and HTML, and some Javascript.

Other bits…

Loves to balance colour, size, order and prominence on websites.
Knows what a web scraper is, and would like to learn to write one.
Thinks that data can change the world, but only if we use it right.
Either good at working remotely, or willing to relocate to the North West.
Desirable – able to make igloos.

To apply – send the following:

An example of a website you’ve made that you’re proud of
If you have one, a visualisation you’ve made of some data (any data!)
Oh, and I guess we’d better see your CV

Along to francis@scraperwiki.com with the word swjob2 in the subject.

Job advert: Web designer/programmer

Francis Irving — Wed, 05 Jan 2011 11:29:30 +0000

Care about oil spills, newspapers or lost cats?

ScraperWiki is a Silicon Valley style startup, but in the North West of England, in Liverpool. We’re changing the world of open data, and how programming is done together on the Internet.

We’re looking for a web designer/programmer who is…

Capable of writing standards compliant CSS and HTML, and some Javascript.
Loves to balance colour, size, order and prominence on websites.
Good enough at Photoshop to make any mockups and icons required.
Likes to talk to and track users, and then do what’s needed to make their experience better.
Server-side coding (Python) a plus but not essential.
Knows what a web scraper is, and would like to learn to write one.
Thinks that data can change the world, but only if we use it right.
Desirable – able to make igloos.

Some practical things…

We’re early stage, spending our seed funding. So be aware things will go either way – we’ll crash and burn, or you’ll be a key, senior person in a growing company.
We’d like this to end up a permanent position, but if you prefer we’re happy to do individual contracts to start with.
Must be willing to either relocate to Liverpool, or able to work from home and travel here regularly (once a week). So somewhere nearby preferred.

To apply – send the following:

An example of a website you’ve made that you’re proud of
If you have one, a visualisation of some data (any data!)

Along to francis@scraperwiki.com with the word swjob1 in the subject.