Open Data Camp 2

I’m back from Open Data Camp 2; and I’m finding it difficult to make a coherent whole of it all. Perhaps it’s the nature of the lack of structure of an un-conference. Maybe the different stakeholders in the open data community throughout the various hierarchies have a common aim but different levers to pull: the […]

Spreadsheets are code: EuSpRIG conference.

I’m back from presenting a talk on DataBaker at the EuSpRIG conference. It’s amazing to see a completely different world of how people use Excel – I’ve been busy tearing the data out of spreadsheets for the Office of National Statistics and using macros to open PDF files in Excel directly using PDFTables. So whilst I’ve […]

DataBaker – making spreadsheets machine-readable

Spreadsheets are often the way of choice for publishing data. They look great, are understandable by people who don’t use databases, and with judicious use of formatting you can represent complicated datasets in a way people can understand. The down side is that machines can’t understand them. Sure, you can export the file as CSV, but that […]

Scraping Spreadsheets with XYPath

Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents. Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated […]

Table Scraping Is Hard

The Problem NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS? […]

ScraperWiki

Extract tables from PDFs and scrape the web

Archive by Author