Francis Irving – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 The Sensible Code Company is our new name https://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/ https://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/#respond Tue, 09 Aug 2016 06:10:13 +0000 https://blog.scraperwiki.com/?p=758224569 For a few years now, people have said “but you don’t just do scraping, and you’re not a wiki, why are you called that?”

We’re pleased to announce that we have finally renamed our company.

We’re now called The Sensible Code Company.

The Sensible Code Company logo

Or just Sensible Code, if you’re friends!

We design and sell products that turn messy information into valuable data.

As we announced a couple of weeks ago the ScraperWiki product is now called QuickCode.

Our other main product converts PDFs into spreadsheets, it’s called PDFTables. You can try it out for free.

We’re also working on a third product – more about that when the time is right.

You’ll see our company name change on social media, on websites and in email addresses over the next day or two.

It’s been great being ScraperWiki for the last 6 years. We’ve had an amazing time, and we hope you have too. We’re sticking to the same vision, to make it so that everyone can make full use of data easily.

We’re looking forward to working with you as The Sensible Code Company!

]]>
https://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/feed/ 0 758224569
Remote working at ScraperWiki https://blog.scraperwiki.com/2016/08/remote-working-at-scraperwiki/ https://blog.scraperwiki.com/2016/08/remote-working-at-scraperwiki/#respond Tue, 02 Aug 2016 13:17:10 +0000 https://blog.scraperwiki.com/?p=758224518 We’ve just posted our first job advert for a remote worker. Take a look, especially if you’re a programmer.

Throughout ScraperWiki’s history, we’ve had some staff working remotely – one for a while as far away as New York!

Sometimes staff have had to move away for other reasons, and have continued to work for us remotely. Othertimes they are from the north of England, but just far enough away that it is only reasonable to come into the office for a couple of days a week.

Collaborative tools are better now – when we first started coding ScraperWiki in late 2009, GitHub wasn’t ubiquitous, and although we had IRC it was much harder to get bizdevs to use than Slack is. It’s hard to believe, but you had to pay for Skype group video calls, and Hangouts hadn’t launched yet.

Meanwhile, our customers and research partners are all over the world.

So we’re quite used to remote working.

We love Liverpool, and we can tell you the advantages and help you move here if that’s what you want.

If it isn’t though, and if you’ve always wanted to work for ScraperWiki as a software engineer remotely, now’s your chance.

]]>
https://blog.scraperwiki.com/2016/08/remote-working-at-scraperwiki/feed/ 0 758224518
QuickCode is the new name for ScraperWiki (the product) https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/ https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/#comments Thu, 14 Jul 2016 14:02:14 +0000 https://blog.scraperwiki.com/?p=758224468 Our original browser coding product, ScraperWiki, is being reborn.

We’re pleased to announce it is now called QuickCode.

QuickCode front page

We’ve found that the most popular use for QuickCode is to increase coding skills in numerate staff, while solving operational data problems.

What does that mean? I’ll give two examples.

  1. Department for Communities and Local Government run clubs for statisticians and economists to learn to code Python on QuickCode’s cloud version. They’re doing real projects straight away, such as creating an indicator for availability of self-build land. Read more

  2. Office for National Statistics save time and money using a special QuickCode on-premises environment, with custom libraries to get data from spreadsheets and convert it into the ONS’s internal database format. Their data managers are learning to code simple Python scripts for the first time. Read more

Why the name change? QuickCode isn’t about just scraping any more, and it hasn’t been a wiki for a long time. The new name is to reflect its broader use for easy data science using programming.

We’re proud to see ScraperWiki grow up into an enterprise product, helping organisations get data deep into their soul.

Does your organisation want to build up coding skills, and solve thorny data problems at the same time?

We’d love to hear from you.

]]>
https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/feed/ 1 758224468
Learning to code bots at ONS https://blog.scraperwiki.com/2016/07/learning-to-code-bots-at-ons/ https://blog.scraperwiki.com/2016/07/learning-to-code-bots-at-ons/#respond Tue, 12 Jul 2016 14:44:09 +0000 https://blog.scraperwiki.com/?p=758224407 The Office for National Statistics releases over 600 national statistics every year. They came to ScraperWiki to help improve their backend processing, so they could build a more usable web interface for people to download data.

We created an on-premises environment where their numerate staff learnt a minimal amount of coding, and now create short scripts to transform data they didn’t have the resource to previously.

Matthew Jukes, Head of Product, Office for National Statistics said:

Who knew a little Python app spitting out CSVs could make people so happy but thank you team @ScraperWiki – great stuff 🙂

Spreadsheets

The data the team were processing was in spreadsheets which look like this:

(Not shown: splits by age range, seasonal adjustments, or the whole pile of other similar spreadsheets)

They needed to turn them into a standard CSV format used internally at the ONS. Each spreadsheet could have 10,000s of observations in it, turning into an output file with that many database rows.

We created an on-premises ScraperWiki environment for the ONS, using standard text editors and Python. Each type of spreadsheet needs one short recipe writing, which is just a few lines of Python expressing the relative relationship of headings, sub-headings and observations.

The environment included a coloured debugger for identifying that headings and cells were correctly matched:

Databaker highlighter

Most of the integration work involved making it easy to code scripts which could transform the data ONS had – coping with specific ways numbers are written, and outputting the correct CSV file format.

Training

As part of the deployment, we gave a week of hands on script development training for 3 members of staff.  Numerate people learning some coding is, we think, vital to improving how organisations use data.

Before the training, Darren Barnes (Open Datasets Manager) said learning to code felt like crossing a “massive chasm”.

EOT team

Within a couple of hours he was write scripts that were then used operationally.

He said it was much easier to write code than to use the data applications with complex graphical interface he often has to work with.

Conclusion

Using graphical ETL software, it took two weeks for an expert consultant to make the converter for one type of spreadsheet. With staff in the business coding Python in ScraperWiki’s easy environment themselves, it takes a couple of hours.

This has saved the ONS time for each type of spreadsheet for the initial conversion. When new statistics come out in later months, those spreadsheets can easily be converted again, with any problems fixed quickly and locally, saving even more.

The ONS have made over 40 converters so far. ScraperWiki has been transformational.

]]>
https://blog.scraperwiki.com/2016/07/learning-to-code-bots-at-ons/feed/ 0 758224407
Running a code club at DCLG https://blog.scraperwiki.com/2016/06/running-a-code-club-at-dclg/ Wed, 08 Jun 2016 15:49:39 +0000 https://blog.scraperwiki.com/?p=758224360 The Department for Communities and Local Government (DCLG) has to track activity across more than 500 local authorities and countless other agencies.

They needed a better way to handle this diversity and complexity of data, so decided to use ScraperWiki to run a club to train staff to code.

Martin Waudby, data specialist, said:

I didn’t want us to just do theory in the classroom. I came up with the idea of having teams of 4 or 5 participants, each tasked to solve a challenge based on a real business problem that we’re looking to solve.

The business problems being tackled were approved by Deputy Directors.

Phase one

The first club they ran had 3 teams, and lasted for two months so participants could continue to do their day jobs whilst  finding the time to learn new skills. They were numerate people – statisticians and economists (just as in our similar project at the ONS). During that period, DCLG held support workshops, and “show and tell” sessions between teams to share how they solved problems.

As ever with data projects, lots of the work involved researching sources of data and their quality. The teams made data gathering and cleaning bots in Python using ScraperWiki’s “Code in Browser” product – an easy way to get going, without anything to install and without worrying about where to store data, or how to download it in different formats.

Here’s what two of the teams got up to…

Team Anaconda

The goal of Team Anaconda (they were all named after snakes, to keep the Python theme!) was to gather data from Local Authority (and other) sites to determine intentions relating to Council Tax levels. The business aim is to spot trends and patterns, and to pick up early on rises which don’t comply with the law.

Local news stories often talk about proposed council tax changes.

Council tax change article

The team in the end set up a Google alert for search terms around council tax changes, and imported that into a spreadsheet. They then downloaded the content of those pages, creating an SQL table with a unqiue key for each article talking about changes to council tax:

Screenshot from 2016-04-26 09-55-53

They used regular expressions to find the phrases describing a percentage increase / decrease in Council Tax.

The team liked using ScraperWiki – it was easy to collaborate on scrapers there, and easier to get into SQL.

The next steps will be to restructure the data to be more useful to the end user, and improve content analysis, for example by extracting local authority names from articles.

Team Boa Constrictor

It’s Government policy to double the number of self-built homes by 2020, so this team was working on parsing sites to collect baseline evidence of the number being built.

They looked at various sources – quarterly VAT receipts, forums, architecture websites, sites which list plots of land for sale, planning application data, industry bodies…

The team wrote code to get data from PlotBrowser, a site which lists self-build land for sale.

Plot Browser scraper

And analysed that data using R.

PlotBrowser data graph

They made scripts to get planning application data, for example in Hounslow. Although they found the data they could easily get from within the applications wasn’t enough for what they needed.

planning

They liked ScraperWiki, especially once they understood the basics of Python.

The next step will be to automate regular data gathering from PlotBrowser, and count when plots are removed from sale.

Phase two

At the end of the competition, teams presented what they’d learnt and done to Deputy Directors. Team Boa Constrictor won!

The teams developed a better understanding of the data available, and the level of effort needed to use it. There are clear next steps to take the projects onwards.

DCLG found the code club so useful, they are running another more ambitious one. They’re going to have 7 teams, extending their ScraperWiki license so everyone can use it. A key goal of this second phase is to really explore the data that has been gathered.

We’ve found at ScraperWiki that a small amount of coding skills, learnt by numerate staff, goes a long way.

As Stephen Aldridge, Director of the Analysis and Data Directorate, says:

ScraperWiki added immense value, and was a fantastic way for team members to learn. The code club built skills at automation and a deeper understanding of data quality and value. The projects all helped us make progress at real data challenges that are important to the department.

]]>
758224360
Highlights of 3 years of making an AI newsreader https://blog.scraperwiki.com/2016/04/highlights-of-3-years-of-making-an-ai-newsreader/ Wed, 06 Apr 2016 09:15:01 +0000 https://blog.scraperwiki.com/?p=758224299 We’ve spent three years working on a research and commercialisation project making natural language processing software to reconstruct chains of events from news stories, representing them as linked data.

If you haven’t heard of Newsreader before, our one year in blog post is a good place to start.

We recently had our final meeting in Luxembourg. Some highlights from the three years:

Papers: The academic partners have produced a barrage of papers. This constant, iterative improvement to knowledge of Natural Language Processing techniques is a key thing that comes out of research projects like this.

Open data: As an open data fan, I like some of the new components which will be of permanent use to anyone in NLP which came out of the project. For example, the MEANTIME corpus of news articles in multiple languages annotated with their events, for use in training.

Meantime example

Open source: Likewise, as an open source fan, Newsreader’s policy was to produce open source software, and it made lots. As an example, the PIKES Knowledge Extraction Suite applies NLP tools to a text.

PIKES overview

Exploitation: Is via integration into existing commercial products. All three commercial consortium members are working on this in some way (often confidentially for now). Originally at ScraperWiki, we thought it might plug into our Code in Browser product. Now our attention is more around using PDFTables with additional natural language processing.

Simple API: A key part of our work was developing the Simple API, making the underlying SPARQL database of news events acccessible to hackers via a simpler REST API. This was vital for the Hackdays, and making the technology more accessible.

Hackdays: We ran several across the course of the project (example). They were great fun, working on World Cup and automotive related news article datasets, drawing a range of people from students to businesses.

Thanks Newsreader for a great project!

Together, we improved the quality of news data extraction, began the process of assembling that into events, and made steps towards commercialisation.

]]>
758224299
Saving time with GOV.UK design standards https://blog.scraperwiki.com/2016/02/draft-design-standards-save-time/ Thu, 04 Feb 2016 08:46:36 +0000 https://blog.scraperwiki.com/?p=758222678 While building the Civil Service People Survey (CSPS) site, ScraperWiki had to deal with the complexities of suppressing data to avoid privacy leaks and making technology to process tens of millions of rows in a fraction of a second.

We didn’t also have time to spend on basic web design. Luckily the Government’s Resources for designers, part of the Government Service Design Manual, saved us from having to.

In this blog post I talk through specific things where the standards saved us time and increased quality.

This is useful for managers – it’s important to know some details to avoid getting too distant from how projects really work. If you’re a developer or designer who’s about to make a site for the UK Government, there are lots of practical links!

Header and footer

To style the header and the footer we used the govuk_template, via a mustache version which automatically updates itself from the original. This immediately looks good.

CSPS about page

The look takes advantage of years old GDS work to be responsiveaccessible and have considered typography. Without using the template we’d have made something useful, but ugly and inaccessible without extra budget for design work.

It also reduces maintenance. The templates are constantly updated, and every now and again we quickly update the copy of them that we include. This keeps our design up to date with the standard, fixes bugs, and ensures compatibility with new devices.

Buttons

The template doesn’t have styling for your content. That’s over in a separate module called the govuk_frontend_template. It has a bunch of useful CSS and Javascript pieces. GOV.UK elements is a good guide to how to use them, with its live examples.

For example, there aren’t many forms and buttons in CSPS, nevertheless they look good.

CSPS login form

The frontend template is full of useful tiny things, such as easily styling external links.

CSPS external link

And making just the right alpha or beta banner.

CSPS alpha banner

The tools aren’t perfect. We would have liked slightly more styling to use inside the pages. Some inside Government are arguing for a comprehensive framework similar to Bootstrap.

All I really want is not to have to define my own <h1>!

Cookies

Privacy is important to users. Despite the flaws in the EU regulations on telling consumers about browser cookies, the goal of informing people how they are being tracked is really important.

Typically in a web project this would involve a bit of discussion over how and when to do so, and what it should look like. For us, it just magically came with the Government’s template.

CSPS cookies

I say magically, we also had to carefully note down all the cookies we use. A useful thing to do anyway!

Error messages

There are now lots of recently made Government digital services to poach bits of web design from.

The probate service has some interesting tabs and other navigation. The blood donation service dashboard has grey triangles to show data increased/decreased. The new digital marketplace has a complex search form with various styles.

I can’t even remember where we took this error message format from – lots of places use it.

CSPS login error

If you’d like to find more, doing a web search for site:service.gov.uk is a great way to start exploring.

]]>
758222678
6 lessons from sharing humanitarian data https://blog.scraperwiki.com/2015/10/6-lessons-from-sharing-humanitarian-data/ Tue, 13 Oct 2015 12:01:19 +0000 https://blog.scraperwiki.com/?p=758222955 This post is a write-up of the talk I gave at Strata London in May 2015 called “Sharing humanitarian data at the United Nations“. You can find the slides on that page.

The Humanitarian Data Exchange (HDX) is an unusual data hub. It’s made by the UN, and is successfully used by agencies, NGOs, companies, Governments and academics to share data.

They’re doing this during crises such as the Ebola epidemic and the Nepal earthquakes, and every day to build up information in between crises.

There are lots of data hubs which are used by one organisation to publish data, far fewer which are used by lots of organisations to share data. The HDX project did a bunch of things right. What were they?

Here are six lessons…

1) Do good design

HDX started with user needs research. This was expensive, and was immediately worth it because it stopped a large part of the project which wasn’t needed.

The user needs led to design work which has made the website seem simple and beautiful – particularly unusual for something from a large bureaucracy like the UN.

HDX front page

2) Build on existing software

When making a hub for sharing data, there’s no need to make something from scratch. Open Knowledge’s CKAN software is open source, this stuff is a commodity. HDX has developers who modify and improve it for the specific needs of humanitarian data.

ckan

3) Use experts

HDX is a great international team – the leader is in New York, most of the developers are in Romania, there’s a data lab in Nairobi. Crucially, they bring in specific outside expertise: frog design do the user research and design work; ScraperWiki, experts in data collaboration, provide operational management.

ScraperWiki logo

4) Measure the right things

HDX’s metrics are about both sides of its two sided network. Are users who visit the site actually finding and downloading data they want? Are new organisations joining to share data? They’re avoiding “vanity metrics”, taking inspiration from tech startup concepts like “pirate metrics“.

HDX metrics

 5) Add features specific to your community

There are endless features you can add to data hubs – most add no value, and end up a cost to maintain. HDX add specific things valuable to its community.

For example, much humanitarian data is in “shape files”, a standard for geographical information. HDX automatically renders a beautiful map of these – essential for users who don’t have ArcGIS, and a good check for those that do.

Syrian border crossing

6) Trust in the data

The early user research showed that trust in the data was vital. For this reason, anyone can’t just come along and add data to it. New organisations have to apply – proving either that they’re known in humanitarian circles, or have quality data to share. Applications are checked by hand. It’s important to get this kind of balance right – being too ideologically open or closed doesn’t work.

Apply HDX

Conclusion

The detail of how a data sharing project is run really matters. Most data in organisations gets lost, left in spreadsheets on dying file shares. We hope more businesses and Governments will build a good culture of sharing data in their industries, just as HDX is building one for humanitarian data.

]]>
758222955
Over a billion public PDFs https://blog.scraperwiki.com/2015/09/over-a-billion-public-pdfs/ Tue, 15 Sep 2015 07:02:33 +0000 https://blog.scraperwiki.com/?p=758222170 You can get a guesstimate for the number of PDFs in the world by searching for filetype:pdf on a web search engine.

These are the results I got in August 2015 – follow the links to see for yourself.

Google Bing
Number of PDFs 1.8 billion 84 million
Number of Excel files 14 million 6 million

The numbers are inexact, but that’s likely over a billion PDF files. Of course, it’s only the visible ones…

But the fact is that the vast majority of PDFs are in corporate or governmental repositories. I’ve heard various government agencies (throughout the world) comment that they have tens of millions (or more) in their own libraries/CMS’s. Various engineering businesses, such as Boeing and Airbus are also known to have tens of millions (or more) in their repositories. Leonard Rosenthol, Adobe’s PDF Architect

Digging a bit deeper by also adding site: to the search, you can find out what percentage of documents that a search engine has indexed are PDFs.

Number of PDFs Total number of pages % PDFs
.com 547 million 25 billion 2%
.gov 316 million 839 million 38%

UK – HM Treasury Summer Budget 2015

That’s proportionately a lot more PDFs published by the US Government than by commercial sites!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758222170
We’re hiring! Technical Architect https://blog.scraperwiki.com/2015/08/were-hiring-technical-architect/ Mon, 24 Aug 2015 15:35:53 +0000 https://blog.scraperwiki.com/?p=758223888 We’ve lots of interesting projects on – with clients like the United Nations and the Cabinet Office, and with our own work building products such as PDFTables.com.

Currently we’re after a Technical Architect, full details on our jobs page.

We’re a small company, so roles depend on individual people. Get in touch if something doesn’t quite fit, but you’re interested!

Digger hiring

]]>
758223888