David Jones – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Technology Radar Report https://blog.scraperwiki.com/2015/06/technology-radar-report/ https://blog.scraperwiki.com/2015/06/technology-radar-report/#comments Fri, 26 Jun 2015 07:43:33 +0000 https://blog.scraperwiki.com/?p=758223193 Creating a sustainable technology company involves keeping up with technology. The thing about technology is that it changes, and we have to look to the future, and invest our time now in things that will be valuable in the future. Or, we could switch to doing SharePoint consultancy for the rest of our lives, but I think most of us here would regard that as “checking out”.

This is a partly personal perspective of the future as I see it from our little hill in the Northwest of England (Brownlow Hill). Iʼm really just sketching out a few things that I see as being important for ScraperWiki. And since theyʼre important for ScraperWiki, theyʼre important for you! Or at least, you might be interested too.

Met1200

The future is already here – itʼs just not evenly distributed. — William Gibson

Gibsonʼs quote certainly applies to the software industry. All of the things I highlight already exist and are in use (some for quite a long time now), they just haven’t reached saturation yet. So looking to the near future is a matter of looking to the now, and making an educated guess as to what technologies will become increasingly abundant.

Python 3

(I have been saying this for 6 years now, but) Python 3 is a real thing and in five yearsʼ time we will have stopped using Python 2 and all switched to Python 3. If you think this all seems obvious, I don’t think we can say the same about the transition from Perl 5 to Perl 6 (which lives in a perpetual state of being “out by christmas”) or from Latex 2e to Latex 3.

Encouragingly, in 2015 real people are using it for real projects (including ScraperWiki!). I would now consider it foolish to start a greenfield Python project in Python 2. If you maintain a Python library, it is starting to look negligent if it doesn’t work with Python 3.

Your Python 2 programming skills will mostly transfer to Python 3. There will be some teething trouble: print() and urllib still fox me sometimes, and I find myself using list() a lot more when debugging (because more things are generators). Niggly details aside, basically everything works and most things are a bit better.

The Go Programming Language

Globally I think the success of Go (the programming language) still remains uncertain, but its ecosystem is now large enough to sustain it in its own right. The risks here are not particularly technical but in the community. I think we would have difficulty hiring a Go programmer (we would have to find a programmer and train them).

The challenge for the next year or so is to work out what existing skills people have that transfer to Go, and, related to that, what a good framework of pre-cursor skills for learning Go looks like. Personally speaking, when learning Go my C skills help me a lot, as does the fact that I already know what a coroutine is. I would say that knowledge of Java interfaces will help.

I don’t think there’s a good path to learning Go yet, it will be interesting to see what develops. For the “Go curious” the Tour of the Go Programming Language is worth a look.

Docker / containers

Docker is healthy, and while it might not win the “container wars” clearly containers are a thing that are going to be technically useful for the next few years (flashback to OS VM). Effort in learning Docker is likely to also be useful in other “API over container” solutions.

Services

Increasingly software is accessed not via a library but via a service available on the web (Software as a Service, SaaS). For example, ScraperWiki has a service to convert PDFs to tables.

ScraperWiki already use a few of these (for email delivery, database storage, accounting, payments, uptime alerts, notifications), and we’ll almost certainly be using more in the future. The obvious difference compared to using a library or building it yourself is that Software as a Service has a direct monetary cost. But that doesn’t necessarily make it more expensive. Consider e-mail delivery. ScraperWiki definitely has the technical expertise to manage our own mail delivery. But as a startup, we don’t have the time to maintain mail servers or the desire keep our mail server skills up to date. We’d rather buy that expertise in the form of the service that Sendgrid offers.

The future is much like the present. We will continue to make buy/build decisions, and increasingly the “buy” side will be a SaaS. The challenges will be in evaluating the offerings. Do they have a nice icon?

Amazon Web Services (AWS)

The mother of all SaaS.

It’s not going away and it’s getting increasingly complex. Amazon release new products every few weeks or so, and the web console becomes increasingly bewildering. I think @frabcus’s observation that “operating the AWS console” is a skill is spot on. I think there is an analogy (suggested by @IanHopkinson_) with the typing pool to desktop word processor transition: a low-paid workforce skilled in typing got replaced by giving PCs with word processors to high-paid executives with no typing skills. We no longer need IT technicians to build racks and wire them together, but instead relatively well paid devops staff do it virtually.

Cloud Formation. It’s a giant “JSON language” that describes how to create and wire together any piece of AWS infastructure.

sigh

Probably the thing to look at though. Even if we don’t use it directly (for example, we might use some replacement for Elastic Beanstalk or generate Cloud Formation files with scripts), knowing how to read it will be useful.

Big instances versus MapReduce

Whilst I think MapReduce will remain an important technology for the sector as a whole, this will be in opposition to the “single big instance”. Don’t get too hung up on terminology, I’m really using MapReduce as a placeholder for all MapReduce and hadoop-like “big data query” technologies.

Amazon Web Services makes it possible to rent “High Performance Computing” class nodes, for reasonable amounts of money. In 2015, you can get a 16 core (32 hyperthreads) instance with 60 or 244 Gigabytes of RAM for a couple of bucks per hour. I think the gap between laptops and big instances is widening, meaning that more ad hoc analysis will be done on a transient instance. You can process some pretty big datasets with 244 GB of RAM without needing to go all Hadoopy.

That is not to say that we should ignore MapReduce, but the challenge may be to find datasets of interest that actually require it.

Crypto

Snowden’s revelations tell us that the NSA, and other state-level actors, are basically everywhere. In particular, there are hostile actors in the data centre. We should consider node to node communications as going across the public internet, even if they are in the same data centre. Practically speaking, this means HTTPS / TLS everywhere.

If we provide a data service to our clients using AWS then ideally only the client, us, and AWS should have access to that data. It is unfortunate that AWS have to have access to the data, but it is practical necessity. Having trusted AWS, we can’t stop them (or even know) shipping all of our data to the NSA, so it is a matter of their reputation that they not do that. At least if we encrypt our network traffic, AWS have to take fairly aggressive steps to send our data to anyone else (they have to fish our session keys out of their RAM, or mass transfer the contents of their RAM somewhere).

There is lots more to do and discuss here. Fortunately ScraperWiki is pretty healthy in this regard, we are sensitive to it and we’re always discussing security.

Browser IDE

Here I’m talking about the “behind the scenes” world that is accessed from the Developer Tools. There is an awesome box of tools there. Programmers are probably all aware of the JavaScript Console and the Web Inspector, but these are the tip of a very large and featureful iceberg. Almost everything is dynamic: adding and disabling CSS rules updates the page live, as does editing the HTML. There is a fully featured single-step debugger that includes a code editor. Only the other day I learnt of the “emulate mobile device” mode for screen size and network.

Spend time poking about with the Developer Tools.

Machine Learning

Although it’s not an area that I know much about, I suspect that it’s not just a buzzword and it may turn out to be useful.

git / Version Control

git is great and there is a lot to learn, but don’t forget its broader historical context. Believe it or not git is not the first version control tool to come along, and github.com is not the first Software Configuration Management company. Just because git does it one particular way doesn’t mean that that way is best. It means that it is merely good enough for one person to manage the flow of patches of patches that go to make up the Linux kernel. I would also remind everyone that git != github. Practically, be aware of which bits of your workflow are git, and which are github.

(I’m bound to say something like that, Software Configuration Management used to be part of my consultancy expertise)

Google have declared this race won. They’ve shut down their own online code management product and have started hosting projects on github.

A plausible future is where everyone uses git and most people are blind to there being anything better and most people think that git == github. Whingeing aside, that future is a much better place to work in than if sourceforge had won.

The Future Technology Radar Report

Who knows what will be on the radar in the future.

]]>
https://blog.scraperwiki.com/2015/06/technology-radar-report/feed/ 1 758223193
End User Programming at the Office for National Statistics https://blog.scraperwiki.com/2015/06/end-user-programming-at-the-office-for-national-statistics/ Wed, 10 Jun 2015 16:32:00 +0000 https://blog.scraperwiki.com/?p=758222753 The Office for National Statistics (ONS) approached us regarding a task which involves transforming data in a spreadsheet. Basically, unpivotting it.

Data transformation is quite a general problem, but one with recurring patterns. Marginal variables are usually, well, somewhere in the margin. Cells generally refer to an observation or the name or value of a marginal variable. But there is enough variation that we cannot hope to capture all the possibilities in a GUI tool. Enter the formal language.

EOT team

ONS and ScraperWiki working on DataBaker

DataBaker, introduced by Dragon’s earlier article is essentially a formal language for describing particular ways of transforming data. DataBaker is essentially a dialect of Python, in that it is Python, but specialised for describing spatial relationships within spreadsheets (SQLAlchemy and numpy are more famous examples that can also be considered as dialects of Python).

It might seem unusual to invent a formal language for this task, but we have read our Nardi’s “A Small Matter of Programming” and are encouraged by quotes such as “ordinary people unproblematically learn and use formal languages and notations in everyday life”.

Earlier this week I interviewed Darren, lead for the ONS team that approached ScraperWiki. This team had essentially no previous programming experience, and are now successfully using DataBaker in their work. They are not professional programmers using a general purpose programming language, they are domain specialists using an end user programming language.

We chose Python because of its its clarity and its proven ability to be learned quickly by relative newcomers (for example, Python is a cornerstone in Software Carpentry’s bootcamp to help scientists learn to code). Darren’s team have no interest in learning Python per se, only in using DataBaker to do their job. It’s testimony to our success that they never have to think “I’m programming in Python”.

We are sneaking programming in by the backdoor, and this works because staff at ONS are already familiar with the domain of spreadsheets, and this makes it easier for them to understand the core concepts behind DataBaker. As Nardi says “people are likely to be better at learning and using computer languages that closely match their interests and their domain knowledge”.

Another part of the success of this project was that the ONS team had what Nardi refers to as a local developer. These are “domain experts who happen to have an intrinsic interest in computers and have more advanced knowledge of a particular program” (Nardi, again). Their local developer is the team’s go-to person for programming problems, and writes scripts, helps curate knowledge, and trains the team peer-to-peer.

Small Matter of ProgrammingA programming language provides the ultimate flexibility, but should only be used as a solution with care and whilst being attentive to the situation and expertise of the end users. The task that’s Darren’s team use DataBaker has no alternative solution: without DataBaker, the task wouldn’t be done. End User Programming for the win!

Footnote

The Nardi quotes are from Bonnie Nardi’s most excellent and sadly little known book: “A Small Matter of Programming”.

]]>
758222753
npm install urchin https://blog.scraperwiki.com/2013/06/npm-install-urchin/ Mon, 24 Jun 2013 16:36:59 +0000 http://blog.scraperwiki.com/?p=758218901 2 urchin tests

Urchin tests

Urchin, the shell testing framework for extreme hipster superheroes (I’m not including myself in that group I should add), is now available as an npm package.

That means you can install it using npm:

sudo npm install -g urchin

If you’re not hipster enough to use npm then you can still wget it from github:

cd /usr/local/bin
wget https://raw.github.com/scraperwiki/urchin/master/urchin
chmod +x urchin

For Urchin, a test is just a program that has exit status 0 for success, and non-0 for fail. Put them all in a directory called test and run:

urchin test

I’ve been using it organise and run the tests for one of my sillier recent projects.

]]>
758218901
A sea of data https://blog.scraperwiki.com/2013/04/a-sea-of-data/ Tue, 30 Apr 2013 11:14:29 +0000 http://blog.scraperwiki.com/?p=758218445 Napoleon_saintheleneMy friend Simon Holgate of Sea Level Research has recently “cursed” me by introducing me to tides and sea-level data. Now I’m hooked. Why are tides interesting? When you’re trying to navigate a super-tanker into San Francisco Bay and you only have few centimetres of clearance, whether the tide is in or out could be quite important!

The French port of Brest has the longest historical tidal record. The Joint Archive for Sea Level has hourly readings from 1846. Those of you wanting to follow along at home should get the code:

[sourcecode language=”text”]
git clone git://github.com/drj11/sea-level-tool.git
cd sea-level-tool
virtualenv .
. bin/activate
pip install -r requirements.txt
[/sourcecode]

After that lot (phew!), you can get the data for Brest by going:

[sourcecode language=”text”]
code/etl 822a
[/sourcecode]

The sea level tool is written in Python and uses our scraperwiki library to store the sea level data in a sqlite database.

Tide data can be surprisingly complex (the 486 pages of [PUGH1987] are testimony to that), but in essence we have a time series of heights, z. Often even really simple analyses can tell us interesting facts about the data.

As Ian tells us, R is good for visualisations. And it turns out it has an installable RSQLite package that can load R dataframes from a sqlite file. And I feel like a grown-up data scientist when I use R. The relevant snippet of R is:

[sourcecode language=”r”]
library(RSQLite)
db <- dbConnect(dbDriver(‘SQLite’), dbname=’scraperwiki.sqlite’, loadable.extensions=TRUE)
bre <- dbGetQuery(db, ‘SELECT*FROM obs WHERE jaslid==”h822a” ORDER BY t’)
[/sourcecode]

I’m sure you’re all aware that the sea level goes up and down to make tides and some tides are bigger than others. Here’s a typical month at Brest (1999-01):

bre-ts

There are well over 1500 months of data for Brest. Can we summarise the data? A histogram works well:

bre-hist

Remember that this is a histogram of hourly sea level observations. So the two humps show the most frequent sea level heights that appear in the hourly series. These are clustered around two heights that are more commonly observed than all others. These are the mean low tide, and the mean high tide. The range, the distance between mean low tide and mean high tide, is about 2.5 metres (big tides, big data!).

This is a comparitively large range, certainly compared to a site like St Helena (where the British imprisoned Napoleon after his defeat at Waterloo). Let’s plot St Helena’s tides on the same histogram as Brest, for comparison:

sth2-hist

Again we have a mean low tide and a mean high tide, but this time the range is about 0.4 metres, and the entire span of observed heights including extremes fits into 1.5 metres. St Helena is a rock in the middle of a large ocean, and this small range is typical of the oceanic tides. It’s the shallow waters of a continental shelf, and complex basin dynamics in northwest Europe (and Kelvin waves, see Lucy’s IgniteLiverpool talk for more details) that gives ports like Brest a high tidal range.

Notice that St Helena has some negative sea levels. Sea level is measured to a 0-point that is fixed for each station but varies from station to station. It is common to pick that point as being the lowest sea level (either observed or predicted) over some period, so that almost all actual observations are positive. Brest follows the usual convention, almost all the observations are positive (you can’t tell from the histogram but there are a few negative ones). It is not clear what the 0-point on the St Helena chart is (it’s clearly not a low low water, and doesn’t look like a mean water level either), and I have exhausted the budget for researching the matter.

Tides are a new subject for me, and when I was reading Pugh’s book, one of the first surprises was the existence of places that do not get two tides a day. An example is Fremantle, Australia, which instead of getting two tides a day (semi-diurnal) gets just one tide a day (diurnal):

fre-ts

The diurnal tides are produced predominantly by the effect of lunar declination. When the moon crosses the equator (twice a nodical month), its declination is zero, the effect is reduced to zero, and so are the diurnal tides. This is in contrast to the twice-daily tides which, while they exhibit large (spring) and small (neap) tides, we still get tides whatever time of the month it is. Because of the modulation of the diurnal tide there is no “mean low tide” and “mean high tide”, tides of all heights are produced, and we get a single hump in the distribution (adding the fremantle data in red):

fre3-hist

So we’ve found something interesting about the Fremantle tides from the kind of histogram which we probably learnt to do in primary school.

Napoleon died on St Helena, but my investigations into St Helena’s tides will continue on the ScraperWiki data hub, using a mixture of standard platform tools, like the summarise tool, and custom tools, like a tidal analysis tool.

Image “Napoleon at Saint-Helene, by Francois-Joseph Sandmann,” in Public Domain from Wikipedia

]]>
758218445
So web scraping is easy? https://blog.scraperwiki.com/2013/01/so-web-scraping-is-easy/ https://blog.scraperwiki.com/2013/01/so-web-scraping-is-easy/#comments Thu, 24 Jan 2013 11:09:59 +0000 http://blog.scraperwiki.com/?p=758217807 Journalists, academics and budding open data hackers often praise ScraperWiki for making web scraping easy. And while it’s true our platform and powerful APIs let you get more done, more easily, the statement still creates some head-scratching at ScraperWiki HQ.

That’s because, as far as we can tell, scraping is hard, no matter what platform you’re using.

For example, let’s pretend you’re scraping a fairly ordinary web page that has some data as a table. Barely a sentence in and we already need to know about HTML and URLs. We need to access this page programmatically, so we need to pick a language to write a scraper in.  Say Python.  How do we select the elements we need from the table? A CSS selector. The header is blue, so how do we detect the colour of an element? RGB hex-triples…

A little bit more thinking like this leads to something like this diagram:

Web Scraping Concepts

Web Scraping Concepts

If you need to know web scraping, you need to know all that.  Admittedly, you don’t need to be an expert (not for most scraping tasks), but you do need to know at least a little bit about lots of things before you can even begin to get something useful out of a web page.

And that’s just a web page. What if you’re scraping PDFs? There’s a whole extra little “tree of knowledge” to do with PDFs.  And one for KML.  And another one for SVG. And another one for Excel and another for CSVs. And another one for those hipster fixie formats that the climate science community like to use.  And then, once you need to visualise your findings, you’re into CSS (styles this time, rather than just selectors), Javascript libraries and plugins, and maybe even (La)TeX.

That’s why we’re changing ScraperWiki. Knowing all this stuff gives you immense power and flexibility, but it’s a tall ask when you just want to quickly grab and analyse some data off the web.  By using pre-built data tools on the new ScraperWiki, you get to perform the most common tasks, quickly and easily, without having to take evening classes in Computer Science.  And then, once you hit the tool’s limitations (because eventually you always will) you can fork and tweak the code, without having to start again from scratch.  And in the process, maybe you’ll even learn something about HTML, CSS, XPath, JSON, Javascript, CSVs, Excel, PDFs, KML, SVGs…

]]>
https://blog.scraperwiki.com/2013/01/so-web-scraping-is-easy/feed/ 2 758217807