Front page – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 GOV.UK – Contracts Finder… £1billion worth of carrots! https://blog.scraperwiki.com/2015/10/gov-uk-contracts-finder-1billion-worth-of-carrots/ Wed, 07 Oct 2015 09:23:14 +0000 https://blog.scraperwiki.com/?p=758224098 Photo by Fovea Centralis / CC BY-ND-2.0

Carrots by Fovea Centralis /CC BY-ND-2.0

This post is about the government Contracts Finder website. This site has been created with a view to helping SMEs win government business by providing a “one-stop-shop” for public sector contracts.

Government has been doing some great work transitioning their departments to GOV.UK and giving a range of online services a makeover. We’ve been involved in this work, in the first instance scraping the departmental content for GOV.UK, then making some performance dashboards for content managers on the Performance Platform.

More recently we’ve scraped the content for databases such as the Air Accident Investigation Board, and made the new Civil Service People Survey website.

As well as this we have an interest in other re-worked government services such as the Charity Commission website, data.gov.uk and the new Companies House website.

Getting back to Contracts Finder – there’s an archive site which lists opportunities posted before 26th February 2015 and a live site, the new Contracts Finder website, which has live opportunities after 26th February 2015. Central government departments and their agencies were required to advertise contracts over £10k on the old Contracts Finder website. In addition the wider public sector were able to advertise contracts on there too, but weren’t required (although on the new Contracts Finder they are required to on contracts over £25k).

The confusingly named Official Journal of the European Union (OJEU) also publishes calls to tender. These are required by EU law over a certain threshold value depending on the area of business in which they are placed. Details of these thresholds can be found here. The Contracts Finder also lists opportunities over these thresholds but it is not clear that this must be the case.

The interface of the new Contracts Finder website is OK, but there is far more flexibility to probe the data if you scrape it from the website. For the archive data this is more a case of downloading the CSV files provided although it is worth scraping the detail pages indicated from the downloads in order to get additional information such as the supplier to which work was awarded.

The headline data published in an opportunity is the title and description, the name of the customer with contact details, the industry – a categorisation of the requirements, a contract value, and a closing date for applications.

We run the scrapers on our Platform which makes it easy to download the data as an Excel spreadsheet or CSV, which we can then load into Tableau for analysis. Tableau allows us to make nice visualisations of the data, and to carry out our own ad hoc queries of the data free from the constraints of the source website. There are about 15,000 entries on the new site, and about 40,000 in the archive.

The initial interest for us was just getting an overview of the data, how many contracts were available in what price range?  As an example we looked at proposals in the range £10k-£250k in the Computer and Related Services sector. The chart below shows the number of opportunities in this range grouped by customer.

customers-10k-250k

These opportunities are actually all closed. How long were opportunities open for? We can see in the histogram below. Most adverts are open for 2-4 weeks, however a significant number have closing dates before their publication dates – it’s not clear why.

new-site-ad-duration

There is always fun to be found in a dataset of this size. For example, we learn that Shrewsbury Council would have appeared to have tendered for up to £1bn worth of fruit and vegetables (see here). With a kilogram of carrots costing less than a £1 this is a lot of veg, or a mis-entry in the data maybe!

Closer to home we discover that Liverpool Council spent £12,000 for a fax service for 2 years! There are also a collection of huge contracts for the MOD which appears to do its contracting from Bristol.

Getting down to more practical business we can use the data to see what opportunities we might be able to apply for. We found the best way to address this was to build a search tool in Tableau which allowed us to search and filter on multiple criteria (words in title, description, customer name, contract size) and view the results grouped together. So it is easy, for example, to see that Leeds City Council has tendered for £13million in Computer and Related Services, the majority of which went on a framework contract with Fujitsu Services Ltd. Or that Oracle won a contract for £6.5 million from the MOD for their services. You can see the austere interface we have made to this data below

Search.jpg

Do you have some data which you want exploring? Why not get in touch with us!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758224098
Henry Morris (CEO and social mobility start-up whizz) on getting contacts from PDF into his iPhone https://blog.scraperwiki.com/2015/09/henry-morris-entrepreneur-for-social-mobility-on-getting-contacts-from-pdf-into-his-iphone/ Wed, 30 Sep 2015 14:11:16 +0000 https://blog.scraperwiki.com/?p=758224084 Henry Morris

Henry Morris

Meet @henry__morris! He’s the inspirational serial entrepreneur that set up PiC and upReach.  They’re amazing businesses that focus on social mobility.

We interviewed him for PDFTables.com

He’s been using it to convert delegate lists that come as PDF into Excel and then into his Apple iphone.

It’s his preferred personal Customer Relationship Management (CRM) system, it’s a simple and effective solution for keeping his contacts up to date and in context.

Read the full interview

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

 

]]>
758224084
Which car should I (not) buy? Find out, with the ScraperWiki MOT website… https://blog.scraperwiki.com/2015/09/which-car-should-i-not-buy-find-out-with-the-scraperwiki-mot-website/ Wed, 23 Sep 2015 15:14:58 +0000 https://blog.scraperwiki.com/?p=758223689 I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:

level2

What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead.

mot.scraperwiki.com, by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

  • 9 years of MOT data was downloaded and concatenated.
  • Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
  • Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
  • The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223689
Civil Service People Survey – Faster, Better, Cheaper https://blog.scraperwiki.com/2015/09/civil-service-people-survey-faster-better-cheaper/ Tue, 08 Sep 2015 13:46:21 +0000 https://blog.scraperwiki.com/?p=758223952 CSPS1

Civil Service Reporting Platform

The Civil Service is one of the UK’s largest employers.  Every year it asks every civil servant what it thinks of its employer: UK plc.

For Sir Jeremy Heywood the survey matters. In his blog post “Why is the People Survey Important?” he says

“The survey is one of the few ways we can objectively compare, on the basis of concrete data, how things are going across departments and agencies.  …. there are common challenges such as leadership, improving skills, pay and reward, work-life balance, performance management, bullying and so on where we can all share learning.”

The data is collected by a professional survey company called ORC International.  The results of the survey have always been available to survey managers and senior civil servants as PDF reports. There is access to advanced functionality within ORC’s system to allow survey managers more granular analysis.

So here’s the issue.  The Cabinet Office wants to give access to all civil servants and in a fast and reliable way.  It wants to give more choice and speed in how the data is sliced and diced – in real time.  Like all government departments it is also under pressure to cut costs.

ScraperWiki built a new Civil Service People Survey Reporting Platform and it’s been challenging.  It’s a moderately large data set.  There’s close to half a million civil servants – over 250,000 answered the last survey which contains 100 questions.   There are 9000 units across government.  This means 30,000,000 rows of data per annum and we’ve ingested 5 years of data.

The real challenges were around:

  • Data Privacy
  • Real Time Querying
  • Design

Data privacy

The civil servants are answering questions on their attitudes to their work, their managers, the organisations they worked in along with questions on who they are: gender, ethnicity, sexual orientation – demographic information. Their responses are strictly confidential and one of the core challenges of the work is maintaining this confidentiality in a tool available over the internet, with a wide range of data filtering and slicing functionality.

A naïve implementation would reveal an individual’s responses either directly (i.e. if they are the only person in a particular demographic group in a particular unit), or indirectly, by taking the data from two different views and taking a difference to reveal the individual. ScraperWiki researched and implemented a complex set of suppression algorithms to allow publishing of the data without breaking confidentiality.

Real-time queriesimage_thumb.png

Each year the survey generates 30,000,000 data points, one for each answer given by each person. This is multiplied by five years of historical data. To enable a wide range of queries our system processes this data for every user request, rather than rely on pre-computed tables which would limit the range of available queries.

Aside from the moderate size, the People Survey data is rich because of the complexity of the Civil Service organisational structure. There are over 9,000 units in the hierarchy which is in some places up to 9 links deep. The hierarchy is used to determine how the data are aggregated for display.

Standard design

image_thumb.pngAn earlier design decision was to use the design guidelines and libraries developed by the Government Digital Service for the GOV.UK website. This means the Reporting Platform has the look and feel of GOV.UK., and we hope follows their excellent usability guidelines.

Going forward

The People Survey digital reporting platform alpha was put into the hands of survey  managers at the end of last year. We hope to launch the tool to the whole civil service after the 2015 survey which will be held in October. If you aren’t a survey manager, you can get a flavour of the People Survey Digital Reporting Platform in the screenshots in this post.

Do you have statistical data you’d like to publish more widely, and query in lightning fast time? If so, get in touch.

]]>
758223952
GP Prescribing data for the UK https://blog.scraperwiki.com/2015/08/gp-prescribing-data-for-the-uk/ Wed, 12 Aug 2015 10:36:10 +0000 https://blog.scraperwiki.com/?p=758223623 hsciclogoOver the past few weeks I have been looking at GP Prescribing data from the Health & Social Care Information Centre, which presents the number of items and cost of all the different medication prescribed and dispensed by GP practices across the UK. The dataset amounts to millions of rows of data each month. I am trying to find trends and patterns that occur with regards to the number of items that occur within this data.

As part of my internship, provided by the Q-step programme,  I am trying to think more quantitatively.

One of the things I have learnt is that when given a dataset the first thing to do with it is to break it down and make sure the meaning of everything is understood. Therefore with the data I am looking at I researched the meaning of each heading for the columns on the dataset. In this blog I will explain what each of these terms mean.

The British National Formulary (BNF)

Central to the GP prescribing data is the BNF. This is the British National Formulary which is produced by The British Medical Association and the Royal Pharmaceutical Society. It is used to give doctors and nurses advice on the selection, prescribing, dispensing and administration of medication in the UK. The BNF classifies medicines into therapeutic groups which are known as BNF Chapters. There are 15 BNF chapters and some ‘pseudo BNF chapters’ (numbered 18 to 23) that include items such as dressings and appliances. The 15 BNF Chapters are:

  • Chapter 1:Gastro-intestinal System
  • Chapter 2: Cardiovascular System
  • Chapter 3: Respiratory System
  • Chapter 4: Central Nervous system
  • Chapter 5: Infection
  • Chapter 6: Endocrine System
  • Chapter 7: Obstetrics, Gynaecology and Urinary- tract disorders
  • Chapter 8: Malignant Diseases and Immunosuppression
  • Chapter 9: Nutrition and blood
  • Chapter 10: Musculoskeletal and joint diseases
  • Chapter 11: Eye
  • Chapter 12:Ear, nose, and oropharynx
  • Chapter 13: Skin
  • Chapter14: Immunological Products and Vaccines
  • Chapter 15: Anesthesia

Under each BNF Chapter there are subsections for example, under Chapter 2 (Cardiovascular System) one of the subsections is 2.12 Lipid-regulating drugs.

The BNF Code is the unique code that each medication has. An example of a BNF Code is 0212000U0AAADAD which is for the drug Simvastatin Tablet 40mg. The BNF Code for each drug is formed as follows:

  • Characters 1 & 2 show the BNF Chapter (02)
  • 3 & 4 show the BNF Section (12)
  • 5 & 6 show the BNF paragraph (00)
  • 7 shows the BNF sub-paragraph (0)
  • 8 & 9 show the Chemical Substance (U0)
  • 10 & 11 show the Product (AA)
  • 12 & 13 indicate the Strength and Formulation (AD)
  • 14 & 15 show the equivalent (AD). The ‘equivalent’ is defined as follows:
    • If the presentation is a generic, the 14th and 15th character will be the same as the 12th and 13th character.
    • Where the product is a brand the 14 and 15 digit will match that of the generic equivalent, unless the brand does not have a generic equivalent in which case A0 will be used.

The BNF Name is the individual preparation name for each drug. It includes the name of the drug, which could be branded or generic, followed by form it comes in and the strength of the medication. On the GP Prescribing Data – Presentation Level  dataset I used, the BNF names were often presented in an abbreviated form due to the limited number of characters available in the dataset.

Other terms

  • The Strategic Health Authorities (SHA) is an NHS organisation established to lead the strategic development of the local health service and manage Primary Care Trusts and NHS Trusts and are responsible for organising working relationships by getting service level agreement.
  • A Primary Care Trust (PCT) are under SHAs and are local organisations that are responsible for managing health services in the community. Examples of PCT are GP surgeries, NHS walk-in centres, dentists and opticians. However in the last 2-3 years the PCTs have been converted to Care Commissioning Groups (CCG), but much of the data that I was looking at talks about PCTs.
  • Items are defined as the number of items that were dispensed in the specified month. A prescription item is a single supply of a medicine, dressing or appliance written on a prescription form. If one prescription form includes four medicines, it is counted as four prescription items.
  • Quantity is the drug dispensed measured in the units. The units are dependent on the makeup of the medication, for example if it is a tablet or capsule the quantity will be the number of tablets or capsules, whereas if it is a solid such as a cream or gel the quantity will be in grammes.
  • Net Ingredient Cost (NIC) is the price of the drug written on the price list or drug tariff.
  • Actual Cost is the estimated cost to the NHS. It is calculated by subtracting the average percentage discount per item (based on the previous month) from the Net Ingredient Cost, and adding in the cost of a container for each prescription item. It is usually lower than NIC.
  • Period is the year and month that the dataset covers.

Now that I understand the meanings of each column of the dataset I am looking at, I am trying to find new things with it. Feel free to refer back to this blog when reading my future blogs on my findings, especially if you stumble upon something you have forgotten the meaning of.

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223623
Burn the digital paper! A call to arms https://blog.scraperwiki.com/2015/08/burn-the-digital-paper-a-call-to-arms/ Fri, 07 Aug 2015 14:58:04 +0000 https://blog.scraperwiki.com/?p=758223526 This is a blog post version of a lunchtime talk I gave at the Open Data Institute. You may prefer to listen to it or use the slides.

Stafford Beer

Stafford Beer was a British cybernetician.

Stafford Beer

He described four stages that happen when you get a computer.

Each stage ends in disappointment.

1. Amazement

It’s an electronic brain!

This is a homeostat from the article “The Electronic Brain” in the March 1949 issue of Radio Electronics.

Homeostat

At this stage you get no benefit from the computer. It is just amazing.

2. Digital paper

Most of us, in every day business use of computers, are at this stage.

We write documents, which in form look much like this one from 1946.

Argentine Situation

We send the documents to each other using an electronic metaphor for postal mail.

Letter Carrier Delivering Mail

Sure, copies are cheaper to make than real paper, and they arrive in an instant. However, the underlying process it the same. That’s why it was so easy to learn.

Even our data is digital paper.

This spreadsheet, a financial ledger from the World’s Columbian Exposition in 1893, looks much the same as any business spreadsheet today.

World's Columbian Exhibition spreadsheet

The numbers in the columns can add themselves up now, but it isn’t making full use of our computers.

This over use of paper methods causes problems. Errors can waste billions of dollars.

Is Excel the most dangerous piece of software in the world?It’s disappointing. Electronic brains must be able to do more!

3. Gather up as data

Let’s structure everything as data and put it all together in a massive store!

At this stage of computer use, the hope is that once everything is uniform, we’ll have amazing power and analysis. Everything will be better!

To do this, the only current method is to hire some nerds and get them to write complex, expensive software.

Computer Engineer

The engineer makes sure inputs are consistent. Consistency is what “data” means.

Apps are an example. They gather the structure as a side effect of user actions. It’s what social networks (and digital spies!) do.

Facebook like on a wall

You can reduce this effort, by making better use of existing data. The Humanitarian Data Exchange (which ScraperWiki works on) is an example.

responder_in_the_field

There are products which help convert digital paper into data.

PDF Tables screenshot

But ultimately, current methods rely on people entering data really really carefully. Which we humans are not very good at.

Typewriter

I can only see two ways to radically improve this, and get more data.

Firstly, more digital literacy. Most people learn to drive, or else cars would be useless. Likewise, can everyone learn to tag things consistently, so CRMs work better? Should everyone learn to code, so we can make infrequent business processes create structured data?

Digital literacy

Secondly, improving artificial intelligence. That is, to have a computation model more sophisticated than current coding, which doesn’t need pickily consistent “data” any more.

Hal

So, you’ve now got lots of data. Alas, it’s still not enough!

Stafford Beer says that even when organisations get everything structured, their data lakes overflowing, their tables linked across the web… Even then they are disappointed.

4. Feedback loops

Next, you realise you need a feedback system to effectively use your data.

Cybernetic factory

Think of your own body, where tangled hierarchical layers of proteins, cells and organs have feedback loops within and between each other.

There’s no point the Government opening data, if it doesn’t alter policy decisions. There’s no point rolling out Business Intelligence software, if it doesn’t make your enterprise chose better.

Viable systems model

More deeply, what should organisations look like now we have computers?

The question which asks how to use the computer in the enterprise, is, in short, the wrong question.

A better formulation is to ask how the enterprise should be run given that computers exist.

The best version of all is the question asking what, given computers, the enterprise now is.

– Stafford Beer, “Brain of the Firm”, 1972

Stafford Beer with his assistant Sonia Mordojovich

If you’ve got some digital paper you want to turn into data, ScraperWiki has various products to help. For more on Stafford Beer, including the wild story of Cybersyn and Allende’s Chile, watch my lightning talk at Liverpool Ignite. All the pictures above are links to further information.

]]>
758223526
….and suddenly I could convert my bank statement from PDF to Excel… https://blog.scraperwiki.com/2015/08/and-suddenly-i-could-convert-my-bank-statement-from-pdf-to-excel/ Wed, 05 Aug 2015 13:03:59 +0000 https://blog.scraperwiki.com/?p=758223405 24 HoursDo you ever:

  • Need an old bank statement only to find out that the bank has archived it, and want to charge you to get it back?
  • Spot check to make sure there are no fraudulent transactions on your account?
  • Like to summarise all your big ticket items for a period?
  • Need to summarise business expenses?

It’s been difficult for me to do any of these as bank transaction systems are Luddite.

15 years after signing up to my smile internet bank account I received a ground breaking message.

“Your paperless statement is now available to view when you login to online banking”.

I logged in excited, expecting an incredible new interface.

Eureka - PDF StatementNo … it meant I can now download a PDF!

Don’t get me wrong – PDF is the “Portable Document Format” – so at least I can keep my own records which is a step forward. But it’s just as clumsy to analyse a PDF as it is to trawl through the bank’s online system (see The Tyranny of the PDF to understand why).

We know a lot about the problems with PDFs at ScraperWiki and we made PDFTables.com.  I’m able to convert my PDF to Excel and get a list of  transactions which I can analyse and store in some order.  Yes – I have to do some post processing but I can automate this with a spreadsheet macro.

You can see on the example I have included that the alignment of the transactions is spot on and I could even use our DataBaker product to take out the transaction descriptions and the values and put them into another system.

Although we’d love everything to be structured data all the way through, the number of PDFs on the web is still increasing exponentially.  Hooray for PDFTables.com!

Statement #173

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223405
Book review: Docker Up & Running by Karl Matthias and Sean P. Kane https://blog.scraperwiki.com/2015/07/book-review-docker-up-running-by-karl-matthias-and-sean-p-kane/ Fri, 17 Jul 2015 11:00:56 +0000 https://blog.scraperwiki.com/?p=758223401 This last week I have been reading dockerDocker Up & Running by Karl Matthias and Sean P. Kane, a newly published book on Docker – a container technology which is designed to simplify the process of application testing and deployment.

Docker is a very new product, first announced in March 2013, although it is based on older technologies. It has seen rapid uptake by a number of major web-based companies who have open-sourced their tooling for using Docker. We have been using Docker at ScraperWiki for some time, and our most recent projects use it in production. It addresses a common problem for which we have tried a number of technologies in search of a solution.

For a long time I have thought of Docker as providing some sort of cut down virtual machine, from this book I realise this is the wrong mindset – it is better to think of it as a “process wrapper”. The “Advanced Topics” chapter of this book explains how this is achieved technically. This makes Docker a much lighter weight, faster proposition than a virtual machine.

Docker is delivered as a single binary containing both client and server components. The client gives you the power to build Docker images and query the server which hosts the running Docker images. The client part of this system will run on Windows, Mac and Linux systems. The server will only run on Linux due to the specific Linux features that Docker utilises in doing its stuff. Mac and Windows users can use boot2docker to run a Docker server, boot2docker uses a minimal Linux virtual machine to run the server which removes some of the performance advantages of Docker but allows you to develop anywhere.

The problem Docker and containerisation are attempting to address is that of capturing the dependencies of an application and delivering them in a convenient package. It allows developers to produce an artefact, the Docker Image, which can be handed over to an operations team for deployment without to and froing to get all the dependencies and system requirements fixed.

Docker can also address the problem of a development team onboarding a new member who needs to get the application up and running on their own system in order to develop it. Previously such problems were addressed with a flotilla of technologies with varying strengths and weaknesses, things like Chef, Puppet, Salt, Juju, virtual machines. Working at ScraperWiki I saw each of these technologies causing some sort of pain. Docker may or may not take all this pain away but it certainly looks promising.

The Docker image is compiled from instructions in a Dockerfile which has directives to pull down a base operating system image from a registry, add files, run commands and set configuration. The “image” language is probably where my false impression of Docker as virtualisation comes from. Once we have made the Docker image there are commands to deploy and run it on a server, inspect any logging and do debugging of a running container.

Docker is not a “total” solution, it has nothing to say about triggering builds, or bringing up hardware or managing clusters of servers. At ScraperWiki we’ve been developing our own systems to do this which is clearly the approach that many others are taking.

Docker Up & Running is pretty good at laying out what it is you should do with Docker, rather than what you can do with Docker. For example the book makes clear that Docker is best suited to hosting applications which have no state. You can copy files into a Docker container to store data but then you’d need to work out how to preserve those files between instances. Docker containers are expected to be volatile – here today gone tomorrow or even here now, gone in a minute. The expectation is that you should preserve state outside of a container using environment variables, Amazon’s S3 service or a externally hosted database etc – depending on the size of the data. The material in the “Advanced Topics” chapter highlights the possible Docker runtime options (and then advises you not to use them unless you have very specific use cases). There are a couple of whole chapters on Docker in production systems.

If my intention was to use Docker “live and in anger” then I probably wouldn’t learn how to do so from this book since the the landscape is changing so fast. I might use it to identify what it is that I should do with Docker, rather than what I can do with Docker. For the application side of ScraperWiki’s business the use of Docker is obvious, for the data science side it is not so clear. For our data science work we make heavy use of Python’s virtualenv system which captures most of our dependencies without being opinionated about data (state).

The book has information in it up until at least the beginning of 2015. It is well worth reading as an introduction and overview of Docker.

Dr Ian Hopkinson is Senior Data Scientist at ScraperWiki, where we often use Docker to help customers manage their data. You can read more about our professional services here.

]]>
758223401
Spreadsheets are code: EuSpRIG conference. https://blog.scraperwiki.com/2015/07/eusprig/ https://blog.scraperwiki.com/2015/07/eusprig/#comments Thu, 16 Jul 2015 09:35:28 +0000 https://blog.scraperwiki.com/?p=758223366 EuSpRIG logo

I’m back from presenting a talk on DataBaker at the EuSpRIG conference. It’s amazing to see a completely different world of how people use Excel – I’ve been busy tearing the data out of spreadsheets for the Office of National Statistics and using macros to open PDF files in Excel directly using PDFTables. So whilst I’ve been thinking of spreadsheets as sources of raw data, it’s easy to forget how everyone else uses spreadsheets. The conference reminded me particularly of one simple fact about spreadsheets that often gets ignored:

Spreadsheets are code.

And spreadsheets are a way of writing code which hasn’t substantially changed since the days of Visicalc in 1978 (the same year as the book which defined the C programming language came out).

Programming languages have changed enormously in this time, promoting higher-level concepts like object orientation, whilst the core of the spreadsheet has remained the same. Certainly, there’s a surprising number of new features in Excel, but few of these help with the core tasks of programming within the spreadsheet.

Structure and style are important: it’s easy to write code which is a nightmare to read. Paul Mireault spoke of his methodology for reducing the complexity of spreadsheets by adhering to a strict set of rules involving copious use of array formulae and named ranges. It also involves working out your model before you start work in Excel, which culminates in a table of parameters, intermediate equations, and outputs.

And at this point I’m silently screaming: STOP! You’re done! You’ve got code!

Sure, there’s the small task of identifying which of these formulae are global, and which are regional and adding appropriate markup; but at this stage the hard work is done; converting that into your language of choice (including Excel) should be straightforward. Excel makes this process overly complicated, but at least Paul’s approach gives clear instructions on how best to handle this conversion (although his use of named ranges is as contentious as your choice of football team or, for programmers, editor.)

Tom Grossman’s talk on reusable spreadsheet code was a cry for help: is there a way of being able to reuse components in a reliable way? But Excel hampers us at every turn.

We can copy and paste cells, but there is so much magic involved. We’re implicitly writing formulae of the form “the cell three to the left” — but we never explicitly say that: instead we read a reference to G3 in cell J3. And we can’t easily replace these implicit references if we’re copy-pasting formulae snippets; we need to be pasting into exactly the right cell in the spreadsheet.

In most programming languages, we know exactly what we’ll get when we copy-and-paste within our source code: a character-by-character identical copy. But copy-and-paste programming is considered a bad ‘smell’: we should be writing reusable functions: but without stepping into the realm of macros each individual invocation of what would be a function needs to be a separate set of cells. There are possibilities of making this work with custom macro functions or plugins – but so many people can’t use spreadsheets containing macros or won’t have installed those plugins. It’s a feature missing from the very core of Excel which makes it so much more difficult and longwinded to work in it.

Not having these abstractions leads to errors. Ray Panko spoke of the errors we never see; how base error rates of a few percent are endemic across all fields of human endeavour. These error rates are at the time of writing the code the first time, and per instruction. We can hope to reduce these error rates through testing, peer review and pairing. Excel hinders testing and promotes copy-paste repetition, increasing the number of operations and the potential for errors. Improving code reuse would also help enormously: the easiest code to test is the code that isn’t there.

A big chunk of the problem is that people think about Excel the same wrong way they think about Word. In Word, it’s not a major problem, so long as you don’t need to edit the document: so long as it looks correct, that might be good enough, even if the slightest change breaks the formatting. That’s simply not true of spreadsheets where a number can look right but be entirely wrong.

Maria Csernoch’s presentation of Sprego – Spreadsheet Lego – described an approach for teaching programming through spreadsheets which is designed to get people thinking about solving the problems they face methodically, from the inside out, rather than repeatedly trying ‘Trial-and-Error Wizard-based’ approach with minimal understanding.

It’s interesting to note the widespread use of array formulae across a number of the talks – if you’re making spreadsheets and you don’t know about them, it might be worth spending a while learning about them!

In short, Excel is broken. And I strongly suspect it can’t be fixed. Yet it’s ubiquitous and business critical. We need to reinvent the wheel and change all four whilst the car is driving down the motorway — and I don’t know how to do that…

]]>
https://blog.scraperwiki.com/2015/07/eusprig/feed/ 3 758223366
Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia https://blog.scraperwiki.com/2015/07/book-review-learning-spark-by-holden-karau-andy-konwinski-patrick-wendell-and-matei-zaharia/ Mon, 06 Jul 2015 10:00:46 +0000 https://blog.scraperwiki.com/?p=758223243 learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!

]]>
758223243