Open Data – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Which car should I (not) buy? Find out, with the ScraperWiki MOT website… https://blog.scraperwiki.com/2015/09/which-car-should-i-not-buy-find-out-with-the-scraperwiki-mot-website/ Wed, 23 Sep 2015 15:14:58 +0000 https://blog.scraperwiki.com/?p=758223689 I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:

level2

What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead.

mot.scraperwiki.com, by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

  • 9 years of MOT data was downloaded and concatenated.
  • Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
  • Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
  • The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223689
Summary – Big Data Value Association June Summit (Madrid) https://blog.scraperwiki.com/2015/07/summary-of-big-data-value-association-june-summit-madrid/ Tue, 21 Jul 2015 10:13:39 +0000 https://blog.scraperwiki.com/?p=758223361 Summit Programme

In late June, 375 Europeans + 1 attended the Big Data Value Association (BVDA) Summit in Madrid. The BVDA is the private part of the Big Data Public Private Partnership.  The Public part is the European Commission.  The delivery mechanism is Horizon 2020 and €500m funding . The PPP commenced in 2015 and runs to 2020.

Whilst the conference title included the word ‘BIG’, the content did not discriminate.  The programme was designed to focus on concrete outcomes. A key instrument of the PPP is the concept of a ‘lighthouse’ project.  The summit had arranged tracks that focused on identifying such projects; large scale and within candidate areas like manufacturing, personalised medicine and energy.

What proved most valuable was meeting the European corporate representatives who ran the vertical market streams.  Telcom Italia, Orange and Nokia shared a platform to discuss their sector. Philips drove a discussion around health and well being.  Jesus Ruiz, Director of Open Innovation in Santander Bank Corporate Technology, led the Finance industry track. He tried to get people to think about ‘innovation’ in the layer above traditional banking services. I suspect he meant in the space where companies like Transferwise (cheaper foreign currency conversion) play. These services improve the speed and reduce the cost of transactions.  However the innovating company never ‘owns’ an individual or corporate bank account.  As a consequence they’re not subject to tight financial regulation. It’s probably obvious to most but I was unaware of the distinction.

I had an opportunity to talk to many people from the influential Fraunhofer Institute!  It’s an ‘applied research’ organisation and a significant contributor to Deutschland’s manufacturing success.  Last year it had a revenue stream of €2b.  It was seriously engaged at the event and is active at finding leading edge ‘lighthouse projects’.  We’re in the transport #TIMON consortia with it – Happy Days 🙂

BDVA - You can join!

BDVA – You can join!

Networking is the big bonus at events like these and with representatives from 28 countries and delegates from Palestine and Israel – there were many people to meet.  The UK was poorly represented and ScraperWiki was the only UK technology company showing it’s wares.  It was a shame given the UK’s torching carrying when it comes to data.  Maurizio Pilu, @Maurizio_Pilu Executive Director, Collaborative R&D at Digital Catapult gave a keynote.  The ODI is mentioned in the PPP Factsheet which is good.

There was a strong sense that the PPP initiative is looking to the long term, and that some of the harder problems have not yet been addressed to extract ‘value’.  There was also an acknowledgement of the importance of standards and a track was run by Phil Archer, Data Activity Lead the W3C .

Stuart Campbell, Director, CEO at Information Catalyst and a professional pan-European team managed the proceedings and it all worked beautifully.  We’re in FP7 and Horizon 2020 consortia so we decided to sponsor and actively support #BDVASummit.  I’m glad we did!

The next big event is the European Data Forum in Luxembourg (16-17 Nov 2015).  We’re sponsoring it and we’ll talk about our data science work, PDFTtables.com and DataBaker.   The event will be opened by Jean-Claude Juncker President of the EU, and Günther Oettinger , European Commissioner for Digital Economy and society.

It’s seems a shame that the mainstream media in the UK focuses so heavily on subjects like #Grexit and #Brexit.  Maybe they could devote some of their column inches to the companies and academics that are making a very significant commitment to finding products and services that make the EU more competitive and also a better place to work and to live.

]]>
758223361