big data – ScraperWiki

Summary – Big Data Value Association June Summit (Madrid)

Aine McGuire — Tue, 21 Jul 2015 10:13:39 +0000

In late June, 375 Europeans + 1 attended the Big Data Value Association (BVDA) Summit in Madrid. The BVDA is the private part of the Big Data Public Private Partnership. The Public part is the European Commission. The delivery mechanism is Horizon 2020 and €500m funding . The PPP commenced in 2015 and runs to 2020.

Whilst the conference title included the word ‘BIG’, the content did not discriminate. The programme was designed to focus on concrete outcomes. A key instrument of the PPP is the concept of a ‘lighthouse’ project. The summit had arranged tracks that focused on identifying such projects; large scale and within candidate areas like manufacturing, personalised medicine and energy.

What proved most valuable was meeting the European corporate representatives who ran the vertical market streams. Telcom Italia, Orange and Nokia shared a platform to discuss their sector. Philips drove a discussion around health and well being. Jesus Ruiz, Director of Open Innovation in Santander Bank Corporate Technology, led the Finance industry track. He tried to get people to think about ‘innovation’ in the layer above traditional banking services. I suspect he meant in the space where companies like Transferwise (cheaper foreign currency conversion) play. These services improve the speed and reduce the cost of transactions. However the innovating company never ‘owns’ an individual or corporate bank account. As a consequence they’re not subject to tight financial regulation. It’s probably obvious to most but I was unaware of the distinction.

I had an opportunity to talk to many people from the influential Fraunhofer Institute! It’s an ‘applied research’ organisation and a significant contributor to Deutschland’s manufacturing success. Last year it had a revenue stream of €2b. It was seriously engaged at the event and is active at finding leading edge ‘lighthouse projects’. We’re in the transport #TIMON consortia with it – Happy Days 🙂

BDVA – You can join!

Networking is the big bonus at events like these and with representatives from 28 countries and delegates from Palestine and Israel – there were many people to meet. The UK was poorly represented and ScraperWiki was the only UK technology company showing it’s wares. It was a shame given the UK’s torching carrying when it comes to data. Maurizio Pilu, @Maurizio_Pilu Executive Director, Collaborative R&D at Digital Catapult gave a keynote. The ODI is mentioned in the PPP Factsheet which is good.

There was a strong sense that the PPP initiative is looking to the long term, and that some of the harder problems have not yet been addressed to extract ‘value’. There was also an acknowledgement of the importance of standards and a track was run by Phil Archer, Data Activity Lead the W3C .

Stuart Campbell, Director, CEO at Information Catalyst and a professional pan-European team managed the proceedings and it all worked beautifully. We’re in FP7 and Horizon 2020 consortia so we decided to sponsor and actively support #BDVASummit. I’m glad we did!

The next big event is the European Data Forum in Luxembourg (16-17 Nov 2015). We’re sponsoring it and we’ll talk about our data science work, PDFTtables.com and DataBaker. The event will be opened by Jean-Claude Juncker President of the EU, and Günther Oettinger , European Commissioner for Digital Economy and society.

It’s seems a shame that the mainstream media in the UK focuses so heavily on subjects like #Grexit and #Brexit. Maybe they could devote some of their column inches to the companies and academics that are making a very significant commitment to finding products and services that make the EU more competitive and also a better place to work and to live.

Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia

Ian Hopkinson — Mon, 06 Jul 2015 10:00:46 +0000

Apache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!

Book review: Big data by Viktor Mayer-Schönberger and Kenneth Cukier

Ian Hopkinson — Mon, 08 Sep 2014 07:28:36 +0000

We hear a lot about “Big Data” at ScraperWiki. We’ve always been a bit bemused by the tag since it seems to be used indescriminately. Just what is big data and is there something special I should do with it? Is it even a uniform thing?

I’m giving a workshop on data science next week and one of the topics of interest for the attendees is “Big Data”, so I thought I should investigate in a little more depth what people mean by “Big Data”. Hence I have read Big Data by Viktor Mayer-Schönberger and Kenneth Cukier, subtitled “A Revolution That Will Transform How We Live, Work and Think” – chosen for the large number of reviews it has attracted on Amazon. The subtitle is a guide to their style and exuberance.

Their thesis is that we can define big data, in contrast to earlier “little data”, by three things:

It’s big but not necessarily that big, their definition for big is that n = all. That is to say that in some domain you take all of the data you can get hold of. They use as one example a study on bout fixing in sumo wrestling, based on data on 64,000 bouts – which would fit comfortably into a spreadsheet. Other data sets discussed are larger such as credit card transaction data, mobile telephony data, Google’s search query data…;
Big data is messy, it is perhaps incomplete or poorly encoded. We may not have all the data we want from every event, it may be encoded using free text rather than strict categories and so forth;
Working with big data we must discard an enthusiasm for causality and replace it with correlation. Working with big data we shouldn’t mind too much if our results are just correlations rather than explanations (causation);
An implicit fourth element is that the analysis you are going to apply to your big data is some form of machine learning.

I have issues with each of these “novel” features:

Scientists have long collected datasets and done calculations that are at the limit (or beyond) their ability to process the data produced. Think protein x-ray crystallography, astronomical data for navigation, the CERN detectors etc etc. You can think of the decadal censuses run by countries such as the US and UK as n = all. Or the data fed to the early LEO computer to calculate the deliveries required for each of their hundreds of teashops. The difference today is that people and companies are able to effortlessly collect a larger quantity of data than ever before. They’re able to collect data without thinking about it first. The idea of n = all is not really a help. The straw man against which it is placed is the selection of a subset of data by sampling.

They say that big data is messy implying that what went before was not. One of the failings of the book is their disregard for those researchers that have gone before. According to them the new big data analysts are comfortable with messiness and uncertainty, unlike those fuddy-duddy statisticians! Small data is messy, scientists and statisticians have long dealt with messy and incomplete data.

The third of their features: we must be comfortable with correlation rather than demand causation. There are many circumstances where correlation is OK, such as when Amazon uses my previous browsing and purchase history to suggest new purchases but the area of machine learning / data mining has long struggled with messiness and causality.

This is not to say nothing has happened in the last 20 or so years regarding data. The ubiquity of computing devices, cheap storage and processing power and the introduction of frameworks like Hadoop are all significant innovations in the last 20 years. But they grow on things that went before, they are not a paradigm shift. Labelling something as ‘big data’, so ill-defined, provides no helpful insight as to how to deal with it.

The book could be described as the “What Google Did Next…” playbook. It opens with Google’s work on flu trends, passes through Google’s translation work and Google Books project. It includes examples from many other players but one gets the impression that it is Google they really like. They are patronising of Amazon for not making full use of the data they glean from their Kindle ebook ecosystem. They pay somewhat cursory attention to issues of data privacy and consent, and have the unusual idea of creating a cadre of algorithmists who would vet the probity of algorithms and applications in the manner of accountants doing audit or data protection officers.

So what is this book good for? It provides a nice range of examples of data analysis and some interesting stories regarding the use to which it has been put. It gives a fair overview of the value of data analysis and some of the risks it presents. It highlights that the term “big data” is used so broadly that it conveys little meaning. This confusion over what is meant by “Big Data” is reflected on the datascience@Berkeley blog which lists definitions of big data from 30 people in the field (here). Finally, it provides me with sufficient cover to make a supportable claim that I am a “Big Data scientist”!

To my mind, the best definition of big data that I’ve seen is that it is like teenage sex…

Everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it too!

Hip Data Terms

thomaslevine — Tue, 26 Feb 2013 11:23:40 +0000

“Big Data” and “Data Science” tend to be terms whose meaning is defined the moment they are used. They are sometimes meaningful, but their meaning is dependent on context. Through the agendas of many hip and not-so-hip data talks we could come up with some definitions some people mean, and will try and describe how big data and data science are used now.

Big data

When some people say big data, they are describing something physically big—in terms of bytes. So a petabyte of data would be big, at least today. Other people think of big data in terms of thresholds of big. So if data don’t fit into random-access memory, or cannot be stored on a single hard-drive, they’re talking about big data. More generally, we might say that if the data can’t be accessed by Excel (the world’s standard data analysis tool), it is certainly big, and you need to know something more about computing in order to store and access the data. Judging data-bigness by physical size sometimes works today. But sizes that seem big today are different from what seemed big twenty years ago and from what will seem big in twenty years. Lets look more at two descriptions of big data that get to the causes of data-bigness. One presenter at Strata London 2012 proposed that big data comes about when it becomes less expensive to store data than to decide about whether or not to delete it. Filing cabinets and libraries are giving way to Hadoop clusters and low-power hard drives, so it has become recently reasonable to just save anything. The second thing to look at is where all of this data comes from. Part of this big data thing is that we can now collect much more data automatically. Before computers, if the post office wanted to study where mail is sent, it could sample letters at various points and record their destinations, return addresses, and routes. Today we already have all our emails, Twitter posts and other correspondence in reasonably standard formats. So, this process is far more automatic, and we can collect much more data.

Data science

So, what is ‘data science’? It broadly seems to be some combination of ‘statistics’ and ‘computer engineering’. They’re in quotes because these categories are ambiguous and because they are difficult to define except in terms of one another. Let’s define ‘data science’ by relating it to ‘statistics’ and ‘software engineering’, and we’ll start with statistics.

‘Data science’ and ‘statistics’

First off, the statistical methods used in ‘data science’ and ‘big data’ seem quite unsophisticated compared to those used in ‘statistics’. Often, it’s just search. For example, the data team at La Nación demonstrated how they’re acquiring loads of documents and allowing journalists to search them. Certainly, they will eventually start doing crude quantitative analyses on the overall document sets, but even the search has already been valuable. Their team pulls in another hip term: ‘data journalism’. Quantitative analyses that do happen are often quite simple. Consider the FourSquare checkin analyses that a couple people from FourSquare demoed at DataGotham. The demo mostly comprised scatterplots of checkins on top of a map, and sometimes it played over time. They touched on the models they were using to guess where someone wanted to check in, but they emphasised the knowledge gained from looking at checkin histories, and these simple plots were helpful for conveying this. In other cases, ‘data science’ simply implies ‘machine learning’. Compared to ‘statistics’, ‘machine learning,’ though, implies a focus on prediction rather than inference. Statisticians seem to make use of more complex models on simpler datasets, and are more concerned with consuming and applying data than they are with the modelling of the data.

‘Data science’ and ‘software engineering’

The products of ‘software engineering’ tend to be tools, and the products of ‘data science’ tend to be knowledge. We can break that distinction into some components for illustration. (NB: These components exaggerate the differences.)

Realtime v. batch: If something is ‘realtime’, it is the result of ‘software engineering’; ‘data science’ is usually done in batches. (Let’s avoid worrying too much about what ‘realtime’ means. We could take ‘realtime’ to mean push rather than pull, and that could work for a reasonable definition of ‘realtime’.)

Organization: ‘Data scientists’ are embedded within organizations that have questions about data (typically about their own data, though that depends on how we think of ownership). Consider any hip web startup with a large database. ‘Software engineers’, on the other hand, make products to be used by other organizations or by other departments within a large organization. Consider any hip web startup ever. Also consider some teams within large companies; I know someone who worked at Google as a ‘software engineer’ to write code for packaging ChromeBooks.

What about ‘analysts’?

If we simplify the world to a two-dimensional space, ‘data scientists’, ‘statisticians’, ‘software engineers’, (This chart uses ‘developer’. Oops.) ‘engineers’ might land here.

Conflating ‘data science’ and ‘big data’

Some people conflate ‘data science’ and ‘big data’. For some definitions of these two phrases, the conflation makes perfect sense, like when ‘big data’ means that the data are big enough that you need to know something about computers. Some people are more concerned with ‘data science’ than they are with ‘big data’, and vice-versa. For example, ‘big data’ is much talked-about at Strata, but ‘data science’ isn’t discussed as much. Perhaps ‘big data’ is buzzier and more popular among the marketing departments? To other people, ‘data science’ is more common, and this is in part to emphasise the fact that they can do useful things with small datasets too. It might be that we want some word to describe what we do. ‘Statistician’ and ‘software developer’ aren’t close enough, but ‘data scientist’ is decent.

Utility of these definitions

Consider taking this post with a grain of salt. Some definitions may be more clear to one group of people than to another, and they may be over-simplified here. On the other hand, these definitions are intended be descriptive rather than prescriptive, so they might be more useful than some other definitions that you’ve heard. No matter how you define a hip or un-hip term, it is impossible to avoid all ambiguities.

International Data Journalism Awards….deadline fast approaching..(10th April 2012)

Aine McGuire — Mon, 26 Mar 2012 17:00:29 +0000

Everybody is talking and trying to do ‘data journalism’ and the first ever International Data Journalism Awards have been established to recognise the huge effort that people are making in this field. It’s a great opportunity to showcase your work. Backed by Google, the prizes are generous at €45,000 (over $55,000) to six winners and the process is being managed by Global Editors

The main objectives are to a) Contribute to setting high standards and highlighting the best practices in data journalism and b) Demonstrate the value of data journalism among editors and media executives.

There are three categories :-

Data-driven investigative journalism
Data visualisation & storytelling
Data-driven applications

The competition is open to media companies, non-profit organisations, freelancers and individuals. Applicants are welcome to submit their best data journalism projects before 10 April 2012 at http://datajournalismawards.org/ submit-your-work/.

To find out more about the competition and how to apply check out datajournalismawards.org. If you have any questions about the competition get in touch with the lovely Liliana Bounegru, DJA Coordinator (bounegru [at] ejc [dot] net). Liliana works at the European Journalism Centre

Happy New Year and Happy New York!

Aine McGuire — Tue, 03 Jan 2012 20:32:42 +0000

We are really pleased to announce that we will be hosting our very first US two day Journalism Data Camp event in conjunction with the Tow Center for Digital Journalism at Columbia University and supported by the Knight Foundation on February 3rd and 4th 2012.

We have been working with Emily Bell @emilybell, Director of the Tow Center and Susan McGregor @SusanEMcG, Assistant Professor at the Columbia J School to plan the event. The main objective is to liberate and use New York data for the purposes of keeping business and power accountable.

After a short introduction on the first day, we will split the event into three parallel streams; journalism data projects; liberating New York data; and ‘learn to scrape’. We plan to inject some fun by running a derby for the project stream and also by awarding prizes in all of the streams. We hope to make the event engaging and enjoyable.

We need journalists, media professionals, students of journalism, political science or information technology, coders, statisticians and public data boffs to dig up the data!

Please pick a stream and sign-up to help us to make New York a data driven city!

Our thanks to Columbia University, Civic Commons, The New York Times, and CUNY for allowing us to use their premises as we sojourned in the big apple

Zarino has created a map with our US events which we will update with additional events as we add locations. https://scraperwiki.com/events/

‘Big Data’ in the Big Apple

Aine McGuire — Thu, 29 Sep 2011 15:05:25 +0000

My colleague @frabcus captured the main theme of Strata New York #strataconf in his most recent blog post. This was our first official speaking engagement in the USA as a Knight News Challenge 2011 winner. Here is my twopence worth!

At first we were a little confused at the way in which the week long conference was split into three consecutive mini conferences with what looked like repetitive content. The reality was that the one day Strata Jump Start was like an MBA for people trying to understand the meaning of ‘Big Data’. It gave a 50,000 foot view of what is going on and made us think about the legal stuff, how it will impact the demand for skills and how the pace with which data is exploding will dramatically change the way in which businesses operate – every CEO should attend or watch the videos and learn!

The following two days called the Strata Summit were focused on what people need to

Big Apple…and small products

think about strategically to get business ready for the onslaught. In his welcome address Edd Dumbill program chair for O’Reilly said “Computers should serve humans….we have been turned into filing clerks by computers….we spend our day sifting, sorting and filing information…something has gone upside down, fortunately the systems that we have created are also part of the solution…big data can help us…it may be the case that big data has to help us!”

To use the local lingo we took a ‘deep dive’ into various aspects of the challenges. The sessions were well choreographed and curated. We particularly liked the session ‘Transparency and Strategic Leaking’ by Dr Michael Nelson (Leading Edge Forum- CSC) where he talked about how companies need to be pragmatic in an age when it is impossible to stop data leaking out of the door. Companies he said ‘are going to have to be transparent’ and ‘are going to have to have a transparency policy’. He referred to a recent article in the Economist ‘The Leaking Corporation’ and its assertion that corporations that leak their own data ‘control the story’.

Courtesy of O’Reilly Media

Simon Wardley’s (Leading Edge Forum – CSC) ‘Situation Normal Everything Must Change’ segment made us laugh especially the philosophical little quips that came from his encounter with a London taxi driver – he conducted it at lightening speed and his explanation of ‘ecosystems’ and how big data offers a potential solution to the ‘Innovation Paradox’ was insightful. It was a heavy duty session but worth it!

Courtesy of O’Reilly Media

There were tons of excellent sessions to peruse. We really enjoyed Cathy O’Neill’s What kinds of people are needed for data management’ which talked about data scientists and how they can help corporations to discern ‘noise’ from signal.

Our very own Francis Irving was interviewed about how ScraperWiki relates to Big Data and Investigative Journalism.

Courtesy of O’Reilly Media

Unfortunately we did not manage to see many of the technology exhibitors #fail. However we did see some very sexy ideas including a wonderful software start-up called Kaggle.com – a platform for data prediction competitions and its Chief Data Scientist Jeremy Howard gave us some great ideas on how to manage ‘labour markets’.

..Oh yes and we checked out why it is called Strata….

We had to leave early to attend the Online New Association – #ONA event in Boston so we missed part III which was the two day Strata Conference itself – it is designed for people at the cutting edge of data – the data scientists and data activists! I just hope that we manage to get to Strata 2012 in Santa Clara next February.

In his closing address ‘Towards a global brain’ Tim O’Reilly gave a list of 10 scary things that are leading into the perfect humanitarian storm including…Climate Change, Financial Meltdown, Disease Control, Government inertia……so we came away thinking of a T-Shirt theme…Hmm we’re f**ked so lets scrape!!!

Courtesy Hugh MacLeod

Four data trends to rule them all, the data scientist king to bind them

Francis Irving — Mon, 26 Sep 2011 16:07:16 +0000

My favourite soundbite from O’Reilly’s Strata data conference was a definition of big data. John Rauser, Amazon’s main data scientist, said to me that “data is big data when you can’t process it on one machine”. And naturally, small data is data that you can process on one machine.

What’s nice about this definition is it makes it immediately clear that as time passes, big data is getting smaller. And of course, small data is getting bigger. This is linked to four interrelating data technology and business trends.

1. Super Moore’s law for data. Even without any specific new technology, what we can process on “one computer” would be getting larger anyway with Moore’s law applied to processors, RAM and disks.

But that’s not what’s happening, we’re also in the middle of the commoditisation of what was once part of Google’s competitive advantage – distribution of work over clusters of bog standard servers, using things like Hadoop.

Right now you still need to hire special engineers to do that, but it is only a matter of time before it is just a service you buy with your credit card – process any amount of data at any speed, with just a slider that any wealthy business man’s data scientist can drag.

The feeling is, rollercoaster!, we’re going faster than Moore’s law with data right now.

2. Business of big data. The result of the first trend, is that now every company does store and can process as much data as once only the tech giants did. This is very significant strategically and tactically, in very specific ways to each industry. Given the right use of the data (see the next two trends), it changes everything.

We’ve seen this in book selling for years now because Amazon was ahead of the curve. But imagine both basic algorithms, and fancy ones like cunningly used to reproduce images from our visual cortex this week, applied in as yet untouched areas.

3. Collaboration of data. The above two trends do not need the Internet. Even if we were still all locked in isolated corporate data centres, through a freak historic accident preventing the invention of global packet switching, we would still be getting the transistor cheapness of Moore’s law, and we’d still be running clusters (“clouds”) of map/reduce servers with just local networking.

The Internet isn’t about raw CPU power. It’s about collaboration. Collaboration is changing how we work with documents, how we share news, how we keep in touch with our friends, how we build software. Why wouldn’t it change how we work with data? Of course, it already is.

There are quite different ways it can happen more. It can create marketplaces for the commercial exchange of data, more transparent than the existing siloing data resellers. It can create tools for socialising the analysis, visualisation, quality checking and gathering of data. It can allow governments, and corporations, to be radically transparent at opening data in all cases except their unique competitive advantage.

It’s remarkable how old the Internet is, and how badly we collaborate on working with data.

4. The data scientist is king. A term cooked up (so I’m told) a few years ago by the chief data scientists at Linked in and Facebook while in a bar deciding what to name their new teams, this is the latest and newly trending iteration of job titles on what was called a statistician or data analyst.

But it isn’t the same. They’re not just geeks buried away. Yes, a data scientist is a data geek. They love data, and interesting data, more than anything. They know how to program, but in scripting languages and SQL, not hard core software engineering. They understand statistics as if they were brought up by a prior probability.

But they also care about the business, and they know how to communicate. They can give presentations to senior management, making hard stats clear. Volunteer, non-profit data scientists have an unassailable passion for their mission.

Data scientists are the glue, linking data to decisions. Cathy O’Neil gave a fantastic talk at Strata describing these beasts, useful if you are one but didn’t realise it had a buzzy new marketing term, or if you are getting into the business of data and need to hire some.

You can’t make full and accurate use of any of the other trends above, without data scientists.