Strata – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Ignition! https://blog.scraperwiki.com/2013/11/ignition/ Wed, 13 Nov 2013 08:46:11 +0000 https://blog.scraperwiki.com/?p=758220384 Strata_london_2013_smallTeam ScraperWiki has been in London today, at the Strata Conference. A trip organised late in the day after I received word that my Ignite talk had been accepted. For those unfamiliar with the Ignite format, it’s 5 minutes, 20 slides with strictly 15 seconds per slide – a headlong plunge. The event was hosted by Nicola Hughes (former scraperwikian) and Duncan Ross. My eleven fellow speakers were an impressive bunch: they spoke on communities, creativity, gambling on American football, tagging the news, DevOps, the lessons of being a volunteer firefighter, the new consumerfinance.gov website, evaluating humanitarian aid efforts (www.evidencefordevelopment.org), the software engineering approach to fixing a leak, running Hadoop on a Raspberry Pi cluster and Piers sang a song. The talks were all entertaining and several of them had content I want to follow up on, it’s a great format for getting an overview of a bunch of stuff very quickly.

You can find all the Ignite speakers on twitter here, and we’re now all on YouTube.

For my part I spoke about the scraping work we’ve been doing with the UN Office for the Coordination of Humanitarian Aid (UN-OCHA). I opted for a set of pictorial slides, you can get a flavour from the thumbnails below:

Ignite

The short format, and automatically advancing slides, is tricky to master. I’d practiced and occasionally hit my timing in the office but during the day I’d largely forgotten the content of my slides. However, it all seemed to go fine and I think the audience were laughing with me rather than at me.

It’s been a while since I’ve attended a conference and Strata London wasn’t what I’m used to given my former a life as a physical scientist. An academic conference is far more focused on the content of the data rather than the technology or just Data. I don’t really care about the technology, it’s the data that interests me.

The day started with a plenary session. Mark Madsen spoke about the sociology of data and looking at previous scientific discoveries through the lens of the modern hype bubble. Gavin Starke spoke about the Open Data Institute, Britain can make some claim to being world leaders in Open Data, and Julie Steele talked about telling stories, from people the point of view of people who tell stories for a living – scientists (of all types) sort of play at this in telling stories with data.

James Burke gave a barnstorming keynote talk. I remember him from my youth, from programs like Tomorrow’s World and here he is 30 years later, sharp as a tack, speaking nineteen to the dozen and hitting his timing for a 35 minute talk to the minute – given my competence at timing a 5 minute talk this impressed me greatly. His theme was unintended consequences of new discoveries which culminated in linking the sinking of the British fleet in 1707 with the invention of the toilet roll. He also made a plea for finding connections between different areas of research – something close to my heart as a chemical physicist and then biological physicist.

You can watch videos of all these plenary speakers here.

After that we headed into parallel sessions, I saw Rajappa Lyer speaking about ETL at Linkedin, a process which uses Oracle databases, Hadoop and Teradata in one workflow; Noel Walsh spoke about Bandit Algorithms – ways of doing testing/selecting options efficiently. Andy Cotgreave spoke about visualising social media data using Tableau. Andy and I had met somewhat accidentally earlier in the day when I went to the Tableau stand. I use Tableau quite heavily for getting an overview of the data we scrape. I’ve been following Andy for a while on twitter and it was good to meet in person. I was rather pleased to see ScraperWiki was the second icon his bookmark bar, he uses the Search for Tweets tool to get data to visualise in Tableau. As an aside I met a few of the people I was already following on twitter, and picked up some new friends. It’s a great mechanism for being ambiently aware of someone, they’re not just a stranger in a room. I actually discovered people at the conference as they tweeted on the #strataconf hashtag.

I then headed deep underground for a talk on using d3 for making interactive visualisations in the pharmaceutical industry, an interesting idea but a significant effort using the low-level d3 library. Next up was a talk on probabilistic data matching, I enjoyed this – not least because the speaker provided free dolly mixtures but also because matching and de-duplication are key activities for us in writing scrapers.

The last talk of the day for me was Adam Kocoloski who gave a talk on data handling for particle physics, reminding me of departmental seminars I attended long ago since he spoke much more about the physics than the technology. It was striking the number of former physicists at Strata I guess we have the appropriate mix of mathematical, programming and pragmatic data analysis skills.

The Ignite talks were at the end of the day, the audience were enjoying wine and canapes whilst the presenters sweated and worried!

]]>
758220384