views – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 5 yr old goes ‘potty’ at Devon and Somerset Fire Service (Emergencies and Data Driven Stories) https://blog.scraperwiki.com/2012/05/5-yr-old-goes-potty/ Fri, 25 May 2012 07:13:33 +0000 http://blog.scraperwiki.com/?p=758216901

It’s 9:54am in Torquay on a Wednesday morning:

One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head.

On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head.

Crews used vaseline and the finger kit to remove the seat from the childs head to leave her uninjured.

A couple of different interests directed me to scrape the latest incidents of the Devon and Somerset Fire and Rescue Service. The scraper that has collected the data is here.

Why does this matter?

Everybody loves their public safety workers — Police, Fire, and Ambulance. They save lives, give comfort, and are there when things get out of hand.

Where is the standardized performance data for these incident response workers? Real-time and rich data would revolutionize its governance and administration, would give real evidence of whether there are too many or too few police, fire or ambulance personnel/vehicles/stations in any locale, or would enable the implementation of imaginative and realistic policies resulting from major efficiency and resilience improvements all through the system?

For those of you who want to skip all the background discussion, just head directly over to the visualization.

A rose diagram showing incidents handled by the Devon and Somerset Fire Service

The easiest method to monitor the needs of the organizations is to see how much work each employee is doing, and add more or take away staff depending on their workloads. The problem is, for an emergency service that exists on standby for unforeseen events, there needs to be a level of idle capacity in the system. Also, there will be a degree of unproductive make-work in any organization — Indeed, a lot of form filling currently happens around the place, despite there being no accessible data at the end of it.

The second easiest method of oversight is to compare one area with another. I have an example from California City Finance where the Excel spreadsheet of Fire Spending By city even has a breakdown of the spending per capita and as a percentage of the total city budget. The city to look at is Vallejo which entered bankruptcy in 2008. Many of its citizens blamed this on the exorbitant salaries and benefits of its firefighters and police officers. I can’t quite see it in this data, and the story journalism on it doesn’t provide an unequivocal picture.

The best method for determining the efficient and robust provision of such services is to have an accurate and comprehensive computer model on which to run simulations of the business and experiment with different strategies. This is what Tesco or Walmart or any large corporation would do in order to drive up its efficiency and monitor and deal with threats to its business. There is bound to be a dashboard in Tesco HQ monitoring the distribution of full fat milk across the country, and they would know to three decimal places what percentage of the product was being poured down the drain because it got past its sell-by date, and, conversely, whenever too little of the substance had been delivered such that stocks ran out. They would use the data to work out what circumstances caused changes in demand. For example, school holidays.

I have surveyed many of the documents within the Devon & Somerset Fire & Rescue Authority website, and have come up with no evidence of such data or its analysis anywhere within the organization. This is quite a surprise, and perhaps I haven’t looked hard enough, because the documents are extremely boring and strikingly irrelevant.

Under the hood – how it all works

The scraper itself has gone through several iterations. It currently operates through three functions: MainIndex(), MainDetails(), MainParse(). Data for each incident is put into several tables joined by the IncidentID value derived from the incident’s static url, eg:

http://www.dsfire.gov.uk/News/Newsdesk/IncidentDetail.cfm?IncidentID=7901&siteCategoryId=3&T1ID=26&T2ID=41

MainIndex() operates their search incidents form grabbing 10 days at a time and saving URLs for each individual incident page into the table swdata.

MainDetails() downloads each of those incident pages, parsing the obvious metadata, and saving the remaining HTML content of the description into the database. (This used to attempt to parse the text, but I then had to move it into the third function so I could develop it more easily.) A good way to find the list of urls that have not been downloaded and saved into the swdetails is to use the following SQL statement:

select swdata.IncidentID, swdata.urlpage 
from swdata 
left join swdetails on swdetails.IncidentID=swdata.IncidentID 
where swdetails.IncidentID is null 
limit 5

We then download the HTML from each of the five urlpages, save it into the table under the column divdetails and repeat until no more unmatched records are retrieved.

MainParse() performs the same progressive operation on the HTML contents of divdetails, saving it into the the table swparse. Because I was developing this function experimentally to see how much information I could obtain from the free-form text, I had to frequently drop and recreate enough of the table for the join command to work:

scraperwiki.sqlite.execute("drop table if exists swparse")
scraperwiki.sqlite.execute("create table if not exists swparse (IncidentID text)")

After marking the text down (by replacing the <p> tags with linefeeds), we have text that reads like this (emphasis added):

One appliance from Holsworthy was mobilised to reports of a motorbike on fire. Crew Commander Squirrell was in charge.

On arrival one motorbike was discovered well alight. One hose reel was used to extinguish the fire. The police were also in attendance at this incident.

We can get who is in charge and what their rank is using this regular expression:

re.findall("(crew|watch|station|group|incident|area)s+(commander|manager)s*([w-]+)(?i)", details)

You can see the whole table here including silly names, misspellings, and clear flaws within my regular expression such as not being able to handle the case of a first name and a last name being included. (The personnel misspellings suggest that either these incident reports are not integrated with their actual incident logs where you would expect persons to be identified with their codenumbers, or their record keeping is terrible.)

For detecting how many vehicles were in attenence, I used this algorithm:

appliances = re.findall("(S+) (?:(fire|rescue) )?(appliances?|engines?|tenders?|vehicles?)(?: from ([A-Za-z]+))?(?i)", details)
nvehicles = 0
for scount, fire, engine, town in lappliances:
    if town and "town" not in data:
        data["town"] = town.lower(); 
    if re.match("one|1|an?|another(?i)", scount):  count = 1
    elif re.match("two|2(?i)", scount):            count = 2
    elif re.match("three(?i)", scount):            count = 3
    elif re.match("four(?i)", scount):             count = 4
    else:                                          count = 0
    nvehicles += count

And now onto the visualization

It’s not good enough to have the data. You need to do something with it. See it and explore it.

For some reason I decided that I wanted to graph the hour of the day each incident took place, and produced this time rose, which is a polar bar graph with one sector showing the number of incidents occurring each hour.

You can filter by the day of the week, the number of vehicles involved, the category, year, and fire station town. Then click on one of the sectors to see all the incidents for that hour, and click on an incident to read its description.

Now, if we matched our stations against the list of all stations, and geolocated the incident locations using the Google Maps API (subject to not going OVER_QUERY_LIMIT), then we would be able to plot a map of how far the appliances were driving to respond to each incident. Even better, I could post the start and end locations into the Google Directions API, and get journey times and an idea of which roads and junctions are the most critical.

There’s more. What if we could identify when the response did not come from the closest station, because it was over capacity? What if we could test whether closing down or expanding one of the other stations would improve the performance in response to the database of times, places and severities of each incident? What if each journey time was logged to find where the road traffic bottlenecks are? How about cross-referencing the fire service logs for each incident with the equivalent logs held by the police and ambulance services, to identify the Total Response Cover for the whole incident – information that’s otherwise balkanized and duplicated among the three different historically independent services.

Sometimes it’s also enlightening to see what doesn’t appear in your datasets. In this case, one incident I was specifically looking for strangely doesn’t appear in these Devon and Somerset Fire logs: On 17 March 2011 the Police, Fire and Ambulance were all mobilized in massive numbers towards Goatchurch Cavern – but the Mendip Cave Rescue service only heard about it via the Avon and Somerset Cliff Rescue. Surprise surprise, the event’s missing from my Fire logs database. No one knows anything of what is going on. And while we’re at it, why are they separate organizations anyway?

Next up, someone else can do the Cornwall Fire and Rescue Service and see if they can get their incident search form to work.

]]>
758216901
Scraping the protests with Goldsmiths https://blog.scraperwiki.com/2011/12/scraping-the-protests/ https://blog.scraperwiki.com/2011/12/scraping-the-protests/#comments Fri, 09 Dec 2011 12:25:51 +0000 http://blog.scraperwiki.com/?p=758215974 Google Map of occupy protests around the worldZarino here, writing from carriage A of the 10:07 London-to-Liverpool (the wonders of the Internet!). While our new First Engineer, drj, has been getting to grips with lots of the under-the-hood changes which’ll make ScraperWiki a lot faster and more stable in the very near future, I’ve been deploying ScraperWiki out on the frontline, with some brilliant Masters students at CAST, Goldsmiths.

I say brilliant because these guys and girls, with pretty much no scraping experience but bags of enthusiasm, managed (in just three hours) to pull together a seriously impressive map of Occupy protests around the world. Using data from no less than three individual wikipedia articles, they parsed, cleaned, collated and geolocated almost 600 protests worldwide, and then visualised them over time using a ScraperWiki view. Click here to take a look.

Okay, I helped a bit. But still, it really pushed home how perfect ScraperWiki is for diving into a sea of data, quickly pulling out what you need, and then using it to formulate bigger hypotheses, flag up possible stories, or gather constantly-fresh intelligence about an untapped field. There was this great moment when the penny suddenly dropped and these journalists, activists and sociologists realised what they’d been missing all this time.

Students like these, with tools like ScraperWiki behind them, are going to take the world by storm.

But the penny also dropped for me, when I saw how suited ScraperWiki is to a role in the classroom. The path to becoming a data science ninja is a long and steep one, and despite the amazing possibilites fresh, clean and accountable data holds for everybody from anthropologists to zoologists, getting that first foot on the ladder is a tricky task. ScraperWiki was never really built as a learning environment, but with so little else out there to guide learners, it fulfils the task surprisingly well. Students can watch their tutor editing and running a scraper in realtime, right alongside their own work, right inside their web browser. They can take their own copy, hack it, and then merge the data back into a classroom pool. They can use it for assignments, and when the built-in documentation doesn’t answer their questions, there’s a whole community of other developers on there, and a whole library of living, working examples of everything from cabinet office tweeting to global shark activity. They can try out a new language (maybe even their first language) without worrying about local installations, plugins or permissions. And then they can share what they make with their classmates, tutors, and the rest of the world.

Guys like these, with tools like ScraperWiki behind them, are going to take the world by storm. I can’t wait to see what they cook up.

]]>
https://blog.scraperwiki.com/2011/12/scraping-the-protests/feed/ 1 758215974
Protect your scrapers! https://blog.scraperwiki.com/2011/06/protect-your-scrapers/ https://blog.scraperwiki.com/2011/06/protect-your-scrapers/#comments Fri, 17 Jun 2011 13:55:06 +0000 http://blog.scraperwiki.com/?p=758214983 You know how it is.

You wrote your scraper on a whim. Because it’s a wiki, some other people found it, and helped fix bugs in it and extend it.

Time passes.

And now your whole business depends on it.

For when that happens, we’ve just pushed an update that lets you protect scrapers.

This stops new people editing them. It’s for when your scraper is being used in a vital application, or if you’re storing data that you can never get back again just by rescraping it.

To protect one of your scrapers, or indeed views, scroll down on the overview page to the Contributors section.

It says “This scraper is public” – the default, ultra liberal policy. So radical that Zarino, ScraperWiki’s designer, suggested that the icon next to that should be a picture of Richard Stallman. We went for a dinky group of people instead.

If you’re the owner, pick edit to change the security of the scraper. Then choose “Protected”.

You can then use the “Add a new editor” button to let someone new work on the scraper, or “demote” to stop someone working on it.

Protecting scrapers is completely free. Later in the summer we’ll be introducing premium accounts so you can have private scrapers.

]]>
https://blog.scraperwiki.com/2011/06/protect-your-scrapers/feed/ 3 758214983
All recipes 30 minutes to cook https://blog.scraperwiki.com/2011/05/all-recipes-30-minutes-to-cook/ Wed, 18 May 2011 10:13:52 +0000 http://blog.scraperwiki.com/?p=758214836 The other week we quietly added two tutorials of a new kind to the site, snuck in behind a radical site redesign.

They’re instructive recipes, which anyone with a modicum of programming knowledge should be able to easily follow.

1. Introductory tutorial

For programmers new to ScraperWiki, to a get an idea of what it does.

It runs through the whole cycle of scraping a page, parsing it then outputting the data in a new form. For a simplest possible example.

Available in Ruby, Python and PHP.

2. Views tutorial

Find out how to output data from ScraperWiki in exactly the format you want – i.e. write your own API functions on our servers.

This could be a KML file, an iCal file or a small web application. This tutorial covers the basics of what a ScraperWiki View is.

Available in Ruby, Python and PHP.

Hopefully these tutorials won’t take as long as Jamie Oliver’s recipes to make. Get in touch with feedback and suggestions!

]]>
758214836
Views part 2 – Lincoln Council committees https://blog.scraperwiki.com/2011/01/views-part-2-lincoln-council-committees/ https://blog.scraperwiki.com/2011/01/views-part-2-lincoln-council-committees/#comments Tue, 04 Jan 2011 20:48:26 +0000 http://blog.scraperwiki.com/?p=758214062 (This is the second of two posts announcing ScraperWiki “views”. A new feature that Julian, Richard and Tom worked away and secretly launched a couple of months ago. Once you’ve scraped your data, how can you get it out again in just the form you want? See also: Views part 1 – Canadian weather stations.)

Lincoln Council committee updates

Sometimes you don’t want to output a visualisation, but instead some data in a specific form for use by another piece of software. You can think of this as using the ScraperWiki code editor to write the exact API you want on the server where the data is. This saves the person providing the data having to second guess every way someone might want to access it.

Andrew Beekan, who works at Lincoln City Council, has used this to make an RSS feed for their committee meetings. Their CMS software doesn’t have this facility built in, so he has to use a scraper to do it.

First he wrote a scraper in ScraperWiki for a “What’s new” search results page from Lincoln Council’s website. This creates a nice dataset containing the name, date and URL of each committee meeting. Next Andrew made a ScraperWiki view and wrote some Python to output exactly the XML that he wants.

Andrew then wraps the RSS feed in Feedburner for people who want email updates. This is all documented in the Council’s data directory. They used to use Yahoo pipes to do this, but Andrew is finding ScraperWiki easier maintain, even though some knowledge of programming is required.

Since then, Andrew has gone on to make a map for the Lincoln decent homes scheme, also using ScraperWiki views – he’s written a blog post about it.

]]>
https://blog.scraperwiki.com/2011/01/views-part-2-lincoln-council-committees/feed/ 2 758214062
Views part 1 – Canadian weather stations https://blog.scraperwiki.com/2010/12/views-part-1-canadian-weather-stations/ https://blog.scraperwiki.com/2010/12/views-part-1-canadian-weather-stations/#comments Fri, 03 Dec 2010 14:19:21 +0000 http://blog.scraperwiki.com/?p=758214052 (This is the first of two posts announcing ScraperWiki “views”. A new feature that Julian, Richard and Tom worked away and secretly launched a couple of months ago. Once you’ve scraped your data, how can you get it out again in just the form you want? See also: Views part 2 – Lincoln Council committees)

Canadian weather stations

Clear Climate Code is a timely project to reimplement the software of climate science academics in nicely structured and commented Python. David Jones has been using ScraperWiki views to find out which areas of the world they don’t have much surface temperature data for, so they can look for more sources.

Take a look at his scraper Canada Climate Sources. If you scroll down, there’s a section “Views using this data from this scraper”. That’s where you can make new views – small pieces of code that output the data the way you want. Think of them as little CGI scripts you can edit in your browser. This is a screenshot of the Canada Weather Station Map view.

It’s a basic Google Map, made for you from a template when you choose “create new view”. But David then edited it, to add conditional code to change the colours and letters on the pins according to the status of the stations.

This is the key powerful thing about ScraperWiki views – even if you start with a standard chart or map, you have the full power of the visualisation APIs you are using, and of HTML, Javascript and CSS, to do more interesting things later.

There’s more about ScraperWiki and the Canada weather stations in the posts Canada and Analysis of Canada Data on the Clear Climate Code blog.

Next week – part 2 will be about how to use views to output your data in the machine readable format that you want.

]]>
https://blog.scraperwiki.com/2010/12/views-part-1-canadian-weather-stations/feed/ 3 758214052