opencorporates – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Conquering Copyright and Scaling Open Data Projects – How Chris Taggart is Counting Culture https://blog.scraperwiki.com/2011/09/how-chris-taggart-is-counting-culture/ https://blog.scraperwiki.com/2011/09/how-chris-taggart-is-counting-culture/#comments Fri, 09 Sep 2011 14:05:18 +0000 http://blog.scraperwiki.com/?p=758215374 Chris Taggart is a founder of OpenlyLocal and OpenCorporates. He says “When people ask what I do I say I open up data, sometimes whether people like it or not.” In the beginning he didn’t really expect much to come of his first scrapers “other than maybe being told off by the councils, because all the councils at that time had got things on their website saying this is copyright”.

He did it anyway with a very profound outcome:

I expected them to send me a take down notice … actually that didn’t happen. What did happen is that a couple of councils contacted us and said we like what you’re doing, will you start scraping us.

His first success spurred him on to create an even more ambitious project. Corporate data. He knew he’d be looking at a vast array of sources scattered across the web, in different languages and formats. So he made call out on ScraperWiki for OpenCorporates. It currently has information from 22 million companies across 28 jurisdictions. And it’s an alpha! I caught up with him on Skpye to find out what he’s learnt about conquering copyright and scaling open data projects.

]]>
https://blog.scraperwiki.com/2011/09/how-chris-taggart-is-counting-culture/feed/ 1 758215374
Constructing the Open Data Landscape https://blog.scraperwiki.com/2011/09/constructing-the-open-data-landscape/ https://blog.scraperwiki.com/2011/09/constructing-the-open-data-landscape/#comments Wed, 07 Sep 2011 11:01:38 +0000 http://blog.scraperwiki.com/?p=758215331 In an article in today’s Telegraph regarding Francis Maude’s Public Data Corporation, Michael Cross asks: “What makes the state think it can be at the cutting edge of the knowledge economy“. He writes in terms of market and business share, giving the example of the satnav market worth over $100bn a year yet it’s based on free data from the US Government’s GPS system.

He credits the internet revolution for transforming public sector data into ‘cashable proposition’. We, along with many other start-ups, foundations and civic coding groups, are part of this ‘geeky world’ of Open Data. So we’d like to add our piece concerning the Open Data movement.

Michael has the right to ask this question because there is this constant custodial battle being fought every day, every scrape and every script on the web for the rights to data. So let me tell you about the geeks’ take on Open Data.

[vimeo http://www.vimeo.com/21711338]

The idea(l) behind Open Data is to create sustainable Open Data projects with purpose. This has been championed in the last couple of years by civic data projects such as MySociety, Open Knowledge Foundation, Code for America, Open AustraliaOpen Development Cambodia is following me on twitter! Older, more established organizations are also being converted to the Open Data ethos. For instance, The World Bank is one major organization turning to Open Data in a big way.

However, much of the public sector data published so far has been pretty much useless. Governments, finally, are beginning to realize that data has little value unless people understand its context and provenance. They are beginning to see that opening up their data can reduce the cost and responsibility of getting it to the end point user, as the Open Declaration on European Public Services clearly says: “The needs of today’s society are too complex to be met by government alone”.

The key to a sustainable Open Data landscape lies not in the organisational heads of government bodies but in the provenance of the data they release and the ways in which it is released. The goal should be to gain the 5 stars of open linked data. For this to be achieved the data needs to be pared down to its raw ingredients. In a research paper entitled “Open Data, Open Society” (see end of post) Marco Fioretti explains:

Public data are really useful only when they are raw, really open and linked … only when data are published online in that way every citizen or organization will be able to automatically analyze and present them in easy to understand forms

This is where ScraperWiki really excels in terms of opening up data. Not only is our data open and accessible through various processes (csv, database, API), even the extraction process is open in the form of a code wiki. In terms of data, we are rawer than raw. If government ordered an open data steak they would order rare, data hubs would order raw, ours would be mooing!

We’re providing some of the heavy machinery needed to construct the Open Data landscape. What it will look like very much depends on the civic cyber-community getting involved. A leader in this community is Chris Taggart, creator of OpenlyLocal and OpenCorporates, and a prolific ScraperWiki user. So I Skyped him to see what he makes of the state thinking it can be at the cutting edge of the knowledge economy:

Speaking of the linked economy, do check out all the links in this post and all the media included here is under Creative Commons license.

If you are interested in getting more involved in the Open Data scene check out the Open Knowledge Foundation.

Open Data, Open Society

]]>
https://blog.scraperwiki.com/2011/09/constructing-the-open-data-landscape/feed/ 1 758215331
Meet the User – Pall Hilmarsson https://blog.scraperwiki.com/2011/07/meet-the-user-pall-hilmarsson/ https://blog.scraperwiki.com/2011/07/meet-the-user-pall-hilmarsson/#comments Fri, 08 Jul 2011 14:10:43 +0000 http://blog.scraperwiki.com/?p=758215110 Our digger has been driving around colder climes by one of our star users. Icelander Pall Hilmarrson. Driving such a heavy vehicle on icy surfaces and through volcanic ash may seem daunting to most people, but Pall has not only ventured forth undeterred, he has given passers-by a lift. One such hitch-hiker is Chris Taggart with his OpenCorporates project. I caught up (electronically) with Pall.

What’s your background and what are your particular interest when it comes to collecting data?

I have a work related experience in design – I started working as a designer 12 years ago, almost by accident. At one point I thought I’d study it and I did try, for the whole of ten days! Fortunately I quit and went for a B.A. degree in anthropology. Somehow I´ve ended up again doing design again. Currently I work for the Reykjavík Grapevine magazine.

I´m particularly interested in freeing data that has some social relevance, something that gives us a new way of seeing and understanding society. That comes from the anthropology. Data that has social meaning.

How have you found using ScraperWiki and what do you find it useful for?

ScraperWiki has been a fantastical tool for me. I had written scrapers before, mostly small scripts to make RSS feeds and only in Perl. ScraperWiki has led me to teach myself Python and write more complex scrapers. It has opened up a whole new set of possibilites. I really like being able to study other peoples scrapers and helping others with their scrapers. I’ve learned so much from ScraperWiki.

Are there any data projects you’re working on at the moment?

Right now I´m involved in scraping some national company registers for the brilliant OpenCorporates site. I´m also compiling a rather large dataset on foreclosures in Iceland the last 10 years – trying to get an image of where the financial meltdown is hitting the hardest. I´m hoping to make it into an interactive map application. So far the data shows some interesting things – going into the project I had some notion that the Reykjavík suburbs with their new apartment buildings would be the bulk of foreclosures. It seems though that the old downtown area is actually where most apartments are going up for auction.

How is the data landscape in the area you’re interested in? Is it accessible, formatted, consistent?

Governmental data over here is not easily accessible, but that might change. A new bill introduced in Parliament aims to free a lot of data and make the right for citizens to access information a lot stronger. But of course it will never be enough. Data begets more data.

So watch out Iceland – you’re being ScraperWikied!

]]>
https://blog.scraperwiki.com/2011/07/meet-the-user-pall-hilmarsson/feed/ 1 758215110
OpenCorporates partners with ScraperWiki & offers bounties for open data scrapers https://blog.scraperwiki.com/2011/03/opencorporates-partners-with-scraperwiki-offers-bounties-for-open-data-scrapers/ https://blog.scraperwiki.com/2011/03/opencorporates-partners-with-scraperwiki-offers-bounties-for-open-data-scrapers/#comments Fri, 25 Mar 2011 11:25:12 +0000 http://blog.scraperwiki.com/?p=758214486

This is a guest post by Chris Taggart, co-founder of OpenCorporates

When we started OpenCorporates it was to solve a real need that we and a number of other people in the open data community had: whether it’s Government spending, subsidy info or court cases, we needed a database of corporate entities to match against, and not just for one country either.

But we knew from the first that we didn’t want this to be some heavily funded monolithic project that threw money at the project in order to create a walled garden of new URIs unrelated to existing identifiers. It’s also why we wanted to work with existing projects like OpenKvK, rather than trying to replace them.

So the question was, how do we make this scale, and at the same time do the right thing – that is work with a variety of different people using different solutions and different programming languages. The answer to both, it turns out, was to use open data, and the excellent ScraperWiki.

How does it work? Well, the basics we need in order to create a company record at OpenCorporates is the company number, the jurisdiction and the company’s name. (If there’s a status field — e.g. dissolved/active — company type or url for more data, that’s a bonus). So, all you need to do is write a scraper for a country we haven’t got data for, name the fields in a standard way (CompanyName, CompanyNumber, Status, EntityType, RegistryUrl, if the url of the company page can’t be worked out from the company number), and bingo, we can pull it into OpenCorporates, with just a couple of lines of code.

Let’s have a look at one we did earlier: the Isle of Man (there’s also one for GibraltarIreland, and in the US, the District of Columbia). It’s written in Ruby, because that’s what we at OpenCorporates code in, but ScraperWiki allows you to write scrapers in Python or php too, and the important thing here is the data, not the language used to produce it.

The Isle of Man company registry website is a .Net system which uses all sorts of hidden fields and other nonsense in the forms and navigation. This is a normally bit of a pain, but because you can use the Ruby Mechanize library to submit forms found on the pages (there’s even a tutorial scraper which shows how to do it), it becomes fairly straightforward.

The code itself should be fairly readable to anyone familiar with Ruby or Python, but essentially it tackles the problem by doing multiple searches for companies beginning with two letters, starting with ‘aa’ then ‘ab’ and so on, and for each letter pair iterating through each page of results in turn, which in turn is scraped to extract the data, using the standardised headings to save them in.  That’s it.

In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.

However, that’s not all. In order to kickstart the effort OpenCorporates (technically Chrinon Ltd, the micro start-up that’s behind OpenCorporates) is offering a bounty for new jurisdictions opened up.

It’s not huge (we’re a micro-startup remember): £100 for any jurisdiction that hasn’t been done yet, £250 for those territories we want to import sooner rather than later (Australia, France, Spain), and £500 for Delaware (there’s a captcha there, so not sure it’s even possible), and there’s an initial cap of £2500 on the bounty pot (details at the bottom of this post).

However, often the scrapers can be written in a couple of hours, and it’s worth stressing again that neither the code nor the data will belong to OpenCorporates, but to the open data community, and if people build other things on it, so much the better. Of course we think it would make sense for them to use the OpenCorporates URIs to make it easy to exchange data in a consistent and predictable way, but, hey, it’s open data 😉

Small, simple pieces, loosely connected, to build something rather cool. So now you can do a search for, oh say Barclays, and get this:

The bounty details: how it works

Find a country/company registry that you fancy opening up the data for (here are a couple of lists of registries). Make sure it’s from the official registry, and not a commercial reseller. Check too that no-one has already written one, or is in the middle of writing one, by checking the scrapers tagged with opencorporates (be nice, and respect other people’s attempts, but feel free to start one if it looks as if someone’s given up on a scraper).

All clear? Go ahead and start a new scraper (useful tutorials here). Call it something like trial_fr_company_numbers (until it’s done and been OK’d) and get coding, using the headings detailed above for the CompanyNumber, CompanyName etc. When it’s done, and it’s churning away pulling in data, email us info@opencorporates.com, and assuming it’s OK, we’ll pay you by Paypal, or by bank transfer (you’ll need to give us an invoice in that case). If it’s not we’ll add comments to the scraper. Any questions, email us at info@opencorporates.com, and happy scraping.

]]>
https://blog.scraperwiki.com/2011/03/opencorporates-partners-with-scraperwiki-offers-bounties-for-open-data-scrapers/feed/ 4 758214486