Data Science Platform – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Which car should I (not) buy? Find out, with the ScraperWiki MOT website… https://blog.scraperwiki.com/2015/09/which-car-should-i-not-buy-find-out-with-the-scraperwiki-mot-website/ Wed, 23 Sep 2015 15:14:58 +0000 https://blog.scraperwiki.com/?p=758223689 I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:

level2

What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead.

mot.scraperwiki.com, by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

  • 9 years of MOT data was downloaded and concatenated.
  • Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
  • Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
  • The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223689
And fast streaming CSV download… https://blog.scraperwiki.com/2014/06/and-fast-streaming-csv-download/ Fri, 20 Jun 2014 11:54:58 +0000 https://blog.scraperwiki.com/?p=758221906 We’re rolling out a series of performance improvements to ScraperWiki.

Yesterday, we sped up the Tableau/OData connector.

Today, it’s the turn of the humble CSV.

When you go to “Download a spreadsheet” you’ll notice the CSV file is now always described as “live”.

Live CSV files

This means it is always up to date, and streams at full speed to your browser. Even free account users don’t need to press the refresh button any more.

If you’ve any external software that automatically downloads the CSV file, you’ll want to update the URL it reads from.

(Developers note: It even sets the little known ‘header’ parameter of RFC4180! And our code is a extreme short call out to the SQLite binary.)

]]>
758221906
Super-faster Tableau integration https://blog.scraperwiki.com/2014/06/super-faster-tableau-integration/ Thu, 19 Jun 2014 14:29:39 +0000 https://blog.scraperwiki.com/?p=758221901 We’ve just rolled out a change to make our OData endpoint much faster.

For example, it is down from 20 minute to 6 minutes to import 150,000 Tweets.

If you’ve Tableau, or other software that can read OData, please try it out!

OData icon

If you’ve already got a connection set up, you need to go and get the URL from the OData tool again, as the new faster one is different.

Full instructions on how to use it in this blog post. If you’re a programmer and you’re interested in how we did it, you can see the new code here.

 

 

 

]]>
758221901
Introducing “30 day free trial” accounts https://blog.scraperwiki.com/2014/04/introducing-30-day-free-trial-accounts/ Wed, 16 Apr 2014 15:27:19 +0000 https://blog.scraperwiki.com/?p=758221457 Last May, we launched free Community accounts on ScraperWiki.

We’ve since found that the limit on number of datasets isn’t enough to convert heavy users into paying customers. This matters, because we want to invest more in improving the product, and adding new tools.

Today, we’re pleased to announce that we’re introducing a new Free Trial plan. This replaces the Community plan for new users, and lets people try out ScraperWiki for free for 30 days.

Free Trial

Existing Community accounts will stay free forever, as a thank you for supporting ScraperWiki early on. We may, however, add new features and tools in future that won’t be available on Community accounts.

We continue to support journalists, if you are one then just ask for a free upgrade.

If you have any questions, please do ask in the comments below, or email me, the CEO, at francis@scraperwiki.com.

]]>
758221457
Scraperwiki’s response to the Heartbleed security failure https://blog.scraperwiki.com/2014/04/scraperwikis-response-to-the-heartbleed-security-failure/ Wed, 09 Apr 2014 17:07:17 +0000 https://blog.scraperwiki.com/?p=758221430 Et tu, HeartbleedFailure

“Catastrophic” is the right word. On the scale of 1 to 10, this is an 11.

― Security expert, Bruce Schneier, responds to Heartbleed

On Monday the 7th of April 2014, a software flaw was identified which exposed approximately two thirds of the web to the risk of catastrophic security failure. The flaw has been dubbed “Heartbleed“.

The potential for exploiting this has now been mitigated by many providers, including ScraperWiki. The ramifications are only slowly becoming understood.

We at ScraperWiki recommend that you change your passwords on all websites of importance to you, especially with your bank, email and anything that can be used to impersonate you; regardless of whether you have used those passwords anywhere else.

What’s the problem?

It turns out that there was a programming mistake in a piece of software
which underpins a significant portion of the web. Anyone who understood the mistake could ask most websites on the internet to tell them the credentials (passwords and usernames) of random people.

On Monday night, the mistake became known to hundreds of thousands of people around the world, good guys and bad. Since the attack can be automated to rapidly divulge potentially millions of credentials, it is very likely that large numbers of our passwords are now compromised.

The nature of the leak means that it is very difficult if not impossible to know if information was stolen for the whole time the mistake was present, since 2012. However, as of writing, there is no positive evidence that it was exploited before the announcement on Monday evening.

What does that mean?

It means that for a period of approximately 12-48 hours anyone could download a program which could be pointed at many websites on the internet — including the likes of banks, social media websites, email and ScraperWiki — and obtain passwords for users who recently logged in, along with other data which could be used to impersonate them, with no audit trail.

How has ScraperWiki responded?

Immediately upon learning of the vulnerability, we upgraded our servers and restarted them, making them safe against this attack.

Out of an abundance of caution we re-keyed our servers, obtained new SSL certificates and invalidated all login sessions – meaning you will have had to re-enter your password to access your data on ScraperWiki.

We’ve also reviewed our security practices and beefed up our servers to enable the latest encryption technology to keep your ScraperWiki credentials and data safe, should other attacks of this nature be discovered.

The effects of Heartbleed may be felt for some time. The internet hosts of the world are reeling from this event. It is worth your while to take a moment to protect yourself by changing your passwords now.

A person, slowly gulping and blinking with text reading

A systems administrator hearing about heartbleed for the first time
(courtesy of “Devops reactions“)

 

]]>
758221430
ScraperWiki Classic retirement guide https://blog.scraperwiki.com/2014/03/scraperwiki-classic-retirement-guide/ https://blog.scraperwiki.com/2014/03/scraperwiki-classic-retirement-guide/#comments Fri, 07 Mar 2014 14:06:04 +0000 https://blog.scraperwiki.com/?p=758221134 tractor-on-beach

In July last year, we announced some exciting changes to the ScraperWiki platform, and our plans to retire ScraperWiki Classic later in the year.

That time has now come. If you’re a ScraperWiki Classic user, here’s what will be changing, and what it means for you:

logo-17c5c46cfc747acf837d7989b622f557Today, we’re adding a button to all ScraperWiki Classic pages, giving you single-click migration to Morph.io, a free cloud scraping site run by our awesome friends at OpenAustralia. Morph.io is very similar to the ScraperWiki Classic platform, allowing you to share the data you have scraped. If you’re an open data activist, or you work on public data projects, you should check them out!

From 12th March onwards, all scrapers on ScraperWiki Classic will be read-only, you will no longer be able to edit the code of the scrapers. You’ll still be able to migrate to Morph.io or copy the code and paste it into the “Code in your browser” tool on the new ScraperWiki. And scheduled scrapers will continue running until 17th March.

On 17th March, scheduled scrapers will stop running. We’re going to take a final copy of all public scrapers on ScraperWiki Classic, and upload them as a single repository to GitHub, in addition to the read-only archive on classic.scraperwiki.com.

Retiring ScraperWiki Classic helps us focus on our new platform and tools, and our “Code in your browser” and “Open your data” tools on our new platform are perfect for journalists and researchers starting to code, and our free 20-dataset Journalist accounts are still available. So you have no excuse not to create an account and go liberate some data! 🙂

If you have any other questions, make sure to visit our ScraperWiki Classic retirement guide for more info and FAQs.

In summary…

ScraperWiki Classic is retiring on 17th March 2014.

You can migrate to Morph.io or our new “Code in your browser” tool at any point.

We’re going to keep your public code and data available in a read-only form on classic.scraperwiki.com for as long as we’re able.

]]>
https://blog.scraperwiki.com/2014/03/scraperwiki-classic-retirement-guide/feed/ 4 758221134
9 things you need to know about the “Code in your browser” tool https://blog.scraperwiki.com/2013/08/9-things-you-need-to-know-about-the-code-in-your-browser-tool/ https://blog.scraperwiki.com/2013/08/9-things-you-need-to-know-about-the-code-in-your-browser-tool/#comments Mon, 05 Aug 2013 15:38:58 +0000 http://blog.scraperwiki.com/?p=758219180 Code in your browser

ScraperWiki has always made it as easy as possible to code scripts to get data from web pages. Our new platform is no exception. The new browser-based coding environment is a tool like any other.

Here are 9 things you should know about it.

Choose language 1. You can use any language you like. We recommended Python, as it is easy to read, and has particularly good libraries for doing data science.

2. You write your code using the powerful ACE editor. This has similar keystrokes and features to Window-based programmer’s editors. ACE logo

3. It’s easy to transfer a scraper from ScraperWiki Classic. Find your scraper, choose “View source”, then copy and paste the code into the new “Code in your browser” tool. You have to make sure you keep the new first line that says the language, e.g. “#!/usr/bin/env python”.

4. There are tutorials on Github, if you want to learn to scrape. It’s a wiki, please help improve them! The tutorials work just as well on your own laptop too. Stop / Running

5. To run the code press the “Run” button, to stop it press the “Stop” button. Schedule menu

6. The code carries on running in the background even if you leave the page. You can come back and see the output log, or even see a scheduled run happening mid-flow.

7. It has flexible scheduling. As well as hourly, daily and monthly, you can choose the time of day you want it to run.

8. You can SSH in, if you need to do something the tool doesn’t do. Your scraper is in “code/scraper”. You can install new libraries, add extra files, edit the crontab, access the SQLite database from the command line, use the Python debugger… Whatever you need.

Report bug

9. It’s open source. You can report bugs and make improvements to the tool’s interface. Please send us pull requests! Want to know more? Try this quick start guide. Read the tool’s FAQ. Or find out 10 technical things you didn’t know about the new ScraperWiki.

]]>
https://blog.scraperwiki.com/2013/08/9-things-you-need-to-know-about-the-code-in-your-browser-tool/feed/ 2 758219180
Uploading a (structured) spreadsheet https://blog.scraperwiki.com/2013/07/uploading-a-structured-spreadsheet/ Wed, 24 Jul 2013 03:36:47 +0000 http://blog.scraperwiki.com/?p=758218793 We’ve made a new tool to help you upload a structured spreadsheet. That is to say, one that contains a table with headers.

Upload spreadsheet

I’m trying it out with an old spreadsheet of expenses from when I worked at mySociety.

If your spreadsheet isn’t consistent enough, it tells you where you’ve gone wrong. In my case, I didn’t have a clear header.

No header

It was easy to add the header, then the spreadsheet uploaded into ScraperWiki.

I then used the “Summarise this data” tool on my expenses data.

Summarise this dataYou can immediately see from the expenses in that period, that the median amount I claimed was 27 pounds (and one large amount of 1800, odd).

Also, I claimed a lot of train tickets! (Nobody at mySociety at the time had a car, and we were scattered all over the country, meeting up every couple of weeks.)

Try for yourself! Register on ScraperWiki, and pick the “Upload spreadsheet” tool from the “New dataset” chooser.

P.S. What if the spreadsheet is a bit more chaotic? We’re working on that too, in a separate tool. If you’re a coder, for a taste of what’s behind it, see the okfn/messytables and scraperwiki/xypath Python libraries.

]]>
758218793
We’ve migrated to EC2 https://blog.scraperwiki.com/2013/07/weve-migrated-to-ec2/ https://blog.scraperwiki.com/2013/07/weve-migrated-to-ec2/#comments Wed, 17 Jul 2013 15:36:35 +0000 http://blog.scraperwiki.com/?p=758218832 When we started work on the ScraperWiki beta, we decided to host it ‘in the cloud’ using Linode, a PaaS (Platform as a Service) provider. For the uninitiated, Linode allows people to host their own virtual Linux servers without having to worry about things like maintaining their own hardware.

On April 15th 2013, Linode were hacked via a ColdFusion zero-day exploit. The hackers were able to access some of Linode’s source code, one of their web servers, and notably, their customer database. In a blog post released the next day, they assured us that all the credit card details they store are encrypted.

Soon after, however, we noticed fraudulent purchases on the company credit card we had associated with our Linode account. It seems that we were not alone in this. We immediately cancelled the card and started to make plans to switch to another VPS provider.

These days, the one of the biggest names in PaaS is Amazon AWS. They’re the market leader and their ecosystem and SLA are more in line with the expectations of our corporate customers. Their API is also incredibly powerful. It’s no wonder that even prior to the Linode hack, we had investigated migrating the ScraperWiki beta platform to Amazon EC2.

Since mid June, all code and data on scraperwiki.com is stored on Amazon’s EC2 platform. Amongst other improvements, you should all have started to notice a significant increase in the speed of Scraperwiki tools.

We have a lot of confidence in the EC2 platform. Amazon have been in existence for a very long time and they have an excellent track record in the PaaS field, where they have curated a reputation for reliability and security. It is for these reasons why we feel confident in putting our user’s data on their servers.

The integrity of any data stored on our service is paramount. We are therefore greatly encouraged by AWS’ backup solution, EBS, which we are currently using. It has afforded us the ability to store our backups in two different geographical regions. Should a region ever go down, we are able to easily and quickly restore ScraperWiki, ensuring a minimum of disruption for our customers. 

Finally, we’re excited to announce that we’re using Canonical’s Juju to manage how we deploy our servers. We’re impressed with what we’ve seen of it so far. It seems to be a really powerful, feature rich product and it has saved us a lot of time. We’re looking forward to it allowing us to better scale our product and spend less time on migrations and deployments. It will also allow us to easily migrate our servers to any OpenStack provider, should we wish to.

The changes we’re making to our platform will result in ScraperWiki being faster and more resistant to disruption. As developers and data scientists ourselves, we understand the necessity for reliable tools and we’re really looking forward to you – the user – having an even better Scraperwiki experience.

]]>
https://blog.scraperwiki.com/2013/07/weve-migrated-to-ec2/feed/ 1 758218832
Your questions about the new ScraperWiki answered https://blog.scraperwiki.com/2013/07/your-questions-about-the-new-scraperwiki-answered/ https://blog.scraperwiki.com/2013/07/your-questions-about-the-new-scraperwiki-answered/#comments Mon, 08 Jul 2013 16:49:27 +0000 http://blog.scraperwiki.com/?p=758219095 You may have noticed we launched a completely new version of ScraperWiki last week. Here’s a suitably meta screengrab of last week’s #scraperwiki twitter activity, collected by the new “Search for tweets” tool and visualised by the “Summarise this data” tool, both running on our new platform.

Twitter summary

These changes have been a long time coming, and it’s really exciting to finally see the new tool-centric ScraperWiki out in the wild. We know you’ve got a load of questions about the new ScraperWiki, and how it affects our old platform, now lovingly renamed “ScraperWiki Classic”. So we’ve created an FAQ that hopefully answers all of your questions about what’s going on.

Take a look: https://scraperwiki.com/help/scraperwiki-classic

If there’s anything missing, or any questions left unanswered, let us know. We want to keep that FAQ up to date as the Classic migration goes on, and we’d love your help improving it.

]]>
https://blog.scraperwiki.com/2013/07/your-questions-about-the-new-scraperwiki-answered/feed/ 2 758219095