Steven Maude – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Elasticsearch and elasticity: building a search for government documents https://blog.scraperwiki.com/2015/06/elasticsearch-and-elasticity-building-a-search-for-government-documents/ Mon, 22 Jun 2015 08:39:44 +0000 https://blog.scraperwiki.com/?p=758222932 A photograph of clouds under a magnifying glass.

Examining Clouds” by Kate Ter Harr, licensed under CC BY 2.0.

Based in Paris, the OECD is the Organisation for Economic Co-operation and Development. As the name suggests, the OECD’s job is to develop and promote new social and economic policies.

One part of their work is researching how open countries trade. Their view is that fewer trade barriers benefit consumers, through lower prices, and companies, through cost-cutting. By tracking how countries vary, they hope to give legislators the means to see how they can develop policies or negotiate with other countries to open trade further.

This is a huge undertaking.

Trade policies not only operate across countries, but also by industry.  This process requires a team of experts to carry out the painstaking research and detective work to investigate current legislation.

Recently, they asked us for advice on how to make better use of the information available on government websites. A major problem they have is searching through large collections of document to find relevant legislation. Even very short sections may be crucial in establishing a country’s policy on a particular aspect of trade.

Searching for documents

One question we considered is: what options do they have to search within documents?

  1. Use a web search engine. If you want to find documents available on the web, search engines are the first tool of choice. Unfortunately, search engines are black boxes: you input a term and get results back without any knowledge of how those results were produced. For instance, there’s no way of knowing what documents might have been considered in any particular search. Personalised search also governs the results you actually see. One normal-looking search of a government site gave us a suspiciously low number of results on both Google and Bing. Though later searches found far more documents, this is illustrative of the problems of search engines for exhaustive searching.
  2. Use a site’s own search feature. This is more likely to give us access to all the documents available. But, every site has a different layout and there’s a lack of a unified user interface for searching across multiple sites at once. For a one-off search of documents, having to manually visit and search across several sites isn’t onerous. Repeating this for a large number of searches soon becomes very tedious.
  3. Build our own custom search tool. To do this, we need to collect all the documents from sites and store those in a database that we run. This way we know what we’ve collected, and we can design and implement searches according to what the OECD need.

Elasticsearch

Enter Elasticsearch: a database designed for full text search and one which seemed to fit our requirements.

Getting the data

To see how Elasticsearch might help the OECD, we collected several thousand government documents from one website.

We needed to do very little in the way of processing. First, we extracted text from each web page using Python’s lxml. Along with the URL and the page title, we then created structured documents (JSON) suitable for storing in Elasticsearch.

Running Elasticsearch and uploading documents

Running Elasticsearch is simple. Visit the release page, download the latest release and just start it running. One sensible thing to do out of the box is change the default cluster name — the default is just elasticsearch. Making sure Elasticsearch is firewalled off from the internet is another sensible precaution.

When you have it running, you can simply send documents to it for storage using a HTTP client like curl:

curl "http://localhost:localport/documents/document" -X POST -d @my_document.json

For the few thousand documents we had, this wasn’t sluggish at all, though it’s also possible to upload documents in bulk should this prove too slow.

Querying

Once we have documents stored, the next thing to do is query them!

Other than very basic queries, Elasticsearch queries are written in JSON, like the documents it stores, and there’s a wide variety of query types bundled into Elasticsearch.

Query JSON is not difficult to understand, but it can become tricky to read and write due to the Russian doll-like structure it quickly adopts. In Python, the addict library is a useful one for making it easier to more directly write queries out without getting lost inside an avalanche of {curly brackets}.

As a demo, we implemented a simple phrase matching search using the should keyword.

This allows combination of multiple phrases, favouring documents containing more matches. If we use this to search for, e.g. "immigration quota"+"work permit", the results will contain one or both of these phrases. However, results with both phrases are deemed more relevant.

The Elasticsearch Tool

elasticsearch_tool

With our tool, researchers can enter a search, and very quickly get back a list of URLs, document titles and a snippet of a matching part of the text.

elasticsearch_results

What we haven’t implemented is the possibility of automating queries which could also save the OECD a lot of time. Just as document upload is automated, we could run periodic keyword searches on our data. This way, Elasticsearch could be scheduled to lookout for phrases that we wish to track. From these results, we could generate a summary or report of the top matches which may prompt an interested researcher to investigate.

Future directions

For (admittedly small scale) searching, we had no problems with a single instance of Elasticsearch. To improve performance on bigger data sets, Elasticsearch also has built-in support for clustering, which looks straightforward to get running.

Clustering also ensures there is no single point of failure. However, there are known issues in that current versions of Elasticsearch can suffer document loss if nodes fail.

Provided Elasticsearch isn’t used as the only data store for documents, this is a less serious problem. It is possible to keep checking that all documents that should be in Elasticsearch are indeed there, and re-add them if not.

Elasticsearch is powerful, yet easy to get started with. For instance, its text analysis features support a large number of languages out of the box. This is important for the OECD who are looking at documents of international origin.

It’s definitely worth investigating if you’re working on a project that requires search. You may find that, having found Elasticsearch, you’re no longer searching for a solution.

]]>
758222932
Where do tweets come from? https://blog.scraperwiki.com/2014/06/where-do-tweets-come-from/ https://blog.scraperwiki.com/2014/06/where-do-tweets-come-from/#comments Mon, 16 Jun 2014 08:40:51 +0000 https://blog.scraperwiki.com/?p=758221888 Geography of Twitter @replies

Geography of Twitter @replies by Eric Fisher, reproduced under a Creative Commons Attribution 2.0 Generic license.

In our Twitter search tool, we provide the location of tweets via the latitude and longitude data Twitter offers. If you want to know about where the user was who created a particular tweet, it’s unfortunate then that most Twitter users (including me) don’t enable this feature. What you usually find are rare sightings of latitude and longitude amongst mostly empty columns.

However, you can often get a good idea of a user’s location either from what they enter as location in their profile or from their time zone. We already get this information when you use the Twitter Friends tool, but not when searching for tweets. Now we’ve added it to our Twitter search too, so you can get an idea of where individual tweets were sent from.

This snippet of a search shows you what we now get and highlights the clear difference between the lonely lat, lng columns and the much busier user location and time zone:

Twitter_search_location_data

Create a new Twitter search dataset and you should see this extra data too!

]]>
https://blog.scraperwiki.com/2014/06/where-do-tweets-come-from/feed/ 1 758221888
What’s Twitter time zone data good for? https://blog.scraperwiki.com/2014/06/whats-twitter-time-zone-data-good-for/ https://blog.scraperwiki.com/2014/06/whats-twitter-time-zone-data-good-for/#comments Thu, 05 Jun 2014 10:08:20 +0000 https://blog.scraperwiki.com/?p=758221842 2744390812_c6e2aa449b_o

Curioso elemento el tiempo” by leoplus, available under a Creative Commons Attribution-ShareAlike license.

The Twitter friends tool has just been improved to retrieve the time zone of users. This is actually more useful than it first might sound.

If you’ve looked at Twitter profiles before, you’ve probably noticed that users can, and sometimes do, enter anything they like as their location.

Looking at @ScraperWiki‘s followers, we can see from a small snippet of users that this can sometimes give us messy data:

...Denver. & Beyond
Hyper Island | Stockholm
London
Manchester
Niteroi, Brazil
Somerset
There's a wine blog too .....
London / Berkshire...

People may enter the same location in a number of ways, and may provide data that isn’t even a location.

Locations from time zones

If we look at users’ time zones, Twitter only allows users to pick from a certain number of well-defined time zones. (There’s 141 in total; I’ve collated the entire set here.) The data this returns is much neater and we’d expect that this typically reflects the user’s home location:

...Abu Dhabi
Adelaide
Alaska
Almaty
America/Toronto
Amsterdam...

We find far fewer unique time zone data entries than unique location data for @ScraperWiki’s followers: there are 1586 different location entries, but just 106 time zones. If we wanted to discover which countries or regions our users are likely to be, the time zone data would be far simpler to work with.

Furthermore, time zone data can give us insight into the location of Twitter users who don’t specify their location if they’ve selected a time zone.

For ScraperWiki’s followers, we found 670 of them had an empty location and around the same number had an empty time zone. But, far fewer user accounts (only 255) have both of these fields empty. So, in some cases, we could have a good guess at the location for users who we couldn’t previously from the data the tool was providing.

We’re always working to improve the Twitter tools! If you have ideas for features you’d like to see, let us know!

]]>
https://blog.scraperwiki.com/2014/06/whats-twitter-time-zone-data-good-for/feed/ 2 758221842
Our new US stock market tool https://blog.scraperwiki.com/2014/05/our-new-us-stock-market-tool/ Wed, 14 May 2014 08:13:27 +0000 https://blog.scraperwiki.com/?p=758221600 In a recent blog post, Ian talked about getting stock market data into Tableau using our Code in a Browser tool. We thought this was so useful that we’ve wrapped this up into an easy-to-use tool. Now you can get stock data by pressing a button and choosing the stocks you’re interested in, no code required!

All you have to do is enter some comma-separated stocks, for example: AAPL,FB,MSFT,TWTR and then press the Get Stocks button to collect all the data that’s available. Once you’ve set the tool running, the data continues to automatically update with the latest data daily. Just as with any other ScraperWiki dataset, you can view in a table, query with SQL or download the data as a spreadsheet for use elsewhere. With our new OData connector, you can also import the data directly into Tableau.

You can see Ian demonstrating the use of the US stock market tool, and using the OData tool to connect to Tableau in this YouTube video:

]]>
758221600
Getting Twitter connections https://blog.scraperwiki.com/2014/04/getting-twitter-connections/ https://blog.scraperwiki.com/2014/04/getting-twitter-connections/#comments Tue, 01 Apr 2014 16:00:53 +0000 https://blog.scraperwiki.com/?p=758221356 Introducing the Get Twitter Friends tool
Chain Linkage by Max Klingensmith is licensed under CC BY-ND 2.0``

Chain Linkage by Max Klingensmith is licensed under CC BY-ND 2.0

Our Twitter followers tool is one of our most popular: enter a Twitter username and it scrapes the followers of that account.

We were often asked if it’s possible not only to get the users that follow a particular account, but the users that are followed by that account too. It’s a great idea, so we’ve developed a tool to do that which we’re testing before we roll it out to all users very soon.

The new “Get Twitter friends” tool is as simple as ever to use. The difference is that when you look at the results with the “View in a table” tool now, you’ll see two tables: twitter_followers and twitter_following. Together, these show you all of a user’s Twitter connections.

How close is @ScraperWiki to our followers?

Our Twitter followers

Our lovely Twitter followers 🙂

With this new tool, we can get an idea of how well connected any Twitter user is to their followers. For instance, how many of ScraperWiki’s followers does ScraperWiki follow back?

Using the “Download as a spreadsheet” tool, we can import the data into Excel. By using filters, we can discover how many users are common to both the list of users who follow and are followed by an account. Alternatively, if you’re intrepid enough to use SQL, you can perform this query directly on ScraperWiki’s platform using the “Query with SQL” tool:

SELECT COUNT(*) FROM
(
SELECT id FROM twitter_followers
INTERSECT
SELECT id FROM twitter_following
);

This gives 774, over half of the total number of users that @ScraperWiki follows. We’re certainly interested in what many of our own followers are doing!

Finding new users to follow

To get suggestions for other Twitter users to look out for, you could track who the users you’re particularly interested in are following.

For instance, Tableau make one of the data visualisation software packages of choice here at ScraperWiki, and we’re fans of O’Reilly’s Strata conferences too. Which Twitter accounts do both of these two accounts follow, that ScraperWiki isn’t already following on Twitter?

It wasn’t too tricky to answer this question using SQL queries on the SQLite databases that the Twitter tool outputs. (Again, you could download the data as spreadsheets and use Excel to do your analysis.)

It turns out that there were 86 accounts that are followed by @tableau and @StrataConf, but not by @ScraperWiki already. Almost all of these are working with data. There are individuals like Hadley Wickham, responsible for R’s ggplot2, and Doug Cutting, who’s one of the creators of Hadoop. And there are businesses like Gnip and Teradata, all relevant suggestions.

Twitter accounts followed by @tableau and @StrataConf, but not by @ScraperWiki

Followed by @tableau and @StrataConf; not by us… yet!

It’s also possible to easily sort these results by user follower count. This lets you see the most popular Twitter accounts that probably belong to companies or people who are prominent in their field. At the same time, you might want to track accounts with relatively few followers: in our example, if both @tableau and @StrataConf are following them, then no doubt they’re doing something interesting.

Want to try it?

We’ve released it to all users now, so visit your ScraperWiki datahub, create a new dataset and add the tool!

]]>
https://blog.scraperwiki.com/2014/04/getting-twitter-connections/feed/ 1 758221356
(Machine) Learning about ScraperWiki’s Twitter followers https://blog.scraperwiki.com/2013/12/machine-learning-about-scraperwikis-twitter-followers/ Tue, 03 Dec 2013 10:27:12 +0000 https://blog.scraperwiki.com/?p=758220470 Machine learning is commonly used these days. Even if you haven’t directly used it personally, you’ve almost certainly encountered it. From checking your credit card purchases to prevent fraudulent transactions, through to sites like Amazon or IMDB telling you what things you might like, it’s a way of making sense of the large amounts of data that are increasingly accessible.

Supervised learning involves taking a set of data to which you have assigned labels and then training a classifier based on this data. This classifier can then be applied to similar data where the labels (or classes) is unknown. Unsupervised learning is where we let machine learning cluster our data for us and hence identify classes automatically.

A frequently used demonstration is the automatic identification of different plant species. The measurements of parts of their flowers are the data and the species is equivalent to a label or class designation. It’s easy to see how these methods can be extended to the business world, for example:

  • Given certain things I know about a manufacturing process, how do I best configure my production line to minimise defects in my product?
  • Given certain things I know about one of my customers, how likely are they to take up an offer I want to make them?

You might have a list of potential customers who have signed up to a newsletter, providing you with some profile information and a set of outcomes: those customers from the list who have bought your product. You then train your classifier to identify likely spenders based on the profiles of those you know of already. When new people sign up to your mailing list, you can then evaluate them with your trained classifier to discover if they are likely customers and thus how much time you should lavish on them.

Scraping our Twitter followers

As ScraperWiki’s launching a fantastic new PDF scraping product that’s of interest to businesses, we wondered if we could apply machine learning to find out whether our Twitter followers link to businesses in their account profiles?

First, we scraped the followers of ScraperWiki’s Twitter account. With the aid of a Python script, we used the expandurl API to convert the Twitter follower URLs from the shortened Twitter t.co form to their ultimate destination, and then we scraped a page from each site.

Building a classifier

For our classifier, we used content we’d scraped from the website along with a classification of business or not business for each site.

We split the follower URLs into around a thousand sites to build the classifier with and around eight hundred sites to actually try the classifier on. Myself and Ian spent several hours classifying websites linked to by ScraperWiki followers’ to see whether they appeared to be businesses or not. With these classifications collated and the sites of interest scraped, we could feed in the processed content from the HTML into a scikit-learn classifier.

We used a linear support vector classifier which was simple to create. The more challenging part is actually deciding on what features to retrieve from each website and storing this in an appropriate matrix of features.

Automatically classifying followers

Classifying one thousand followers by hand was an arduous day of toil. By contrast, running the classifier was far more pleasant. Once started, it happily went off and classified the remaining eight hundred accounts for us. The lift curve below shows how the classifier actually performed on the sites we tested it on (red curve) compared to the expected performance if we simply took a random sampling of sites (blue line) as we work through the websites.

Lift curve showing how the business classifier performs.

Looking at the top 25% of the classifier’s predictions, we actually find the majority of the businesses. To confirm each one of these predictions by hand would take us just a quarter of the time it would take to look through the entire set, yet we’d find 70% of all of the businesses, a big time saving.

Feeding the classifier’s predictions into the Contact Details Tool is then a convenient workflow to help us figure out which of our followers are businesses and then how we could go and contact them.

If you’d like to chat to us to see how this type of classification could help your business, get in touch with us and let us know what you’re interested in discovering!

]]>
758220470
Finding contact details from websites https://blog.scraperwiki.com/2013/11/finding-contact-details-from-websites/ Mon, 25 Nov 2013 17:00:28 +0000 https://blog.scraperwiki.com/?p=758220458 Since August, I’ve been an intern at ScraperWiki. Unfortunately, that time’s shortly coming to an end. Over the past few months, I’ve learnt a huge amount. I’ve been surprised at just how fast-moving things are in a startup and I’ve been involved with several exciting projects. Before the internship ends, I thought it would be a good time to blog about some of them.

When I’ve visited company websites to get, for example, their email address or telephone number, it can be a frustrating experience if it’s not immediately clear how I can get to that information. If you’re carrying out a research project, and have a list of companies or organisations that you have to repeat this process for, it becomes tedious very quickly.

Enter the Contact Details Tool!

User interface of the Contact Details Tool

As its straightforward name suggests, the Contact Details Tool is designed to get contact details from websites automatically. The user interface is definitely still a prototype, but perfectly functional! (It’s a placeholder barebones Bootstrap frontend.) All we need to do is to type in website URLs and then click “Get Details”. In this case, let’s suppose that we want to find out about ScraperWiki itself (yes, very meta).

A few seconds (and some behind the scenes scraping) later, we get back a table of results:

Contacts extracted using ScraperWiki's Contact Details Tool

Addresses extracted using ScraperWiki's Contact Details Tool

Everything you need to say hello to us in just about every way possible: by Twitter, email, telephone or maybe even, if you’re in Liverpool, in person!

With a couple of clicks, these results can be downloaded directly as a spreadsheet for offline use. You can do quick searches or even more complicated SQL database queries on the results directly using ScraperWiki’s platform to carry out any further filtering. For instance, we might only be interested in the Twitter accounts we’ve retrieved. (If you’re then interested in more about particular Twitter users, you can search for their tweets or find their Twitter followers using ScraperWiki’s other tools.)

Where the Contact Details Tool really starts saving you time is when you need information from several websites. Instead of sitting there having to scrape websites by hand or tediously conducting lots of internet searches, you can just enter the URLs of interest and let the tool do the work for you.

For a short project that myself and Sean, another intern (along with adult supervision from Zarino), put together in a few weeks, it’s been great to see how a prototype product goes from concept to reality and, moreover, that it’s useful.

As you’ve seen, it’s easy to contact us! So, if you’re interested in what the Contact Details Tool can do for you, please send us an email and we’ll get back in touch.

]]>
758220458
Hi, I’m Steve https://blog.scraperwiki.com/2013/09/hi-im-steve/ Mon, 02 Sep 2013 16:00:41 +0000 http://blog.scraperwiki.com/?p=758219323 319d3fcHi, I’m Steve and I’m the most recent addition to ScraperWiki’s burgeoning intern ranks. So, how exactly did I end up here?

Looking at ScraperWiki’s team page, you can see that scientists working here is a common theme. I’m no different in that regard. Until recently, I was working as a university scientific researcher (looking at new biomedical materials).

As much as I’ve enjoyed that work, I began to wonder what other problems I could tackle with my scientific training. I’ve always had a big interest in technology. And, thanks to the advent of free online courses from the likes of edX and Coursera, I’ve recently become more involved with programming. When I heard about data science a few months ago, it seemed like it might be an ideal career for me, using skills from both of these fields.

Having written a web scraper myself to help in my job searching, I had some idea of what that involves. I’d also previously seen ScraperWiki’s site while reading about scrapers. When I heard that ScraperWiki were advertising for a data science intern, I knew it would be a great chance to gain a greater insight into what this work entails.

Since I didn’t have any prior notions of what working in a technology company or a startup involves, I’m pleased that it’s been so enjoyable. From an outsider coming in, there are many positive aspects of how the company works:

ScraperWiki is small (but perfectly formed): the fact that everyone is based in the same office makes it easy to ask a question directly to the most relevant person. Even if people are working remotely, they are in contact via the company’s Internet Relay Chat channel or through Google Hangouts. This also means that I’m seeing both sides of the company: both what the Data Services team do and the ongoing work to constantly improve the platform.

Help’s on hand: having knowledgeable and experienced people around in the office is a huge benefit when I encounter a technical problem, even if it’s not related to ScraperWiki. When I’m struggling to find a solution myself, I can always ask and get a quick response.

There’s lots of collaboration: pair programming is a great way to pick up new skills. Rather than struggling to get started with, say, some new module or approach, you can see someone else start working with it and pick up tips to push you past the initial inertia of trying something new.

And there’s independence too: as well as working with others on what they are doing and trying to help where I can, I’ve also been given some small projects of my own. Even in the short time I’m here, I should be able to construct some useful tools that might be made publically available via ScraperWiki’s platform.

(Oh, I shouldn’t miss out refreshments: as Matthew, another intern, recently pointed out, lunch often involves a fun outing to one of Liverpool’s many fine eateries. As well as that, tea is a regular office staple.)

It’s definitely been an interesting couple of weeks for me here. you can usually see what I’m up to via Twitter or my own blog. Over the next few weeks, I’m looking forward to writing here again about what I’ve been working on.

]]>
758219323