Table Xtract – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Hiding invisible text in Table Xtract https://blog.scraperwiki.com/2014/05/hiding-invisible-text-in-table-xtract/ Mon, 19 May 2014 08:09:32 +0000 https://blog.scraperwiki.com/?p=758221688 As part of the my London Underground visualisation project I wanted to get data out of a table on Wikipedia, you can see it below. It contains data on every London Underground station including things like the name of the station, the opening date, which zone it is in, how many passengers travel through it and so forth.

image

Such tables can be copy-pasted directly into Excel or Tableau but the result is a mess with extraneous lines and so forth which needs manual editing for us to work with it. Alternatively we can use the ScraperWiki Table Xtract tool to get the data in rather cleaner form, we can see the result of doing this below. It looks pretty good, the Station name and Lines columns come out nicely and there is only one row per station, and no blank rows. But something weird is going on in the numeric and date fields, characters have been appended to the data we can see in the table.

image

It turns out these extra characters are a result of invisible text added to the tables to make the table sortable by those columns. This “invisible” text can be seen by inspecting the source of the HTML page. There are various ways of making text invisible on a web page but Wikipedia seems to just use one in it’s sortable tables. Once I had identified the issue it was just a case of writing some code to hide the invisible text in the Table Xtract tool. To do this I modified the messytables library on which Table Xtract is built, you can see the modification here. The stringent code review requirements at ScraperWiki meant I had two goes at making the change!

You can see the result in the screenshot below, the Opened, Mainline Opened and Usage columns now are free of extraneous text. This fix should apply across Wikipedia and also to tables on other web pages which use the same method to make text invisible.

image

We’re keen to incrementally improve our tools, so if there’s a little fix to any of our tools that you want us to make then please let us know!

]]>
758221688
The Tyranny of the PDF https://blog.scraperwiki.com/2013/12/the-tyranny-of-the-pdf/ https://blog.scraperwiki.com/2013/12/the-tyranny-of-the-pdf/#comments Fri, 27 Dec 2013 09:30:25 +0000 https://blog.scraperwiki.com/?p=758220272 Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Why is ScraperWiki so interested in PDF files?

Because the world is full of PDF files. The treemap above shows the scale of their dominance. In the treemap the area a segment covers is proportional to the number of examples we found. We found 508,000,000 PDF files, over 70% of the non-html web. It is constructed by running Google queries for different filetypes (i.e. filetype:csv). After filtering out the basic underpinnings of the internet (the HTML, HTM, aspx, php and so forth) we are left with hosted files. The sorts of files you might find also on your local hard drive or on corporate networks.

These are just the files that Google has indexed. There are likely to be many, many more in private silos such as company reports databases, academic journal archives, bank statements, credit card bills, material safety data sheets, product catalogues, product specifications…

The number of PDFs on the indexed internet amounts to 1 for every 5 or so people with internet access but I know that personally I am responsible for another couple of hundred which are unique to me – mainly bank and credit card statements. That goes without mentioning the internal company reports I’ve generated, and numerous receipts and invoices.

So we’ve established there’s an awful lot of PDF files in the world.

Why are there so many of them?

The first version of PDF was released in 1993, around the birth of the internet. It is a mechanism for describing the appearance of a page on screen or on paper. It does not care about the semantics of the content: the titles, paragraphs, footnotes and so forth. It just cares about how the page looks, and how a page looks is important. For its original application it is the readers job to parse the structure of the document. The job of the PDF document is solely to look consistent on a range of devices.

PDF documents are about appearance.

They are also about content control, there are explicit mechanisms in the format to control access with passwords and limit the ability to copy or even print the content. PDF is designed as a read only medium – the intention of a PDF document is that you should read it, not edit it. There is a variant PDF/A designed explicitly as an archival format. Therefore for publishers of data, who wish to limit the use to which the data they sell, PDF is an ideal format. Ubiquitous, high-fidelity but difficult to re-purpose.

PDF documents are about content control.

These are the reasons that the format is so common: it is optimised for appearance and to read only, and there are a lot of people who want to generate such content.

What’s in them?

PDF files contain all manner of content. There are scientific reports, bank statements, the verbatim transcripts of the UN assembly, product information, football scores and statistics, insurance policy documents, SEC filings, planning applications, traffic surveys, government information revealed by Freedom of Information requests, the membership records of learned societies going back for hundreds of years, the financial information of public, private and charitable bodies, climate records…

We can divide the contents of PDFs crudely into free form text, and structured data: data in things that look like tables or database records. Sometimes these contents are mixed together in the same document. For example a scientific paper or a report may contain text and tables of data. Sometimes PDF files just contain tables of data. Apparently free form text can be quite structured.

The point about structured contents is that they are ripe for re-use, if only they can be prised from the grip of a format that is solely interested appearance.

Our customers are particularly interested in PDF files which contains tables of data, things like production data for crops or component specifications, or election results. But even free form text can be useful once liberated. For example, you may wish to mine the text of your company reports in order to draw previously unseen connections using machine learning algorithms. Or, to take the UN transcripts as an example, we can add to the PDF by restructuring the contents to record who said what when and how they voted in a manner that is far more accessible than the original document.

What’s the problem?

The problem with PDF files is that they describe the appearance of the page but do not mark up the logical content. Current, mainstream tools allow you to search for the text inside a PDF file but they provide little context for that text and, if you are interested in numbers, then search tools are little use to you at all.

In general PDF documents don’t even know about words, a program that converts PDF to text will have to reconstruct the words and sentences from the raw positions of letters found in the PDF file.

This is why ScraperWiki is so interested in PDF files, and why we made the extraction of tables from PDF files a core feature of our Table Xtract product.

What data do you have trapped in your PDF files?

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
https://blog.scraperwiki.com/2013/12/the-tyranny-of-the-pdf/feed/ 1 758220272
Time to try Table Xtract https://blog.scraperwiki.com/2013/12/time-to-try-table-xtract/ https://blog.scraperwiki.com/2013/12/time-to-try-table-xtract/#comments Wed, 18 Dec 2013 13:23:39 +0000 https://blog.scraperwiki.com/?p=758220645 Getting data out of websites and PDFs has been a problem for years with the default solution being the prolific use of copy and paste.

ScraperWiki has been working on a way to accurately extract tabular data from these sources and making it easy to export to Excel or CSV format. We have been internally testing and improving it with a seemingly infinite array of tables of different complexities. We’re now confident enough in Table Xtract’s accuracy that we are proud to put our name on it and launch a free beta version for you to try.

We are not stopping there though ,as many of you are asking for the ability to create datasets from multiple files/websites and more importantly, refresh them when the source files are updated. Look out for a Table Xtract Enterprise announcement soon!

Try it for free!

]]>
https://blog.scraperwiki.com/2013/12/time-to-try-table-xtract/feed/ 1 758220645
Table Scraping Is Hard https://blog.scraperwiki.com/2013/11/table-scraping-is-hard/ Tue, 05 Nov 2013 15:25:43 +0000 http://blog.scraperwiki.com/?p=758219419 The Problem

NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS?

It’s a common problem – there’s lots of data out there which isn’t in a nice, neat, obviously usable format, for various reasons. What we’d like to have is all that data in a single database so we can slice it by time and place, and look for patterns. There’s no magic bullet for this, yet; we’re having to solve issues on each step of the way.

Where’s the data?

A sizable portion of the data is stored in data.gov.uk, but significant chunks were stored elsewhere on individual NHS trust websites. Short of spidering every NHS trust website, we wouldn’t be able to find the spreadsheets, so we automated finding the spreadsheets through Google.

Google make it difficult to scrape them – ironic, given that that’s what they do to everyone else – building a search engine means scraping websites and indexing the content. We also found that the data in the tables was held in a variety of formats and whilst most of this spending data was in spreadsheets, we also found tables in web pages and PDFs. Usually these would each need separate software to understand the tables, but we’ve been building new tools to make this easier to let us extract tables from all these different formats so we don’t need to worry about where the table originally came from.

The requirement from central government to provide this spending data has led to some consistency in the types of information provided; but there are still difficulties in matching up the different columns of data: Dept Family, Department family, and Departmental Family are all obviously the same thing to a human, but it’s more difficult to work out how to describe such things to a computer. Where one table has both “Gross” and “Net”, which should be matched up with another table’s “Amount”?

Worse still, it’s possible for the data in the tables to need to be matched, such as company names; where an entry exists for “BT” it needs to be matched to “British Telecommunications PLC”, rather than “BT Global Services”. Quite how to do this reliably, even with access to Companies House data is still not as easy as it should be. Hopefully projects such as OpenCorporates which also uses Scraperwiki will in the future make this an easier job.

To handle the problem of providing a uniform interface to tables in PDF, on web pages and in Excel files we made a library which we built into our recently launched product, Table Xtract.

Try Table Xtract for free!

]]>
758219419
pdftables – a Python library for getting tables out of PDF files https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/ https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/#comments Mon, 29 Jul 2013 16:21:54 +0000 http://blog.scraperwiki.com/?p=758219060 Got PDFs you want to get data from?
Try our web interface and API over at PDFTables.com!

One of the top searches bringing people to the ScraperWiki blog is “how do I scrape PDFs?” The answer typically being “with difficulty”, but things are getting better all the time.

PDF is a page description format, it has no knowledge of the logical structure of a document such as where titles are, or paragraphs, or whether it’s two column format or one column. It just knows where characters are on the page. The plot below shows how characters are laid out for a large table in a PDF file.

AlmondBoard7_LTChar

This makes extracting structured data from PDF a little challenging.

Don’t get me wrong, PDF is a useful format in the right place, if someone sends me a CV – I expect to get it in PDF because it’s a read only format. Send it in Microsoft Word format and the implication is that I can edit it – which makes no sense.

I’ve been parsing PDF files for a few years now, to start with using simple online PDF to text converters, then with pdftohtml which gave me better location data for text and now using the Python pdfminer library which extracts non-text elements and as well as bonding words into sentences and coherent blocks. This classification is shown in the plot below, the blue boxes show where pdfminer has joined characters together to make text boxes (which may be words or sentences). The red boxes show lines and rectangles (i.e. non-text elements).

AlmondBoard7

More widely at ScraperWiki we’ve been processing PDF since our inception with the tools I’ve described above and also the commercial, Abbyy software.

As well as processing text documents such as parliamentary proceedings, we’re also interested in tables of numbers. This is where the pdftables library comes in, we’re working towards making scrapers which are indifferent to the format in which a table is stored, receiving them via the OKFN messytables library which takes adapters to different file types. We’ve already added support to messytables for HTML, now its time for PDF support using our new, version-much-less-than-one pdftables library.

Amongst the alternatives to our own efforts are Mozilla’s Tabula, written in Ruby and requiring the user to draw around the target table, and Abbyy’s software which is commercial rather than open source.

pdftables can take a file handle and tell you which pages have tables on them, it can extract the contents of a specified page as a single table and by extension it can return all of the tables of a document (at the rate of one per page). It’s possible, for simple tables to do this with no parameters but for more difficult layouts it currently takes hints in the form of words found on the top and bottom rows of the table you are looking for. The tables are returned as a list of list of lists of strings, along with a diagnostic object which you can use to make plots. If you’re using the messytables library you just get back a tableset object.

It turns out the defining characteristic of a data scientist is that I plot things at the drop of a hat, I want to see the data I’m handling. And so it is with the development of the pdftables algorithms. The method used is inspired by image analysis algorithms, similar to the Hough transforms used in Tabula. A Hough transform will find arbitrarily oriented lines in an image but our problem is a little simpler – we’re interested in vertical and horizontal rows.

To find these vertical rows and columns we project the bounding boxes of the text on a page onto the horizontal axis ( to find the columns) and the vertical axis to find the rows. By projection we mean counting up the number of text elements along a given horizontal or vertical line. The row and column boundaries are marked by low values, gullies, in the plot of the projection. The rows and columns of the table form high mountains, you can see this clearly in the plot below. Here we are looking at the PDF page at the level of individual characters, the plots at the top and left show the projections. The black dots show where pdftables has placed the row and column boundaries.

AlmondBoard8_projection

pdftables is currently useful for supervised use but not so good if you want to just throw PDF files at it. You can find pdftables on Github and you can see the functionality we are still working on in the issue tracker. Top priorities are finding more than one table on a page and identifying multi-column text layouts to help with this process.

You’re invited to have a play and tell us what you think – ian@scraperwiki.com

Got PDFs you want to get data from?
Try our web interface and API over at PDFTables.com!
]]>
https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/feed/ 3 758219060