The recipes are written in a special language. They look like this:
from databaker.constants import * def per_file(tableset): return tableset def per_tab(tab): obs = tab.filter("MGSL").assert_one().shift(DOWN).fill(RIGHT).fill(DOWN).is_number().is_not_italic() tab.col('A').one_of(['Male', 'Female', 'All Persons']).dimension('gender', CLOSEST, ABOVE) tab.col('A').is_date().dimension(TIME, DIRECTLY, LEFT) tab.regex("All aged .*").dimension('ages', CLOSEST, ABOVE) tab.filter("Total economically active").fill(LEFT).fill(RIGHT).is_not_blank().dimension('indicator_', DIRECTLY, ABOVE) return obs
You can just edit or copy recipes to make basic changes. If you need to do anything more complicated, it is worth knowing they are written in the Python programming language. There's lots of online help and training for Python.
from databaker.constants import *
This is just boilerplate: it needs to be at the start of each file.
def per_file(tableset): ... return wanted_sheets
Here we say which tabs in the Excel spreadsheet we're interested in.
- Explicit names of tables as strings
def per_file(tabs): return "My Table" # one table called 'My Table' # or return ["This Table", "That Table"] # two tables with specific names # or return "*" # all tables
- Using a list of table names to inform decisions:
def per_file(tabs): tablist = tabs.names # get a list of names tablist.discard('Notes') # remove names from list tablist.discard('CV notes') return tablist
- Returning the tables themselves
def per_file(tabs): return tabs # all tables
This last one is potentially useful for selecting tables based on the contents of their cells:
def per_file(tabs): for tab in tabs: if tab.excel_ref("A1").value == '2015': yield tab
def per_tab(tab): .... return observations
As sugggested by the name, this code is run for every tab in the spreadsheet. It needs to return the observations and tell the tab which cells are which headers.
These are done with:
def per_tab(tab): tab.dimension('table_header', selector().value) # get text describing sheet - dark blue tab.selector().dimension(GEOG, DIRECTLY, LEFT) # places - red tab.selector().dimension('top_header', DIRECTLY, ABOVE) # what the number is - green obs = tab.selector() # observations - yellow return obs
(in this example,
selector() isn't an actual
databaker function. It's
a placeholder for functions that pick relevant cells, e.g.
.is_not_italic(). For real selectors, see the
You can return multiple segments by using
yield. This will lead to the previous headers being wiped; you can use variables to avoid having to recalculate the unchanged headers.
The output probably won't be in the correct order, however: each tab is processed in turn, rather than each segment. You can use
sortcsv --ons <csvfile> to change the file in-place.
For dimensions that are the same for all observations on a tab, a literal string can be used. (This might be
tab.name, the name of the tab, or a selector which selects a single cell followed by
.dimension(name, method, direction)
For dimensions where the value for a particular observation is in a single cell, we need to say how to get from the observations to the headers: we need to specify whether we stay in the same column or row (
DIRECTLY) or just need to get the closest in that direction, regardless (
CLOSEST). (Where it isn't, see
The region examined for headers typically includes the row/column the observation is in.
CLOSEST changes which cell is selected:
This does lead to slightly counterintuitive behaviour when selecting
CLOSEST cells which are off in one direction from the observations.
For the table below, we'd look for lowercase letters using
DIRECTLY, LEFT, but for the uppercase letters with
[above: to look
DIRECTLY in the same row, you need the closest cell to the
[above: whereas the
CLOSEST is found at the closest relevant cell
Sometimes the value of the dimension isn't all in the same cell: the
TIME for the cell below should be "2012 Q3" but they're in different cells.
We need subdimensions to pick out the different locations for the text. (We can't
.glue the cells together since we need the year in multiple places)
This would look like:
tab.dimension(DATE, [years.subdim(CLOSEST, ABOVE), quarters.subdim(DIRECTLY, LEFT)])
Note that subdimensions are exactly the same as normal dimensions except:
- they don't have a label (i.e. their first parameter isn't
YEARor a column name)
- they don't appear independently in the CSV output, only joined together.
- they're not chainable:
.subdim()doesn't return a bag.
For example, `tab.subdim('literal string') is also a valid subdimension.
Theoretically this might mean you can nest subdimensions, but this hasn't been tested.
There is also an optional
join_function parameter which controls how the subdimension values are concatenated. By default this inserts a space between each pair (i.e.
join_function = ' '.join).
Dimensions such as
GEOG have special places in the output CSV format.
Dimensions which are strings are appended to the end of each row in order.
If an observation isn't a number,
DATAMARKER is automatically populated with the value instead, and the observation field is left empty.
This won't happen if the
DATAMARKER field isn't blank; you'll get a warning instead.
TIMEUNIT is automatically populated if
TIME contains a well-formatted date-string. (Doesn't work on Excel date-formatted items, yet. Yell if that's a problem.)
Well formatted date string examples:
2000 Q1 and
Jan-Mar 2000 (Quarter);
Jan 2000 (Month).
STATUNIT have no special properties above going in particular columns in the output.