By Jonathan Ackerman, Solution Architect, ClearPoint

Houston we have a problem

Reference data is everywhere and for most organisations it is a major chore to deal with. Not only have to ensure that it is accurate, but also ensuring that it is available to all the myriad of systems that need it. This often leads to a raft of different replication, caching and “persontronic” strategies that are expensive to maintain and often fail.

And the risk of inaccurate reference data can be surprisingly high. A 2005 study estimated that up to 70% of financial transactions contain reference data and that the 15 largest US-based financial organisations lost between $238 to $1242 million each, annually, due to bad reference data!

Obviously, having one source of the truth, that all systems can access, can greatly cut down on these issues and it makes reference data maintenance much easier. Luckily, we have proven architecture patterns to use here in the form of REST APIs and microservices.

Unfortunately, this centralisation uncovers another problem. Not all clients of our reference data need to use it in the same way.

This means that we either:

  • create a ton of client specific RESTful endpoints that are expensive to maintain
  • or we build a generic set of endpoints that clients will need to navigate, extracting and merging data as they go. But this is inefficient, slow and error-prone.
  • or perhaps, the worst solution of all, expose an endpoint that dumps out a big JSON blob and leave it up to the client to deal with

So to summarise:

  • We want to know what “version” of the reference data is where – this will keep the auditors happy and ensure that all the systems are singing from the same reference data. Ideally we want this all in one place.
  • We want it to be efficient, fast and easy for systems to consume the reference data
  • Ideally, we don’t want to write, and more importantly, maintain any code

Sounds simple, doesn’t it? No? Well, there is a cunning plan that can make it so…

The cunning plan

Firstly, let’s containerise our reference data. If our reference data is held in an immutable, labeled Docker image, we can meet our versioning requirements. If the data changes, we just build a new version of the container and deploy it. It also means that we are maintaining our reference data all in one place. That will keep the auditors happy.

This also helps with our “fast” requirement: we can deploy our container into a nice auto-scaling cluster to make sure that the data is always available and delivery is always speedy.

The containerisation approach can also help if for some reason the reference data cannot be centrally deployed. In this case, different systems can deploy their own instance of the reference data, pulling the same container from a central container repository.

Right, so we have solved the auditing, versioning, maintenance and replication problems. Now we need to solve our expensive to maintain, hard to design service endpoint problem.

For this we are going to use the awesome GraphQL. GraphQL is a query language for your API. You can think of it as SQL for data exposed via APIs. This analogy is selling GraphQL a bit short, but it is a good starting point to understanding what it is.

By exposing our reference data using GraphQL, clients can craft requests for the data in the exact form they need. Navigating relationships in the data (the graph) and requesting only the data they need (the query). This can often be done in a single round-trip, aiding our quest for efficient and fast. It also means we only need to create a single GraphQL endpoint.

Now for the last piece of the cunning plan. How can we do this without any coding? And the answer is… we are going to use the fantastic PostGraphile tool. This open source, MIT licensed tool takes a PostgreSQL database and automagically wraps it with GraphQL. “No code” achievement unlocked!

Ok, so our plan of attack is this:

  • Create a PostgreSQL database containing our reference data
  • Wrap the database with PostGraphile
  • Bake all of this into a Docker container
  • And finally sprinkle with some automation to make help us reach reference data Nirvana

Putting it into practice

This GitHub repo has a full-blown example but I’ll work you through the interesting bits…

First off the data. Let’s say that for this example the data set is countries, cities and airports of the world. This sample data was extracted from the OpenFlights dataset (see their license).

In the data set, airports are mapped to cities and countries, and cities are mapped to countries. This data has been dumped into 3 .csv files which, forms the master source for our reference data:
Snippet of countries.csv

PP,Papua New Guinea

Snippet of cities.csv

2,PP,Mount Hagen,Pacific/Port_Moresby

Snippet of airports.csv

0,Goroka Airport,0,PP,GKA,AYGA,-6.0816898345900015,145.391998291,5282,Pacific/Port_Moresby
1,Madang Airport,1,PP,MAG,AYMD,-5.20707988739,145.789001465,20,Pacific/Port_Moresby
2,Mount Hagen Kagamuga Airport,2,PP,HGU,AYMH,-5.826789855957031,144.29600524902344,5388,Pacific/Port_Moresby

Next up we have setupdatabase.sql. This script creates the PostgreSQL database:


And loads the .csv files into the tables:

    id int PRIMARY KEY,
    country_code varchar(2) NOT NULL REFERENCES countries(country_code),
    name varchar(100) NOT NULL,
    timezone_id varchar(30) NOT NULL

COPY cities
FROM '/tmp/cities.csv' DELIMITER ',' CSV NULL AS '\N';

The next moving part is the file. This is used to start up PostgreSQL and PostGraphile in the container. You’ll notice a bunch of interesting command line options being passed to PostGraphile, I’ll cover these below.

service postgresql start
postgraphile -n -c postgres://graphqlref:graphqlref@ --no-ignore-rbac --disable-default-mutations --dynamic-json --append-plugins @graphile-contrib/pg-simplify-inflector --enhance-graphiql

The last part of the puzzle is the Dockerfile that creates the container. There isn’t anything out of the ordinary here. It just installs the required packages and copies the required files into the container. The only slightly odd step is that it fires up PostgreSQL during the build and populates the reference data database. This allows the container to start up with a pre-baked database:

USER postgres
RUN service postgresql start && \
    psql -f /tmp/setupdatabase.sql
USER root

Now all we need to do is build and run the container. We can do this  locally to have a play:

docker build -t graphqlref .
docker run --rm -it -p 5000:5000  graphqlref

Browse to http://localhost:5000/graphiql

This will bring up a UI that you can use to create and test your GraphQL queries:

PostGraphile Web UI

Because we used the –enhance-graphiql parameter for Postgraphile, we get the explorer panel on the left which lets us use point and click to construct our queries. You will also see that we get very natural names for the entities in our graph based on our database tables. This is the work of the –append-plugins @graphile-contrib/pg-simplify-inflector parameter which activates the simplify inflector plugin.

Explorer panel

Show me the data

Let’s write our first query and grab all the cities in New Zealand from our dataset. To do this, we will start by retrieving the country with a code of NZ and then navigate the country to cities relationship. We can do this by point and clicking in the explorer or by typing the query into the query field:

Example query - cities in New Zealand

And when we run this query we get:

Query results - cities in New Zealand

If you look closely at the explorer you will notice a bunch of fields such as countryByName, airportByIataCode and airportByIcaoCode. These are automatically created by PostGraphile because those database columns are defined as UNIQUE in the database. For an example of how these mappings happen, check out this part of the PostGraphile documentation.

In our case, we could use countryByName to write our query like this instead:

Example query - cities in New Zealand using country name

Notice that in this example, I have also requested the countryCode of the country along with the list of cities.

Speaking of which, that country code field is a good example of the flexibility of PostGraphile. By default, it expects the primary keys to be named id and foreign key columns to be named table_id. This allows it to be generate nice, simplified names. However  this is not the case with our countries table which has a primary key named country_code.

PostGraphile provides a host of features to work the peculiarities of SQL databases. The simplest is using smart comments to give PostGraphile hints on how we want it to do things.

In our country code case, if you look in the setupdatebase.sql script you will see these two lines:

COMMENT ON CONSTRAINT "cities_country_code_fkey" on "cities" is E'@foreignFieldName cities\n@fieldName country';
COMMENT ON CONSTRAINT "airports_country_code_fkey" on "airports" is E'@foreignFieldName airports\n@fieldName country';

Which gives PostGraphile the hints it needs to generate the nice simplified names were are after.

Ok, let’s do one more example query before moving on. This one navigates “up” the tree from an airport, pulling related city and country data out of the database:

Example query - Auckland Airport Information

Obviously, I have just touched on some of the features of GraphQL here. There are all sorts of filters that can be applied to do pagination, sorting and other conditions. In fact, there is another PostGraphile plugin that could be particularly useful for querying reference data. It is the connection filter plugin that adds a host of new conditional filters such as “contains”, “greatThan”, “startsWith”. There are even plugins to do geospatial queries using the PostGIS extension for PostgreSQL. Using this we could for example query for “nearby” airports using a GPS position.

That’s a wrap

Before wrapping up, I just want to touch on some other key parameters I pass to PostGraphile in the script.

The first is –disable-default-mutations. This stops PostGraphile from creating it’s default CRUD mutations making the reference data read-only (yep GraphQL can be used to update data too).

Slightly related, is the –no-ignore-rbac parameter. This one ensures that PostGraphile honours the database permissions when exposing data from the database. PostGraphile can also use PostgreSQL Row Level Security policies along with JWT tokens to further lock down access.

Lastly, left as an exercise for the reader, is automation. You can easily imagine a process where a pull request is raised for a change to some reference data. Once approved and merged, a continuous integration server would then build, label and published a container to a Docker repo. After that, an automated continuous deployment process would kick off to deploy the new reference data throughout the organisation. Applying automated acceptance tests, load tests and security tests along the way, of course.

And there we have it; reference data nirvana!


Get in touch with us if you need any help or advice on your own reference data, containerisation, continuous delivery pipelines or Cloud friendly architectures.