Basic geoinferencing with Shapely, GeoJSON and OpenRefine

Posted on Mon 25 February 2013 in misc


A year ago I wrote a small geocoding webservice for an organisation in Ghent, using the open data on administrative boundaries provided by the municipality of Ghent. It wrapped the geocoding service of Google, and added a bit of geoinferencing, so that each address was located within its proper statistical sector and block. This allowed, in combination with OpenRefine (former Google Refine), the organisation to batch-clean their addresses.

Trouble is, each time I fired up the webservice after a few months, I got cryptic errors relating to geospatial libraries and their Python bindings I was using. Main culprit seemed to be libgeos, of which the direct Python bindings are no longer maintained. Sometimes reinstalling the Python libraries did the trick, but the latest try this did not even work. E.g. reinstalling pyspatialite results in download errors, which should be resolvable by installing a specific version, but are not.

Admittedly, the stack the webservice used was kind of large for the basic spatial inference the webservice was doing. It used a spatially aware SQLite database (needs sqlite + pysqlite and spatialite + pyspatialite) to store the geometric objects (needs libgeos + Python bindings) defining the boundaries. Once it was running you simply did a SQL-query checking if the geocoded address was contained within the sector/block, e.g.

SELECT sector_name, sector_code FROM sectors 
WHERE ST_Contains(Geometry, MakePoint(lat,long))

Pretty standardized, extendable and fast. But in the end, running a small RDBMS (and the entire dependency-stack supporting it) was a bit overkill. I needed an as pure-Python solution as possible.

Enter Shapely

The former maintainers of the libgeos Python bindings referenced Shapely, which was a perfect fit:

"The first premise of Shapely is that Python programmers should be able to perform PostGIS type geometry operations outside of an RDBMS."

It is decently documented, and defines Python objects such as a Point and a Polygon, with methods that could test their relation, e.g. Polygon.contains(Point). This is all I need, and it only required the (maintained) C version of the GEOS library as dependency.

So the rewrite was largely reduced to getting the custom JSON or KML data provided by the municipality of Ghent into Shapely. This step however took a lot longer then expected, so I am documenting it here at length for the sake of people searching for the same errors I encountered.

Shapley focusses on geometric operations, and keeps its read/write functionality minimal, i.e. reading or writing the WKT/WKB format. My options for getting data into Shapely thus were:

  1. Converting the JSON or KML-file to WKT/WKB and reading this in with the native functions.
  2. Using fastkml to read in the KML-file and provide Shapely-objects.
  3. Convering the JSON or KML-file to GeoJSON and converting them to Shapely objects.

Before trying to read in data provided by a third party, it is useful to be sure your input is valid. You can validate the KML against the XML schema using e.g. xmllint (included with libxml):

curl -O
curl -O
xmllint --schema ogckml22.xsd sectoren.kml --noout

The original KML files contain some capitalisation errors on the elements, nothing major to fix.

The JSON files are provided by the underlying Datatank framework, and are a custom format, i.e. not GeoJSON.

Reading in spatial data

1. The WKT/WKB option

The gdal-bin package contains the very extensive ogr2ogr GIS-format conversion command. Converting the KML file to a GIS-format of choice is a CLI oneliner:

ogr2ogr -f CSV wijken.csv wijken.kml -lco GEOMETRY=AS_WKT
ogr2ogr -f SQLite wijken.wkb wijken.kml

You can read this into Shapely using

from shapely import wkt, wkb
f = open('../kml/wijken.wkb')

This gives however the following error (same for wkt()):

ReadingError: Could not create geometry because of errors while reading input.

As it provides no further info, and the Python code at this point is a wrapper for the underlying C-functions, I do not know what is wrong with the format.

2. The fastkml route

The fastkml Python library parses KML and returns Shapely objects if Shapely is installed. This works, but it seems to lose the ExtendedData properties in the KML-file (or at least does not pass them to the Shapely objects).

So if you only need the geometries, this is a fast and workable solution. However, I needed the geometry metatadata for the sectors and blocks, so this was not a solution.

3. The GeoJSON route

On first sight GeoJSON would be a good fit with Shapely, as it works with the Python geo interface. I overestimated the degree of fit however, as Shapely is concerned with the lower-level geometry objects, and not the more higher level GeoJSON constructs such as FeatureCollection.

After some validation by Sean Gillis, the author of Shapely, that the GeoJSON route was doable (with a small workaround) I settled for choice three.

As mentioned, the original JSON format is not GeoJSON, so not directly usable. It is however possible to convert the KML-file to a GeoJSON file using ogr2ogr.

ogr2ogr -f GeoJSON sectors.json sectors.kml

Unfortunately, a.t.m. the version in the stable Ubuntu/Debian repositories only contains a partial KML-driver, which does not include the ExtendedData-attributes I need. The libkml driver that does support the ExtendedData-elements should be included with GDAL starting from version 1.9.2.

Again, two steps forward, one backward: while GDAL 1.9.2 is available from an Ubuntu PPA, for some reason it does not come with the needed version of libkml. As getting the full GDAL-stack to compile would likely lead to further yak-shaving, I abandoned that route.

That leaves us with one, non-reusable route: custom parsing the ExtendedData KML-attributes from the description element, i.e. a Python script that converts the KML to GeoJSON and reparses the KML to add the properties.

Finally, geoinferencing!

Once you have you spatial data in Shapely, the core code itself is quite simple:

import json
from shapely import shape, Point

# load GeoJSON file containing sectors
with ('sectors.json', 'r') as f:
    js = json.load(f)

# construct point based on lat/long returned by geocoder
point = Point(45.4519896, -122.7924463)

# check each polygon to see if it contains the point
for feature in js['features']:
    polygon = shape(feature['geometry'])
    if polygon.contains(point):
        print 'Found containing polygon:', feature

We simply loop over the 11 blocks and 200 odd sectors, and return the matching ones. Not the most optimal solution (e.g. we could nest the sectors in the blocks first), but as the webservice spends most time on talking to the geocoder, it is not really an issue (or can be optimized later).

The final webservice adds a cache using Redis, and exposes the very simple API using Flask. I.e. a simply GET


gives you a JSON-document, containing the information of Googles geocoding service, plus the sector and block information:

  "geocode_status": "OK", 
  "sector_code": "A02", 
  "number": "3", 
  "full_address": "Korte Meer 3, 9000 Gent, Belgium", 
  "street": "Korte Meer", 
  "postal_code": "9000", 
  "lat": "51.0520538", 
  "query": "Korte Meer 3 Gent", 
  "geoinference_status": "OK", 
  "lng": "3.7229667", 
  "sector_name": "KOUTER", 
  "country": "Belgium", 
  "locality": "Gent", 
  "block_name": "Binnenstad", 
  "valid_address": "

The service is most effectively used in combination with OpenRefine. Using the function "Add column by fetching URLs" on a column containing addresses in Ghent will pull in all the data in the form of a JSON structure which you then can futher process.

Use a GREL expression such as '' + escape(value, 'url') to fetch the data and value.parseJson()["sector_code"] to process the the resulting data into usable columns.