Filling a Geofence with Geohashes

The nature of my works means that I spend a decent amount of time dealing with geospatial data. Whilst there are many different ways to represent areas on the globe, geohashes are particularly fascinating.

A Quick Primer on Geohashes

If you're already familiar with geohashes, go ahead and skip this section.

A geohash is an arbitrary precision value which represents an area on the Earth's surface. In string form, they are encoded with a base 32 alphabet, meaning each character in the string can be represented in 5 bits. A geohash with precision 6 (i.e. with 6 characters, or 30 bits of information) covers about 0.6km. A geohash with precision 12 would cover about 0.03cm which is more than enough for pretty much anything.

Without going into details of the algorithm too much, the way I like to think of geohashes is by imagining the world is divided up into 32 sections. Each one of these sections would have a corresponding letter from the base 32 alphabet. Any time you add another letter to the geohash, you're subdividing that area into another 32 parts. For example, here are all of the geohashes at precision 2 under d and e:

One of the most useful properties is that by adding another character, you know that the geohash you create is contained in the previous one. For example, you can quickly tell that dpr74ht is contained by dpr74h because they share a common prefix. This, of course, can be extrapolated to any precision.

The Real World is Not Rectangular

Herein lies the problem. If we want to represent a very specific area of the world, it's extremely unlikely that the area will line up exactly with the borders of a geohash. For these cases, we have a much better solution; geojson. With geojson, we can define a polygon which represents an exact area of the world. For example, if we want to completely represent the area of Regent's Park in London, we could define this geofence:

To represent the same geofence with geohashes, we can choose to use gcpvh which also includes most of Camden, or we can continue to subdivide that hash into smaller chunks which are more accurate, but now we need to use the union of gcpvhq, gcpvhw, gcpvhj, gcpvhm, gcpvht, gcpvhh, gcpvhk, gcpvhs, gcpvh7, gcpvhe and we're still pretty far from only containing the park. As we generate a better approximation of the real area, the number of geohashes continues to grow.

So if we can get a more accurate representation of reality with geofences, why bother with geohashes at all?

An Interesting Use Case for Geohashes

There are many great use cases for geohashes, but one specific one recently required some interesting work. On this project, we were collecting millions of geographical points every day, and we wanted to count the frequency that they appeared in different regions (across all time). However, the exact regions that our customers wanted were defined by geofences and were not known until runtime. The product would effectively let a user draw any geofence on the map, and calculate how many points fell within that area.

With a small number of points, this is a simple problem to solve. We could use a Point in Polygon (PiP) algorithm to simply check each point against the polygon that the user submits as a query. However, with hundreds of millions of points (and especially with geofences with a large number of vertices) this computation becomes incredibly expensive.

Instead, what we chose to do was to encode the points we were receiving to a fixed-precision geohash at write time and store those geohashes in redis. For example, we could store a map of geohash -> count:

{
  gcpvhm: 143,
  gcpvhk: 271,
  ....
}

Now, when the user submits their geofence to query for the counts we need to:

  1. Figure out which geohashes fall within their geofence.
  2. Fetch all of the counts for those geohashes from redis.
  3. Sum the counts.

This sounds pretty simple, but it turns out that there are quite a few nuances to solving it, and there's not a lot of pre-existing open source work either. So let's build it!

Part 1 - Inside or Out

We know we want to fill a geofence with multiple geohashes, so our 'base case' - so to speak - is whether a single geohash falls within a geofence or not. However, what does it really mean to fall within a geofence? This image shows the 3 possibilities that can occur:

Two things are clearly evident; the red box shouldn't be included and the green one should be. But what about the blue box, should it be included or not? That's a decision that we might either want to default, or give to the user, so let's try and design our algorithm in such a way that it works with both options. We'll call these different modes Contains for the case where we only want geohashes which are fully contained by the geofence, or Intersects for the case where we also want the geohashes which intersect the boundary of the geofence.

So let's take a look at some code for this. Unfortunately, there's not a package which defines both geojson and our PiP functions, so we'll be using the twpayne/go-geom library to represent our geofence and the paulsmith/gogeos library for our geometric operations (contains/intersect). We need to start off with some boring code to convert between these two types (because neither accepts points as an interface!):

func geomToGeosCoord(coord geom.Coord) geos.Coord {  
    return geos.Coord{
        X: coord.X(),
        Y: coord.Y(),
    }
}

func geomToGeosCoords(coords []geom.Coord) []geos.Coord {  
    out := make([]geos.Coord, len(coords))
    for i := 0; i < len(coords); i++ {
        out[i] = geomToGeosCoord(coords[i])
    }
    return out
}

// hashToGeometry converts a geohash to a geos polygon by taking its bounding box.
func hashToGeometry(hash string) *geos.Geometry {  
    bounds := geohash.BoundingBox(hash)
    return geos.Must(geos.NewPolygon([]geos.Coord{
        geos.NewCoord(bounds.MinLng, bounds.MinLat),
        geos.NewCoord(bounds.MinLng, bounds.MaxLat),
        geos.NewCoord(bounds.MaxLng, bounds.MaxLat),
        geos.NewCoord(bounds.MaxLng, bounds.MinLat),
        geos.NewCoord(bounds.MinLng, bounds.MinLat),
    }))
}


func polygonToGeometry(geofence *geom.Polygon) *geos.Geometry {  
    // Convert the outer shell to geos format.
    shell := geofence.LinearRing(0).Coords()
    shellGeos := geomToGeosCoords(shell)

    // Convert each hole to geos format.
    numHoles := geofence.NumLinearRings() - 1
    holes := make([][]geos.Coord, numHoles)
    for i := 0; i < numHoles; i++ {
        holes[i] = geomToGeosCoords(geofence.LinearRing(i).Coords())
    }

    return geos.Must(geos.NewPolygon(shellGeos, holes...))
}

We can now write our simple Intersects and Contains functions:

// Intersects tests if the geofence contains the hash by doing a geos intersection.
func Intersects(geofence *geom.Polygon, hash string) (bool, error) {  
    hashGeo := hashToGeometry(hash)
    fence := polygonToGeometry(geofence)
    return fence.Intersects(hashGeo)
}

// Contains tests if the geofence contains the hash by doing a geos contains.
func Contains(geofence *geom.Polygon, hash string) (bool, error) {  
    hashGeo := hashToGeometry(hash)
    fence := polygonToGeometry(geofence)
    return fence.Contains(hashGeo)
}

We can now write some tests to ensure this is working correctly (the tests for contains and helper functions have been left out for brevity):

func TestIntersects(t *testing.T) {  
    geofence := readFileAsGeometry(t, "testdata/regents.geojson")
    cases := []struct {
        name            string
        hash            string
        shouldIntersect bool
    }{
        {"outside", "q", false},
        {"inside", "gcpvhkz", true},
        {"boundary", "gcpvhk", true},
        {"empty", "", true},
    }

    for _, test := range cases {
        t.Run(test.name, func(t *testing.T) {
            inter, err := Intersects(geofence, test.hash)
            assert.NoError(t, err)
            assert.Equal(t, test.shouldIntersect, inter)
        })
    }
}

Note that the check for "" returns true, as the geohash "" represents the whole world (this will be important later).

Part 2 - The Fill

Now that we know we can test whether a geohash should be contained in our result set or not, we need to figure out which geohashes we need to test in order to completely fill our geofence.

We can quite quickly see that a brute force approach is out of the picture, as even at precision 6 we'd need 32^6 (or 1,073,741,824) hashes to check. The first thing that struck me was that this might be solved using a simple breadth-first search. The approach was to:

  1. Start in the centre of the geofence.
  2. Check if the hash is contained by/intersects the geofence.
  3. Calculate the neighbouring geohashes.
  4. Repeat for each one.

The implementation of which looks like this:

func BreadthFirstFill(poly *geojson.Polygon) ([]string, error) {  
    centroid := Centroid(poly)
    hashes := make([]string, 0)
    seen := make(map[string]struct{}, 0)

    queue := []string{geohash.Encode(centroid)}
    for len(queue) > 0 {
        next := queue[0]
        queue = queue[1:]
        neighbors, _ := geohash.Neighbors(next)
        for _, hash := range neighbors {
            // If we've already seen this hash before, ignore it.
            if _, found := seen[hash]; found {
                continue
            }
            seen[hash] = struct{}{}

            contains, err := Contains(oply, hash)
            if err != nil {
                return nil, err
            }
            if contains {
                queue = append(queue, hash)
                hashes = append(hashes, hash)
            }
        }
    }
    return hashes, nil
}

This is a pretty simple function, and it works! Mostly. Here's the visualised output of the code for our previous geofence with precision 7 hashes:

All fine with this simple geofence, but there are two problems with this implementation. Firstly, we can only produce fixed precision coverage of our geofence. This is fine for our implementation, but not ideal for the general case. Secondly, and most importantly, this algorithm might product the wrong output in the case of holes or - counter-intuitively - if the centroid is outside the geofence.

Part 3 - The Recursive Fill

After talking about these problems with my colleague, he suggested that I think about a recursive solution instead. It would work something like this:

  1. Start with the entire world ("").
  2. If the hash is completely contained by the geofence, add it to set result set, stop recursing.
  3. Otherwise, if the hash intersects with the geofence, don't add it, but recurse with all 32 children.
  4. If the hash we're checking is equal our maximum precision, add that hash to the set and stop recursing.
  5. Finally, if the hash doesn't intersect with the geofence, don't add it, stop recursing.

This solves our centroid problem as we always start with a geohash that intersects the geofence (the whole world) and go from there. It also has the nice property that if the geohash is completely contained at, let's say, precision 6, we don't need to enumerate all the hashes (1024 of them) to represent the same space in precision 8.

We'll define a new function called computeVariableHashses which will recursively compute this set of hashes:

func computeVariableHashses(fence *geom.Polygon, mode FillMode, hash string) ([]string, error) {  
    cont, err := Contains(fence, hash)
    if err != nil {
        return nil, err
    }
    if cont {
        return []string{hash}, nil
    }

    inter, err := Intersects(fence, hash)
    if err != nil {
        return nil, err
    }
    if !inter {
        return nil, nil
    }

    if len(hash) == f.minPrecision {
        // If we hit the max precision and we intersected but didn't contain,
        // it means we're at the boundary and can't go any smaller. So if we're
        // using FillIntersects, include the hash, otherwise don't.
        if mode == FillIntersects {
            return []string{hash}, nil
        }
        return nil, nil
    }

    // We didn't reach the max precision, so recurse with the next hash down.
    hashes := make([]string, 0)
    for _, next := range geohashBase32Alphabet {
        out, err := computeVariableHashses(fence, mode, hash+next)
        if err != nil {
            return nil, err
        }
        hashes = append(hashes, out...)
    }
    return hashes, nil
}

The resulting set of hashes looks like this:

Here's an animation that shows how this algorithm works:

The final piece of this algorithm is that we also want the ability to generate the fixed set if needed. Once we have the variable set, we can simply iterate through each hash and, if it's below the precision we want, extend that hash out with all possible children:

extendHashToMaxPrecision(hash string) []string {  
    if len(hash) == f.minPrecision {
        return []string{hash}
    }
    hashes := make([]string, 0, 32)
    for _, next := range geohashBase32Alphabet {
        out := extendHashToMaxPrecision(hash + next)
        hashes = append(hashes, out...)
    }
    return hashes
}

There are some more optimisations that we can do to improve how this works, but this is all we need for a basic, working solution. The complete code for this library is available on GitHub if you'd like to take a look.