Zonal Statistics Algorithm with Python in 4 Steps

It is a common need to summarize information from a gridded dataset within an irregularly shaped area. While at first glance this may seem simple, reconciling differences between raster (gridded) and vector (polygon) datatypes can quickly become complicated. This article shows how to implement a zonal statistics algorithm in Python in 4 steps.

1. Load raster data and vector polygons
2. Rasterize polygon features
3. Mask input data to polygon extent
4. Calculate zonal statistics for the polygon extent

1. Load raster data and vector polygons

Start by importing the necessary Python modules.

```import gdal  # raster management
import ogr  # vector managment
import numpy as np  # gridded data
import os  # file managment
import csv  # for saving results in csv format```

Now set the file paths for the raster and vector data and use `gdal` and `ogr` to load the raster and vector data, respectively. Access the layer which contains the polygon data from the vector data source that was loaded. Then get the GeoTransform information (positions the gridded raster data and specifies cell sizes) and no data value for the raster data.

```fn_raster = "path/to/raster"
fn_zones = "path/to/vector/polygons"

# load rater and vector data
r_ds = gdal.Open(fn_raster)
p_ds = ogr.Open(fn_zones)

# get vector layer
lyr = p_ds.GetLayer()

# get geotransform and no data for raster
geot = r_ds.GetGeoTransform()
nodata = r_ds.GetRasterBand(1).GetNoDataValue()

zstats = []  # empty list to hold the zonal statistics once they are calculated```

We’re also going to do a little more setup with `gdal` and `ogr` to prepare for determining the overlap between the raster grid and vector polygons. Let’s get `gdal` and `ogr` drivers to create temporary raster and vector layers in memory and set the name for a temporary shapefile.

```mem_driver = ogr.GetDriverByName("Memory")
mem_driver_gdal = gdal.GetDriverByName("MEM")
shp_name = "temp"```

2. Rasterize polygon features

Once the raster and vector data are loaded it’s time to determine where they overlap. This will be the most complicated and intensive part of the algorithm. To start, read the first polygon feature from the vector layer. Then start a `while` loop that will continue as long as there is another feature in the vector layer. Below is the basic way to set up the loop. This loop will be modified as we continue.

```p_feat = lyr.GetNextFeature()

while p_feat:
p_feat = lyr.GetNextFeature()```

We need to create some functions that will help us create a new raster to store the rasterized polygon features. These functions will make more sense when we call them with actual data. For now, add the function definitions at the top of the script. First we need to convert the convert the coordinates of the bounding box containing a polygon feature to cell coordinates, or offsets, (row and column numbers) that to correspond the input raster. The bounding box of a polygon is the rectangle that contains the entire polygon feature, it consists of minimum and maximum X and Y coordinates.

```def boundingBoxToOffsets(bbox, geot):
col1 = int((bbox[0] - geot[0]) / geot[1])
col2 = int((bbox[1] - geot[0]) / geot[1]) + 1
row1 = int((bbox[3] - geot[3]) / geot[5])
row2 = int((bbox[2] - geot[3]) / geot[5]) + 1
return [row1, row2, col1, col2]```

Next write a function that creates a new GeoTransform with the cell offsets that were calculated in the function above. These two functions will allow us to create a new, smaller raster that only covers the area overlapped by a polygon feature.

```def geotFromOffsets(row_offset, col_offset, geot):
new_geot = [
geot[0] + (col_offset * geot[1]),
geot[1],
0.0,
geot[3] + (row_offset * geot[5]),
0.0,
geot[5]
]
return new_geot```

Now comes the most complicated part of the algorithm: creating a new empty raster in memory and converting the polygon features to rasters. Follow the comments in code snippet below to understand what is accomplished by each line of code. You may also find it useful to review the GDAL and OGR documentation.

```while p_feat:
if p_feat.GetGeometryRef() is not None:
if os.path.exists(shp_name):
mem_driver.DeleteDataSource(shp_name)  # delete temporary raster if it already exists
tp_ds = mem_driver.CreateDataSource(shp_name)  # create a new, empty raster in memory
tp_lyr = tp_ds.CreateLayer('polygons', None, ogr.wkbPolygon)  # create a temporary polygon layer
tp_lyr.CreateFeature(p_feat.Clone())  # copy the current feature to the temporary polygon layer
offsets = boundingBoxToOffsets(p_feat.GetGeometryRef().GetEnvelope(),\
geot)  # get the bounding box of the polygon feature and convert the coordinates to cell offsets
new_geot = geotFromOffsets(offsets[0], offsets[2], geot)  # calculate the new geotransform for the polygonized raster

tr_ds = mem_driver_gdal.Create(\
"", \
offsets[3] - offsets[2], \
offsets[1] - offsets[0], \
1, \
gdal.GDT_Byte)  # create the raster for the rasterized polygon in memory

tr_ds.SetGeoTransform(new_geot)  # set the geotransfrom the rasterized polygon
gdal.RasterizeLayer(tr_ds, [1], tp_lyr, burn_values=[1])  # rasterize the polygon feature

offsets[2],\
offsets[0],\
offsets[3] - offsets[2],\
offsets[1] - offsets[0])  # read data from the input raster that corresponds to the location of the rasterized polygon

id = p_feat.GetFID()  # get the ID of the polygon feature, you can use a different attribute field```

3. Mask input data to polygon zones

At this point we have two concurrent (overlapping) `numpy` arrays: `tr_array` and `r_array`. `tr_array` has a value of 1 that indicates where the polygon has been rasterized. `r_array` has the values we want to use to calculate zonal statistics for each polygon.

First, make sure that the input raster exists. Next create a mask array. We want to get data from `r_array` where it does not contain no-data values and where the corresponding value in `tr_array` equals 1 (or `True`).

```if r_array is not None:
r_array,\

Now we’ll use `maskarray` to calculate zonal statistics for each polygon feature.

4. Calculate zonal statistics for each polygon extent

Write one more function that takes a number of values as inputs and creates a dictionary of values. Place this function with the others at the top of the zonal statistics script.

```def setFeatureStats(fid, min, max, mean, median, sd, sum, count, names=["min", "max", "mean", "median", "sd", "sum", "count"]):
featstats = {
names[0]: min,
names[1]: max,
names[2]: mean,
names[3]: median,
names[4]: sd,
names[5]: sum,
names[6]: count,
names[7]: fid,
}
return featstats```

Now, calculate the values from `maskarray` and pass them to the `setFeatureStats` function. Only do this if the array exists (`is not None`). Otherwise, pass the no data value to set the statistics.

```
zstats.append(setFeatureStats(\
id,\
else:
zstats.append(setFeatureStats(\
id,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata))
else:
zstats.append(setFeatureStats(\
id,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata))```

The Full Zonal Statistics Script

Congratulations! You have created a Python zonal statistics script that can be applied to many different applications. The full script is available below.

```import gdal
import ogr
import os
import numpy as np
import csv

def boundingBoxToOffsets(bbox, geot):
col1 = int((bbox[0] - geot[0]) / geot[1])
col2 = int((bbox[1] - geot[0]) / geot[1]) + 1
row1 = int((bbox[3] - geot[3]) / geot[5])
row2 = int((bbox[2] - geot[3]) / geot[5]) + 1
return [row1, row2, col1, col2]

def geotFromOffsets(row_offset, col_offset, geot):
new_geot = [
geot[0] + (col_offset * geot[1]),
geot[1],
0.0,
geot[3] + (row_offset * geot[5]),
0.0,
geot[5]
]
return new_geot

def setFeatureStats(fid, min, max, mean, median, sd, sum, count, names=["min", "max", "mean", "median", "sd", "sum", "count", "id"]):
featstats = {
names[0]: min,
names[1]: max,
names[2]: mean,
names[3]: median,
names[4]: sd,
names[5]: sum,
names[6]: count,
names[7]: fid,
}
return featstats

mem_driver = ogr.GetDriverByName("Memory")
mem_driver_gdal = gdal.GetDriverByName("MEM")
shp_name = "temp"

fn_raster = "C:/pyqgis/raster/USGS_NED_13_n45w116_IMG.img"
fn_zones = "C:/temp/zonal_stats/zones.shp"

r_ds = gdal.Open(fn_raster)
p_ds = ogr.Open(fn_zones)

lyr = p_ds.GetLayer()
geot = r_ds.GetGeoTransform()
nodata = r_ds.GetRasterBand(1).GetNoDataValue()

zstats = []

p_feat = lyr.GetNextFeature()
niter = 0

while p_feat:
if p_feat.GetGeometryRef() is not None:
if os.path.exists(shp_name):
mem_driver.DeleteDataSource(shp_name)
tp_ds = mem_driver.CreateDataSource(shp_name)
tp_lyr = tp_ds.CreateLayer('polygons', None, ogr.wkbPolygon)
tp_lyr.CreateFeature(p_feat.Clone())
offsets = boundingBoxToOffsets(p_feat.GetGeometryRef().GetEnvelope(),\
geot)
new_geot = geotFromOffsets(offsets[0], offsets[2], geot)

tr_ds = mem_driver_gdal.Create(\
"", \
offsets[3] - offsets[2], \
offsets[1] - offsets[0], \
1, \
gdal.GDT_Byte)

tr_ds.SetGeoTransform(new_geot)
gdal.RasterizeLayer(tr_ds, [1], tp_lyr, burn_values=[1])

offsets[2],\
offsets[0],\
offsets[3] - offsets[2],\
offsets[1] - offsets[0])

id = p_feat.GetFID()

if r_array is not None:
r_array,\

zstats.append(setFeatureStats(\
id,\
else:
zstats.append(setFeatureStats(\
id,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata))
else:
zstats.append(setFeatureStats(\
id,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata,\
nodata))

tp_ds = None
tp_lyr = None
tr_ds = None

p_feat = lyr.GetNextFeature()

fn_csv = "C:/temp/zonal_stats/zstats.csv"
col_names = zstats[0].keys()
with open(fn_csv, 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, col_names)