Zonal Statistics Algorithm with Python in 4 Steps

It is a common need to summarize information from a gridded dataset within an irregularly shaped area. While at first glance this may seem simple, reconciling differences between raster (gridded) and vector (polygon) datatypes can quickly become complicated. This article shows how to implement a zonal statistics algorithm in Python in 4 steps.

Load raster data and vector polygons
Rasterize polygon features
Mask input data to polygon extent
Calculate zonal statistics for the polygon extent

1. Load raster data and vector polygons

Start by importing the necessary Python modules.

import gdal  # raster management
import ogr  # vector managment
import numpy as np  # gridded data
import os  # file managment
import csv  # for saving results in csv format

Now set the file paths for the raster and vector data and use gdal and ogr to load the raster and vector data, respectively. Access the layer which contains the polygon data from the vector data source that was loaded. Then get the GeoTransform information (positions the gridded raster data and specifies cell sizes) and no data value for the raster data.

fn_raster = "path/to/raster"
fn_zones = "path/to/vector/polygons"

# load rater and vector data
r_ds = gdal.Open(fn_raster)
p_ds = ogr.Open(fn_zones)

# get vector layer
lyr = p_ds.GetLayer()

# get geotransform and no data for raster
geot = r_ds.GetGeoTransform()
nodata = r_ds.GetRasterBand(1).GetNoDataValue()

zstats = []  # empty list to hold the zonal statistics once they are calculated

We’re also going to do a little more setup with gdal and ogr to prepare for determining the overlap between the raster grid and vector polygons. Let’s get gdal and ogr drivers to create temporary raster and vector layers in memory and set the name for a temporary shapefile.

mem_driver = ogr.GetDriverByName("Memory")
mem_driver_gdal = gdal.GetDriverByName("MEM")
shp_name = "temp"

2. Rasterize polygon features

Once the raster and vector data are loaded it’s time to determine where they overlap. This will be the most complicated and intensive part of the algorithm. To start, read the first polygon feature from the vector layer. Then start a while loop that will continue as long as there is another feature in the vector layer. Below is the basic way to set up the loop. This loop will be modified as we continue.

p_feat = lyr.GetNextFeature()

while p_feat:
    p_feat = lyr.GetNextFeature()

We need to create some functions that will help us create a new raster to store the rasterized polygon features. These functions will make more sense when we call them with actual data. For now, add the function definitions at the top of the script. First we need to convert the convert the coordinates of the bounding box containing a polygon feature to cell coordinates, or offsets, (row and column numbers) that to correspond the input raster. The bounding box of a polygon is the rectangle that contains the entire polygon feature, it consists of minimum and maximum X and Y coordinates.

def boundingBoxToOffsets(bbox, geot):
    col1 = int((bbox[0] - geot[0]) / geot[1])
    col2 = int((bbox[1] - geot[0]) / geot[1]) + 1
    row1 = int((bbox[3] - geot[3]) / geot[5])
    row2 = int((bbox[2] - geot[3]) / geot[5]) + 1
    return [row1, row2, col1, col2]

Next write a function that creates a new GeoTransform with the cell offsets that were calculated in the function above. These two functions will allow us to create a new, smaller raster that only covers the area overlapped by a polygon feature.

def geotFromOffsets(row_offset, col_offset, geot):
    new_geot = [
    geot[0] + (col_offset * geot[1]),
    geot[1],
    0.0,
    geot[3] + (row_offset * geot[5]),
    0.0,
    geot[5]
    ]
    return new_geot

Now comes the most complicated part of the algorithm: creating a new empty raster in memory and converting the polygon features to rasters. Follow the comments in code snippet below to understand what is accomplished by each line of code. You may also find it useful to review the GDAL and OGR documentation.

while p_feat:
    if p_feat.GetGeometryRef() is not None:
        if os.path.exists(shp_name):
            mem_driver.DeleteDataSource(shp_name)  # delete temporary raster if it already exists
        tp_ds = mem_driver.CreateDataSource(shp_name)  # create a new, empty raster in memory
        tp_lyr = tp_ds.CreateLayer('polygons', None, ogr.wkbPolygon)  # create a temporary polygon layer
        tp_lyr.CreateFeature(p_feat.Clone())  # copy the current feature to the temporary polygon layer
        offsets = boundingBoxToOffsets(p_feat.GetGeometryRef().GetEnvelope(),\
        geot)  # get the bounding box of the polygon feature and convert the coordinates to cell offsets
        new_geot = geotFromOffsets(offsets[0], offsets[2], geot)  # calculate the new geotransform for the polygonized raster
        
        tr_ds = mem_driver_gdal.Create(\
        "", \
        offsets[3] - offsets[2], \
        offsets[1] - offsets[0], \
        1, \
        gdal.GDT_Byte)  # create the raster for the rasterized polygon in memory
        
        tr_ds.SetGeoTransform(new_geot)  # set the geotransfrom the rasterized polygon
        gdal.RasterizeLayer(tr_ds, [1], tp_lyr, burn_values=[1])  # rasterize the polygon feature
        tr_array = tr_ds.ReadAsArray()  # read data from the rasterized polygon
        
        r_array = r_ds.GetRasterBand(1).ReadAsArray(\
        offsets[2],\
        offsets[0],\
        offsets[3] - offsets[2],\
        offsets[1] - offsets[0])  # read data from the input raster that corresponds to the location of the rasterized polygon
        
        id = p_feat.GetFID()  # get the ID of the polygon feature, you can use a different attribute field

3. Mask input data to polygon zones

At this point we have two concurrent (overlapping) numpy arrays: tr_array and r_array. tr_array has a value of 1 that indicates where the polygon has been rasterized. r_array has the values we want to use to calculate zonal statistics for each polygon.

First, make sure that the input raster exists. Next create a mask array. We want to get data from r_array where it does not contain no-data values and where the corresponding value in tr_array equals 1 (or True).

if r_array is not None:
            maskarray = np.ma.MaskedArray(\
            r_array,\
            maskarray=np.logical_or(r_array==nodata, np.logical_not(tr_array)))

Now we’ll use maskarray to calculate zonal statistics for each polygon feature.

4. Calculate zonal statistics for each polygon extent

Write one more function that takes a number of values as inputs and creates a dictionary of values. Place this function with the others at the top of the zonal statistics script.

def setFeatureStats(fid, min, max, mean, median, sd, sum, count, names=["min", "max", "mean", "median", "sd", "sum", "count"]):
    featstats = {
    names[0]: min,
    names[1]: max,
    names[2]: mean,
    names[3]: median,
    names[4]: sd,
    names[5]: sum,
    names[6]: count,
    names[7]: fid,
    }
    return featstats

Now, calculate the values from maskarray and pass them to the setFeatureStats function. Only do this if the array exists (is not None). Otherwise, pass the no data value to set the statistics.

            
            if maskarray is not None:
                zstats.append(setFeatureStats(\
                id,\
                maskarray.min(),\
                maskarray.max(),\
                maskarray.mean(),\
                np.ma.median(maskarray),\
                maskarray.std(),\
                maskarray.sum(),\
                maskarray.count()))
            else:
                zstats.append(setFeatureStats(\
                id,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata))
        else:
            zstats.append(setFeatureStats(\
                id,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata))

The Full Zonal Statistics Script

Congratulations! You have created a Python zonal statistics script that can be applied to many different applications. The full script is available below.

import gdal
import ogr
import os
import numpy as np
import csv


def boundingBoxToOffsets(bbox, geot):
    col1 = int((bbox[0] - geot[0]) / geot[1])
    col2 = int((bbox[1] - geot[0]) / geot[1]) + 1
    row1 = int((bbox[3] - geot[3]) / geot[5])
    row2 = int((bbox[2] - geot[3]) / geot[5]) + 1
    return [row1, row2, col1, col2]


def geotFromOffsets(row_offset, col_offset, geot):
    new_geot = [
    geot[0] + (col_offset * geot[1]),
    geot[1],
    0.0,
    geot[3] + (row_offset * geot[5]),
    0.0,
    geot[5]
    ]
    return new_geot


def setFeatureStats(fid, min, max, mean, median, sd, sum, count, names=["min", "max", "mean", "median", "sd", "sum", "count", "id"]):
    featstats = {
    names[0]: min,
    names[1]: max,
    names[2]: mean,
    names[3]: median,
    names[4]: sd,
    names[5]: sum,
    names[6]: count,
    names[7]: fid,
    }
    return featstats

mem_driver = ogr.GetDriverByName("Memory")
mem_driver_gdal = gdal.GetDriverByName("MEM")
shp_name = "temp"

fn_raster = "C:/pyqgis/raster/USGS_NED_13_n45w116_IMG.img"
fn_zones = "C:/temp/zonal_stats/zones.shp"

r_ds = gdal.Open(fn_raster)
p_ds = ogr.Open(fn_zones)

lyr = p_ds.GetLayer()
geot = r_ds.GetGeoTransform()
nodata = r_ds.GetRasterBand(1).GetNoDataValue()

zstats = []

p_feat = lyr.GetNextFeature()
niter = 0

while p_feat:
    if p_feat.GetGeometryRef() is not None:
        if os.path.exists(shp_name):
            mem_driver.DeleteDataSource(shp_name)
        tp_ds = mem_driver.CreateDataSource(shp_name)
        tp_lyr = tp_ds.CreateLayer('polygons', None, ogr.wkbPolygon)
        tp_lyr.CreateFeature(p_feat.Clone())
        offsets = boundingBoxToOffsets(p_feat.GetGeometryRef().GetEnvelope(),\
        geot)
        new_geot = geotFromOffsets(offsets[0], offsets[2], geot)
        
        tr_ds = mem_driver_gdal.Create(\
        "", \
        offsets[3] - offsets[2], \
        offsets[1] - offsets[0], \
        1, \
        gdal.GDT_Byte)
        
        tr_ds.SetGeoTransform(new_geot)
        gdal.RasterizeLayer(tr_ds, [1], tp_lyr, burn_values=[1])
        tr_array = tr_ds.ReadAsArray()
        
        r_array = r_ds.GetRasterBand(1).ReadAsArray(\
        offsets[2],\
        offsets[0],\
        offsets[3] - offsets[2],\
        offsets[1] - offsets[0])
        
        id = p_feat.GetFID()
        
        if r_array is not None:
            maskarray = np.ma.MaskedArray(\
            r_array,\
            maskarray=np.logical_or(r_array==nodata, np.logical_not(tr_array)))
            
            if maskarray is not None:
                zstats.append(setFeatureStats(\
                id,\
                maskarray.min(),\
                maskarray.max(),\
                maskarray.mean(),\
                np.ma.median(maskarray),\
                maskarray.std(),\
                maskarray.sum(),\
                maskarray.count()))
            else:
                zstats.append(setFeatureStats(\
                id,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata))
        else:
            zstats.append(setFeatureStats(\
                id,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata,\
                nodata))
        
        tp_ds = None
        tp_lyr = None
        tr_ds = None
        
        p_feat = lyr.GetNextFeature()
        
fn_csv = "C:/temp/zonal_stats/zstats.csv" 
col_names = zstats[0].keys()
with open(fn_csv, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, col_names)
    writer.writeheader()
    writer.writerows(zstats)

Zonal Statistics Algorithm with Python in 4 Steps

1. Load raster data and vector polygons

2. Rasterize polygon features

3. Mask input data to polygon zones

4. Calculate zonal statistics for each polygon extent

The Full Zonal Statistics Script

How to Install Mamba for Python

Intro to NetCDF with Python (netCDF4)

Vectorize Moving Window Grid Operations on NumPy Arrays

Python: Geographic Object-Based Image Analysis (GeOBIA) – Part 2: Image Classification

numpy: Methods For Creating Arrays

1. Load raster data and vector polygons

2. Rasterize polygon features

3. Mask input data to polygon zones

4. Calculate zonal statistics for each polygon extent

The Full Zonal Statistics Script

Similar Posts