CyFi: Cyanobacteria Finder

CyFi is a command line tool that uses satellite imagery and machine learning to estimate cyanobacteria levels in small, inland water bodies. Cyanobacteria is a type of harmful algal bloom (HAB), which can produce toxins that are poisonous to humans and their pets, and can threaten marine ecosystems.

The goal of CyFi is to help water quality managers better allocate resources for in situ sampling, and make more informed decisions around public health warnings for critical resources like lakes and reservoirs.

Ultimately, more accurate and more timely detection of algal blooms helps keep both the human and marine life that rely on these water bodies safe and healthy.

For a brief overview of CyFi's main capabilities and uses, check out the CyFi demo deck.

Example lake annotated with points of severity

Stylized view of severity estimates for points on a lake with a cyanobacteria bloom.
Base image from NASA Landsat Image Gallery

Quickstart

Install

Note: There is a known issue with the pip installation on M1 Macs due to LightGBM. If you're on a Mac, we recommend installing CyFi with conda, shown in the second option below.

Install CyFi with pip:

pip install cyfi

Alternatively, CyFi can be installed with conda:

conda install -c conda-forge cyfi

For detailed instructions for those installing python for the first time, see the Installation docs.

Generate batch predictions

Generate batch predictions at the command line with cyfi predict.

First, specify your sample points in a csv with the following columns:

  • latitude
  • longitude
  • date

For example, sample_points.csv could be:

latitude longitude date
41.424144-73.2069372023-06-22
36.045-79.09194152023-07-01
35.884524-78.9539972023-08-04

Then run:

cyfi predict sample_points.csv

This will output a preds.csv that contains a column for cyanobacteria density and a column for the associated severity level:

sample_id date latitude longitude density_cells_per_ml severity
7ff4b4a56965d80f6aa501cc25aa1883 2023-06-22 41.424144 -73.206937 34,173 moderate
882b9804a3e28d8805f98432a1a9d9af 2023-07-01 36.045 -79.0919415 7,701 low
10468e709dcb6133d19a230419efbb24 2023-08-04 35.884524 -78.953997 4,053 low

To see all of the available options, run cyfi predict --help.

Generate prediction for a single point

Or, generate a cyanobacteria estimate for a single point on a single date using cyfi predict-point.

Just specify the latitude, longitude, and date as arguments at the command line.

cyfi predict-point --lat 41.2 --lon -73.2 --date 2023-09-14

This will print out the estimated cyanobacteria density and associated severity level.

2023-10-04 16:25:40.581 | SUCCESS  | cyfi.cli:predict_point:154 - Estimate generated:
date                    2023-09-14
latitude                      41.2
longitude                    -73.2
density_cells_per_ml        32,820
severity                  moderate

To see all of the available options, run cyfi predict-point --help.

A note on severity levels

Severity levels are based on World Health Organization (WHO) cyanobacteria density thresholds.

  • Low: 0 - 20,000 cells/ml
  • Moderate: 20,000 - 100,000 cells/ml
  • High: > 100,000 cells/ml

However users should feel free to to use their own thresholds as makes sense for their needs.

Visualizing predictions

Launch the CyFi Explorer to view the Sentinel-2 imagery used to generate each cyanobacteria estimate!


About the model

CyFi was born out of the Tick Tick Bloom machine learning competition, hosted by DrivenData. The goal in that challenge was to detect and classify the severity of cyanobacteria blooms in small, inland water bodies using publicly available satellite, climate, and elevation data. Labels were based on "in situ" samples that were collected manually by many organizations across the U.S. The model in CyFi is based on the winning solutions from that challenge, and has been optimized for generalizability and efficiency.

Why use machine learning

Machine learning is particularly well-suited to this task because indicators of cyanobacteria are visible from free, routinely collected data sources. Whereas manual water sampling is time and resource intensive, machine learning models can generate estimates in seconds. This allows water managers to prioritize where water sampling will be most beneficial, and can provide a birds-eye view of water conditions across the state.

Data sources

CyFi relies on two data sources as input:

Sentinel-2 satellite imagery

  • Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission. It supports the monitoring of vegetation, soil and water cover, as well as observation of inland waterways and coastal areas. The Sentinel-2 Multispectral Instrument (MSI) samples 13 spectral bands: four bands at 10 metres, six bands at 20 metres and three bands at 60 metres spatial resolution. The mission provides a global coverage of the Earth's land surface every 5 days. Sentinel-2 data is accessed through Microsoft's Planetary Computer.

Land cover map

  • The Climate Research Data Package (CRDP) Land Cover Gridded Map (2020) classifies land surface into 22 classes, which have been defined using the United Nations Food and Agriculture Organization's Land Cover Classification System (LCCS). This map is based on data from the Medium Resolution Imaging Spectrometer (MERIS) sensor on board the polar-orbiting Envisat-1 environmental research satellite by the European Space Agency. This data comes from the CCI-LC database hosted by the ESA Climate Change Initiative's Land Cover project.

Overview of the model

Each observation (or "sampling point") is a unique combination of date, latitude, and longitude.

Example input csv row:

latitude longitude date
41.424144 -73.206937 2023-06-22

Satellite imagery feature generation for each observation is as follows:

  • identify relevant Sentinel-2 tiles based on
    • a bounding box of 2,000m around the sampling point
    • a time range of 30 days prior to (and including) the sampling date
  • select the most recent image that has a bouding box containing fewer than 5% of cloud pixels
  • filter the pixels in the bounding box to the water area using the scene classification (SCL) band
  • generate summary statistics (e.g., mean, max, min) and ratios (e.g, NDVI) using the 15 Sentinel-2 bands

The land cover value for each sampling point is looked up from the static land cover map, and added to the satellite features.

Example features csv row:

B01_meanB02_meanB03_meanB04_meanB05_meanB06_meanB07_meanB08_meanB09_meanB11_meanB12_meanB8A_meanWVP_meanAOT_meanpercent_watergreen95thgreen5thgreen_red_ratiogreen_blue_ratiored_blue_ratiogreen95th_blue_ratiogreen5th_blue_ratioNDVI_B04NDVI_B05NDVI_B06NDVI_B07AOT_rangemonthdays_before_sampleland_cover
548.11341.61607.31613.8234.0287.7265.32929.33316.7362.7153.3171.71742.876.07.14e-053919.0711.60.9961.21.22.90.50.30.90.80.80.056130

Cyanobacteria estimates are then generated by a LightGBM model, a gradient-boosted decision tree algorithm. The model was trained using "in situ" labels collected manually by many organizations across the U.S.

Density values are discretized into severity levels using the WHO guidlines and a prediction csv is written out.

Example predictions csv row:

date latitude longitude density_cells_per_ml severity
2019-08-26 38.9725 -94.67293 42,6593 high