There are three key types of data you need to collect to process information from a Mothbox.
- Deployment
- Metadata
- Species List
Basic Organization
We organize our photographic data we get from “Deployments” of the Mothboxes like shown below:
.
└── Dataset/ (Collection of deployments)
└── Dataset_PROJECT_SITE_DEVICE_YYYY-MM-DD/ (Deployment Folder)
├── metadata.csv
├── species_list.csv
├── YYYY-MM-01 (first nightly folder of a deployment)
├── YYYY-MM-02 (second nightly folder...)
└── YYYY-MM-03/ (Third nightly folder...)
├── DEVICE_YYYY-MM-DD-HH-MM-SS_HDR0.jpg (Raw Image collected)
├── DEVICE_YYYY-MM-DD-HH-MM-SS_HDR0_botdetection.json (auto-created Yolo detection with auto-ID)
└── DEVICE_YYYY-MM-DD-HH-MM-SS_HDR0.json (human, ground truth detection data- optional)
├── patches/ (an automatically created folder made by the detection script)
├── DEVICE_YYYY-MM-DD-HH-MM-SS_HDR0_PATCHINDEX_DETECTIONMODEL.pt.jpg.jpg (Raw Image collected)
In addition to the Deployment photo data, there are two other files you will need to completely process your data.
- Metadata CSV
- This ties the photos and IDs to metadata like location and date
- Species List
- This improves the automatic Identification process by limiting the guess to only creatures that might be in your desired location and type of creature (e.g. Insecta or more broadly like Arthropoda)
These two files don’t have to be organized in any special way, but we keep examples of these files in the AI folder of the github repo You can also organize them into each Deployment if you want like we show above.
Please Organize Your Data like this!
There might be better ways to organize this stuff, and we are open to suggestions! But for now, it’s important to try to organize your data like this because the scripts we are whipping together to process your images rely on an organization and naming structure like this to tie together the different types of data!
Below, we will discuss more particulars about good ways to organize the Deployment, Metadata, and Species List.
Deployment
Each “deployment” is a data from device left out in the field somewhere.
Deployment Name
The deployment has a unique name like this:
PROJECT_SITE_DEVICE_YYYY-MM-DD
The “Project” is a broad project that you are collecting this data for. You couldname it something like “BatSurvey” or “MtTotumasDrySeason” (No spaces)
The “Site” is a human name for the very specific place you left the Mothbox, like “TreeNearLodge” (No Spaces)
The “Device” is a unique name that the Mothbox calls itself. These are names based off the internal serial number of the Raspberry Pi on the Mothbox meshed with a list we made of Spanish and English verbs, nouns, and adjectives. Like “FuerteFrog”
Then there is a date stamp that marks the first day a mothbox was left out in the field. like 2024-04-30. The format is YYYY-MM-DD.
Nightly Folders
A deployment usually has several nights. Each night is collected in its own folder. The nightly folders are automatically created by the Mothbox and have a basic format:
YYYY-MM-DD
A special note about Mothbox “nights.” Since most of our data collection happens at night, each night for these folders runs from 12:00 pm of the first day it is left out until 11:59am of the next day. In this way, images captured at, for instance, 3AM are considered part of the same night that started 10 hours earlier at 7 PM the preceding day.
Samples
Each data “sample” consists of a set of grouped files.
DEVICE_YYYY-MM-DD-HH-MM-SS.jpg (Raw Image collected)
DEVICE_YYYY-MM-DD-HH-MM-SS_botdetection.json (Bot created labels)
DEVICE_YYYY-MM-DD-HH-MM-SS.json (Human created labels)
- Raw Image
The “raw” photos we capture look like this. They are insects on a white background.
- Bot created labels
Each sample photo might also have a similarly named file next to it, but the file type is “.json” and the file name ends with “botdetection.” These are files generated by automated means to detect where the insects are in the photo. Generally these files are made by the Mothbot_Detect.py script.
- Human created labels
There are files that have the same name as the Raw Image but end with “.json”. These are human-created “Ground-Truth” datasets. They don’t have “botdetection” on the ends their file names.
Metadata
Equally as important as the photographic data you collect is the metadata about your deployments. We need to create a metadata file for each raw photo. This contains information about the sampling like:
- occurrenceID (file name with unique timestamp of the specific individual photo (“gradoVerdín_2024_07_25__21_12_05_HDR0_crop_0.jpg”)
- basisOfRecord (i.e. MACHINE_DETECTED)
- deployment ID
- eventDate (timestamp)
- GPS data
- raw_photo (location of the original “raw photo”)
- identifier (Who did the most up to date ID? i.e. “Mothbot” or “Hubert Szczygiel”
- cv_confidence (how confident the AI was in detecting this if machine detected)
- Taxonomic information: class order family genus species commonName scientificName
You should fill out a row on that form for each of your deployments.
We have printable forms that field technicians can take to their sites:
Alternatively, fill out
Remember to collect this metadata for EVERY SINGLE DEPLOYMENT or else it is not useful in the end!
All photos from a single deployment should be in a folder named with the convention: “PROJECT_SITE_MOTHBOXID_YYYY-MM-DD” (the COUNTRY_ prefix is optional)
Species List
The species list is used by the indentification script to narrow down the possibilities of what it is trying to guess. Using GBIF’s species list generator, you can narrow down the possibilities by taxa or location. For example, you could download this list of only the insects that are in Panama.
If you want to go super broad, you could just try to get a list of all arthropods, or you could limit things to a specific family of moths. It’s up to you!
Processing Pipeline
Detection
The first thing we need to do when processing mothbox data is to “detect” where all the creatures are in a “raw photo.”
In other words, we want to go from a raw photo that might have many insects:
to a collection of many small photos that each only show one insect.
You can run this stage of the processing automatically with the Mothbot Detection script.
Identification
Next in the processing steps, we feed all those detections in another pass to a different script called Mothbox_ID.py, which uses BioCLIP to automatically ID the different creatures detected.
It will try to give each detection a label based on what type of creature it thinks it is. It will perform and additional filtering step and label any incorrectly detected images as an Error (for instance if a piece of dirt or blurry photo was detected).
You can run this stage of the processing automatically with the Mothbot ID script.
Editing the Database
Finally there are some remaining scripts that let a human expert go through this automatically detected and identified data to Identify things to deeper levels or fix incorrect IDs.
You can run this stage of the processing automatically with the Mothbot Create Database script.
Start Processing
Go to the next steps in this section to start processing your data!