At ruumi we help farmers with rotational grazing by using satellite data to fit statistical models for estimating grass covers.
Multi-spectral images work well on cloudless sunny days; thin and cirrus clouds we can remove with advanced machine learning. To tackle thick clouds we are integrating satellite radar data, piercing through all cloud types day and night.
The Sentinel-1 constellation provides us with satellite radar data at least two times per week for this lovely planet we call Earth. But the Sentinel-1 data comes at a price: images can not be used as is but need compute-heavy pre-processing.
Compare for example a multi-spectral image, the raw Sentinel-1 data and the same data after we applied the recommended pre-processing steps from the literature.
For one week worth of data, for Europe alone we are talking about processing roughly 2.5 TB of data. We needed a way to easily scale out compute-heavy pre-processing at a reasonable price and at planet scale.
So Much Data, So Little Time⌗
To scale out the compute-heavy pre-processing, we decided to run auto-scaling clusters on the spot market (for massive cost savings) with AWS Batch. This allows us to automatically scale out to thousands of workers when we need to, and to shut down the clusters when we are done.
The high-level ideas can be summarized as
- We provide a docker image for each task to run, for example reading a Sentinel-1 tile from S3, pre-processing it, saving the results to S3 again
- We deploy the batch environment on AWS, orchestrating auto-scaling clusters, limits and constraints, and job queues for scheduling messages
- We enqueue one message for every Sentinel-1 tile, which gets assigned to clusters scaling out automatically
And while we are using the AWS CDK to maintain our infrastructure as code, we ran into some issues with the fine details. This is why we decided to open-source a starter repository to benefit the broader community and show what we have learned.
The project contains everything you need to scale out your own workload in a few easy steps; in addition we show some darker corners of the CDK’s Batch integration
- How to provide additional disk scratch space for your workloads
- How to inject secrets into the container using the secret manager
- Using multiple compute environments with different priorities on the spot market
- Everything else you need around the tasks like ECR for managing the docker image
Let us know what you think! And of course contributions are welcome.
About The Author⌗
Britta is an engineer at ruumi, working on planet-scale solutions for satellite images. In the last weeks she has been focusing on satellite radar integration.
Want to leave feedback? Reach out at firstname.lastname@example.org ❤️