OlmoEarth v1.1: A More Efficient Family of Earth Observation Models
Introduction
We recently released OlmoEarth (v1) in November 2025, and since then, our partners have been leveraging it across a wide range of tasks, including mangrove change tracking, forest loss classification, and producing country-scale crop-type maps within days. Every release brings us closer to our goal: making state-of-the-art AI accessible for organizations and communities dedicated to protecting people and the planet.
Increasing Efficiency
The OlmoEarth models are transformer-based architectures, one of the dominant approaches in machine learning today. To process remote sensing data like Sentinel-2 imagery, we convert it into a sequence of tokens that the model can ingest. Two key levers control efficiency: model size and token sequence length.
Compute costs scale quadratically with the token sequence length, so even small reductions can significantly cut running costs. We designed OlmoEarth v1.1 to reduce compute by up to 3x while maintaining performance on a range of research benchmarks and tasks we’ve developed with our partners.
Efficiency through Token Length Reduction
To process remote sensing data, we first convert it into a sequence of tokens the model can ingest. Two important levers control efficiency in transformer-based models: model size (which allows users to pick the size that fits their compute budget) and token sequence length.
Lowering the token sequence length means fewer operations are needed for each forward pass, making the computation cheaper and faster. The y-axis is inverted because lower average rank values indicate better efficiency. Labels show different model families and sizes.
Designing the Token
This raises a crucial question: what should a token represent in transformer-based remote sensing models? Take Sentinel-2 imagery, a common modality we process. A Sentinel-2 input is typically represented as a tensor with dimensions [H, W, T, D=12], where H and W are the spatial dimensions (latitude and longitude pixels), and T represents time.
Currently, we split this data into resolution-based patches. For example, if we pick a spatial patch size of p, our overall Sentinel-2 image is divided into patches of size p x p:
- A single token per timestep per resolution is created for each patch.
- This results in H/p x W/p x T x 3 tokens if we have 12 bands (D=12).
Using a unique token per resolution is common, as seen with models like Galileo and SatMAE. However, collapsing all resolutions into a single token can reduce the number of tokens significantly, leading to substantial material savings across pretraining, fine-tuning, and inference.
Merging Tokens Without Performance Loss
To combine these tokens without compromising performance, we had to modify our pre-training regimen. We detailed these changes in our paper. The result is a model family that does more with less: at every size, OlmoEarth v1.1 runs up to three times cheaper than OlmoEarth v1. This makes frequent, planet-scale map refreshes more affordable for any team using OlmoEarth.
For Developers
The result is a model family that does more with less. At every size, OlmoEarth v1.1 runs up to three times cheaper than OlmoEarth v1, making frequent, planet-scale map refreshes more affordable for every team running OlmoEarth. If you’re using a model from the original OlmoEarth family, try OlmoEarth v1.1. It provides similar performance to OlmoEarth v1 while requiring one third of the compute.
For Researchers
We train OlmoEarth v1.1 on the same dataset as OlmoEarth v1, ensuring any differences between them isolate methodological changes. This allows us to advance understanding of scientific principles when pretraining models for remote sensing.
Get Started
Check out the OlmoEarth v1.1 weights and training code, including the weights for our Base, Tiny, and Nano models.
Key Takeaways
- OlmoeEarth v1.1 is a more efficient family of Earth observation models that cuts compute costs by up to 3x while maintaining performance on key tasks.
- The efficiency gains are achieved through reducing the token sequence length, which impacts both model size and computation cost.
- For researchers, this allows for deeper exploration into the effects of different pre-training methodologies without the confounding variables introduced by varying dataset or architecture choices.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




