Calvalus - Fast Processing and Exploitation of Full-Mission Datasets
Boettcher, Martin; Fomferra, Norman; Zuehlke, Marco; Brockmann, Carsten
Brockmann Consult GmbH, GERMANY

Earth Observation (EO) missions provide a unique dataset of observation data of our environment. To make full use of it, a powerful multi-mission processing infrastructure is required. It shall allow to generate full mission products with new algorithms and versions, to aggregate results in temporal and spatial dimensions, to support calibration and validation as the basis for reliable scientific conclusions, and to test new ideas in a rapid prototyping approach. ESA Climate Change Initiative (ESA-CCI) projects face this challenge, also in view of the amount of data that will be produced by Sentinel instruments.

Key features of a multi-mission processing infrastructure are

Network transfer limits scalability in traditional archive-based systems and also in the cloud. Compared to that, the data-local approach uses a distributed file system (DFS) to store inputs already on computing nodes of a computer cluster. Also intermediates and outputs are usually stored on the local processing node such that only requests controlled by some job scheduler travel over the network. The exception of course is aggregation as it intends to combine data from different sources. It can be supported by an efficient concurrent approach called "map-reduce" which streams the data in a controlled way, avoiding storage of intermediate results.

Fault tolerance is a must when processing large amounts of data. This includes both the hardware infrastructure and the processing. Data replication cares for reliable storage, and failure of a computing node automatically leads to re-scheduling of processing tasks to other nodes. Failure to process a certain input leads to automatic retries and finally to labelling and error reporting without necessarily stopping the overall production. After interruptions only steps that had not been completed are re-done.

Product quality assurance for inputs, outputs, and often also intermediate products has turned out to be essential to avoid invalidating an aggregation result by a single erroneous input. QA comprises automated and also visual steps. Instant matchups with in situ observations and trend analysis are enabled as additional means for validation. They are examples of information reduction based on complete missions that can efficiently been done in such a processing infrastructure.

A processor repository, automated deployment, and the support for concurrent versions enable the use of the same infrastructure for reliable production and for test. Revised versions of algorithms, new products, instrument trends and calibration refinements can be efficiently evaluated on vast datasets or mission-long time series. This way, also development has access to all data, though it may get a different share of the processing resources. A fair scheduling allows to use the infrastructure in a multi-mission-multi-project way. Whenever a project requests resources it gets at least its share.

A system exemplifying this approach has been developed on the basis of Apache Hadoop in the Calvalus project. The initial Calvalus concept had been developed under an ESA SME-LET study. Apache Hadoop is a middleware that allows for the distributed processing of large data sets across clusters of computers using the map-reduce programming model and a robust distributed file system. These technologies, initially developed and published by Google Inc., provide a massive parallelisation of tasks by data-local processing whereby avoiding data transfer bottlenecks. Calvalus applies these technologies to mission wide EO datasets. It has combined it with ESA BEAM as a processor framework and several additional executable processors. The Calvalus systems are currently operational for several projects, for example ESA CoastColour, ESA Land Cover and Ocean Colour CCI, Fronts (German BSH), and SIOCS (ESA). The ongoing evolution of Calvalus comprises features like processor deployment by users (to transfer algorithms to the data), improved resource management, and efficient ingestion and extraction for NRT applications.