Both distributed aggregation and replication for high availability (yes, I am thinking of CRDTs) are techniques that can help tackle geo-replication, offline operation and edge/fog computing. Distributed aggregation often shares many properties in common with CRDT style convergent replication, but they are not the same concept. I have witnessed this difficulty in separating the two concepts in many settings, and this prompted me to attempt a clarification.
The main diference is that in replication there is an abstraction of a single replicated state that can be updated in the multiple locations where a replica is present. This state is not owned by any given replica, but any replica can evolve it by applying operations that transform the shared state. This notion applies both in strong consistency and high availability settings. The diference being that in highly available replication the replicas are allowed to diverge and later reconcile. Another factor is that operations that lead to state changes are often the result of the activity an external user that interacts with the system, e.g. adjusting the target room temperature up by 2 degrees. As such, different users, can do conflicting actions, either concurrently or in sequence (most of us did in their childhood on/off light switching fights with other kids and adults).
Distributed data aggregation refers to several data aggregation techniques that are common in sensor network settings and datacenter infrastructure monitoring. In contrast to replication, each node/location has access to its own local data, e.g. CPU utilisation or a local measurement of humidity levels, and typically this data can evolve continuously. Also, the data to be aggregated is often not directly controlled by users, it usually results from an external physical process or the result of complex system evolutions. Thus, each sensing node usually has exclusive access to a local input value that evolves in time. The aggregation process is then tasked with collecting and transforming this information, e.g. calculating the average or the maximum value, and making it available at a specified location (sink) or disseminating it back to the nodes (by broadcasting the aggregate result). In aggregation the source of truth for each individual measurement is in the actual node that provided it.
Sometimes the two concepts have in common the notion of data merging. In state-based CRDTs operations are reflected in a semi-lattice state that can be combines with others with a join function. In data aggregation there is also often a notion of joining data together, but there is an additional aspect of data reduction and summarisation that is usually not present in CRDT designs. To add to the confusion, its is possible to combine the two concepts in a single system, as we did in the design of Scalable Eventually Consistent Counters, that combines a hierarchical CRDT design with a global aggregation and reporting facet.
However, ignoring corner cases, the difference can be quite clear and recognising it can help in selecting the right tools. A final take-away example is to consider the control of room temperature: The plus/minus control that sets the set point temperature can be captured by a CRDT; The combining of different temperature sensors across the room to obtain the average temperature is distributed aggregation.
In the context of the H2020 LightKone project we are trying to advance the state of the art in both CRDT based highly available replication solutions and in general purpose distributed aggregation protocols. Bellow I leave some pointers to recent results in each of the categories.
CRDT based high availability:
- Albert van der Linde, Pedro Fouto, João Leitão, Nuno Preguiça, Santiago Castiñeira and Annette Bieniusa. Legion: Enriching Internet Services with Peer-to-Peer Interactions, Proceedings of the 26th International Conference on World Wide Web (WWW 2017), Perth, Australia, April 3-7, 2017.
- Paulo Sérgio Almeida, Ali Shoker and Carlos Baquero. Delta state replicated data types, Elsevier, Journal of Parallel and Distributed Computing, 2018
Distributed data aggregation:
- Pedro Costa and João Leitão. Practical Continuous Aggregationin Wireless Edge Environments, Proceedings of the 37th IEEE International Symposium on Reliable Distributed Systems (SRDS 2018), October 2-5, 2018.
- Ziad Kassam, Ali Shoker, Paulo Sérgio Almeida and Carlos Baquero. Aggregation Protocols in Light of Reliable Communication, in the proceedings of the 16th IEEE International Symposium on Network Computing and Applications (NCA 2017), IEEE Computer Society, Cambridge, MA, USA, October 2017.