Skip to main content

What is a Swarm?

A Swarm represents a dynamic ensemble of nodes constituting a distinct storage ecosystem. Data is fragmented, distributed, and replicated within a Swarm based on predetermined configurations.

The architecture of each Swarm demonstrates remarkable adaptability, ranging from straightforward flat topologies to elaborate multi-layered and multi-site arrangements. This versatility significantly augments data resilience, distribution efficiency, and infrastructure robustness, even in scenarios constrained to a handful of sites or data centers.

Swarm overview

Swarms overview

A swarm consists of Nexuses, Nodes, and Agents, which are organized into Rings and grouped under Redundancy Classes. The next sections will explain each of these components in detail.

Nexus

A Nexus is a subset of nodes within a swarm, analogous to availability zones, with a strict relationship. This structure enables the identification of geographically proximate nodes, such as those located in the same region or sharing a significant statistical correlation in downtime probabilities or hardware performance.

In the context of enterprise-level installations, a Nexus typically occupies a centralised position within the same data centre. This strategic placement facilitates enhanced operational efficiency and ensures a more robust and cohesive network management.

A Nexus contains a dynamic and potentially vast array of nodes. The number of these nodes is not fixed; rather, it is expected to grow over time. The Nexus architecture's inherent flexibility and scalability make it an ideal solution for evolving network demands and technological advancements.

Nexus

Redundancy class

A Redundancy Class is a configurable data protection policy that defines the distribution and fault tolerance levels within a Swarm, based on a unique set of five parameters. These parameters are designed to optimize network resilience and data integrity in distributed systems using the Reed-Solomon error correction model.

  • Geographical Redundancy: Defined by the variables geo_n and geo_k, their sum determines the total number of Nexuses involved in the Swarm. Here, geo_n specifies the number of Nexuses required to reconstruct the original data, while geo_k represents the additional Nexuses providing redundancy and fault tolerance. This ensures robust coverage and effective data recovery, minimizing the risk of data loss in complex network environments.
  • Local Redundancy: Defined by the variables local_n and local_k, their sum indicates the total number of nodes within each Nexus. Local_n represents the nodes essential for data operations, while local_k adds redundancy, enhancing the resilience of each Nexus against node failures. At this level, Reed-Solomon error correction enables efficient recovery by leveraging the local network's speed and the proximity of agents, ensuring data integrity even when some nodes fail.
  • Anti-Affinity Group (AAG): Represented by the parameter aag, this defines the maximum number of nodes that can reside on the same physical machine. It prevents over-reliance on a single machine, reducing the risk of simultaneous failures. Within the Reed-Solomon framework, this ensures that data is not only distributed across nodes but also across physically separate machines, further enhancing the network’s robustness.

By integrating these parameters into the Reed-Solomon framework, the Redundancy Class provides a comprehensive approach to data redundancy and error correction. Balancing the total number of Nexuses and nodes (n + k) with redundancy needs (k), and regulating node distribution through the aag parameter, it safeguards against data loss and corruption. This ensures high availability, reliability, and resilience of data in distributed systems.

note

The size overhead introduced by the redundancy class is called ratio and is equal to ngeo+kgeongeo×nlocal+klocalnlocal\frac{n_{geo} + k_{geo}}{n_{geo}} \times \frac{n_{local} + k_{local}}{n_{local}}.

Redundancy Class

Ring

A Ring is a collection of agents within a Swarm, organized to collaboratively ensure data availability, integrity, and resilience against failures or downtime.

Rings operate under the governance of a specific redundancy class, which dictates their operational rules, such as data distribution, replication, and fault tolerance. Multiple rings can exist within the same redundancy class, each serving as an independent unit of storage capacity.

This modularity allows scalability: when additional storage is needed for a redundancy class, new rings can be seamlessly created by assigning the necessary nodes and aligning them with the redundancy policies of that class. By structuring nodes into rings, the system balances storage efficiency, fault tolerance, and operational flexibility.

note

An agent may be part of multiple rings. Within each Ring, nodes are uniquely identified by a sequence number, facilitating efficient data retrieval and recovery operations.

Node

Nodes are the building blocks of the Swarm, directly impacting the distributed system's availability, resilience, and performance. They can be physical, such as servers, or virtual, like virtual machines. Each node hosts a lightweight, containerized agent that facilitates management within the system, contributing to its overall capacity and participating in data storage and communication operations.