Skip to main content

Redundancy

When it comes to geo-distributed object storage, it's crucial to ensure that nodes stay online. This will guarantee the highest standards of durability (99.999999999%) and availability (99.95%).

To do that, many centralized cloud companies create multiple copies of data via strategies like RAID or bucket replication. This allows for diversifying the risk of downtime but still comes with bandwidth and storage capacity saturation. In the case of RAID, the data is also vulnerable to local disasters like fires and blackouts.

Instead, Cubbit leverages redundancy via Reed-Solomon error-correcting code. This ensures 99.999999999% of durability and 99.95% availability without considerable sacrifices in terms of performance, storage capacity, and ensuring offsite geo-distribution.

This algorithm, applied to a Galois field of 256 elements (GF(28)GF(28)), divides a file into nn shards and creates kk additional redundancy shards, for a total of n+kn + k. Any nn of these shards is enough to retrieve the complete object or file.

Currently, nn and kk are set to:

  • n=20n = 20
  • n+k=60n + k = 60

This choice allows payloads to persist on the network with a storing ratio of just 3 (n+kn\frac{n + k}{n}) of the original payload size — with higher performance and the same (or even higher) durability of RAID and bucket replication.

Even though there is only a small probability of losing a file by leveraging the Reed-Solomon algorithm, the chance that a significant number of agents in the same pool goes offline at the same time still exists (e.g., a blackout, local disaster, fire).

Heuristics on pool selection

To prevent the issue stated in the last paragraph of the section above, the Cubbit Coordinator selects nodes based on these 5 criteria:

  1. Uptime: if a peer's uptime is high, it will be more reliable and available to receive shards.
  2. Geolocation: there is a balance between the choice of close peers (to optimize performance) and distant ones (to diversify the risk of pool downtime due to blackouts and local disasters.
  3. Load average: the more shards a peer handles, the greater the probability of losing a payload when a peer goes offline. For this reason, a good pool should not handle too many related shards.
  4. Bandwidth: the bandwidth of the peers should be balanced inside the pool to provide a uniform transfer speed to all users.
  5. Peer history: if users turn off some nodes systematically at night, they get a low rank, so other peers are chosen as more reliable.

On top of peer selection, Cubbit also has a file recovery strategy by crowdsourcing the power of peer-to-peer nodes. You can check out the file recovery documentation.