What is Cubbit DS3?
Intro
Cubbit DS3 is a geo-distributed cloud object storage platform. Unlike centralized cloud providers, Cubbit runs on a peer-to-peer (P2P) network where each node is connected to multiple others, with a central hub — the Coordinator — optimizing the network and making it increasingly faster and more efficient with time.
The intelligence of the coordinator, combined with the power of distribution, is what makes Cubbit one of a kind: a geo-distributed cloud object storage that provides superior security, scalability, and cost-effectiveness.
Key concepts of Cubbit DS3 include:
- P2P network: This section explains the foundation of the Cubbit network, its benefits, and how it operates to provide increased resilience against node failures and maintain data integrity.
- Architecture: This section delves into the architecture of Cubbit DS3, describing its components and their interactions, which enable optimized storage efficiency, scalability, and accessibility.
- Object storage: This section describes the object storage employed by Cubbit DS3, which enables the efficient storage and management of data in a scalable and secure manner. It also discusses the advantages of object storage over other storage methods and how it is implemented within the Cubbit platform.
- Redundancy: This section discusses the redundancy features of Cubbit DS3, explaining how data is fragmented and stored across multiple nodes in the P2P network to ensure high availability.
- Coordinator: This section covers the role and functionality of the Coordinator, the central hub responsible for optimizing the network, enhancing fault tolerance, and ensuring efficient file recovery.
- File recovery: This section outlines the file recovery process in Cubbit DS3, detailing how the platform supports seamless file recovery by reconstructing data from the available fragments, even in the event of node failures.
P2P Network
Cubbit's P2P network is the physical layer that serves as the backbone of Cubbit. The network — also called Swarm — is composed of multiple Cubbit nodes, working together to provide secure and scalable cloud storage for users.
Nodes can be either physical or virtual. A physical node is a Cubbit Cell, a standalone device that acts as both a data storage unit and relay node, enabling data to be securely stored and shared on the network, while a virtual node is a Cubbit Cell virtualized on top of the customer's existing on-prem infrastructure.
To ensure maximum security, users' data is encrypted with AES-256 and split into chunks, which are then processed into multiple redundant shards via Reed-Solomon error-correcting codes and safely spread across the network through peer-to-peer, end-to-end encrypted channels. Because of this, no Cubbit Cell stores any file or object in its entirety, not even its owner's files. Instead, it stores encrypted shards of multiple people's files.
Architecture
The architecture of Cubbit DS3 revolves around three entities: the Agent, the Coordinator, and the SDK.
- The Agent is a small piece of software enabling a Cubbit storage node within Cubbit's peer-to-peer network. It runs inside the Cubbit Cell.
- The Coordinator is a set of centralized microservices designed to coordinate and optimize the Swarm.
- The Cubbit SDK is a collection of tools and resources that developers can use to build software applications on top of Cubbit. It interacts with the Agent, the Coordinator, and Cubbit's S3 Gateway to enable Cubbit's S3-compatible cloud object storage.
Object storage
In computer science, "object" refers to binary data, often known as a Binary Large OBject (BLOB). BLOBs can encompass images, audio files, spreadsheets, and even binary executable code.
Object storage refers to a platform that provides specialized tools for storing, retrieving, and locating BLOBs. In terms of practical applications, it is a type of data storage architecture designed for large amounts of unstructured data, such as videos, audio files, images, and documents. It organizes objects into buckets, similar to folders in a file system, with each bucket able to hold an unlimited number of objects. Unlike traditional file and block storage, object storage does not utilize a hierarchical file system. Instead, it has a flat address space, where data is stored as objects containing raw data, metadata, and a unique identifier.
Object storage is superior to block and file storage in several ways:
- Security: object storage provides higher levels of data durability by replicating data across multiple nodes and storing it in a flat address space, minimizing the risk of data loss in the event of a node failure.
- Scalability: object storage is designed to handle large amounts of unstructured data in a high-performance manner, making it ideal for storing and accessing vast quantities of files, images, videos, and other data types.
- Cost-effectiveness: object storage eliminates the need for expensive data tiering, making it more cost-effective than traditional file and block storage.
File storage vs block storage vs object storage
Data storage methods have progressed over the years to adapt to the changing nature of data. File-based and block-based storage are well-suited to structured data, but as organizations face increasing volumes of unstructured data, object-based storage has emerged as the superior solution. File storage organizes data within folders and is based on a hierarchy of directories and subdirectories. It works well for small, easily organized data, but as the number of files grows, it becomes cumbersome and time-consuming. Block storage breaks down a file into equally-sized blocks and stores them separately, offering improved efficiency and performance for critical business applications and transactional databases. Object storage, on the other hand, treats objects as discrete units of data stored in a structurally flat environment, with each object including raw data, metadata, and a unique identifier. It offers cost-effective storage capacity for unstructured data and is ideal for data that does not change frequently or at all. Additionally, it provides more descriptive metadata than file storage, allowing for customization and further analysis.
Cubbit DS3
Cubbit DS3, short for Distributed Simple Storage Service, is a geo-distributed, S3-compatible object storage platform. Its buckets have comparable functionality to AWS S3 buckets, providing reliable and scalable storage solutions for unstructured data. Additionally, the geo-distributed nature of Cubbit DS3 solution provides users with enhanced security and cost-effectiveness compared to traditional centralized cloud storage solutions. Being S3 compatible, Cubbit can offer state-of-the-art solutions for data protection, distribution, and retrieval, such as:
- Object Locking
- Bucket Versioning
- Multipart Upload
Use cases
Backup
Cubbit DS3 is the ideal solution for organizations looking to automate their off-site backups in a fast, secure, and immutable manner. The platform is compatible with Veeam or any S3 client, making it easy to store your backups on Cubbit and protect your data from disaster. With its advanced data management features, users can easily retrieve their data as needed cost-effectively, taking advantage of Cubbit S3 compatibility. The geo-distributed network infrastructure of Cubbit provides scalable storage for large amounts of data without sacrificing performance. Data security is a top priority, with multi-layered encryption and robust disaster recovery options ensuring that your data is always protected.
Hybrid Cloud
With Cubbit DS3 and the right hybrid cloud strategy, you can virtually extend your NAS and virtual machines without disrupting your workflow ensuring quick recovery of your cold backups while collaborating locally in a secure manner. Whether you're dealing with massive data sets or demanding applications, Cubbit DS3 can help you execute a simple, secure, and high-performance hybrid cloud strategy while breaking free from bandwidth limitations.
Cloud-to-cloud
With Cubbit DS3, you can automate your scripting processes, effortlessly migrate massive amounts of data, and stay compliant with regulations using Cubbit DS3. With a simple change of endpoint and connection to Cubbit via RClone/AWS CLI, you can implement a multi-cloud strategy, diversify your risk, and provide your customers with a secure exit plan that meets GDPR requirements.
Hybrid Backup
With two-way synchronization between your NAS and object storage, Cubbit DS3 lets you schedule backup jobs and safeguard your data against potential threats such as ransomware and natural disasters. By utilizing deduplication technology, you can also compress your data, minimize bandwidth usage, and shorten transfer times. Implementing a comprehensive backup plan gives you peace of mind, knowing that your data is secure and easily accessible in case of an emergency.
Cloud-native applications
With Cubbit DS3 APIs, you can streamline your app development process and focus on building without worrying about data storage. Cubbit DS3 can serve as a persistent data store for building or transitioning to cloud-native applications, providing you with a highly scalable, flexible, and cost-effective solution. Simply change your endpoint and deploy it on Cubbit’s S3-compatible object storage for immediate results.
Rich media storage and delivery
Cubbit DS3 is designed to be highly cost-effective for storing and distributing large and rich media files, such as music, video, images, and more complex multimedia objects. With its powerful global distribution capabilities, organizations can quickly and easily distribute their media content to a global audience, reducing costs and improving user experience. Cubbit DS3 is perfect for media organizations with rapidly growing data storage needs, as it ensures that their storage solutions will keep pace with their business requirements.
Big data analytics
Cubbit DS3 can be used to store large amounts of any data type, including big data. With S3-compatible third-party apps, organizations can execute big data analysis and gain valuable insights into customer behavior, operations, and market trends. The data can be stored in its raw form, allowing for flexibility in analyzing it and deriving meaningful insights. This can help organizations make more informed decisions and drive growth, without having to worry about the limitations of traditional storage systems.
Internet of Things
Cubbit DS3 is designed to handle large amounts of machine-to-machine data efficiently and cost-effectively. Since Cubbit DS3 is S3 compatible, it can be leveraged through third-party apps to support artificial intelligence and advanced analytics applications to turn data generated by IoT devices into actionable insights. With Cubbit DS3, IoT-first organizations can streamline their data management processes, reduce costs, and improve the accuracy of their data.
Redundancy
When it comes to geo-distributed object storage, it's crucial to ensure that nodes stay online. This will guarantee the highest standards of durability (99.999999999%) and availability (99.95%).
To do that, many centralized cloud companies create multiple copies of data via strategies like RAID or bucket replication. This allows for diversifying the risk of downtime but still comes with bandwidth and storage capacity saturation. In the case of RAID, the data is also vulnerable to local disasters like fires and blackouts.
Instead, Cubbit leverages redundancy via Reed-Solomon error-correcting code. This ensures 99.999999999% of durability and 99.95% availability without considerable sacrifices in terms of performance, storage capacity, and ensuring offsite geo-distribution.
This algorithm, applied to a Galois field of 256 elements (), divides a file into shards and creates additional redundancy shards, for a total of . Any of these shards is enough to retrieve the complete object or file.
Currently, and are set to:
This choice allows payloads to persist on the network with a storing ratio of just 3 () of the original payload size — with higher performance and the same (or even higher) durability of RAID and bucket replication.
Even though there is only a small probability of losing a file by leveraging the Reed-Solomon algorithm, the chance that a significant number of agents in the same pool goes offline at the same time still exists (e.g., a blackout, local disaster, fire).
Heuristics on pool selection
To prevent the issue stated in the last paragraph of the section above, the Cubbit Coordinator selects nodes based on these 5 criteria:
- Uptime: if a peer's uptime is high, it will be more reliable and available to receive shards.
- Geolocation: there is a balance between the choice of close peers (to optimize performance) and distant ones (to diversify the risk of pool downtime due to blackouts and local disasters.
- Load average: the more shards a peer handles, the greater the probability of losing a payload when a peer goes offline. For this reason, a good pool should not handle too many related shards.
- Bandwidth: the bandwidth of the peers should be balanced inside the pool to provide a uniform transfer speed to all users.
- Peer history: if users turn off some nodes systematically at night, they get a low rank, so other peers are chosen as more reliable.
On top of peer selection, Cubbit also has a file recovery strategy by crowdsourcing the power of peer-to-peer nodes. You can check out the file recovery.
Coordinator
The Coordinator is Cubbit's centralized intelligence that is in charge of:
- Ensuring the security of transactions.
- Facilitating communications between agents.
- Optimizing the payload distribution across the network.
- Keeping track of file locations and uptime via metadata.
- Triggering recovery procedures for files in the swarm.
The Coordinator is cryptographically blind — it only knows the files' location and their functional metadata. All payloads are stored on the distributed storage nodes, ensuring high performance and resilience through geo-distribution.
The Cubbit Coordinator has a key role in data distribution, peer selection, and recovery procedures.
The role of the Coordinator in file distribution and peer selection
Every file stored on Cubbit is encrypted with AES-256, split into chunks, multiplied into shards (using an algorithm called Reed-Solomon), and geo-distributed on the peer-to-peer network of nodes. The original file can be recovered from any subset of n shards out of the .
After the Reed-Solomon redundancy procedure, the Coordinator determines which agents are most suitable for hosting the encrypted shards of files. Each of the shards is stored on a different agent.
To do so, the Coordinator runs machine learning algorithms to nullify the probability of losing files and grant a steady network performance. The Coordinator spreads the chunks as far as possible from each other while minimizing network latency and other factors (bandwidth usage, storage optimization, etc.)
How the Coordinator triggers the recovery procedure
From the shards created by the redundancy protocol, you only need shards to download a file from the Swarm.
To ensure the highest standards of durability and availability, the Coordinator monitors the uptime status of each Cubbit Cell, triggering a recovery procedure when the total number of online shards hits a certain threshold — namely, .
When the recovery procedure is triggered, the Coordinator alerts the remaining online storage nodes that host that file to contact a set of newly available nodes to fully restore the number of online shards to the maximum level.
File recovery
Even though the probability of a coordinated downtime that results in an unavailable file is negligible, the chance that a significant number of peers in a pool goes offline simultaneously still exists (e.g., a blackout, local disaster, fire).
To ensure the highest standard of durability (99.999999999%) and availability (99.95%), Cubbit employs 2 procedures:
- Heuristics on pool selection: you can check the 5 criteria used for peer selection in the Redundancy section.
- Lazy recovery strategy: we'll dive deeper below.
Lazy recovery strategy
If, after a series of cumulative disconnections, the redundancy of a file (i.e., the number of online shards that exceed ) is smaller than a chosen security threshold (being ), the Coordinator triggers a recovery procedure (called "Lazy recovery strategy"):
- The coordinator identifies the () alternative members of the pool that will replace the offline ones.
- The coordinator instructs a node to retrieve n shards from the damaged pool.
- The chosen node retrieves shards and inverts Reed Solomon to obtain an encrypted chunk (note that it is not necessary to know the AES key used to encrypt the file to invert the redundancy process).
- The node redistributes the recovered shards to the () new members of the original pool.