What is Distributed Storage System? A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.
Partitioning
How does the system manage distribution of data across nodes?
Mutation
What support does the system have for modifying data?
Read paths
How is data in the system accessed?
Availability and Consistency
What tradeoffs does the system make in terms of availability of the system versus consistency of data?
Use Cases
To what problems does a system provide a solution?
Examples - Google File System (GFS), BigTable, Spanner, and Google Search
Components of their architecture are similar.
They are key–value stores.
They are designed to scale well.
They enforce ordering as part of how they store data on disk and index it.
Under the hood, data is immutable, but there are patterns that allow data to be mutable.
They aim to solve similar use cases (for example, fast point access to records) and have
similar performance characteristics.
Partitioning
Distributed storage systems scale because they use partitioning of information—in this case, partitioning refers to how the system determines which nodes store data; in other words, the mechanism used to distribute data across the system. In general, a limited number of options for data partitioning exist in distributed storage systems: centralized, range, and hash partitioning.
CENTRALIZED PARTITIONING - GFS and HDFS
Pros: The advantage of this type of system is that the centralized node can make sure the data is partitioned evenly, even in the cases of data node failure. The central service knows where everything is and can point to it at any time.
Cons: potential bottleneck created by having one metadata service because it will be constrained in terms of memory and in terms of throughput for requests in multitenant environments.
RANGE PARTITIONING - A good realworld example of range partitioning is a dictionary. There is a fundamental problem with a range partitioning strategy—that problem is known as skew. Skew occurs when one partition has a significantly larger amount of content versus the other partitions.
HASH PARTITIONING - An effective hash function and smart use of hash keys can evenly spread data values across partitions, ensuring that data is evenly distributed across our cluster. Some systems such as Cassandra or Elasticsearch use hash partitioning, but in general a system that uses range partitioning can use hash partitioning by prepending a hash to the keys used by the system.
Mutation Options
APPEND ONLY
FILE VERSUS RECORD
RECORD SIZE
MUTATION LATENCY
Read Paths
INDEXING
Indexing is widely used in distributed systems but often in different ways. Some systems will index only the start of a file, leaving all the content in that file for a scan to filter through. Other systems will index every field in a record and even content within a record. In general, there are four categories of indexing in distributed systems: Indexing at the file level One index for chunks of data. Indexing at the record level Every record or document has a primary key and is indexed by that value. Simple secondary indexing Simple secondary indexes on fields that are not the primary key. Reverse indexing Lucenebased indexing in which everything can be indexed. This is used mainly for searching or faceting use cases
ROW-BASED VERSUS COLUMNAR STORAGE
Availability Versus Consistency
Best Practices while building your Data Warehouse
Sources
Picture credit: Picture from Unsplash
🔗 Read more about Snowflake here
🔗 Read more about Cassandra here
🔗 Read more about Elasticsearch here
🔗 Read more about Kafka here
🔗 Read more about Spark here
🔗 Read more about Data Lakes here
🔗 Read more about Redshift vs Snowflake here
🔗 Read more about Best Practices on Database Design here