Distributed Storage Systems

Nidhi Vichare
5 minute read
October 15, 2020
Distributed storage systems
Data Engineering
Data Warehouse
Databases
Usage-driven design

Distributed Storage Systems


What is Distributed Storage System? A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

Attributes of Distributed Storage Systems

  • Partitioning

    How does the system manage distribution of data across nodes?

  • Mutation

    What support does the system have for modifying data?

  • Read paths

    How is data in the system accessed?

  • Availability and Consistency

    What tradeoffs does the system make in terms of availability of the system versus consistency of data?

  • Use Cases

    To what problems does a system provide a solution?

    • LARGE SCANS
    • RANDOM ACCESS TO DATA
    • CUBING
    • TIME SERIES
    • HIGH MUTABILITY

Storage System Genealogy

Examples - Google File System (GFS), BigTable, Spanner, and Google Search

  • Cassandra and HBase are both based on BigTable
 Components of their architecture are similar.
 They are key–value stores.
 They are designed to scale well.
 They enforce ordering as part of how they store data on disk and index it.
 Under the hood, data is immutable, but there are patterns that allow data to be mutable.
 They aim to solve similar use cases (for example, fast point access to records) and have
 similar performance characteristics.

Partitioning

Distributed storage systems scale because they use partitioning of information—in this case, partitioning refers to how the system determines which nodes store data; in other words, the mechanism used to distribute data across the system. In general, a limited number of options for data partitioning exist in distributed storage systems: centralized, range, and hash partitioning.

CENTRALIZED PARTITIONING - GFS and HDFS

Pros: The advantage of this type of system is that the centralized node can make sure the data is partitioned evenly, even in the cases of data node failure. The central service knows where everything is and can point to it at any time.

Cons: potential bottleneck created by having one metadata service because it will be constrained in terms of memory and in terms of throughput for requests in multitenant environments.

RANGE PARTITIONING - A good realworld example of range partitioning is a dictionary. There is a fundamental problem with a range partitioning strategy—that problem is known as skew. Skew occurs when one partition has a significantly larger amount of content versus the other partitions.

HASH PARTITIONING - An effective hash function and smart use of hash keys can evenly spread data values across partitions, ensuring that data is evenly distributed across our cluster. Some systems such as Cassandra or Elasticsearch use hash partitioning, but in general a system that uses range partitioning can use hash partitioning by prepending a hash to the keys used by the system.

Mutation Options

APPEND ONLY

FILE VERSUS RECORD

RECORD SIZE

MUTATION LATENCY

Read Paths

INDEXING

Indexing is widely used in distributed systems but often in different ways. Some systems will index only the start of a file, leaving all the content in that file for a scan to filter through. Other systems will index every field in a record and even content within a record. In general, there are four categories of indexing in distributed systems: Indexing at the file level One index for chunks of data. Indexing at the record level Every record or document has a primary key and is indexed by that value. Simple secondary indexing Simple secondary indexes on fields that are not the primary key. Reverse indexing Lucenebased indexing in which everything can be indexed. This is used mainly for searching or faceting use cases

ROW-BASED VERSUS COLUMNAR STORAGE

Availability Versus Consistency

Best Practices while building your Data Warehouse

  • Metadata management – Documenting the metadata related to all the source tables, staging tables, and derived tables are very critical in deriving actionable insights from your data. It is possible to design the ETL tool such that even the data lineage is captured. Some of the widely popular ETL tools also do a good job of tracking data lineage.
  • Logging – Logging is another aspect that is often overlooked. Having a centralized repository where logs can be visualized and analyzed can go a long way in fast debugging and creating a robust ETL process.
  • Joining data – Most ETL tools have the ability to join data in extraction and transformation phases. It is worthwhile to take a long hard look at whether you want to perform expensive joins in your ETL tool or let the database handle that. In most cases, databases are better optimized to handle joins.
  • Keeping the transaction database separate – The transaction database needs to be kept separate from the extract jobs and it is always best to execute these on a staging or a replica table such that the performance of the primary operational database is unaffected.
  • Monitoring/alerts – Monitoring the health of the ETL/ELT process and having alerts configured is important in ensuring reliability.
  • Point of time recovery – Even with the best of monitoring, logging, and fault tolerance, these complex systems do go wrong. Having the ability to recover the system to previous states should also be considered during the data warehouse process design.

Sources

Picture credit: Picture from Unsplash

Further Reading

🔗 Read more about Snowflake here

🔗 Read more about Cassandra here

🔗 Read more about Elasticsearch here

🔗 Read more about Kafka here

🔗 Read more about Spark here

🔗 Read more about Data Lakes here

🔗 Read more about Redshift vs Snowflake here

🔗 Read more about Best Practices on Database Design here