M3DB, a distributed time series database

Overview

M3DB is written entirely in Go and does not have any required dependencies. For larger deployments, one may use an etcd cluster to manage M3DB cluster membership and topology definition. High Level Goals Some of the high level goals for the project are defined as: Monitoring support: M3DB was primarily developed for collecting a high volume of monitoring time series data, distributing the storage in a horizontally scalable manner and most efficiently leveraging the hardware.

Storage Engine

M3DB is a time series database that was primarily designed to be horizontally scalable and able to handle high data throughput. Time Series Compression One of M3DB’s biggest strengths as a time series database (as opposed to using a more general-purpose horizontally scalable, distributed database like Cassandra) is its ability to compress time series data resulting in huge memory and disk savings. There are two compression algorithms used in M3DB: M3TSZ and protobuf encoding.

Sharding

Timeseries keys are hashed to a fixed set of virtual shards. Virtual shards are then assigned to physical nodes. M3DB can be configured to use any hashing function and a configured number of shards. By default murmur3 is used as the hashing function and 4096 virtual shards are configured. Benefits Shards provide a variety of benefits throughout the M3DB stack: They make horizontal scaling easier and adding / removing nodes without downtime trivial at the cluster level.

Consistency Levels

M3DB provides variable consistency levels for read and write operations, as well as cluster connection operations. These consistency levels are handled at the client level. Write consistency levels One: Corresponds to a single node succeeding for an operation to succeed. Majority: Corresponds to the majority of nodes succeeding for an operation to succeed. All: Corresponds to all nodes succeeding for an operation to succeed. Read consistency levels One: Corresponds to reading from a single node to designate success.

Storage

Overview The primary unit of long-term storage for M3DB are fileset files which store compressed streams of time series values, one per shard block time window size. They are flushed to disk after a block time window becomes unreachable, that is the end of the time window for which that block can no longer be written to. If a process is killed before it has a chance to flush the data for the current time window to disk it must be restored from the commit log (or a peer that is responsible for the same shard if replication factor is larger than 1.

Commit Logs And Snapshot Files

Overview M3DB has a commit log that is equivalent to the commit log or write-ahead-log in other databases. The commit logs are completely uncompressed (no M3TSZ encoding), and there is one per database (multiple namespaces in a single process will share a commit log.) Integrity Levels There are two integrity levels available for commit logs: Synchronous: write operations must wait until it has finished writing an entry in the commit log to complete.

Peer Streaming

Client Peer streaming is managed by the M3DB client. It fetches all blocks from peers for a specified time range for bootstrapping purposes. It performs the following steps: Fetch all metadata for blocks from all peers who own the specified shard Compares metadata from different peers and determines the best peer(s) from which to stream the actual data Streams the block data from peers Steps 1, 2 and 3 all happen concurrently.

Caching

Overview Blocks that are still being actively compressed / M3TSZ encoded must be kept in memory until they are sealed and flushed to disk. Blocks that have already been sealed, however, don’t need to remain in-memory. In order to support efficient reads, M3DB implements various caching policies which determine which flushed blocks are kept in memory, and which are not. The “cache” itself is not a separate datastructure in memory, cached blocks are simply stored in their respective in-memory objects with various different mechanisms (depending on the chosen cache policy) determining which series / blocks are evicted and which are retained.