Storage
Overview
The primary unit of long-term storage for M3DB are fileset files which store compressed streams of time series values, one per shard block time window size.
They are flushed to disk after a block time window becomes unreachable, that is the end of the time window for which that block can no longer be written to. If a process is killed before it has a chance to flush the data for the current time window to disk it must be restored from the commit log (or a peer that is responsible for the same shard if replication factor is larger than 1.)
FileSets
A fileset has the following files:
- Info file: Stores the block time window start and size and other important metadata about the fileset volume.
- Summaries file: Stores a subset of the index file for purposes of keeping the contents in memory and jumping to section of the index file that within a few pages of linear scanning can find the series that is being looked up.
- Index file: Stores the series metadata, including tags if indexing is enabled, and location of compressed stream in the data file for retrieval.
- Data file: Stores the series compressed data streams.
- Bloom filter file: Stores a bloom filter bitset of all series contained in this fileset for quick knowledge of whether to attempt retrieving a series for this fileset volume.
- Digests file: Stores the digest checksums of the info file, summaries file, index file, data file and bloom filter file in the fileset volume for integrity verification.
- Checkpoint file: Stores a digest of the digests file and written at the succesful completion of a fileset volume being persisted, allows for quickly checking if a volume was completed.
┌───────────────────────┐
┌─────────────────────┐ ┌─────────────────────┐ │ Index File │
│ Info File │ │ Summaries File │ │ (sorted by ID) │
├─────────────────────┤ │ (sorted by ID) │ ├───────────────────────┤
│- Block Start │ ├─────────────────────┤ ┌─>│- Idx │
│- Block Size │ │- Idx │ │ │- ID │
│- Entries (Num) │ │- ID │ │ │- Size │
│- Major Version │ │- Index Entry Offset ├──┘ │- Checksum │
│- Summaries (Num) │ └─────────────────────┘ │- Data Entry Offset ├──┐
│- BloomFilter (K/M) │ │- Encoded Tags │ │
│- Snapshot Time │ │- Index Entry Checksum │ │
│- Type (Flush/Snap) │ └───────────────────────┘ │
│- Snapshot ID │ │
│- Volume Index │ │
│- Minor Version │ │
└─────────────────────┘ │
│
┌─────────────────────┐ ┌─────────────────────────────┘
┌─────────────────────┐ │ Bloom Filter File │ │
│ Digests File │ ├─────────────────────┤ │ ┌─────────────────────┐
├─────────────────────┤ │- Bitset │ │ │ Data File │
│- Info file digest │ └─────────────────────┘ │ ├─────────────────────┤
│- Summaries digest │ │ │List of: │
│- Index digest │ └─>│ - Marker (16 bytes)│
│- Data digest │ │ - ID │
│- Bloom filter digest│ │ - Data (size bytes)│
└─────────────────────┘ └─────────────────────┘
┌─────────────────────┐
│ Checkpoint File │
├─────────────────────┤
│- Digests digest │
└─────────────────────┘
In the diagram above you can see that the data file stores compressed blocks for a given shard / block start combination. The index file (which is sorted by ID and thus can be binary searched or scanned) can be used to find the offset of a specific ID.
FileSet files will be kept for every shard / block start combination that is within the retention period. Once the files fall out of the period defined in the configurable namespace retention period they will be deleted.