POD5 Format Specification
Overview
The file format is, at its core, a collection of Apache Arrow tables, stored in the Apache Feather 2
(also know as Apache Arrow IPC File) format, and bundled into a container format. The container file
has the extension .pod5.
Table Schemas
POD5 files are a custom wrapper format around arrow that contain several arrow tables.
All the tables should have the following custom_metadata fields set on them:
Name |
Example Value |
Notes |
|---|---|---|
MINKNOW:pod5_version |
1.0.0 |
The version of this specification that the schema was based on. |
MINKNOW:software |
MinNOW Core 5.2.3 |
A free-form description of the software that wrote the file, intended to help pin down the source of files that violate the specification. |
MINKNOW:file_identifier |
cbf91180-0684-4a39-bf56-41eaf437de9e |
Must be identical across all tables. Allows checking that the files correspond to each other. |
Extension Types
Several fields in the table schemas use custom arrow types.
minknow.uuid
The schemas make extensive use of UUIDs to identify reads. This is stored using an extension type, with the following properties:
Name: "minknow.uuid"
Physical storage: FixedBinary(16)
minknow.vbz
Storage for VBZ-encoded data:
Name: "minknow.vbz"
Physical storage: LargeBinary
Tables
The Reads, Signal and Run Info tables must all be present in a POD5 file. Note that some very early POD5 files produced by pre-0.1 versions of the pod5 library did not include a Run Info table, instead including that information in the Reads table.
Reads Table
The Reads table contains a single row per read, and describes the metadata for each read. The
signal column of the read links to the Signal table, and allows a reads signal to be retrieved.
The run_info column links to the the Run Info table, providing more context for the read and
avoiding duplicating data that is common to many or all reads in the file.
Some fields of the Reads table are dictionaries: the contents of the table are stored in a lookup written prior to each batch of read rows and the read row itself then contains an integer index. This allows space savings on fields that would otherwise be repeated. Only simple types are stored in dictionaries as third party tools have limited support for dictionaries of structs.
[tables/reads.toml] contains specific information about fields in the reads table.
Signal Table
The signal table contains the (optionally compressed) signal data where one row contains sequence of sample data, and some information about the sample data origin.
[tables/signal.toml] contains specific information about fields in the signal table.
Run Info Table
The run info table contains a single row per MinKNOW run that any read in the file came from.
Several fields of the Reads table are dictionaries, the contents of the table are stored in a lookup written prior to each batch of read rows, the read row itself then contains an integer index. This allows space savings on fields that would otherwise be repeated.
[tables/run_info.toml] contains specific information about fields in the reads table.
Combined file Layout
Layout
<signature "\213POD\r\n\032\n">
<section marker: 16 bytes>
<embedded file 1 (padded to 8-byte boundary)><section marker: 16 bytes>
...
<embedded file N (padded to 8-byte boundary)><section marker: 16 bytes>
<footer magic: "FOOTER\000\000">
<footer (padded to 8-byte boundary)>
<footer length: 8 bytes little-endian signed integer>
<section marker: 16 bytes>
<signature "\213POD\r\n\032\n">
All padding bytes should be zero. They ensure memory mapped files have the alignment that Arrow expects.
Signature
The first and last eight bytes of the file are both a fixed set of values:
| Decimal | 139 | 80 | 79 | 68 | 13 | 10 | 26 | 10 |
| Hexadecimal | 0x8B | 0x50 | 0x4F | 0x44 | 0x0D | 0x0A | 0x1A | 0x0A |
| ASCII C Notation | \213 | P | O | D | \r | \n | \032 | \n |
The format of the signature is based on the PNG file signature, and inherits several useful features from it for detecting file corruption:
The first byte is non-ASCII to reduce the probability it is interpreted as a text file.
The first byte has the high bit set to catch file transfers that clear the top bit.
The \r\n (CRLF) sequence and the final \n (LF) byte check that nothing has attempted to standardise line endings in the file.
The second-last byte (\032) is the CTRL-Z sequence, which stops file display under MS-DOS.
Rationale
A unique, fixed signature for the file type allows quickly identifying that the file is in the
expected format, and provides an easy way for tools like the UNIX file command to determine the
file type.
Placing it at the end allows quickly checking whether the file is complete.
Section marker
The section marker is a 16-byte UUID, generated randomly for each file. All the section markers in a given file must be identical.
Rationale
This aids in recovery of partially-written files (that are missing a footer) - while most of the embedded Arrow IPC files can be scanned easily, it may not be obvious where the footer ends. A given randomly-generated 16-byte value is highly unlikely to occur in actual data, and can be scanned for to find the end of the embedded file for certain. The first section marker is just so that recovery tools know what to look for.