pod5_subset

Tool for subsetting pod5 files into one or more outputs

assert_filename_template(template: str, subset_columns: List[str], ignore_incomplete_template: bool) None[source]

Get the keys named in the template to assert that they exist in subset_columns

assert_overwrite_ok(output: Path, names: Iterable[str], force_overwrite: bool) None[source]

Given the output directory path and target filenames, assert that no unforced overwrite will occur unless requested raising an FileExistsError if not

calculate_transfers(inputs: List[Path], read_targets: Dict[str, Set[Path]], missing_ok: bool, duplicate_ok: bool) Dict[Path, Dict[Path, Set[str]]][source]

Calculate the transfers which stores the collection of read_ids their source and destination.

create_default_filename_template(subset_columns: List[str]) str[source]

Create the default filename template from the subset_columns selected

launch_subsetting(transfers: Dict[Path, Dict[Path, Set[str]]], show_pbar: bool = False) None[source]

Iterate over the transfers one target at a time, opening sources and copying the required read_ids. Wait for the repacker to finish before moving on to ensure we don’t have too many open file handles.

main()[source]

pod5 subsample main

parse_csv_mapping(csv_path: Path) Dict[str, Set[str]][source]

Parse the csv direct mapping of output target to read_ids

parse_direct_mapping_targets(csv_path: Optional[Path] = None, json_path: Optional[Path] = None) Dict[str, Set[str]][source]

Parse either the csv or json direct mapping of output target to read_ids

Return type:

dictionary mapping of output target to read_ids

parse_json_mapping(json_path: Path) Dict[str, Set[str]][source]

Parse the json direct mapping of output target to read_ids

parse_table_mapping(summary_path: Path, filename_template: Optional[str], subset_columns: List[str], read_id_column: str = 'read_id', ignore_incomplete_template: bool = False) Dict[str, Set[str]][source]

Parse a table using pandas to create a mapping of output targets to read ids

resolve_targets(output: Path, mapping) Dict[str, Set[Path]][source]

Resolve the targets from the mapping

subset_pod5(inputs: List[Path], output: Path, csv: Optional[Path], json: Optional[Path], table: Optional[Path], columns: List[str], threads: int, template: str, read_id_column: str, missing_ok: bool, duplicate_ok: bool, ignore_incomplete_template: bool, force_overwrite: bool) Any[source]

Prepare the subsampling mapping and run the repacker

subset_pod5s_with_mapping(inputs: Iterable[Path], output: Path, mapping: Mapping[str, Set[str]], threads: int = 1, missing_ok: bool = False, duplicate_ok: bool = False, force_overwrite: bool = False) List[Path][source]

Given an iterable of input pod5 paths and an output directory, create output pod5 files containing the read_ids specified in the given mapping of output filename to set of read_id.