pod5_subset

Tool for subsetting pod5 files into one or more outputs

assert_filename_template(template: str, subset_columns: List[str], ignore_incomplete_template: bool) → None[source]: Get the keys named in the template to assert that they exist in subset_columns

assert_overwrite_ok(output: Path, names: Iterable[str], force_overwrite: bool) → None[source]: Given the output directory path and target filenames, assert that no unforced overwrite will occur unless requested raising an FileExistsError if not

calculate_transfers(inputs: List[Path], read_targets: Dict[str, Set[Path]], missing_ok: bool, duplicate_ok: bool) → Dict[Path, Dict[Path, Set[str]]][source]: Calculate the transfers which stores the collection of read_ids their source and destination.

create_default_filename_template(subset_columns: List[str]) → str[source]: Create the default filename template from the subset_columns selected

launch_subsetting(transfers: Dict[Path, Dict[Path, Set[str]]], show_pbar: bool = False) → None[source]: Iterate over the transfers one target at a time, opening sources and copying the required read_ids. Wait for the repacker to finish before moving on to ensure we don’t have too many open file handles.

main()[source]: pod5 subsample main

parse_csv_mapping(csv_path: Path) → Dict[str, Set[str]][source]: Parse the csv direct mapping of output target to read_ids

parse_direct_mapping_targets(csv_path: Optional[Path] = None, json_path: Optional[Path] = None) → Dict[str, Set[str]][source]

Parse either the csv or json direct mapping of output target to read_ids

Return type:: dictionary mapping of output target to read_ids

parse_json_mapping(json_path: Path) → Dict[str, Set[str]][source]: Parse the json direct mapping of output target to read_ids

parse_table_mapping(summary_path: Path, filename_template: Optional[str], subset_columns: List[str], read_id_column: str = 'read_id', ignore_incomplete_template: bool = False) → Dict[str, Set[str]][source]: Parse a table using pandas to create a mapping of output targets to read ids

resolve_targets(output: Path, mapping) → Dict[str, Set[Path]][source]: Resolve the targets from the mapping

subset_pod5(inputs: List[Path], output: Path, csv: Optional[Path], json: Optional[Path], table: Optional[Path], columns: List[str], threads: int, template: str, read_id_column: str, missing_ok: bool, duplicate_ok: bool, ignore_incomplete_template: bool, force_overwrite: bool) → Any[source]: Prepare the subsampling mapping and run the repacker

subset_pod5s_with_mapping(inputs: Iterable[Path], output: Path, mapping: Mapping[str, Set[str]], threads: int = 1, missing_ok: bool = False, duplicate_ok: bool = False, force_overwrite: bool = False) → List[Path][source]: Given an iterable of input pod5 paths and an output directory, create output pod5 files containing the read_ids specified in the given mapping of output filename to set of read_id.