pod5_subset

Tool for subsetting pod5 files into one or more outputs

class WorkQueue(context: SpawnContext, transfers: LazyFrame)[source]

Bases: object

__init__(context: SpawnContext, transfers: LazyFrame) None[source]
join() None[source]

Call join on the work queue waiting for all tasks to be done

shutdown() int[source]

Shutdown all queues returning the counts of all remaining items

assert_filename_template(template: str, subset_columns: List[str], ignore_incomplete_template: bool) None[source]

Get the keys named in the template to assert that they exist in subset_columns

assert_overwrite_ok(targets: LazyFrame, force_overwrite: bool) None[source]

Given the target filenames, assert that no unforced overwrite will occur unless requested raising an FileExistsError. Unlinks existing files if they exist if force_overwrite set

calculate_transfers(sources: LazyFrame, targets: LazyFrame, missing_ok: bool) LazyFrame[source]

Produce the transfers dataframe which maps the read_ids, source and destination

column_keys_from_template(template: str) List[str][source]

Get a list of placeholder keys in the template

create_default_filename_template(subset_columns: List[str]) str[source]

Create the default filename template from the subset_columns selected

default_filename_template(subset_columns: List[str]) str[source]

Create the default filename template from the subset_columns selected

fstring_to_polars(template: str) Tuple[str, List[str]][source]

Repalce f-string keyed placeholders with positional ones and return the keys in their respective position

get_separator(path: Path) str[source]

Inspect the first line of the file at path and attempt to determine the field separator as either tab or comma, depending on the number of occurrences of each Returns “,” or “<tab>”

launch_subsetting(transfers: LazyFrame, threads: int = 2) None[source]

Iterate over the transfers dataframe subsetting reads from sources to destinations

main()[source]

pod5 subsample main

overall_progress(queue: WorkQueue)[source]
parse_csv_mapping(csv_path: Path) LazyFrame[source]

Parse the csv direct mapping of output target to read_ids to a targets dataframe

parse_source(path: Path) LazyFrame[source]

Reads the read ids available in a given pod5 file returning a dataframe with the formatted read_ids and the source filename

parse_source_process(paths: JoinableQueue, parsed_sources: Queue)[source]

Parse sources until paths queue is consumed

parse_sources(paths: Set[Path], duplicate_ok: bool, threads: int = 2) LazyFrame[source]

Reads all inputs and return formatted lazy dataframe

parse_table_mapping(summary_path: Path, filename_template: Optional[str], subset_columns: List[str], read_id_column: str = 'read_id', ignore_incomplete_template: bool = False) LazyFrame[source]

Parse a table using polars to create a mapping of output targets to read ids

process_subset_tasks(queue: WorkQueue, process: int)[source]

Consumes work from the queue and launches subsetting tasks

resolve_output_targets(targets: LazyFrame, output: Path) LazyFrame[source]

Prepend the output path to the target filename and resolve the complete string

subset_pod5(inputs: List[Path], output: Path, columns: List[str], csv: Optional[Path] = None, table: Optional[Path] = None, threads: int = 2, template: str = '', read_id_column: str = 'read_id', missing_ok: bool = False, duplicate_ok: bool = False, ignore_incomplete_template: bool = False, force_overwrite: bool = False, recursive: bool = False) Any[source]

Prepare the subsampling mapping and run the repacker

subset_pod5s_with_mapping(inputs: Set[Path], output: Path, targets: LazyFrame, threads: int = 2, missing_ok: bool = False, duplicate_ok: bool = False, force_overwrite: bool = False) None[source]

Given an iterable of input pod5 paths and an output directory, create output pod5 files containing the read_ids specified in the given mapping of output filename to set of read_id.

subset_reads(dest: Path, sources: DataFrame, process: int) None[source]

Copy the reads in sources into a new pod5 file at dest