pod5_subset
Tool for subsetting pod5 files into one or more outputs
- assert_filename_template(template: str, subset_columns: List[str], ignore_incomplete_template: bool) None[source]
Get the keys named in the template to assert that they exist in subset_columns
- assert_overwrite_ok(targets: LazyFrame, force_overwrite: bool) None[source]
Given the target filenames, assert that no unforced overwrite will occur unless requested raising an FileExistsError. Unlinks existing files if they exist if force_overwrite set
- calculate_transfers(sources: LazyFrame, targets: LazyFrame, missing_ok: bool) LazyFrame[source]
Produce the transfers dataframe which maps the read_ids, source and destination
- column_keys_from_template(template: str) List[str][source]
Get a list of placeholder keys in the template
- create_default_filename_template(subset_columns: List[str]) str[source]
Create the default filename template from the subset_columns selected
- default_filename_template(subset_columns: List[str]) str[source]
Create the default filename template from the subset_columns selected
- fstring_to_polars(template: str) Tuple[str, List[str]][source]
Repalce f-string keyed placeholders with positional ones and return the keys in their respective position
- get_separator(path: Path) str[source]
Inspect the first line of the file at path and attempt to determine the field separator as either tab or comma, depending on the number of occurrences of each Returns “,” or “<tab>”
- launch_subsetting(transfers: LazyFrame, threads: int = 2) None[source]
Iterate over the transfers dataframe subsetting reads from sources to destinations
- parse_csv_mapping(csv_path: Path) LazyFrame[source]
Parse the csv direct mapping of output target to read_ids to a targets dataframe
- parse_source(path: Path) LazyFrame[source]
Reads the read ids available in a given pod5 file returning a dataframe with the formatted read_ids and the source filename
- parse_source_process(paths: JoinableQueue, parsed_sources: Queue)[source]
Parse sources until paths queue is consumed
- parse_sources(paths: Set[Path], duplicate_ok: bool, threads: int = 2) LazyFrame[source]
Reads all inputs and return formatted lazy dataframe
- parse_table_mapping(summary_path: Path, filename_template: Optional[str], subset_columns: List[str], read_id_column: str = 'read_id', ignore_incomplete_template: bool = False) LazyFrame[source]
Parse a table using polars to create a mapping of output targets to read ids
- process_subset_tasks(queue: WorkQueue, process: int)[source]
Consumes work from the queue and launches subsetting tasks
- resolve_output_targets(targets: LazyFrame, output: Path) LazyFrame[source]
Prepend the output path to the target filename and resolve the complete string
- subset_pod5(inputs: List[Path], output: Path, columns: List[str], csv: Optional[Path] = None, table: Optional[Path] = None, threads: int = 2, template: str = '', read_id_column: str = 'read_id', missing_ok: bool = False, duplicate_ok: bool = False, ignore_incomplete_template: bool = False, force_overwrite: bool = False, recursive: bool = False) Any[source]
Prepare the subsampling mapping and run the repacker
- subset_pod5s_with_mapping(inputs: Set[Path], output: Path, targets: LazyFrame, threads: int = 2, missing_ok: bool = False, duplicate_ok: bool = False, force_overwrite: bool = False) None[source]
Given an iterable of input pod5 paths and an output directory, create output pod5 files containing the read_ids specified in the given mapping of output filename to set of read_id.