Graph: module to modelize a GFA graph in memory#

Modelizes a graph object

class gfagraphs.graph.Graph(gfa_file: str | None = None, with_sequence: bool = True, low_memory: bool = False, with_reverse_edges: bool = False, regexp: str = '.*')#

Modelizes a GFA graph in memory from a .gfa file.

Returns: object made of dicts holding informations about the datastructure
Return type: Graph

add_dovetails() → None#: Adds dovetails on tips of the graph (at the start/end of each path)

add_edge(source: str, ori_source: str, sink: str, ori_sink: str, **metadata: dict) → None#

Applies the addition of an edge to the current graph

Parameters

source (str) – the node form where the edge extrudes
ori_source (str) – the orientation from which the edge comes
sink (str) – the node to which the edge goes
ori_sink (str) – the orientation the edge enters the target node
**metadata (Any) – optional, supplementary GFA-compatible tags.

Raises

ValueError – specified orientation is not compatible with GFA format

add_node(name: str, sequence: str, **metadata: dict) → None#

Applies the addition of a node on the currently edited graph.

Parameters

name (str) – a name for the node to be added
sequence (str) – a label (substring) associated to the node
**metadata (Any) – optional, additional informations for the node (must be GFA-compatible)

add_path(identifier: str, name: str, chain: list[tuple[str, gfagraphs.abstractions.Orientation]], start: int = 0, end: int | None = None, origin: str | None = None, **metadata: dict) → None#

Applies the addition of a path on the currently edited graph. Please note that it does not add any of the maybe missing nodes or edges (as we cound not assume the length of nodes nor the orientation of edges)

Parameters

name (str) – name of the path
chain (list[tuple[str, Orientation]]) – a series of tuples describing node_name,orientation)
start (int, optional) – starting offset for the path, by default 0
end (int | None, optional) – ending offset of the path (length - start), by default None
origin (str | None, optional) – alternative name, used for W-line formatting, by default None

compute_child_nodes() → None#: For each edge in the graph, annotates extruding nodes from the edges info This function is O(n) with n being the number of edges.

compute_neighbors() → None#: Computes both predecessors and successors This function is O(n) with n being the number of edges.

compute_orientations() → None#: Computes both predecessors and successors, by their orientations This function is O(n) with n being the number of edges.

compute_parent_nodes() → None#: For each edge in the graph, annotates intruding nodes from the edges info This function is O(n) with n being the number of edges.

get_edges(node_name: str) → list[tuple[tuple[str, str], dict]]#

Return all the edges of a node

Parameters: node_name (str) – a node in the graph
Returns: for each edge matching criterion, the source and target as well as the supplementary tags
Return type: list[tuple[tuple[str, str], dict]]

get_free_node_name() → str#

Asks the generator for the next available node name

Returns: the generator should be computed and won’t work in low_memory mode
Return type: str

get_in_edges(node_name: str) → list[tuple[tuple[str, str], dict]]#

Return all the entering edges of a node

Parameters: node_name (str) – a node in the graph
Returns: for each edge matching criterion, the source and target as well as the supplementary tags
Return type: list[tuple[tuple[str, str], dict]]

get_next_unused_node_name() → str#

Returns the next available integer as str to identify a new node to be created, within the minmax range of nodes defined in the graph.

Returns: a possible node name in the graph which is not used currently
Return type: str

get_out_edges(node_name: str) → list[tuple[tuple[str, str], dict]]#

Return all the exiting edges of a node

Parameters: node_name (str) – a node in the graph
Returns: for each edge matching criterion, the source and target as well as the supplementary tags
Return type: list[tuple[tuple[str, str], dict]]

global_offset(reference: str, threads: int = 1) → None#

We want to create a global offset (GO) for each node, which consists in the positions the sequences would have if they were represented as a left-normalized multiple alignement, with gaps. Positions are stored in the segments, with the “GO” tag. Warning: if reference has loops, positions are going to be ambiguous. Moreover, in this first version, only one coordinate per node is assigned, meaning loops wont be annotated twice. As of now function is NOT RECOMMANDED to use for production. This fonction is RECURSIVE and will FAIL on HUGE GRAPHS.

Parameters

reference (str) – name of the path we want to use as backbone for our position system
threads (int, optional) – number of threads to use for computation (max parallel deep seaches), by default 1

merge_segments(*segs: str, merge_name: str | None = None) → None#

Given a series of nodes, merges it to the first of the series.

Parameters

merge_name (str | None, optional) – the name to merge to. If not specified, uses the first of the series, by default None
*segs (Series[str]) – a series of nodes to be merged. Must be consecutive and don’t disturb other paths.

reconstruct_sequences() → dict[str, Generator]#

Reads the paths (if they exists) that describes genomes in the graph Aggregates the nodes (by their reading direction) per path

Returns: mapping between name of path and generator of every substring in the path
Return type: dict[str, Generator]
Raises: RuntimeError – if the graph does not have paths

rename_node(old_name: str, new_name: str) → None#

Replace the node name and all its references be it in path, node accessions, edges

Parameters

old_name (str) – the current name of the node
new_name (str) – the new name to be given to the node

save_graph(output_file: str, minimal: bool = False, output_format: bool | Any = False) → None#

Given a GFA graph loaded in memory, writes it to disk in a GFA-compatible format.

Parameters

output_file (str) – path on disk where to output the GFA file
minimal (bool, optional) – if only required tags should be written in the output file, by default False
output_format (bool | Any, optional) – a GFA subformat to write to, by default False

sequence_offsets(recalculate: bool = False) → None#

Calculates the offsets within each path for each node Here, we aim to extend the current GFA tag format by adding tags that do respect the GFA naming convention. A JSON string, PO (Path Offset) positions, relative to paths. Hence, PO:J:{‘w1’:[(334,335,’+’)],’w2’:[(245,247,’-‘)]} tells that the walk/path w1 contains the sequence starting at position 334 and ending at position 335, and the walk/path w2 contains the sequence starting at the offset 245 (ending 247), and that the sequences are reversed one to each other. Note that any non-referenced walk in this field means that the node is not inside the given walk.

Parameters: recalculate (bool, optional) – If the offsets should be re-computed from scratch, by default False

split_segments(segment_name: str, future_segment_name: list, position_to_split: list) → None#

Given a segment to split and a series/single new name(s) + position(s), breaks the node in multiple nodes and includes splits them in the Graph data

If you want to split the segment A into A,B, … you must provide self.split_segments(

A,[A,B, …],[(start_A,end_A),(start_B,end_B), …])

Parameters

segment_name (str) – the node to split
future_segment_name (list) – the futures names of the nodes. The current name will be used for the first node of the splitting series
position_to_split (list) – a list of breakpoints where to split to.

Raises

ValueError – the number of specified breakpoints is incompatible with the number of names provided

unfold() → None#

[EXPERIMENTAL, WIP] Applies an unfolding on cycles, that allows them to be linearized WARNING: May solely be used on graphs with paths. WARNING: Not fully tested yet, use at your own discretion. TODO: fix closing edge of cycle not destroyed.

Raises

NotImplementedError – the graph does not have paths
RuntimeError – the graph was loaded in incorrect mode

gfagraphs.graph.futures_collector(func: Callable, argslist: list, kwargslist: list[dict] | None = None, num_processes: int = 2) → list#

Spawns len(arglist) instances of func and executes them at num_processes instances at time.

func : a function
argslist (list): a list of tuples, arguments of each func
kwargslist (list[dict]) a list of dicts, kwargs for each func
num_processes (int)max number of concurrent instances.
Default : number of available logic cores
memory (float|None) : ratio of memory to be used, ranging from .05 to .95. Will not work if resource is incompatible.

gfagraphs.graph.revcomp(string: str, compl: dict = {'A': 'T', 'C': 'G', 'G': 'C', 'N': 'N', 'T': 'A'}) → str#

Tries to compute the reverse complement of a sequence

Args:: string (str): original character set compl (dict, optional): dict of correspondances. Defaults to {‘A’: ‘T’, ‘C’: ‘G’, ‘G’: ‘C’, ‘T’: ‘A’}.
Raises:: IndexError: Happens if revcomp encounters a char that is not in the dict
Returns:: str: the reverse-complemented string