Graph: module to modelize a GFA graph in memory#
Modelizes a graph object
- class gfagraphs.graph.Graph(gfa_file: str | None = None, with_sequence: bool = True, low_memory: bool = False, with_reverse_edges: bool = False, regexp: str = '.*')#
Modelizes a GFA graph in memory from a .gfa file.
- Returns
object made of dicts holding informations about the datastructure
- Return type
- add_dovetails() None#
Adds dovetails on tips of the graph (at the start/end of each path)
- add_edge(source: str, ori_source: str, sink: str, ori_sink: str, **metadata: dict) None#
Applies the addition of an edge to the current graph
- Parameters
source (str) – the node form where the edge extrudes
ori_source (str) – the orientation from which the edge comes
sink (str) – the node to which the edge goes
ori_sink (str) – the orientation the edge enters the target node
**metadata (Any) – optional, supplementary GFA-compatible tags.
- Raises
ValueError – specified orientation is not compatible with GFA format
- add_node(name: str, sequence: str, **metadata: dict) None#
Applies the addition of a node on the currently edited graph.
- Parameters
name (str) – a name for the node to be added
sequence (str) – a label (substring) associated to the node
**metadata (Any) – optional, additional informations for the node (must be GFA-compatible)
- add_path(identifier: str, name: str, chain: list[tuple[str, gfagraphs.abstractions.Orientation]], start: int = 0, end: int | None = None, origin: str | None = None, **metadata: dict) None#
Applies the addition of a path on the currently edited graph. Please note that it does not add any of the maybe missing nodes or edges (as we cound not assume the length of nodes nor the orientation of edges)
- Parameters
name (str) – name of the path
chain (list[tuple[str, Orientation]]) – a series of tuples describing node_name,orientation)
start (int, optional) – starting offset for the path, by default 0
end (int | None, optional) – ending offset of the path (length - start), by default None
origin (str | None, optional) – alternative name, used for W-line formatting, by default None
- compute_child_nodes() None#
For each edge in the graph, annotates extruding nodes from the edges info This function is O(n) with n being the number of edges.
- compute_neighbors() None#
Computes both predecessors and successors This function is O(n) with n being the number of edges.
- compute_orientations() None#
Computes both predecessors and successors, by their orientations This function is O(n) with n being the number of edges.
- compute_parent_nodes() None#
For each edge in the graph, annotates intruding nodes from the edges info This function is O(n) with n being the number of edges.
- get_edges(node_name: str) list[tuple[tuple[str, str], dict]]#
Return all the edges of a node
- Parameters
node_name (str) – a node in the graph
- Returns
for each edge matching criterion, the source and target as well as the supplementary tags
- Return type
list[tuple[tuple[str, str], dict]]
- get_free_node_name() str#
Asks the generator for the next available node name
- Returns
the generator should be computed and won’t work in low_memory mode
- Return type
str
- get_in_edges(node_name: str) list[tuple[tuple[str, str], dict]]#
Return all the entering edges of a node
- Parameters
node_name (str) – a node in the graph
- Returns
for each edge matching criterion, the source and target as well as the supplementary tags
- Return type
list[tuple[tuple[str, str], dict]]
- get_next_unused_node_name() str#
Returns the next available integer as str to identify a new node to be created, within the minmax range of nodes defined in the graph.
- Returns
a possible node name in the graph which is not used currently
- Return type
str
- get_out_edges(node_name: str) list[tuple[tuple[str, str], dict]]#
Return all the exiting edges of a node
- Parameters
node_name (str) – a node in the graph
- Returns
for each edge matching criterion, the source and target as well as the supplementary tags
- Return type
list[tuple[tuple[str, str], dict]]
- global_offset(reference: str, threads: int = 1) None#
We want to create a global offset (GO) for each node, which consists in the positions the sequences would have if they were represented as a left-normalized multiple alignement, with gaps. Positions are stored in the segments, with the “GO” tag. Warning: if reference has loops, positions are going to be ambiguous. Moreover, in this first version, only one coordinate per node is assigned, meaning loops wont be annotated twice. As of now function is NOT RECOMMANDED to use for production. This fonction is RECURSIVE and will FAIL on HUGE GRAPHS.
- Parameters
reference (str) – name of the path we want to use as backbone for our position system
threads (int, optional) – number of threads to use for computation (max parallel deep seaches), by default 1
- merge_segments(*segs: str, merge_name: str | None = None) None#
Given a series of nodes, merges it to the first of the series.
- Parameters
merge_name (str | None, optional) – the name to merge to. If not specified, uses the first of the series, by default None
*segs (Series[str]) – a series of nodes to be merged. Must be consecutive and don’t disturb other paths.
- reconstruct_sequences() dict[str, Generator]#
Reads the paths (if they exists) that describes genomes in the graph Aggregates the nodes (by their reading direction) per path
- Returns
mapping between name of path and generator of every substring in the path
- Return type
dict[str, Generator]
- Raises
RuntimeError – if the graph does not have paths
- rename_node(old_name: str, new_name: str) None#
Replace the node name and all its references be it in path, node accessions, edges
- Parameters
old_name (str) – the current name of the node
new_name (str) – the new name to be given to the node
- save_graph(output_file: str, minimal: bool = False, output_format: bool | Any = False) None#
Given a GFA graph loaded in memory, writes it to disk in a GFA-compatible format.
- Parameters
output_file (str) – path on disk where to output the GFA file
minimal (bool, optional) – if only required tags should be written in the output file, by default False
output_format (bool | Any, optional) – a GFA subformat to write to, by default False
- sequence_offsets(recalculate: bool = False) None#
Calculates the offsets within each path for each node Here, we aim to extend the current GFA tag format by adding tags that do respect the GFA naming convention. A JSON string, PO (Path Offset) positions, relative to paths. Hence, PO:J:{‘w1’:[(334,335,’+’)],’w2’:[(245,247,’-‘)]} tells that the walk/path w1 contains the sequence starting at position 334 and ending at position 335, and the walk/path w2 contains the sequence starting at the offset 245 (ending 247), and that the sequences are reversed one to each other. Note that any non-referenced walk in this field means that the node is not inside the given walk.
- Parameters
recalculate (bool, optional) – If the offsets should be re-computed from scratch, by default False
- split_segments(segment_name: str, future_segment_name: list, position_to_split: list) None#
Given a segment to split and a series/single new name(s) + position(s), breaks the node in multiple nodes and includes splits them in the Graph data
If you want to split the segment A into A,B, … you must provide self.split_segments(
A,[A,B, …],[(start_A,end_A),(start_B,end_B), …])
- Parameters
segment_name (str) – the node to split
future_segment_name (list) – the futures names of the nodes. The current name will be used for the first node of the splitting series
position_to_split (list) – a list of breakpoints where to split to.
- Raises
ValueError – the number of specified breakpoints is incompatible with the number of names provided
- unfold() None#
[EXPERIMENTAL, WIP] Applies an unfolding on cycles, that allows them to be linearized WARNING: May solely be used on graphs with paths. WARNING: Not fully tested yet, use at your own discretion. TODO: fix closing edge of cycle not destroyed.
- Raises
NotImplementedError – the graph does not have paths
RuntimeError – the graph was loaded in incorrect mode
- gfagraphs.graph.futures_collector(func: Callable, argslist: list, kwargslist: list[dict] | None = None, num_processes: int = 2) list#
Spawns len(arglist) instances of func and executes them at num_processes instances at time.
func : a function
argslist (list): a list of tuples, arguments of each func
kwargslist (list[dict]) a list of dicts, kwargs for each func
- num_processes (int)max number of concurrent instances.
Default : number of available logic cores
memory (float|None) : ratio of memory to be used, ranging from .05 to .95. Will not work if resource is incompatible.
- gfagraphs.graph.revcomp(string: str, compl: dict = {'A': 'T', 'C': 'G', 'G': 'C', 'N': 'N', 'T': 'A'}) str#
Tries to compute the reverse complement of a sequence
- Args:
string (str): original character set compl (dict, optional): dict of correspondances. Defaults to {‘A’: ‘T’, ‘C’: ‘G’, ‘G’: ‘C’, ‘T’: ‘A’}.
- Raises:
IndexError: Happens if revcomp encounters a char that is not in the dict
- Returns:
str: the reverse-complemented string