Graph: module to modelize a GFA graph in memory#

Modelizes a graph object

class gfagraphs.graph.Graph(gfa_file: str | None = None, with_sequence: bool = True, low_memory: bool = False, with_reverse_edges: bool = False, regexp: str = '.*')#

Modelizes a GFA graph in memory from a .gfa file.

Returns

object made of dicts holding informations about the datastructure

Return type

Graph

add_dovetails() None#

Adds dovetails on tips of the graph (at the start/end of each path)

add_edge(source: str, ori_source: str, sink: str, ori_sink: str, **metadata: dict) None#

Applies the addition of an edge to the current graph

Parameters
  • source (str) – the node form where the edge extrudes

  • ori_source (str) – the orientation from which the edge comes

  • sink (str) – the node to which the edge goes

  • ori_sink (str) – the orientation the edge enters the target node

  • **metadata (Any) – optional, supplementary GFA-compatible tags.

Raises

ValueError – specified orientation is not compatible with GFA format

add_node(name: str, sequence: str, **metadata: dict) None#

Applies the addition of a node on the currently edited graph.

Parameters
  • name (str) – a name for the node to be added

  • sequence (str) – a label (substring) associated to the node

  • **metadata (Any) – optional, additional informations for the node (must be GFA-compatible)

add_path(identifier: str, name: str, chain: list[tuple[str, gfagraphs.abstractions.Orientation]], start: int = 0, end: int | None = None, origin: str | None = None, **metadata: dict) None#

Applies the addition of a path on the currently edited graph. Please note that it does not add any of the maybe missing nodes or edges (as we cound not assume the length of nodes nor the orientation of edges)

Parameters
  • name (str) – name of the path

  • chain (list[tuple[str, Orientation]]) – a series of tuples describing node_name,orientation)

  • start (int, optional) – starting offset for the path, by default 0

  • end (int | None, optional) – ending offset of the path (length - start), by default None

  • origin (str | None, optional) – alternative name, used for W-line formatting, by default None

compute_child_nodes() None#

For each edge in the graph, annotates extruding nodes from the edges info This function is O(n) with n being the number of edges.

compute_neighbors() None#

Computes both predecessors and successors This function is O(n) with n being the number of edges.

compute_orientations() None#

Computes both predecessors and successors, by their orientations This function is O(n) with n being the number of edges.

compute_parent_nodes() None#

For each edge in the graph, annotates intruding nodes from the edges info This function is O(n) with n being the number of edges.

get_edges(node_name: str) list[tuple[tuple[str, str], dict]]#

Return all the edges of a node

Parameters

node_name (str) – a node in the graph

Returns

for each edge matching criterion, the source and target as well as the supplementary tags

Return type

list[tuple[tuple[str, str], dict]]

get_free_node_name() str#

Asks the generator for the next available node name

Returns

the generator should be computed and won’t work in low_memory mode

Return type

str

get_in_edges(node_name: str) list[tuple[tuple[str, str], dict]]#

Return all the entering edges of a node

Parameters

node_name (str) – a node in the graph

Returns

for each edge matching criterion, the source and target as well as the supplementary tags

Return type

list[tuple[tuple[str, str], dict]]

get_next_unused_node_name() str#

Returns the next available integer as str to identify a new node to be created, within the minmax range of nodes defined in the graph.

Returns

a possible node name in the graph which is not used currently

Return type

str

get_out_edges(node_name: str) list[tuple[tuple[str, str], dict]]#

Return all the exiting edges of a node

Parameters

node_name (str) – a node in the graph

Returns

for each edge matching criterion, the source and target as well as the supplementary tags

Return type

list[tuple[tuple[str, str], dict]]

global_offset(reference: str, threads: int = 1) None#

We want to create a global offset (GO) for each node, which consists in the positions the sequences would have if they were represented as a left-normalized multiple alignement, with gaps. Positions are stored in the segments, with the “GO” tag. Warning: if reference has loops, positions are going to be ambiguous. Moreover, in this first version, only one coordinate per node is assigned, meaning loops wont be annotated twice. As of now function is NOT RECOMMANDED to use for production. This fonction is RECURSIVE and will FAIL on HUGE GRAPHS.

Parameters
  • reference (str) – name of the path we want to use as backbone for our position system

  • threads (int, optional) – number of threads to use for computation (max parallel deep seaches), by default 1

merge_segments(*segs: str, merge_name: str | None = None) None#

Given a series of nodes, merges it to the first of the series.

Parameters
  • merge_name (str | None, optional) – the name to merge to. If not specified, uses the first of the series, by default None

  • *segs (Series[str]) – a series of nodes to be merged. Must be consecutive and don’t disturb other paths.

reconstruct_sequences() dict[str, Generator]#

Reads the paths (if they exists) that describes genomes in the graph Aggregates the nodes (by their reading direction) per path

Returns

mapping between name of path and generator of every substring in the path

Return type

dict[str, Generator]

Raises

RuntimeError – if the graph does not have paths

rename_node(old_name: str, new_name: str) None#

Replace the node name and all its references be it in path, node accessions, edges

Parameters
  • old_name (str) – the current name of the node

  • new_name (str) – the new name to be given to the node

save_graph(output_file: str, minimal: bool = False, output_format: bool | Any = False) None#

Given a GFA graph loaded in memory, writes it to disk in a GFA-compatible format.

Parameters
  • output_file (str) – path on disk where to output the GFA file

  • minimal (bool, optional) – if only required tags should be written in the output file, by default False

  • output_format (bool | Any, optional) – a GFA subformat to write to, by default False

sequence_offsets(recalculate: bool = False) None#

Calculates the offsets within each path for each node Here, we aim to extend the current GFA tag format by adding tags that do respect the GFA naming convention. A JSON string, PO (Path Offset) positions, relative to paths. Hence, PO:J:{‘w1’:[(334,335,’+’)],’w2’:[(245,247,’-‘)]} tells that the walk/path w1 contains the sequence starting at position 334 and ending at position 335, and the walk/path w2 contains the sequence starting at the offset 245 (ending 247), and that the sequences are reversed one to each other. Note that any non-referenced walk in this field means that the node is not inside the given walk.

Parameters

recalculate (bool, optional) – If the offsets should be re-computed from scratch, by default False

split_segments(segment_name: str, future_segment_name: list, position_to_split: list) None#

Given a segment to split and a series/single new name(s) + position(s), breaks the node in multiple nodes and includes splits them in the Graph data

If you want to split the segment A into A,B, … you must provide self.split_segments(

A,[A,B, …],[(start_A,end_A),(start_B,end_B), …])

Parameters
  • segment_name (str) – the node to split

  • future_segment_name (list) – the futures names of the nodes. The current name will be used for the first node of the splitting series

  • position_to_split (list) – a list of breakpoints where to split to.

Raises

ValueError – the number of specified breakpoints is incompatible with the number of names provided

unfold() None#

[EXPERIMENTAL, WIP] Applies an unfolding on cycles, that allows them to be linearized WARNING: May solely be used on graphs with paths. WARNING: Not fully tested yet, use at your own discretion. TODO: fix closing edge of cycle not destroyed.

Raises
  • NotImplementedError – the graph does not have paths

  • RuntimeError – the graph was loaded in incorrect mode

gfagraphs.graph.futures_collector(func: Callable, argslist: list, kwargslist: list[dict] | None = None, num_processes: int = 2) list#

Spawns len(arglist) instances of func and executes them at num_processes instances at time.

  • func : a function

  • argslist (list): a list of tuples, arguments of each func

  • kwargslist (list[dict]) a list of dicts, kwargs for each func

  • num_processes (int)max number of concurrent instances.

    Default : number of available logic cores

  • memory (float|None) : ratio of memory to be used, ranging from .05 to .95. Will not work if resource is incompatible.

gfagraphs.graph.revcomp(string: str, compl: dict = {'A': 'T', 'C': 'G', 'G': 'C', 'N': 'N', 'T': 'A'}) str#

Tries to compute the reverse complement of a sequence

Args:

string (str): original character set compl (dict, optional): dict of correspondances. Defaults to {‘A’: ‘T’, ‘C’: ‘G’, ‘G’: ‘C’, ‘T’: ‘A’}.

Raises:

IndexError: Happens if revcomp encounters a char that is not in the dict

Returns:

str: the reverse-complemented string