Scott
Scott is a software able to compute, for any labelled (edge and node) graph, a canonical tree representative of its isomorphism class, that we can derive to a canonical trace (string) or adjacency matrix.
Written and developed by :
- Nicolas BLOYET (See-d, IRISA Expression, LMBA)
- Pierre-François MARTEAU (IRISA Expression)
- Emmanuel FRÉNOD (See-d, LMBA)
Full paper is available on Hal.
Table of contents
Overview
A graph G is a graph, defined by a set of vertices V and a set of edges E, which are pairs of vertices. In some cases, both edges and vertices can carry a label (or color) as additional local information.
We present below the example of the cafeine molecule, as a graph in which vertices represent atoms, labelled by chemical element (C, O, N, H, etc.), and the edges represent covalent bonds, labelled following the bond modality (simple, double, triple, etc.).
Graph Isomorphism
As long as we describe a graph by an enumeration of its elements, there are several possible descriptions of the same structure.
Let G and H be two graphs. They are said to be isomorphics if there exists a bijection between their respective vertices sets which preserves edges. Below the Wikipedia example, where you will find more informations about the problem.
Isomorphism is an equivalence relation, and so partitions the ensemble of graphs into equivalence classes, packing together in a same class all the graphs representing the same structure. How many ?
While determining (or not) if two graphs are isomophics seems trivial for small graphs, it is actually a problem which remains unresolved in polynomial time in the general case (polynomial heuristics do exist for restricted class of graphs). There is still uncertaincy about this problem theoretical complexity, despite some very recent works [1] seem to prove its appartenance to P (Polynomial).
There are a lot of practical applications of this problem [2, 3], as in many graph related problem, we want to consider graphs belonging to the same isomorphism class as equals.
Graph Canonization
The graph canonization is a related problem, consisting in finding for a graph a canonical representant, unique for its isomorphism class. Two graphs are isomorphics if and only if their canonical forms are equal.
This problem is at least as difficult as graph isomorphism, as it answers to it explicitly. Actually, it is very often less efficient to find a canonical representative than testing isomorphism between two graphs, as there are some shortcuts leading to an early decision in the second case (e.g. not the same number of edges/vertices, not the same degrees, etc.).
However, once the canonical representant of a graph is computed, it can be stored and re-used, making this method of resolution suitable in (sub-)graph matchings : given a population of graphs for which we previously computed their canonical form, we can tell if a new graph is already present in the collection without testing the candidate against every known graph, considering it is trivial to compare two canonical representants.
State of Art
Several algorithms already exist, the most used being nauty
[4], bliss
[5], traces
[6] for canonization or conauto
[7], saucy
[8] for isomorphism testing. These algorithms are highly efficient, but unfortunately, among algorithms able to give a canonical form, none of them is able to natively deal with labelled edges otherwise than rewriting the graph in an edge-unlabelled way, increasing the problem size. Moreover, those algorithms are sequential, and do not take advantage of multi-threaded hardware.
Another approach is gspan
, which can handle both edge and vertice labelling, but as it is based on finding a lexicographic minimal description of a graph among the enumeration of them, it is not suitable for whole graphs as long as their size grow, explaining why it is mainly used for (small) subgraphs mining.
We adress through
Scott
the problem of canonizing each edge-labelled, vertice-labelled graph.
Key idea
We propose here an algorithm based on graph rewriting. Scott execution follows three main steps, illustrated below.
- Levelling of vertices, according to an elected root
- Re-writing of cycles without information loss
- Canonical encoding of the tree obtained
The root identity can be obvious is the best case (combination of label, degree, degree of neighboorhood-<1:n>, etc.), but in the worst case where there are several candidates, they are computed, the minimal trace obtained being unique for an isomorphism class.
It can be proved that the following set of three rewritings is sufficient to transform a levelled-graph into a unique tree.
By applying successive re-writings, aiming to avoid all form of cycle without any loss of information, the graph converges to a tree. Those graphs editions are applied following an order derived on the graph itself, ensuring the tree obtained is a canonical representant of the isomorphism class of the graph.
As it is possible to recursively define an order relation on a tree (known property), we can use this canonical tree to obtain some compacts isomorphism-invariants representation of the graph, such as trace (string) or standardized adjacency matrix.
For more details about the algorithm, please refer to the paper or the Python implementation we propose.
Getting started
From source code
Simply clone the repo in a local repertory
# get the code
git clone https://github.com/theplatypus/scott.git
cd ./scott
# (optionnal) install using setuptools
python3 setup.py install
Then you should be able to import scott
package from python :
import scott as st
From Pypi
pip install <todo>
From Docker
To get scott
in an environment with additional dependencies installed (chemical librabries, jupyter notebooks,etc.), a Docker container is available :
# Build the image containing all the stuff for a simple standalone install
docker build -t scott .
# or pull it
docker pull <todo>
# run an interactive shell, where you can import scott in python default interpreter
docker run --rm -it scott
# or run a jupyter notebook including scott
docker run -it -p 8888:8888 scott /bin/bash -c "jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser --allow-root"
For specific uses, you have access to alternative Dockerfiles
in /dockerfiles
, each of them being a tag that you can also pull.
Pypy
An image including the Pypy interpreter, a high-performance alternative to the classic CPython. Use it with ipython
or launch a jupyter server.
docker build -t scott:pypy -f dockerfiles/pypy/Dockerfile .
docker run -it --rm -p 8888:8888 scott:pypy
# > ipython
# > jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser --allow-root
Debian
A debian-based image, if you are not an Anaconda supporter.
docker build -t scott:debian -f dockerfiles/debian/Dockerfile .
docker run -it --rm -p 8888:8888 scott:debian
# > ipython
# > jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser --allow-root
PySpark
An image designed to run inside a Spark cluster.
docker build -t scott:spark -f dockerfiles/pyspark/Dockerfile .
docker run -it -v $(pwd):/home/scott scott:spark pyspark --master local[*]
Usage
Build a Graph
You can describe a Graph
algorithmically.
import scott as st
# describe a graph from scratch
graph = st.structs.graph.Graph()
n1 = st.structs.node.Node("1", "C")
n2 = st.structs.node.Node("2", "O")
n3 = st.structs.node.Node("3", "H")
n4 = st.structs.node.Node("4", "H")
e1 = st.structs.edge.Edge("1", n1, n2, modality=2)
e2 = st.structs.edge.Edge("2", n1, n3)
e3 = st.structs.edge.Edge("3", n1, n4)
graph.add_node(n1)
graph.add_nodes([n2, n3, n4])
graph.add_edge(e1)
graph.add_edge(e2)
graph.add_edge(e3)
print(n1)
print(e1)
print(graph)
Import Graphs
Scott is also able to parse a few graph formats files. Note that a parsing function always returns a list, even if there is one molecule in the file.
# Parse a .sdf file (chemical file standard) :
# from blob
CAFEINE_URL = "https://drive.google.com/uc?id=1lXeFVGS77oK_qL3NESDV_UjJknPyiICx"
file_content = urlopen(CAFEINE_URL).read().decode()
compounds = st.parse.from_sdf(file_content, ignore_hydrogens = False)
cafeine = compounds[0]
print(cafeine)
# from file path - note we ignore hydrogens this time
compounds = st.parse.from_sdf(file_path='./data/molecule/cafeine.sdf', ignore_hydrogens = True)
cafeine_without_H = compounds[0]
print(cafeine_without_H)
# Parse a SMILES string (RDKit required)
smile = st.parse.parse_smiles('CCOCOCC=CCONC')
# we can iterate over graph vertices
for id_node in smile.V :
print("Node #%s : %s" % (str(id_node), str(smile.V[id_node].label)))
# Parse a .dot file
cfi1 = st.parse.from_dot(file_path = './data/isotest/cfi-rigid-t2-dot/cfi-rigid-t2-0016-04-1.dot')[0]
# Parse a .dimacs file
cfi2 = st.parse.from_dimacs(file_path = './data/isotest/cfi-rigid-t2/cfi-rigid-t2-0016-04-1')[0]
Canonical traces
A canonical trace is a string representation of the tree representative of an isomorphism class.
simple = st.parse.from_pubchem_xml(file_path= './data/molecule/simple.xml')[0]
# we get a CGraph object from the function `st.canonize.to_cgraph`...
simple_cgraph = st.canonize.to_cgraph(simple)
# ...that we can print directly or convert to string
simple_canon = str(simple_cgraph)
assert simple_canon == '(H:1, H:1, (H:1, H:1, ((((C:1).#2{$1}:2)C:2)C:2, ((((C:1).#2{$1}:2)C:2)C:1).#2{$2}:1)C:1)C:1, (O:2, (((((C:1).#2{$1}:2)C:2)C:1).#2{$2}:1)O:1)C:1)C'
If you only need a hash function for graphs, you can apply a string hash function (md5
, sha
, etc.) to the trace obtained.
import hashlib
assert hashlib.sha224(simple_canon.encode()).hexdigest() == 'a90f308ea4c2cd8a1003a32507b4769f5ef5f31bb4f2602856200982'
G = st.parse.from_dimacs(file_path = './data/isotest/cfi-rigid-t2/cfi-rigid-t2-0020-01-1')[0]
H = st.parse.from_dimacs(file_path = './data/isotest/cfi-rigid-t2/cfi-rigid-t2-0020-01-2')[0]
E = st.parse.from_dimacs(file_path = './data/isotest/cfi-rigid-t2/cfi-rigid-t2-0020-02-2')[0]
F = st.parse.from_dimacs(file_path = './data/isotest/cfi-rigid-t2/cfi-rigid-t2-0020-02-2')[0]
# Can take few seconds
Gc = st.canonize.to_cgraph(G)
Hc = st.canonize.to_cgraph(H)
Ec = st.canonize.to_cgraph(E)
Fc = st.canonize.to_cgraph(F)
# So, following the nomenclature G == H and E == F, ...
assert str(Gc) == str(Hc)
assert str(Ec) == str(Fc)
# ... but G != E/F and H != E/F
assert str(Gc) != str(Ec)
assert str(Hc) != str(Fc)
Canonical Adjacency Matrices
On a graph G
of N
vertices, an adjacency matrix A
is a N*N
array describing the link between two vertices.
If G
is non-directed, then A
is symetric. If edges are not labelled, then A
is binary.
import numpy as np
np.array(g.adjacency_matrix())
# array([
# [0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1],
# [1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1],
# [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1],
# [0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0],
# [1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1],
# [1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0],
# [1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0],
# [1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0],
# [1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1],
# [1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0],
# [0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0],
# [0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1],
# [1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1],
# [1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1],
# [1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0],
# [1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1],
# [1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
# [0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1],
# [0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],
# [1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0]])
We typically use as input in a Graph Neural Network the 3-tuple :
A
, the binary adjacency matrixX
, aV*D1
matrix mapping each vertice with a vector ofD1
elements (color, etc.)E
, aV*V*D2
matrix mapping each edge with a vector ofD2
elements
We can merge A
and E
to get an adjacency tensor of shape N*N*D2
, depending of the GNN implementation used. Note that if edges labels are qualitative, you should always use a one-hot encoding, and so a E
matrix. The only case where a “flat” A
is acceptable is when edges labels are purely quantitative (or if edges are not labelled).
In any case, Scott
can help to get a standardized adjacency matrix, such as isomophic graph will have the exact same adjacency matrices, which can help learning process by bringing the “same elements towards the same neurons”.
# Let g and h be two isomorphic graphs
g = st.parse.from_dot(file_path="./data/isotest/cfi-rigid-t2-dot/cfi-rigid-t2-0020-02-2.dot")[0]
h = st.parse.from_dot(file_path="./data/isotest/cfi-rigid-t2-dot/cfi-rigid-t2-0020-02-1.dot")[0]
# if we compare their adjacency matrix, it is very unlikely to get the two exact same matrices,
# as there is no order on vertices
g.adjacency_matrix() == h.adjacency_matrix()
# False
# but if we induce an order based on the representant tree given by scott,
# there is only one canonical adjacecny matrix
g.adjacency_matrix(canonic = True) == h.adjacency_matrix(canonic = True)
# True
Citation
If you use or fork scott
in further works, please cite the following :
@inproceedings{bloyet2019scott,
title={Scott: A method for representing graphs as rooted trees for graph canonization},
author={Bloyet, Nicolas and Marteau, Pierre-Fran{\c{c}}ois and Frenod, Emmanuel},
booktitle={International Conference on Complex Networks and Their Applications},
pages={578--590},
year={2019},
organization={Springer}
}
You will find the paper edited by Springer
DOI :
10.1007/978-3-030-36687-2_48
The paper is available on HAL as well.
Licence
The Python source code we provide is available on the GitHub repo, under the MIT public licence.
Feel free to improve it.