IEEE Finding Patterns in Three Dimensional Graphs Algorithms and Applications to Scientific Data Mining


IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002 731
Finding Patterns in
Three-Dimensional Graphs: Algorithms and
Applications to Scientific Data Mining
Xiong Wang, Member, IEEE, Jason T.L. Wang, Member, IEEE,
Dennis Shasha, Bruce A. Shapiro, Isidore Rigoutsos, Member, IEEE, and Kaizhong Zhang
AbstractÐThis paper presents a method for finding patterns in 3D graphs. Each node in a graph is an undecomposable or atomic unit
and has a label. Edges are links between the atomic units. Patterns are rigid substructures that may occur in a graph after allowing for
an arbitrary number of whole-structure rotations and translations as well as a small number (specified by the user) of edit operations in
the patterns or in the graph. (When a pattern appears in a graph only after the graph has been modified, we call that appearance
ªapproximate occurrence.º) The edit operations include relabeling a node, deleting a node and inserting a node. The proposed method
is based on the geometric hashing technique, which hashes node-triplets of the graphs into a 3D table and compresses the label-
triplets in the table. To demonstrate the utility of our algorithms, we discuss two applications of them in scientific data mining. First, we
apply the method to locating frequently occurring motifs in two families of proteins pertaining to RNA-directed DNA Polymerase and
Thymidylate Synthase and use the motifs to classify the proteins. Then, we apply the method to clustering chemical compounds
pertaining to aromatic, bicyclicalkanes, and photosynthesis. Experimental results indicate the good performance of our algorithms and
high recall and precision rates for both classification and clustering.
Index TermsÐKDD, classification and clustering, data mining, geometric hashing, structural pattern discovery, biochemistry,
medicine.
ć
1 INTRODUCTION
TRUCTURAL pattern discovery finds many applications in the data mining field, where automated discovery of
Snatural sciences, computer-aided design, and image patterns, classification and clustering rules is one of the
processing [8], [33]. For instance, detecting repeatedly
main tasks. We establish a framework for structural pattern
occurring structures in molecules can help biologists to
discovery in the graphs and apply our approach to
understand functions of the molecules. In these domains,
classifying proteins and clustering compounds. While the
molecules are often represented by 3D graphs. The tertiary
domains chosen here focus on biochemistry, our approach
structures of proteins, for example, are 3D graphs [5], [9],
can be generalized to other applications where graph data
[18]. As another example, chemical compounds are also
occur commonly.
3D graphs [21].
1.1 3D Graphs
In this paper, we study a pattern discovery problem for
Each node of the graphs we are concerned with is an
graph data. Specifically, we propose a geometric hashing
undecomposable or atomic unit and has a 3D coordinate.1
technique to find frequently occurring substructures in a set
Each node has a label, which is not necessarily unique in a
of 3D graphs. Our study is motivated by recent advances in
graph. Node labels are chosen from a domain-dependent
alphabet . In chemical compounds, for example, the
. X. Wang is with the Department of Computer Science, California State
alphabet includes the names of all atoms. A node can be
University, Fullerton, CA 92834. E-mail: wang@ecs.fullerton.edu.
identified by a unique, user-assigned number in the graph.
. J.T.L. Wang is with the Department of Computer and Information Science,
New Jersey Institute of Technology, University Heights, Newark, NJ Edges in the graph are links between the atomic units. In
07102. E-mail: jason@cis.njit.edu.
the paper, we will consider 3D graphs that are connected.
. D. Shasha is with the Courant Institute of Mathematical Sciences, New
For disconnected graphs, we consider their connected
York University, 251 Mercer St., New York, NY 10012.
E-mail: shasha@cs.nyu.edu. components [16].
. B.A. Shapiro is with the Laboratory of Experimental and Computational
A graph can be divided into one or more rigid
Biology, Division of Basic Sciences, National Cancer Institutes, Frederick,
substructures. A rigid substructure is a subgraph in which
MD 21702. E-mail: bshapiro@ncifcrf.gov.
the relative positions of the nodes in the substructure are
. I. Rigoutsos is with the IBM T.J. Watson Research Center, Yorktown
Heights, NY 10598. E-mail: rigoutso@us.ibm.com.
fixed, under some set of conditions of interest. Note that the
. K. Zhang is with the Department of Computer Science, The University of
rigid substructure as a whole can be rotated (we refer to this
Western Ontario, London, Ontario, Canada N6A 5B7.
as a ªwhole-structureº rotation or simply a rotation when
E-mail: kzhang@csd.uwo.ca.
the context is clear). Thus, the relative position of a node in
Manuscript received 19 May 1999; revised 18 Oct. 2000; accepted 18 Jan.
2001; posted to Digital Library 7 Sept. 2001.
For information on obtaining reprints of this article, please send e-mail to: 1. More precisely, the 3D coordinate indicates the location of the center of
tkde@computer.org, and reference IEEECS Log Number 109849. the atomic unit.
1041-4347/02/$17.00 ß 2002 IEEE
732 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
TABLE 1
Identifiers, Labels and Global Coordinates of the
Nodes of the Graph in Fig. 1
Fig. 1. A graph G.
the substructure and a node outside the substructure can be
changed under the rotation. The precise definition of a
ªsubstructureº is application dependent. For example, in
chemical compounds, a ring is often a rigid substructure.
Example 1. To illustrate rigid substructures in a graph,
1.2 Patterns in 3D Graphs
consider the graph G in Fig. 1. Each node is associated
We consider a pattern to be a rigid substructure that may
with a unique number, with its label being enclosed in
occur in a graph after allowing for an arbitrary number of
parentheses. Table 1 shows the 3D coordinates of the
rotations and translations as well as a small number
nodes in the graph with respect to the Global Coordinate
(specified by the user) of edit operations in the pattern or
Frame. We divide the graph into two rigid substructures:
in the graph. We allow three types of edit operations:
Str0 and Str1. Str0 consists of nodes numbered 0, 1, 2, 3,
relabeling a node, deleting a node and inserting a node.
4, and 5 as, well as, edges connecting the nodes (Fig. 2a).
Relabeling a node v means to change the label of v to any
Str1 consists of nodes numbered 6, 7, 8, 9, and 10 as, well
valid label that differs from its original label. Deleting a
as, edges connecting them (Fig. 2b). Edges in the rigid
node v from a graph means to remove the corresponding
substructures are represented by boldface links. The
atomic unit from the 3D Euclidean space and make the
edge f5; 6g connecting the two rigid substructures is
edges touching v connect with one of its neighbors v0.
represented by a lightface link, meaning that the two
Inserting a node v into a graph means to add the
substructures are rotatable with respect to each other
corresponding atomic unit to the 3D Euclidean space and
around the edge.2 Note that a rigid substructure is not
make a node v0 and a subset of its neighbors become the
necessarily complete. For example, in Fig. 2a, there is no
neighbors of v.3
edge connecting the node numbered 1 and the node
We say graph G matches graph G0 with n mutations if by
numbered 3.
applying an arbitrary number of rotations and translations
We attach a local coordinate frame SF0 (SF1, respec-
as well as n node insert, delete or relabeling operations, 1) G
tively) to substructure Str0 (Str1, respectively). For instance,
and G0 have the same size,4 2) the nodes in G geometrically
let us focus on the substructure Str0 in Fig. 2a. We attach a
match those in G0, i.e., they have coinciding 3D coordinates,
local coordinate frame to Str0 whose origin is the node and 3) for each pair of geometrically matching nodes, they
numbered 0. This local coordinate frame is represented by have the same label. A substructure P approximately occurs
in a graph G (or G approximately contains P) within
three basis points Pb1, Pb2, and Pb3, with coordinates
n mutations if P matches some subgraph of G with
Pb1x0; y0; z0Ä…, Pb2x0 ‡ 1; y0; z0Ä…, and Pb3x0; y0 ‡ 1; z0Ä…, re-
n mutations or fewer where n is chosen by the user.
spectively. The origin is Pb1 and the three basis vectors are
~ ~ ~ ~ ~
Vb1;b2, Vb1;b3, and Vb1;b2 Vb1;b3. Here, Vb1;b2 represents the
Example 2. To illustrate patterns in graphs, consider the set
~
vector starting at point Pb1 and ending at point Pb2. Vb1;b2
S of three graphs in Fig. 3a. Suppose only exactly
~
Vb1;b3 stands for the cross product of the two corresponding
coinciding substructures (without mutations) occurring
vectors. We refer to this coordinate frame as Substructure
3. Here, exactly which subset of the neighbors is chosen is unimportant,
Frame 0, or SF0. Note that the basis vectors of SF0 are
as the proposed geometric hashing technique hashes nodes only, ignoring
orthonormal. That is, the length of each vector is 1 and the
edges among the nodes. Notice that when a node v is inserted or deleted,
the nodes surrounding v do not move, i.e., their coordinates remain the
angle between any two basis vectors has 90 degrees. Also
same. Note also that we do not allow multiple edit operations to be applied
note that, for any node numbered i in the substructure Str0 to the same node. Thus, for example, inserting a node with label m followed
by relabeling it to n is considered as inserting a node with label n. The three
with global coordinate Pixi; yi; ziÄ…, we can find a local
edit operations are extensions of the edit operations on sequences; they arise
coordinate of the node i with respect to SF0, denoted Pi0,
naturally in graph editing [10] and molecule evolution [25]. As shown in
where Section 4, based on these edit operations, our algorithm finds useful
patterns that can be used to classify and cluster 3D molecules effectively.
4. The size of a graph is defined to be the number of nodes in the graph,
~
Pi0 ˆ Vb1;i ˆxi x0; yi y0; zi z0Ä…: 1Ä…
since our geometric hashing technique considers nodes only. Furthermore,
in our target applications, e.g., chemistry, nodes are atomic units and
2. Such an edge is referred to as a rotatable edge. If two rigid determine the size of a compound. Edges are links between the nodes and
substructures cannot be rotated with respect to each other around the edge have a different meaning from the atomic units; as a consequence, we
connecting them, that edge is a nonrotatable edge. exclude the edges from the size definition.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 733
quality of the patterns by using them to classify the
compounds. Here, we extend the work in [35] by
1. considering more general edit operations including
node insert, delete, and relabeling,
2. presenting the theoretical foundation and evaluating
the performance and efficiency of our pattern-
finding algorithm,
3. applying the discovered patterns to classifying
3D proteins, which are much larger and more
complicated in topology than chemical compounds,
and
Fig. 2. The rigid substructures of the graph in Fig. 1.
4. presenting a technique to cluster 3D graphs based on
the patterns occurring in them.
in at least two graphs and having size greater than three
Specifically, we conducted two experiments. In the first
are considered as ªpatterns.º Then, S contains one
experiment, we applied the proposed method to locating
pattern shown in Fig. 3b. If substructures having size
frequently occurring motifs (substructures) in two families
greater than four and approximately occurring in all the
of proteins pertaining to RNA-directed DNA Polymerase
three graphs within one mutation (i.e., one node delete,
and Thymidylate Synthase, and used the motifs to classify
insert, or relabeling is allowed in matching a substruc- the proteins. Experimental results showed that our method
achieved a 96.4 percent precision rate. In the second
ture with a graph) are considered as ªpatterns,º then S
experiment, we applied our pattern-finding algorithm to
contains one pattern shown in Fig. 3c.
discovering frequently occurring patterns in chemical
Our strategy to find the patterns in a set of 3D graphs is
compounds chosen from the Merck Index pertaining to
to decompose the graphs into rigid substructures and, then,
aromatic, bicyclicalkanes and photosynthesis. We then used
use geometric hashing [14] to organize the substructures
the patterns to cluster the compounds. Experimental results
and, then, to find the frequently occurring ones. In [35], we
showed that our method achieved 99 percent recall and
applied the approach to the discovery of patterns in
precision rates.
chemical compounds under a restricted set of edit opera- The rest of the paper is organized as follows: Section 2
tions including node insert and node delete, and tested the presents the theoretical framework of our approach and
Fig. 3. (a) The set S of three graphs, (b) the pattern exactly occurring in two graphs in S, and (c) the pattern approximately occurring, within one
mutation, in all the three graphs.
734 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
4
describes the pattern-finding algorithm in detail. Section 3
5 ˆ 20 node-triplets:
3
evaluates the performance and efficiency of the pattern-
finding algorithm. Section 4 describes the applications of
There are several alternative ways to decompose
our approach to classifying proteins and clustering com- 3D graphs into rigid substructures, depending on the
pounds. Section 5 discusses related work. Section 6 application at hand and the nature of the graphs. For the
purposes of exposition, we describe our pattern-finding
concludes the paper.
algorithm based on a partitioning strategy. Our approach
assumes a notion of atomic unit which is the lowest level of
2 PATTERN-FINDING ALGORITHM
description in the case of interest. Intuitively, atomic units
are fundamental building elements, e.g., atoms in a
2.1 Terminology
molecule. Edges arise as bonds between atomic units. We
Let S be a set of 3D graphs. The occurrence number of a
break a graph into maximal size rigid substructures (recall
pattern P is the number of graphs in S that approxi-
that a rigid substructure is a subgraph in which the relative
mately contain P within the allowed number of muta-
positions of nodes in the substructure are fixed). To
tions. Formally, the occurrence number of a pattern P
accomplish this, we use an approach similar to [16] that
with respect to mutation d and set S, denoted
employs a depth-first search algorithm, referred to as DFB,
occur nod P Ä…, is k if there are k graphs in S that contain
S to find blocks in a graph.5 The DFB works by traversing the
P within d mutations. For example, consider Fig. 3 again.
graph in a depth-first order and collecting nodes belonging
Let S contain the three graphs in Fig. 3a. Then,
to a block during the traversal, as illustrated in the example
occur no0 P1Ä… = 2; occur no1 P2Ä… = 3. below. Each block is a rigid substructure. We merge two
S S
Given a set S of 3D graphs, our algorithm finds all the rigid substructures B1 and B2 if they are not rotatable with
respect to each other, that is, the relative position of a node
patterns P where P approximately occurs in at least
n1 2 B1 and a node n2 2 B2 is fixed. The algorithm
Occur graphs in S within the allowed number of mutations
maintains a stack, denoted ST K, which keeps the rigid
Mut and jP j Size, where jP j represents the size, i.e., the
substructures being merged. Fig. 4 shows the algorithm,
number of nodes, of the pattern P. (Mut, Occur, and Size
which outputs a set of rigid substructures of a graph G. We
are user-specified parameters.) One can use the patterns in
then throw away the substructures P where jP j < Size. The
several ways. For example, biologists or chemists may
remaining substructures constitute the candidate patterns
evaluate whether the patterns are significant; computer
generated from G. This pattern-generation algorithm runs
scientists may use the patterns to classify or cluster
in time linearly proportional to the number of edges in G.
molecules as demonstrated in Section 4.
Example 3. We use the graph in Fig. 5 to illustrate how the
Our algorithm proceeds in two phases to search for the
Find_Rigid_Substructures algorithm in Fig. 4 works.
patterns: 1) find candidate patterns from the graphs in S;
Rotatable edges in the graph are represented by lightface
and 2) calculate the occurrence numbers of the candidate
links; nonrotatable edges are represented by boldface
patterns to determine which of them satisfy the user-
links. Initially, the stack ST K is empty. We invoke DFB
specified requirements. We describe each phase in turn
to locate the first block (Step 4). DFB begins by visiting
below.
the node numbered 0. Following the depth-first search,
DFB then visits the nodes numbered 1, 2, and 5. Next,
2.2 Phase 1 of the Algorithm
DFB may visit the node numbered 6 or 4. Without loss of
In phase 1 of the algorithm, we decompose the graphs into
generality, assume DFB visits the node numbered 6 and,
rigid substructures. Dividing a graph into substructures is
then, the nodes numbered 10, 11, 13, 14, 15, 16, 18, and
necessary for two reasons. First, in dealing with some
17, in that order. Then, DFB visits the node numbered 14,
molecules such as chemical compounds in which there may
and realizes that this node has been visited before. Thus,
exist two substructures that are rotatable with respect to
DFB goes back to the node numbered 17, 18, etc., until it
each other, any graph containing the two substructures is
returns to the node numbered 14. At this point, DFB
not rigid. As a result, we decompose the graph into identifies the first block, B1, which includes nodes
numbered 14, 15, 16, 17, and 18. Since the stack ST K is
substructures having no rotatable components and consider
empty now, we push B1 into ST K (Step 7).
the substructures separately. Second, our algorithm hashes
In iteration 2, we call DFB again to find the next block
node-triplets within rigid substructures into a 3D table.
(Step 4). DFB returns the nonrotatable edge f13; 14g as a
When a graph as a whole is too large, as in the case of
block, denoted B2.6 The block is pushed into ST K
proteins, considering all combinations of three nodes in the
(Step 7). In iteration 3, DFB locates the block B3, which
graph may become prohibitive. Consequently, decompos-
includes nodes numbered 10, 11, 12, and 13 (Step 4).
ing the graph into substructures and hashing node-triplets
Since B3 and the top entry of the stack, B2 are not
of the substructures can increase efficiency. For example,
rotatable with respect to each other, we push B3 into
consider a graph of 20 nodes. There are
5. A block is a maximal subgraph that has no cut-vertices. A cut-vertex of
20 a graph is one whose removal results in dividing the graph into multiple,
ˆ 1140 node-triplets:
disjointed subgraphs [16].
3
6. In practice, in graph representations for molecules, one uses different
notation to distinguish nonrotatable edges from rotatable edges. For
On the other hand, if we decompose the graph into five
example, in chemical compounds, a double bond is nonrotatable. The
substructures, each having four nodes, then there are only different notation helps the algorithm to determine the types of edges.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 735
Fig. 4. Algorithm for finding rigid substructures in a graph.
ST K (Step 7). In iteration 4, we continue to call DFB and is rotatable with respect to B9, we pop out f5; 6g to form
get the single edge f6; 10g as the next block, B4 (Step 4).
a rigid substructure (Step 9) and push B9 into ST K
Since f6; 10g is rotatable and the stack ST K is nonempty,
(Step 10). Finally, since there is no block left in the graph,
we pop out all nodes in ST K, merge them and output
we pop out all nodes in B9 to form a rigid substructure
the rigid substructure containing nodes numbered 10, 11,
and terminate (Step 13).
12, 13, 14, 15, 16, 17, 18 (Step 9). We then push B4 into
2.3 Phase 2 of the Algorithm
ST K (Step 10).
Phase 2 of our pattern-finding algorithm consists of two
In iteration 5, DFB visits the node numbered 7 and,
subphases. In subphase A of phase 2, we hash and store the
then, 8 from which DFB goes back to the node numbered
candidate patterns generated from the graphs in phase 1 in
7. It returns the single edge f7; 8g as the next block, B5 a 3D table H. In subphase B, we rehash each candidate
(Step 4). Since f7; 8g is connected to the current top entry
pattern into H and calculate its occurrence number. Notice
of ST K, f6; 10g, via a rotatable edge f6; 7g, we pop out
that in the subphase B, one does not need to store the
f6; 10g, which itself becomes a rigid substructure (Step 9).
candidate patterns in H again.
We then push f7; 8g into ST K (Step 10). In iteration 6,
In processing a rigid substructure (pattern) of a 3D graph,
DFB returns the single edge f7; 9g as the next block, B6 we choose all three-node combinations, referred to as node-
(Step 4). Since B6 and the current top entry of ST K, B5 =
triplets, in the substructure and hash the node-triplets. We
f7; 8g, are not rotatable with respect to each other, we
hash three-node combinations, because to fix a rigid
push B6 into ST K (Step 7). In iteration 7, DFB goes back
substructure in the 3D Euclidean space one needs at least
from the node numbered 7 to the node numbered 6 and
three nodes from the substructure and three nodes are
returns the single edge f6; 7g as the next block, B7 sufficient provided they are not collinear. Notice that the
(Step 4). Since f6; 7g is rotatable, we pop out all nodes in
proper order of choosing the nodes i; j; k, in a triplet is
ST K, merge them and output the resulting rigid
significant and has an impact on the accuracy of our
substructure containing nodes numbered 7, 8, and 9
approach, as we will show later in the paper. We determine
(Step 9). We then push B7 = f6; 7g into ST K (Step 10).
the order of the three nodes by considering the triangle
formed by them. The first node chosen always opposes the
In iteration 8, DFB returns the block B8 = f5; 6g
longest edge of the triangle and the third node chosen
(Step 4). Since f5; 6g and f6; 7g are both rotatable, we pop
opposes the shortest edge. For example, in the triangle in
out f6; 7g to form a rigid substructure (Step 9). We then
Fig. 6, we choose i; j; k, in that order. Thus, the order is
push B8 into ST K (Step 10). In iteration 9, DFB returns
unique if the triangle is not isosceles or equilateral, which
the block B9 containing nodes numbered 0, 1, 2, 3, 4, and
usually holds when the coordinates are floating point
5 (Step 4). Since the current top entry of ST K, B8 = f5; 6g,
Fig. 5. The graph used for illustrating how the Find_Rigid_Substructures
algorithm works. Fig. 6. The triangle formed by the three nodes i; j; k.
736 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
TABLE 2
The Node Labels of the Graph in Fig. 1 and
Their Indices in the Array A
numbers. On the other hand, when the triangle is isosceles,
we hash and store the two node-triplets ‰i; j; kŠ and ‰i; k; jŠ,
assuming that node i opposes the longest edge and edges
fi; jg, fi; kg have the same length. When the triangle is
equilateral, we hash and store the six node-triplets ‰i; j; kŠ,
Fig. 7. Calculation of the coordinates of the basis points Pb , Pb , Pb of
1 2 3
‰i; k; jŠ, ‰j; i; kŠ, ‰j; k; iŠ, ‰k; i; jŠ, and ‰k; j; iŠ.
Substructure Frame 0 (SF0) with respect to the local coordinate frame
The labels of the nodes in a triplet form a label-triplet,
LF‰i; j; kŠ.
which is encoded as follows: Suppose the three nodes
chosen are v1, v2, v3, in that order. We maintain all node
Let
labels in the alphabet in an array A. The code for the
d1 ˆl1 ‡ l2Ä… mod P rime1 mod Nrow
labels is an unsigned long integer, defined as
d2 ˆl2 ‡ l3Ä… mod P rime2 mod Nrow :
L1 P rime ‡ L2Ä… P rimeÄ… ‡L3;
d3 ˆl3 ‡ l1Ä… mod P rime3 mod Nrow
where P rime > j j is a prime number, L1, L2, and L3 are
Prime1, P rime2, and P rime3 are three prime numbers and
the indices for the node labels of v1, v2, and v3, respectively,
in the array A. Thus, the code of a label-triplet is unique.
Nrow is the cardinality of the hash table in each dimension.
This simple encoding scheme reduces three label compar-
We use three different prime numbers in the hope that the
isons into one integer comparison.
distribution of the hash function values is not skewed even
Example 4. To illustrate how we encode label-triplets,
if pairs of l1, l2, l3 are correlated. The node-triplet ‰i; j; kŠ is
consider again the graph G in Fig. 1. Suppose the node
hashed to the 3D bin with the address h‰d1Š‰d2Š‰d3Š.
labels are stored in the array A, as shown in Table 2.
Intuitively, we use the squares of the lengths of the three
Suppose P rime is 1,009. Then, for example, for the three
nodes numbered 2, 0, and 1 in Fig. 1, the code for the edges connecting the three chosen nodes to determine the
corresponding label-triplet is 2 1; 009 ‡ 0Ä… 1; 009Ä… ‡
hash bin address. Stored in that bin are the graph
1 ˆ 2; 036; 163:
identification number, the substructure identification num-
ber, and the label-triplet code. In addition, we store the
2.3.1 Subphase A of Phase 2 coordinates of the basis points Pb1, Pb2, Pb3 of Substructure
In this subphase, we hash the candidate patterns generated Frame 0 (SF0) with respect to the three chosen nodes.
in phase 1 of the pattern-finding algorithm into a 3D table.
Specifically, suppose the chosen nodes i, j, k are not
For the purposes of exposition, consider the example
collinear. We can construct another local coordinate frame,
substructure Str0 in Fig. 2a, which is assumed to be a
~ ~ ~ ~
denoted LF ‰i; j; kŠ, using Vi;j, Vi;k and Vi;j Vi;k as basis
candidate pattern. We choose any three nodes in Str0 and
vectors. The coordinates of Pb1, Pb2, Pb3 with respect to the
calculate their 3D hash function values as follows: Suppose
local coordinate frame LF ‰i; j; kŠ, denoted SF0‰i; j; kŠ, form a
the chosen nodes are numbered i, j, k, and have global
coordinates Pixi; yi; ziÄ…, Pjxj; yj; zjÄ…, and Pkxk; yk; zkÄ…,
3 3 matrix, which is calculated as follows (see Fig. 7):
respectively. Let l1, l2, l3 be three integers where
0 1
~
Vi;b1
B C
l1 ˆxi xjÄ…2 ‡yi yjÄ…2 ‡zi zjÄ…2Ä… Scale ~
SF0‰i; j; kŠ ˆ @ Vi;b2 A
A 1; 3Ä…
~
l2 ˆxi xkÄ…2 ‡yi ykÄ…2 ‡zi zkÄ…2Ä… Scale : 2Ä… Vi;b3
l3 ˆxk xjÄ…2 ‡yk yjÄ…2 ‡zk zjÄ…2Ä… Scale
where
0 1
Here, Scale ˆ 10p is a multiplier. Intuitively, we round to the
~
Vi;j
B C:
nearest pth position following the decimal point (here p is the
~
A ˆ @ Vi;k A
4Ä…
last accurate position) and, then, multiply the numbers by 10p.
~ ~
Vi;j Vi;k
The reason for using the multiplier is that we want some
Thus, suppose the graph in Fig. 1 has identification number
digits following the decimal point to contribute to the
12. The hash bin entry for the three chosen nodes i, j, k is
distribution of the hash function values. We ignore the digits
12; 0; Lcode; SF0‰i; j; kŠÄ…, where Lcode is the label-triplet
after the position p because they are inaccurate. (The
appropriate value for the multiplier can be calculated once code. Since there are 6 nodes in the substructure Str0, we
data are given, as we will show in Section 3.) have
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 737
6
Recall that we choose the three nodes i, j, k based on the
ˆ 20 node-triplets
3
triangle formed by themÐthe first node chosen always
opposes the longest edge of the triangle and the third node
generated from the substructure and, therefore, 20 entries in
chosen opposes the shortest edge. Without loss of generality,
the hash table for the substructure.7
let us assume that the nodes i; j; k are chosen in that order.
Example 5. To illustrate how we hash and store patterns in
~ ~
Thus, Vi;j has the shortest length, Vi;k is the second shortest
the hash table, let us consider Table 1 again. The basis
~ ~
and Vj;k is the longest. We use node i as the origin, Vi;j as the X-
points of SF0 of Fig. 2a have the following global
~
axis and Vi;k as the Y-axis. Then, construct the local coordinate
coordinates:
~ ~ ~ ~
frame LF ‰i; j; kŠ using Vi;j, Vi;k, and Vi;j Vi;k as basis vectors.
~
Thus, we exclude the longest vector Vj;k when constructing
Pb11:0178; 1:0048; 2:5101Ä…;
LF‰i; j; kŠ. Here is why.
Pb22:0178; 1:0048; 2:5101Ä…;
The coordinates x; y; zÄ… of each node in a 3D graph have
Pb31:0178; 2:0048; 2:5101Ä…:
an error due to rounding. Thus, the real coordinates for the
The local coordinates, with respect to SF0, of the node should be x; y; zÄ…, where x ˆ x ‡ , y ˆ y ‡ , z ˆ
1 2
nodes numbered 0, 1, 2, 3, and 4 in substructure Str0 of
z ‡ for three small decimal fractions ; ; . After
3 1 2 3
Fig. 2a are as follows:
constructing LF ‰i; j; kŠ and when calculating SF0‰i; j; kŠ,
one may add or multiply the coordinates of the 3D vectors.
0
P00:0000; 0:0000; 0:0000Ä…;
We define the accumulative error induced by a calculation C,
0
P10:1843; 1:0362; 0:5081Ä…;
denoted CÄ…, as
0
P20:3782; 1:9816; 0:5095Ä…;
CÄ… ˆjf fj;
0
P3 0:3052; 1:0442; 0:6820Ä…;
0
where f is the result obtained from C with the real
P4 0:2568; 1:7077; 0:5023Ä…:
coordinates and f is the result obtained from C with
Now, suppose Scale, P rime1, P rime2, P rime3 are 10,
rounding errors.
1,009, 1,033, and 1,051, respectively, and Nrow is 31.
Recall that in calculating SF0‰i; j; kŠ, the three basis vectors
Thus, for example, for the nodes numbered 1, 2, and 3,
of LF‰i; j; kŠ all appear in the matrix A defined in (4). Let
the hash bin address is h‰25Š‰12Š‰21Š and,
~ ~ ~ ~
jVi;jj ˆjVi;jj ‡ ; jVi;kj ˆjVi;kj ‡ ;
0 1 1 2
1:056731 0:357816 0:173981
~ ~
@ A: and jVj;kj ˆjVj;kj ‡ . Let ˆ maxfj j; j j; j jg. Notice
SF0‰1; 2; 3Š ˆ 0:875868 0:071949 0:907227 5Ä… 3 1 2 3
0:035973 0:417517 0:024041
~ ~ ~ ~
jVi;j Vi;kj ˆjVi;jjjVi;kj sin ;
As another example, for the nodes numbered 1, 4, and 2,
~ ~
where jVi;jj is the length of Vi;j, and is the angle between
the hash bin address is h‰24Š‰0Š‰9Š and
~ ~
0 1
Vi;j and Vi;k. Thus,
0:369457 1:308266 0:242990
@ A:
SF0‰1; 4; 2Š ˆ 0:043584 0:857105 1:006793 6Ä…
~ ~
jVi;j Vi;kjÄ…
0:455285 0:343703 0:086982
~ ~ ~ ~
ˆjjVi;jjjVi;kj sin jVi;jjjVi;kj sin j
Similarly, for the substructure Str1, we attach a local
~ ~ ~ ~
ˆ jjVi;jj ‡ Ä…jVi;kj ‡ Ä… sin jVi;jjjVi;kj sin j
1 2
coordinate frame SF1 to the node numbered 6, as shown
~ ~ :
ˆ jjVi;jj ‡jVi;kj ‡ Ä…jj sin j
in Fig. 2b. There are 10 hash table entries for Str1, each
2 1 1 2
having the form 12; 1; Lcode; SF1‰l; m; nŠÄ… where l; m; n
~ ~
jVi;jjj j ‡jVi;kjj j ‡j jj jÄ…j sin j
2 1 1 2
are any three nodes in Str1.
~ ~
jVi;jj ‡jVi;kj ‡ Ä…
ˆ U1
7. One interesting question here is regarding the size limitation of
Likewise,
patterns (rigid substructures) found by our algorithm. Suppose the graph
identification number and the substructure identification number are both
~ ~ ~ ~
jVi;j Vj;kjÄ… jVi;jj ‡jVj;kj ‡ Ä… ˆ U2
short integers of 2 bytes. Lcode is a long integer of 4 bytes. SF0‰i; j; kŠ is a
and
3 3 matrix of floating point numbers, each requiring 4 bytes. Thus, each
node-triplet requires 44 bytes. For a set of M patterns (substructures), each
~ ~ ~ ~
jVi;k Vj;kjÄ… jVi;kj ‡jVj;kj ‡ Ä… ˆ U3:
T
having T nodes on average, the hash table requires M 44 bytes of
3
Among the three upperbounds U1, U2, U3, U1 is the smallest.
disk space. Suppose there are 10,000 patterns, i.e., M ˆ 10; 000. If T ˆ 20,
It's likely that the accumulative error induced by calculating
the hash table requires about 502 Megabytes. If T ˆ 100, the hash table
the length of the cross product of the two corresponding
requires over 71 Gigabytes. This gives an estimate of how large patterns can
~ ~
vectors is also the smallest. Therefore, we choose Vi;j, Vi;k,
be in practice. In our case, the pattern sizes are between 5 and 13, so the size
~
and exclude the longest vector Vj;k in constructing the local
demands are modest. Note that the time and memory space needed also
coordinate frame LF ‰i; j; kŠ, so as to minimize the accumu-
increase dramatically as patterns become large.
lative error induced by calculating SF0‰i; j; kŠ.
738 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
nodes u, v, and w geometrically match the nodes i, j, and k,
2.3.2 Subphase B of Phase 2
respectively.
Let H be the resulting hash table obtained in subphase A of
phase 2 of the pattern-finding algorithm. In subphase B, we Proof. (If) Let A be as defined in (4) and let
calculate the occurrence number of each candidate pattern
0 1
~
Vu;v
P by rehashing the node-triplets of P into H. This way, we
B C:
~
B ˆ @ Vu;w A
8Ä…
are able to match a node-triplet tri of P with a node-triplet
0
~ ~
tri0 of another substructure (candidate pattern) P stored in Vu;v Vu;w
subphase A where tri and tri0 have the same hash bin
Note that, if u, v, and w geometrically match i, j, and k,
address. By counting the node-triplet matches, one can infer
0 respectively, then jBj ˆjAj, where jBj (jAj, respectively)
whether P matches P and, therefore, whether P occurs in
is the determinant of the matrix B (matrix A, respec-
the graph from which P0 is generated.
tively). That is to say, jA 1jjBj ˆ1.
We associate each substructure with several counters,
From (3) and by the definition of the SFP in (7),
which are created and updated as illustrated by the
we have
following example. Suppose the two substructures (pat-
0 1
0 1
terns) of graph G with identification number 12 in Fig. 1
~
Vi;b1 Pu
B C
have already been stored in the hash table H in subphase A.
@
~
SFP ˆ @ Vi;b2 A
A 1 B ‡ Pu A: 9Ä…
Suppose i; j; k, are three nodes in the substructure Str0 of G.
~ Pu
Vi;b3
Thus, for this node-triplet, its entry in the hash table is
12; 0; Lcode; SF0‰i; j; kŠÄ…. Now, in subphase B, consider Thus, the SFP basically transforms Pb1, Pb2, and Pb3 via
another pattern P ; we hash the node-triplets of P using two translations and one rotation, where Pb1, Pb2, and Pb3
the same hash function. Let u; v; w, be three nodes in P that are the basis points of the Substructure Frame 0 (SF0).
~ ~ ~ ~
have the same hash bin address as i; j; k; that is, the node- Since Vb1;b2, Vb1;b3, and Vb1;b2 Vb1;b3 are orthonormal
triplet ‰u; v; wŠ ªmatchesº the node-triplet ‰i; j; kŠ. If the vectors, and translations and rotations do not change
~ ~ ~
nodes u; v; w, geometrically match the nodes i; j; k respec- this property [27], we know that Vc1;c2, Vc1;c3, and Vc1;c2
~
Vc1;c3 are orthonormal vectors.
tively, i.e., they have coinciding 3D coordinates after
(Only if) If u, v, and w do not match i, j and k
rotations and translations, we call the node-triplet match a
geometrically while having the same hash bin address,
true match; otherwise it is a false match. For a true match, let
then there would be distortion in the aforementioned
0 1
0 1
~ ~ ~
~
transformation. Consequently, Vc1;c2, Vc1;c3, and Vc1;c2
Vu;v Pu
B C
~
@
~ Vc1;c3 would no longer be orthonormal vectors. u
t
SFP ˆ SF0‰i; j; kŠ @ Vu;w A
‡ Pu A: 7Ä…
~ ~
Vu;v Vu;w Pu
Theorem 2. If two true node-triplet matches yield the same SFP
This SFP contains the coordinates of the three basis points
and the codes of the corresponding label-triplets are the same,
of the Substructure Frame 0 (SF0) with respect to the global
then the two node-triplet matches are augmentable, i.e., they
coordinate frame in which the pattern P is given. We
can be combined to form a larger substructure match between
compare the SFP with those already associated with the
P and Str0.
substructure Str0 (initially none is associated with Str0). If
Proof. Since three nodes are enough to set the SFP at a fixed
the SFP differs from the existing ones, a new counter is
position and direction, all the other nodes in P will have
created, whose value is initialized to 1, and the new counter
definite coordinates under this SFP . When another node-
is assigned to the SFP . If the SFP is the ªsameº as an
triplet match yielding the same SFP occurs, it means that
existing one with counter value Cnt,8 and the code of the
geometrically there is at least one more node match
label-triplet of nodes i, j, k; equals the code of the label-
between Str0 and P . If the codes of the corresponding
triplet of nodes u, v, w, then Cnt is incremented by one. In
label-triplets are the same, it means that the labels of the
general, a substructure may be associated with several
corresponding nodes are the same. Therefore, the two
different SFP s, each having a counter.
We now present the theory supporting this algorithm. node-triplet matches are augmentable. u
t
Below, Theorem 1 establishes a criterion based on which
Fig. 8 illustrates how two node-triplet matches are
one can detect and eliminate a false match. Below,
augmented. Suppose the node-triplet ‰3; 4; 2Š yields the
Theorem 2 justifies the procedure of incrementing the
SFP shown in the figure. Further, suppose that the node-
counter values.
triplet ‰1; 2; 3Š yields the same SFP , as shown in Fig. 8. Since
Theorem 1. Let Pc1, Pc2, and Pc3 be the three basis points forming
the labels of the corresponding nodes numbered 3 and 2 are
~ ~
the SFP defined in (7), where Pc1 is the origin. Vc1;c2, Vc1;c3,
the same, we can augment the two node-triplets to form a
~ ~
and Vc1;c2 Vc1;c3 are orthonormal vectors if and only if the
larger match containing nodes numbered 1, 2, 3, 4.
0 Thus, by incrementing the counter associated with the
8. By saying SFP is the same as an existing SFP , we mean that for each
entry ei;j; 1 i; j 3, at the ith row and the jth column in SFP and its
SFP , we record how many true node-triplet matches are
0
corresponding entry e0 in SFP , jei;j e0 j , where is an adjustable
i;j i;j
augmentable under this SFP . Notice that in cases where two
parameter depending on the data. In the examples presented in the paper,
ˆ 0:01. node-triplet matches occur due to reflections, the directions
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 739
Fig. 8. Illustration of two augmentable node-triplet matches.
of the corresponding local coordinate systems are different,
Fig. 9. A substructure (pattern) P .
so are the SFP s. As a result, these node-triplet matches are
not augmentable.
5
ˆ 10;
3
Example 6. To illustrate how we update counter values for
augmentable node-triplet matches, let us consider the
since all matching node-triplets have the same SFP as
pattern P in Fig. 9. In P , the nodes numbered 0, 1, 2, 3, 4
in (10) and the labels of the corresponding nodes are
match, after rotation, the nodes numbered 5, 4, 3, 1, 2 in
the same.
the substructure Str0 in Fig. 2a. The node numbered 0 in
Now, consider again the SFP defined in (7) and the three
Str0 does not appear in P (i.e., it is to be deleted). The
basis points Pc1, Pc2, Pc3 forming the SFP , where Pc1 is the
labels of the corresponding nodes are identical. Thus,
P matches Str0 with 1 mutation, i.e., one node is deleted. origin. We note that for any node i in the pattern P with
Now, suppose in P , the global coordinates of the
global coordinate Pixi; yi; ziÄ…, it has a local coordinate with
nodes numbered 1, 2, 3, and 4 are
respect to the SFP , denoted Pi0, where
P1 0:269000; 4:153153; 2:911494Ä…;
~
Pi0 ˆ Vc1;i E 1: 12Ä…
P2 0:317400; 4:749386; 3:253592Ä…;
Here, E is the base matrix for the SFP , defined as
P30:172100; 3:913515; 4:100777Ä…;
0 1
~
P40:366000; 3:244026; 3:433268Ä…:
Vc1;c2
B C
~
E ˆ @ Vc1;c3 A
13Ä…
Refer to Example 5. For the nodes numbered 3, 4, and
~ ~
Vc1;c2 Vc1;c3
2 of P , the hash bin address is h‰25Š‰12Š‰21Š, which is the
same as that of nodes numbered 1, 2, 3 of Str0, and
~
andVc1;i is the vector starting at Pc1 and ending at Pi.
0 1
~ ~ ~ ~
Remark. If Vc1;c2, Vc1;c3, and Vc1;c2 Vc1;c3 are orthonormal
0:012200 5:005500 4:474200
@ A: 10Ä… vectors, then jEj ˆ1. Thus, a practically useful criterion
SFP ˆ 0:987800 5:005500 4:474200
0:012200 4:298393 3:767093
for detecting false matches is to check whether or not
~ ~ ~ ~
jEj ˆ1. If jEj ˆ1, then Vc1;c2, Vc1;c3, and Vc1;c2 Vc1;c3 are
6
The three basis vectors forming this SFP are
not orthonormal vectors and, therefore, the nodes u, v,
~
Vc1;c2 ˆ1:000000; 0:000000; 0:000000Ä…;
and w do not match the nodes i, j and k geometrically
~ (see Theorem 1).
Vc1;c3 ˆ0:000000; 0:707107; 0:707107Ä…;
Intuitively, our scheme is to hash node-triplets and
~ ~
Vc1;c2 Vc1;c3 ˆ0:000000; 0:707107; 0:707107Ä…;
match the triplets. Only if one triplet tri matches another
tri0 do we see how the substructure containing tri matches
which are orthonormal.
the pattern containing tri0. Using Theorem 1, we detect and
For the nodes numbered 3, 1, and 4 of P , the hash bin
eliminate false node-triplet matches. Using Theorem 2, we
address is h‰24Š‰0Š‰9Š, which is the same as that of nodes
record in a counter the number of augmentable true node-
numbered 1, 4, 2 of Str0, and
triplet matches. The following theorem says that the
0 1
0:012200 5:005500 4:474200 counter value needs to be large (i.e., there are a sufficient
@ A: 11Ä… number of augmentable true node-triplet matches) in
SFP ˆ 0:987800 5:005500 4:474200
0:012200 4:298393 3:767093 order to infer that there is a match between the
corresponding pattern and substructure. The larger the
These two true node-triplet matches have the same
Mut, the fewer node-triplet matches are needed.
SFP and, therefore, the corresponding counter associated
Theorem 3. Let Str be a substructure in the hash table H and let
with the substructure Str0 of the graph 12 in Fig. 1 is
G be the graph from which Str is generated. Let P be a pattern
updated to 2. After hashing all node-triplets of P , the
counter value will be where jP j Mut ‡ 3. After rehashing the node-triplets of P ,
740 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
suppose there is an SFP associated with Str whose counter qualified patterns. In Section 3, we will show experimen-
value Cnt > , where tally that the missed patterns are few compared with those
P
found by exhaustive search.
N 1Ä… N 2Ä… N 3Ä…
N 1
ˆ ˆ 14Ä…
P
Theorem 4. Let the set S contain K graphs, each having at most
3
6
N nodes. The time complexity of the proposed pattern-finding
and N ˆjP j Mut. Then, P matches Str within Mut
algorithm is OKN3Ä….
mutations (i.e., P approximately occurs in G, or G
Proof. For each graph in S, phase 1 of the algorithm
approximately contains P , within Mut mutations).
requires ON2Ä… time to decompose the graph into
Proof. By Theorem 2, we increase the counter value only
substructures. Thus, the time needed for phase 1 is
when there are true node-triplet matches that are
OKN2Ä…. In subphase A of phase 2, we hash each
augmentable under the SFP . Thus, the counter value
candidate pattern P by considering the combinations of
shows currently how many node-triplets from the
any three nodes in P , which requires time
pattern P are found to match with the node-triplets
from the substructure Str in the hash table. Note that P
jP j
ˆ OjP j3Ä…:
represents the number of distinct node-triplets that can 3
be generated from a substructure of size N 1. If there
Thus, the time needed to hash all candidate patterns is
are N 1 node matches between P and Str, then
OKN3Ä…. In subphase B of phase 2, we rehash each
Cnt .9 Therefore, when Cnt > , there are at least
P P
candidate pattern, which requires the same time
N node matches between Str and P. This completes the
OKN3Ä…. u
t
proof. u
t
Notice that our algorithm cannot handle patterns with
size smaller than 3, because the basic processing unit here is
3 PERFORMANCE EVALUATION
a node-triplet of size 3. This is reflected in the condition
We carried out a series of experiments to evaluate the
jPj Mut ‡ 3 stated in Theorem 3. Since Mut can only be
performance and the speed of our approach. The programs
zero or greater than zero, this condition implies that the size
were written in the C programming language and run on a
of a pattern must be greater than or equal to 3.
SunSPARC 20 workstation under the Solaris operating
Example 7. To illustrate how Theorem 3 works, let us refer
system version 2.4. Parameters used in the experiments can
to Example 6. Suppose the user-specified mutation
be classified into two categories: those related to data and
number Mut is 1. The candidate pattern P in Fig. 9 has
those related to the pattern-finding algorithm. In the first
size jP j ˆ5. After rehashing the node-triplets of P , there
is only one counter associated with the substructure Str0 category, we considered the size (in number of nodes) of a
graph and the total number of graphs in a data set. In the
in Fig. 2a; this counter corresponds to the SFP in (10) and
second category, we considered all the parameters de-
the value of the counter, Cnt, is 10. Thus, Cnt is greater
scribed in Section 2, which are summarized in Table 3
than ˆ5 2Ä…5 3Ä…5 4Ä…=6 ˆ 1. By Theorem 3, P
P
together with the base values used in the experiments.
should match the substructure Str0 within 1 mutation.
Two files were maintained: One recording the hash bin
This means that there are at least four node matches
addresses and the other containing the entries stored in the
between P and Str0.
hash bins. To evaluate the performance of the pattern-
finding algorithm, we applied the algorithm to two sets of
Thus, after rehashing the node-triplets of each candidate
data: 1,000 synthetic graphs and 226 chemical compounds
pattern P into the 3D table H, we check the values of the
obtained from a drug database maintained in the National
counters associated with the substructures in H. By
Cancer Institute. When generating the artificial graphs, we
Theorem 3, P approximately occurs in a graph G within
randomly generated the 3D coordinates for each node. The
Mut mutations if G contains a substructure Str and there is
at least one counter associated with Str whose value node labels were drawn randomly from the range A to E.
Cnt > . If there are less than Occur graphs in which P The size of the rigid substructures in an artificial graph
P
approximately occurs within Mut mutations, then we
ranged from 4 to 10 and the size of the graphs ranged from
discard P . The remaining candidates are qualified patterns.
10 to 50. The size of the compounds ranged from 5 to 51.
Notice that Theorem 3 provides only the ªsufficientº (but
In this section, we present experimental results to answer
not the ªnecessaryº) condition for finding the qualified
questions concerning the performance of the pattern-
patterns. Due to the accumulative errors arising in the
finding algorithm. For example, are all approximate
calculations, some node-triplets may be hashed to a wrong
patterns found, i.e., is the recall high? Are any uninteresting
bin. As a result, the pattern-finding algorithm may miss
patterns found, i.e., is the precision high? In Section 4, we
some node-triplet matches and, therefore, miss some
study the applications of the algorithm and intend to
answer questions such as whether graphs having some
9. Notice that we cannot use Cnt ˆ here. The reason is that we
P
common phenomenological activity (e.g., they are proteins
augment node-triplet matches only when they yield the same SFP . In
practice, it's likely that two node-triplet matches could be augmented, but
with the same function) share structural patterns in
yield different SFP s due to the errors accumulated during the calculation of
common and whether these patterns can characterize the
the SFP s. As a consequence, our algorithm fails to detect those node-triplet
matches and update the counter values appropriately. graphs as a whole.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 741
TABLE 3
Parameters in the Pattern-Finding Algorithm and Their Base Values Used in the Experiments
UP F ound
3.1 Effect of Data-Related Parameters
PR ˆ 100%;
P F ound
To evaluate the performance of the proposed pattern-
where P F ound is the number of patterns found by the
finding algorithm, we compared it with exhaustive search.
proposed algorithm, UP F ound is the number of patterns
The exhaustive search procedure works by generating all
found that satisfy the user-specified parameter values, and
candidate patterns as in phase 1 of the pattern-finding
T otalP is the number of qualified patterns found by
algorithm. Then, the procedure examines if a pattern P
exhaustive search. One would like both RE and PR to be
approximately matches a substructure Str in a graph by
permuting the node labels of P and checking if they match as high as possible. Fig. 10 shows the running times of the
the node labels of Str. If so, the procedure performs algorithms as a function of the number of graphs and Fig. 11
translation and rotation on P and checks if P can shows the recall. The parameters used in the proposed
geometrically match Str. pattern-finding algorithm had the values shown in Table 3.
The speed of the algorithms was measured by the
As can be seen from the figures, the proposed algorithm is
running time. The performance was evaluated using three
10,000 times faster than the exhaustive search method when
measures: recall (RE), precision (PR), and the number of
the data set has more than 600 graphs while achieving a
false matches, Nfm, arising during the hashing process.
very high (> 97%) recall. Due to the accumulative errors
(Recall that a false match arises, if a node-triplet ‰u; v; wŠ
arising in the calculations, some node-triplets may be
from a pattern P has the same hash bin address as a node- hashed to a wrong bin. As a result, the proposed algorithm
triplet ‰i; j; kŠ from a substructure of graph G, though the may miss some node-triplet matches in subphase B of
nodes u; v; w do not match the nodes i; j; k geometrically, phase 2 and, therefore, cannot achieve a 100 percent recall.
see Section 2.3.2.) Recall is defined as In these experiments, precision was 100 percent.
Fig. 12 shows the number of false matches introduced by
UP F ound
RE ˆ 100%: the proposed algorithm as a function of the number of
T otalP
graphs. For the chemical compounds, Nfm is small. For the
Precision is defined as
synthetic graphs, Nfm increases as the number of graphs
Fig. 10. Running times as a function of the number of graphs. Fig. 11. Recall as a function of the number of graphs.
742 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
Fig. 12. Number of false matches as a function of the number of graphs. Fig. 14. Effect of Size.
becomes large. Similar results were obtained when testing the fewer false matches. On the other hand, when Nrow
the size of graphs for both types of data.
is too large, the running times increase substantially, since
one needs to spend a lot of time in reading the 3D table
3.2 Effect of Algorithm-Related Parameters
containing the hash bin addresses.
The purpose of this section is to analyze the effect of
Examining Fig. 13, we see how Scale affects Nfm. Taking
varying the algorithm-related parameter values on the
an extreme case, when Scale ˆ 1, a node-triplet with the
performance of the proposed pattern-finding algorithm.
squares of the lengths of the three edges connecting the
To avoid the mutual influence of parameters, the analysis
nodes being 12.4567 is hashed to the same bin as a node-
was carried out by fixing the parameter values related to
triplet with those values being 12.0000, although the two
data graphsÐthe 1,000 synthetic graphs and 226 com-
node-triplets do not match geometrically, see Section 2.3.1.
pounds described above were used, respectively. In each
On the other hand, when Scale is large (e.g., Scale = 10,000),
experiment, only one algorithm-related parameter value
the distribution of hash function values is less skewed,
was varied; the other parameters had the values shown in
which reduces the number of false matches. It was observed
Table 3.
that Nfm was largest when Scale was 100. This happens
We first examined the effect of varying the parameter
because with this Scale value, inaccuracy was being
values on generating false matches. Since few false
introduced in calculating hash bin addresses. A node-triplet
matches were found for chemical compounds, the
being hashed to a bin might generate k false matches where
experiments focused on synthetic graphs. It was observed
k is the total number of node-triplets already stored in that
that only Nrow and Scale affected the number of false
binÐk would be large if the distribution of hash function
matches. Fig. 13 shows Nfm as a function of Scale for
values is skewed.
Nrow = 101, 131, 167, 199, respectively. The larger the
Figs. 14, 15, and 16 show the recall as a function of Size,
Nrow, the fewer entries in a hash bin, and consequently
Mut, and Scale, respectively. In all three figures, precision
Fig. 13. Number of false matches as a function of Scale. Fig. 15. Effect of Mut.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 743
to the nth digit after the decimal point, i.e., to 10 n, the error
caused by eliminating the digits after the n ‡ 1Ä…th position is
no larger than 0:5 10 n. Namely, for any coordinates
x; y; zÄ…, we have 0:5 10 n, 0:5 10 n, and
x y
0:5 10 n. Thus,
z
2 jxi xjjj j ‡jyi yjjj j
l1 xi xj yi yj
‡jzi zjjj jÄ… Scale
zi zj
‡j j2 ‡j j2 ‡j j2Ä… Scale
xi xj yi yj zi zj
2 jxi xjjj j ‡j jÄ… ‡ jyi yjjj j
xi xj yi
‡j jÄ… ‡ jzi zjjj j ‡j jÄ…Ä… Scale
yj zi zj
‡ j j ‡j jÄ…2 ‡j j ‡j jÄ…2 ‡j j ‡j jÄ…2Ä… Scale
xi xj yi yj zi zj
2 jxi xjj ‡jyi yjj ‡jzi zjjÄ…
2 0:5 10 n Scale
Fig. 16. Impact of Scale.
‡ 3 2 0:5 10 nÄ…2 Scale
ˆjxi xjj ‡jyi yjj ‡jzi zjjÄ…
is 100 percent. From Fig. 14 and Fig. 15, we see that Size and
Mut affect recall slightly. It was also observed that the 2 Scale 10 n ‡ 3 Scale 10 2n:
number of interesting patterns drops (increases, respec-
Assume that the range of the coordinates is ‰ M; MŠ. We
tively) significantly as Size (Mut, respectively) becomes
have jxi xjj 2M, jyi yjj 2M, and jzi zjj 2M.
large. Fig. 16 shows that the pattern-finding algorithm
Thus,
yields a poor performance when Scale is large. With the
12 M Scale 10 n ‡ 3 Scale 10 2n:
l1
data we tested, we found that setting Scale to 10 is the best
overall for both recall and precision.
The second term is obviously negligible in comparison
In sum, the value of Scale has a significant impact on the
with the first term. In order to keep the calculation of hash
performance of our algorithm. The choice of Scale is
bin addresses accurate, the first term should be smaller than
domain and data dependent. In practice, how would one
1. That is, Scale should be chosen to be the largest possible
select a value of Scale for a particular database? Recall that
number such that
the coordinates x; y; zÄ… of each node in a 3D graph have an
12 M Scale 10 n < 1
error due to rounding, see Section 2.3.1. The real coordi-
nates for the node at xi; yi; ziÄ… should be xi; yi; ziÄ…, where
or
xi ˆ xi ‡ , yi ˆ yi ‡ , zi ˆ zi ‡ . Similarly, the real
xi yi zi
1
coordinates for the node at xj; yj; zjÄ… should be xj; yj; zjÄ…,
Scale < 10n:
where xj ˆ xj ‡ , yj ˆ yj ‡ , zj ˆ zj ‡ . The real 12 M
xj yj zj
value of l1 in (2) should be
In our case, the coordinates of the chemical compounds
used were accurate to the 4th digit after the decimal point and
l1 ˆxi xjÄ…2 ‡yi yjÄ…2 ‡zi zjÄ…2Ä… Scale
the range of the coordinates was [-10, 10]. Consequently,
ˆxi ‡ xj Ä…2 ‡yi ‡ yj Ä…2
xi xj yi yj
104 500
‡zi ‡ zj Ä…2Ä… Scale
Scale < ˆ
zi zj
12 10 6
ˆxi xjÄ…2 ‡yi yjÄ…2 ‡zi zjÄ…2Ä… Scale
Thus, choosing Scale ˆ 10 is good, as validated by our
‡ 2 xi xjÄ… Ä… ‡yi yjÄ… Ä…
xi xj yi yj
experimental results. Notice that Scale can be determined
‡zi zjÄ… Ä…Ä… Scale
zi zj once the data are given. All we need to know is how
accurate the data are, and what the ranges of their
‡ Ä…2 ‡ Ä…2 ‡ Ä…2Ä… Scale
xi xj yi yj zi zj
coordinates are, both of which are often clearly associated
ˆ l1 ‡ ;
l1
with the data.
Figs. 17 and 18 show the recall and precision as a
where is an accumulative error,
l1
function of . It can be seen that when is 0.01, precision is
ˆ 2 xi xjÄ… Ä… ‡yi yjÄ… Ä…
l1 xi xj yi yj
100 percent and recall is greater than 97 percent. When
‡zi zjÄ… Ä…Ä… Scale : becomes smaller (e.g., ˆ 0:0001 ), precision remains the
zi zj
same while recall drops. When becomes larger (e.g.,
‡ Ä…2 ‡ Ä…2 ‡ Ä…2Ä… Scale
xi xj yi yj zi zj
ˆ 10), recall increases slightly while precision drops. This
15Ä…
happens because some irrelevant node-triplet matches were
Since we used l1 to calculate the hash bin addresses, it is included, rendering unqualified patterns returned as an
critical that the accumulative error would not mislead us to a answer. We also tested different values for Occur, Nrow,
wrong hash bin. Assuming that the coordinates are accurate P rime1, Prime2, and P rime3. It was found that varying
744 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
Fig. 18. Precision as a function of .
Fig. 17. Recall as a function of .
0
the original substructure P . Supposing the pattern P has
these parameter values had little impact on the performance
N nodes, the total number of node-triplets that include the
of the proposed algorithm.
node v is
Finally, we examined the effect of sensor errors, which
are caused by the measuring device (e.g., X-ray crystal-
N 1
:
lography), and are proportional to the values being
2
measured. Suppose that the sensor error is 10 n of the
Thus, we would miss the same number of node-triplet
real value for any coordinates x; y; zÄ…, i.e., ˆ x 10 n,
x
matches. However, if all other node-triplets that do not
ˆ y 10 n, and ˆ z 10 n. We have
y z
include the node v are matched successfully, we would
x ˆ x 1 ‡ 10 nÄ…; y ˆ y 1 ‡ 10 nÄ…;
still have
and z ˆ z 1 ‡ 10 nÄ…. Let represent the sensor error
l1
N N 1 N 1
ˆ
induced in calculating l1 in (2). In a way similar to the
3 2 3
calculation of the accumulative error in (15), we obtain
l1
node-triplet matches. According to Theorem 3, there would
0
ˆxi xjÄ…2 ‡yi yjÄ…2 ‡zi zjÄ…2Ä… 2 10 n Scale
l1
be N 1 node matches between the pattern P and the
original substructure P (i.e., P0 matches P with 1 mutation).
‡xi xjÄ…2 ‡yi yjÄ…2 ‡zi zjÄ…2Ä… 10 2n Scale:
This example reveals that, in general, if we allow mutations
Let us focus on the first term since it is more significant.
to occur in searching for patterns and the number of nodes
Clearly, the proposed approach is not feasible in the
with sensor errors equals the allowed number of mutations,
environment with large sensor errors. Taking an example,
the recall will be the same as the case in which there are no
suppose that the sensor error is 10 2 of the real value.
sensor errors and we search for exactly matched patterns
Again, in order to keep the calculation of hash bin
without mutations. On the other hand, if no mutations are
addresses accurate, the accumulative error should be
allowed in searching for patterns and sensor errors occur,
smaller than 1. That is,
the recall will be zero.
xi xjÄ…2 ‡yi yjÄ…2 ‡zi zjÄ…2Ä… 2 10 2 Scale < 1
4 DATA MINING APPLICATIONS
or
One important application of pattern finding involves the
ability to perform classification and clustering as well as
xi xjÄ…2 ‡yi yjÄ…2 ‡zi zjÄ…2Ä… Scale < 50:
other data mining tasks. In this section, we present two data
Even when Scale ˆ 1, the Euclidean distance between
mining applications of the proposed algorithm in scientific
any two nodes must be smaller than 7, since the square of
domains: classifying proteins and clustering compounds.10
the distance, viz. xi xjÄ…2 ‡yi yjÄ…2 ‡zi zjÄ…2Ä…, must
4.1 Classifying Proteins
be less than 50. Most data sets, including all the chemical
Proteins are large molecules, comprising hundreds of
compounds we used, do not satisfy this restriction.
amino acids (residues) [18], [33]. In each residue the C ,
To illustrate how this would affect recall, consider a
C , and N atoms form a backbone of the residue [17].
substructure P already stored in the hash table. We add a
Following [28], we represent each residue by the three
10 2 sensor error to the coordinates of a node v in P . Call
atoms. Thus, if we consider a protein as a 3D graph, each
0 0
the resulting substructure P . Then, we use P as a pattern
node of the graph is an atom. Each node has a label, which
to match the original substructure P in the hash table. Those
0
node-triplets generated from the pattern P that include the
10. Another application of the proposed algorithm to flexible chemical
node v would not match the corresponding node-triplets in searching in 3D structural databases can be found in [34].
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 745
motifs. That is, there were 2,784 times in which a node-
triplet ‰u; v; wŠ from a pattern was found to have the same
hash bin address as a node-triplet ‰i; j; kŠ from a sub-
structure of a protein, though the nodes u; v; w did not
match the nodes i; j; k geometrically, see Section 2.3.2.
To evaluate the quality of the discovered motifs, we
applied them to classifying the proteins using the 10-way
cross-validation scheme. That is, each family was divided
into 10 groups of roughly equal size. Specifically, the
RNA-directed DNA Polymerase family, referred to as
family 1, contained five groups each having five proteins
and five groups each having four proteins. The Thymi-
dylate Synthase family, referred to as family 2, contained
seven groups each having four proteins and three groups
each having three proteins, and 10 tests were conducted.
In each test, a group was taken from a family and used
as test data; the other nine groups were used as training
data for that family. We applied our pattern-finding
algorithm to each training data set to find motifs (the
parameter values used were as shown in Table 3). Each
Fig. 19. (a) A 3D protein. (b) The three substructures of the protein in (a).
motif M found in family i was associated with a weight d
where
is the name of the atom and is not unique in the protein. We
assign a unique number to identify a node in the protein, d ˆ ri rj 1 i; j 2; i 6ˆ j:
where the order of numbering is obtained from the Protein
Here ri is M's occurrence number in the training data set
Data Bank (PDB) [1], [2], accessible at http://www.rcsb.org.
of family i. Intuitively, the more frequently a motif occurs in
In the experiments, we examined two families of
its own family and the less frequently it occurs in the other
proteins chosen from PDB pertaining to RNA-directed
family, the higher its weight is. In each family we collected
DNA Polymerase and Thymidylate Synthase. Each family
all the motifs having a weight greater than one and used
contains proteins having the same functionality in various
them as the characteristic motifs of that family.
organisms. We decompose each protein into consecutive
When classifying a test protein Q, we first decomposed Q
substructures, each substructure containing six nodes.
into consecutive substructures as described above. The
Two adjacent substructures overlap by sharing the two
result was a set of substructures, say, Q1; . . . ; Qp. Let
neighboring nodes on the boundary of the two substruc-
nk; 1 i 2; 1 k p, denote the number of characteristic
i
tures (see Fig. 19). Thus, each substructure is a portion of
motifs in family i that matched Qk within one mutation.
the polypeptide chain backbone of a protein where the
Each family i obtained a score Ni where
polypeptide chain is made up of residues linked together
Pp
by peptide bonds. The peptide bonds have strong covalent
nk
kˆ1 i
Ni ˆ
bonding forces that make the polypeptide chain rigid. As
mi
a consequence, the substructures used by our algorithm
are rigid. Notice that in the proteins there are other atoms and mi is the total number of characteristic motifs in family
such as O and H (not shown in Fig. 19) lying between two i. Intuitively, the score Ni here was determined by the
residues. Since these atoms are not as important as C , C , number of the characteristic motifs in family i that occurred
and N in determining the structure of a protein, we do in Q, divided by the total number of characteristic motifs in
not consider them here. Table 4 summarizes the number family i. The protein Q was classified into the family i with
of proteins in each family, their sizes and the frequently maximum Ni. If the scores were 0 for both families (i.e., the
occurring patterns (or motifs) discovered from the test protein did not have any substructure that matched any
proteins. The parameter values used were as shown in characteristic motif), then the ªno-opinionº verdict was
Table 3. In the experiments, 2,784 false matches were given. This algorithm is similar to those used in [30], [35] to
detected and eliminated during the process of finding the classify chemical compounds and sequences.
TABLE 4
Statistics Concerning the Proteins and Motifs Found in Them
746 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
As in Section 3, we use recall (REc) and precision (PRc) average-group method [13] to cluster the graphs in S, which
to evaluate the effectiveness of our classification algorithm. works as follows.
Recall is defined as Initially, every graph is a cluster. The algorithm merges
two nearest clusters to form a new cluster, until there are
P2
T otalNum NumLossi
iˆ1 c only K clusters left where K is a user-specified parameter.
REc ˆ 100%;
T otalNum
The distance between two clusters C1 and C2 is given by
where T otalNum is the total number of test proteins and
X
1
jdGx; GyÄ…j; 16Ä…
NumLossi is the number of test proteins that belong to
c
jC1jjC2j
Gx2C1;Gy2C2
family i but are not assigned to family i by our algorithm
(they are either assigned to family j, j 6ˆ i, or they receive
where jCij, i ˆ 1; 2, is the size of cluster Ci. The algorithm
the ªno-opinionº verdict). Precision is defined as
requires ON2Ä… distance calculations where N is the total
P2
number of graphs in S.
T otalNum NumGaini
iˆ1 c
PRc ˆ 100%; We applied this algorithm to clustering chemical
T otalNum
compounds. Ninety eight compounds were chosen from
where NumGaini is the number of test proteins that do not
the Merck Index that belonged to three groups pertaining to
c
belong to family i but are assigned by our algorithm to
aromatic, bicyclicalkanes and photosynthesis. The data was
family i. With the 10-way cross validation scheme, the
created by the CORINA program that converted 2D data
average REc over the 10 tests was 92.7 percent and the
(represented in SMILES string) to 3D data (represented in
average PRc was 96.4 percent. It was found that 3.7 percent
PDB format) [22]. Table 5 lists the number of compounds in
test proteins on average received the ªno-opinionº verdict
each group, their sizes and the patterns discovered from
during the classification. We repeated the same experiments
them. The parameter values used were Size ˆ 5, Occur ˆ 1,
using other parameter values and obtained similar results,
Mut ˆ 2; the other parameters had the values shown in
except that larger Mut values (e.g., 3) generally yielded
Table 3.
lower REc.
To evaluate the effectiveness of our clustering algorithm,
The binary classification problem studied here is con- we applied it to finding clusters in the compounds. The
cerned with assigning a protein to one of two families. This
parameter value K was set to 3, as there were three groups.
problem arises frequently in protein homology detection
As in the previous sections, we use recall (REr) and
[31]. The performance of our technique degrades when
precision (PRr) to evaluate the effectiveness of the cluster-
applied to the n-ary classification problem, which is
ing algorithm. Recall is defined as
concerned with assigning a protein to one of many families
PK
T otalNum NumLossi
[32]. For example, we applied the technique to classifying
iˆ1 r
REr ˆ 100%;
three families of proteins and the REc and PRc dropped to
T otalNum
80 percent.
where NumLossi is the number of compounds that belong
r
to group Gi, but are assigned by our algorithm to group Gj,
4.2 Clustering Compounds
i 6ˆ j, and T otalNum is the total number of compounds
In addition to classifying proteins, we have developed an
tested. Precision is defined as
algorithm for clustering 3D graphs based on the patterns
PK
occurring in the graphs and have applied the algorithm
T otalNum NumGaini
iˆ1 r
to grouping compounds. Given a collection S of 3D PRr ˆ 100%;
T otalNum
graphs, the algorithm first uses the procedure depicted in
Section 2.2 to decompose the graphs into rigid substruc- where NumGaini is the number of compounds that do
r
tures. Let fStrpjp ˆ 0; 1; :::; N 1g be the set of substruc- not belong to group Gi, but are assigned by our
algorithm to group Gi. Our experimental results indicated
tures found in the graphs in S where jStrpj Size.
Using the proposed pattern-finding algorithm, we exam- that REr ˆ PRr ˆ 99%. Out of the 98 compounds, only
one compound in the photosynthesis group was assigned
ine each graph Gq in S and determine whether each
incorrectly to the bicyclicalkanes group. We experimented
substructure Strp approximately occurs in Gq within
with other parameter values and obtained similar
Mut mutations. Each graph Gq is represented as a bit
results.11
string of length N, i.e., Gq ˆb0; b1; :::; bN 1Ä…, where
q q q
1 if Strp occurs in Gq within Mut mutations
bp ˆ
q 5 RELATED WORK
0 otherwise:
There are several groups working on pattern finding (or
For example, consider the two patterns P1 and P2 in
knowledge discovery) in molecules and graphs. Conklin et al.
Fig. 3 and the three graphs in Fig. 3a. Suppose the
[4], [5], [6], for example, represented a molecular structure as
allowed number of mutations is 0. Then, G1 is repre-
an image, which comprised a set of parts with their 3D
sented as 10, G2 as 01, and G3 as 10. On the other hand,
coordinates and a set of relations that were preserved for the
suppose the allowed number of mutations is 1. Then, G1
image. The authors used an incremental, divisive approach to
and G3 are represented as 11 and G2 as 01.
discover the ªknowledgeº from a data set, that is, to build a
The distance between two graphs Gx and Gy, denoted
dGx; GyÄ…, is defined as the Hamming distance [12] between
11. The Occur value was fixed at 1 in these experiments because of the
their bit strings. The algorithm then uses the well-known fact that all the compounds were represented as binary bit strings.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 747
TABLE 5
Statistics Concerning the Chemical Compounds and Patterns Found in Them
subsumption hierarchy that summarized and classified the
point or a specific point chosen based on chemical
data set. The algorithm relied on a measure of similarity
considerations. The 3D substructure can be rotated around
among molecular images that was defined in terms of their
the hinge point. The authors then developed schemes to
largest common subimages. In [8], Djoko et al. developed a
generate node-triplets and hash table indices. However,
system, called SUBDUE, that utilized the minimum descrip-
based on those schemes, false matches cannot be detected in
tion length principle to find repeatedly occurring substruc-
the hash table and must be processed in a subsequent,
tures in a graph. Once a substructure was found, the
additional verification phase. Thus, even if a pattern
substructure was used to simplify the graph by replacing
appears to match several substructures during the search-
instances of the substructure with a pointer to the
ing phase, one has to compare the pattern with those
discovered substructure. In [7], Dehaspe et al. used
substructures one by one to make sure they are true
DATALOG to represent compounds and applied data
matches during the verification phase. When mutations are
mining techniques to predicting chemical carcinogenicity.
allowed, which were not considered in [23], [24], [29], a
Their techniques were based on Mannila and Toivonen's
brute-force verification test would be very time-consuming.
algorithm [15] for finding interesting patterns from a class
For example, in comparing our approach with their
of sentences in a database.
approach in performing protein classification as described
In contrast to the above work, we use the geometric
in Section 4, we found that both approaches yield the same
hashing technique to find approximately common pat-
recall and precision, though our approach is 10 times faster
terns in a set of 3D graphs without prior knowledge of
than their approach.
their structures, positions, or occurrence frequency. The
geometric hashing technique used here originated from
the work of Lamdan and Wolfson for model based
6 CONCLUSION
recognition in computer vision [14]. Several researchers
In this paper, we have presented an algorithm for finding
attempted to parallelize the technique based on various
patterns in 3D graphs. A pattern here is a rigid substructure
architectures, such as the Hypercube and the Connection
that may occur in a graph after allowing for an arbitrary
Machine [3], [19], [20]. It was observed that the distribu-
number of rotations and translations as well as a small
tion of the hash table entries might be skewed. To balance
number of edit operations in the pattern or in the graph. We
the distribution of the hash function values, delicate
used the algorithm to find approximately common patterns
rehash functions were designed [20]. There were also
in a set of synthetic graphs, chemical compounds and
efforts exploring the uncertainty existing in the geometric
proteins. This yields a kind of alphabet of patterns. Our
hashing algorithms [11], [26].
experimental results demonstrated the good performance of
In 1996, Rigoutsos et al. employed geometric hashing
the proposed algorithm and its usefulness for pattern
and magic vectors for substructure matching in a database
discovery. We then developed classification and clustering
of chemical compounds [21]. The magic vectors were bonds
algorithms using the patterns found in the graphs, and
among atoms; the choice of them was domain dependent
applied them to classifying proteins and clustering com-
and was based on the type of each individual graph. We
pounds. Empirical study showed high recall and precision
extend the work in [21] by providing a framework for
rates for both classification and clustering, indicating the
discovering approximately common substructures in a set
significance of the patterns.
of 3D graphs, and applying our techniques to both
We have implemented the techniques presented in the
compounds and proteins. Our approach differs from [21]
paper and are combining them with the algorithms for
in that instead of using the magic vectors, we store a
acyclic graph matching [36] into a toolkit. We use the toolkit
coordinate system in a hash table entry. Furthermore, we
to find patterns in various types of graphs arising in
establish a theory for detecting and eliminating false
different domains and as part of our tree and graph search
matches occurring in the hashing.
engine project. The toolkit can be obtained from the authors
More recently, Wolfson and his colleagues also applied
and is also accessible at http://www.cis.njit.edu/~jason/
geometric hashing algorithms to protein docking and
sigmod.html on the Web.
recognition [23], [24], [29]. They attached a reference frame
In the future, we plan to study topological graph
to each substructure, where the reference frame is an
matching problems. One weakness of the proposed geo-
orthonormal coordinate frame of arbitrary orientations,
metric hashing approach is its sensitivity to errors as we
established at a ªhingeº point. This hinge point can be any discussed in Section 3.2. One has to tune the Scale
748 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 4, JULY/AUGUST 2002
[10] H.N. Gabow, Z. Galil, and T.H. Spencer, ªEfficient Implementa-
parameter when the accuracy of data changes. Another
tion of Graph Algorithms Using Contraction,º Proc. 25th Ann.
weakness is that the geometric hashing approach doesn't
IEEE Symp. Foundations of Computer Science, pp. 347-357, 1984.
take edges into consideration. In some domains, edges are
[11] W.E.L. Grimson, D.P. Huttenlocher, and D.W. Jacobs, ªAffine
Matching with Bounded Sensor Error: Study of Geometric
important despite the nodes matching. For example, in
Hashing and Alignment,º Technical Memo AIM-1250, Artificial
computer vision, edges may represent some kind of
Intelligence Laboratory, Massachusetts Inst. of Technology, 1991.
boundaries and have to be considered in object recognition.
[12] R.W. Hamming ªError Detecting and Error Correcting Codes,º
(Notice that when applying the proposed geometric The Bell System Technical J., vol. 29, no. 2, pp. 147-160, 1950,
Reprinted in E.E. Swartzlander, Computer Arithmetic. vol. 2, Los
hashing approach to this domain, the node label alphabet
Alamitos, Calif.: IEEE Computer Soc. Press Tutorial, 1990.
could be extended to contain semantic information, such as
[13] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An
color, shape, etc.) We have done preliminary work in
Introduction to Cluster Analysis. New York: John Wiley & Sons,
1990.
topological graph matching, in which our algorithm
[14] Y. Lamdan and H. Wolfson, ªGeometric Hashing: A General and
matches edge distances and the software is accessible at
Efficient Model-Based Recognition Scheme,º Proc. Int'l Conf.
http://cs.nyu.edu/cs/faculty/shasha/papers/agm.html.
Computer Vision, pp. 237-249, 1988.
We plan to combine geometric matching and topological [15] H. Mannila and H. Toivonen, ªLevelwise Search and Borders of
Theories in Knowledge Discovery,º Data Mining and Knowledge
matching approaches to finding patterns in more general
Discovery, vol. 1, no. 3, pp. 241-258, 1997.
graphs.
[16] J.A. McHugh, Algorithmic Graph Theory. Englewood Cliffs, N.J.:
Prentice Hall, 1990.
[17] X. Pennec and N. Ayache, ªAn On2Ä… Algorithm for 3D
ACKNOWLEDGMENTS
Substructure Matching of Proteins,º Proc. First Int'l Workshop
Shape and Pattern Matching in Computational Biology, pp. 25-40,
The authors would like to thank the anonymous reviewers
1994.
for their constructive suggestions that helped improve both
[18] C. Pu, K.P. Sheka, L. Chang, J. Ong, A. Chang, E. Alessio,
I.N. Shindyalov, W. Chang, and P.E. Bourne, ªPDBtool: A
the quality and the presentation of this paper. We also
Prototype Object Oriented Toolkit for Protein Structure
thank Dr. Carol Venanzi, Bo Chen, and Song Peng for useful
Verification,º Technical Report CUCS-048-92, Dept. Computer
discussions and for providing chemical compounds used in
Science, Columbia Univ., 1992.
[19] I. Rigoutsos and R. Hummel, ªScalable Parallel Geometric
the paper. This work was supported in part by US National
Hashing for Hypercube (SIMD) Architectures,º Technical Report
Science Foundation grants IRI-9531548, IRI-9531554, IIS-
TR-553, Dept. Computer Science, New York Univ., 1991.
9988345, IIS-9988636, and by the Natural Sciences and
[20] I. Rigoutsos and R. Hummel, ªOn a Parallel Implementation of
Engineering Research Council of Canada under Grant No. Geometric Hashing on the Connection Machine,º Technical
Report TR-554, Dept. Computer Science, New York Univ., 1991.
OGP0046373. Part of this work was done while X. Wang
[21] I. Rigoutsos, D. Platt, and A. Califano, ªFlexible Substructure
was with the Department of Computer and Information
Matching in Very Large Databases of 3D-Molecular Information,º
Science, New Jersey Institute of Technology. Part of this
research report, IBM T.J. Watson Research Center, Yorktown
Heights, N.Y., 1996.
work was done while J.T.L. Wang was visiting Courant
[22] J. Sadowski and J. Gasteiger, ªFrom Atoms and Bonds to Three-
Institute of Mathematical Sciences, New York University.
Dimensional Atomic Coordinates: Automatic Model Builders,º
Chemical Rev., pp. 2567-2581, vol. 93, 1993.
[23] B. Sandak, R. Nussinov, and H.J. Wolfson, ªA Method for
REFERENCES
Biomolecular Structural Recognition and Docking Allowing
[1] E.E. Abola, F.C. Bernstein, S.H. Bryant, T.F. Koetzle, and J. Weng, Conformational Flexibility,º J. Computational Biology, vol. 5, no. 4,
pp. 631-654, 1998.
ªProtein Data Bank,º Data Comm. Int'l Union of Crystallography,
[24] B. Sandak, H.J. Wolfson, and R. Nussinov, ªFlexible Docking
pp. 107-132, 1987.
Allowing Induced Fit in Proteins: Insights from an Open to Closed
[2] F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer, M.D.
Conformational Isomers,º Proteins, vol. 32, pp. 159-74, 1998.
Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi,
[25] Time Warps, String Edits, and Macromolecules: The Theory and
ªThe Protein Data Bank: A Computer-Based Archival File for
Practice of Sequence Comparison. D. Sankoff and J.B. Kruskal, eds.,
Macromolecular Structures,º J. Molecular Biology, vol. 112, 1977.
Reading, Mass.: Addison-Wesley, 1983.
[3] O. Bourdon and G. Medioni, ªObject Recognition Using Geo-
[26] K.B. Sarachik, ªLimitations of Geometric Hashing in the Presence
metric Hashing on the Connection Machine,º Proc. 10th Int'l Conf.
of Gaussian Noise,º Technical Memo AIM-1395, Artificial Intelli-
Pattern Recognition, pp. 596-600, 1990.
gence Laboratory, Massachusetts Inst. of Technology, 1992.
[4] D. Conklin, ªKnowledge Discovery in Molecular Structure
[27] R.R. Stoll, Linear Algebra and Matrix Theory, New York: McGraw-
Databases,º doctoral dissertation, Dept. Computing and Informa-
Hill, 1952.
tion Science, Queen's Univ., Canada, 1995.
[28] I.I. Vaisman, A. Tropsha, and W. Zheng, ªCompositional
[5] D. Conklin, ªMachine Discovery of Protein Motifs,º Machine
Preferences in Quadruplets of Nearest Neighbor Residues in
Learning, vol. 21, pp. 125-150, 1995. Protein Structures: Statistical Geometry Analysis,º Proc. IEEE Int'l
[6] D. Conklin, S. Fortier, and J. Glasgow, ªKnowledge Discovery in Join Symp. Intelligence and Systems, pp. 163-168, 1998.
Molecular Databases,º IEEE Trans. Knowledge and Data Eng., vol. 5,
[29] G. Verbitsky, R. Nussinov, and H.J. Wolfson, ªFlexible Structural
no. 6, pp. 985-987, 1993.
Comparison Allowing Hinge Bending and Swiveling Motions,º
[7] L. Dehaspe, H. Toivonen, and R.D. King, ªFinding Frequent
Proteins, vol. 34, pp. 232-254, 1999.
Substructures in Chemical Compounds,º Proc. Fourth Int'l Conf. [30] J.T.L. Wang, G.-W. Chirn, T.G. Marr, B.A. Shapiro, D. Shasha, and
K. Zhang, ªCombinatorial Pattern Discovery for Scientific Data:
Knowledge Discovery and Data Mining, pp. 30-36, 1998.
Some Preliminary Results,º Proc. 1994 ACM SIGMOD Int'l Conf.
[8] S. Djoko, D.J. Cook, and L.B. Holder, ªAn Empirical Study of
Management of Data, pp. 115-125, 1994.
Domain Knowledge and Its Benefits to Substructure Discovery,º
IEEE Trans. Knowledge and Data Eng., vol. 9, no. 4, pp. 575-586, [31] J.T.L. Wang, Q. Ma, D. Shasha, and C.H. Wu, ªApplication of
1997. Neural Networks to Biological Data Mining: A Case Study in
[9] D. Fischer, O. Bachar, R. Nussinov, and H.J. Wolfson, ªAn Protein Sequence Classification,º Proc. Sixth ACM SIGKDD Int'l
Efficient Automated Computer Vision Based Technique for Conf. Knowledge Discovery and Data Mining, pp. 305Ä…309, 2000.
Detection of Three Dimensional Structural Motifs in Proteins,º [32] J.T.L. Wang, T.G. Marr, D. Shasha, B. A. Shapiro, G.-W. Chirn, and
J. Biomolecular and Structural Dynamics, vol. 9, no. 4, pp. 769-789, T.Y. Lee, ªComplementary Classification Approaches for Protein
1992. Sequences,º Protein Eng., vol. 9, no. 5, pp. 381-386, 1996.
WANG ET AL.: FINDING PATTERNS IN THREE-DIMENSIONAL GRAPHS: ALGORITHMS AND APPLICATIONS TO SCIENTIFIC DATA MINING 749
[33] J.T.L. Wang, B.A. Shapiro, and D. Shasha, eds., Pattern Discovery in Bruce A. Shapiro received the BS degree from
Biomolecular Data: Tools, Techniques and Applications, New York: Brooklyn College studying mathematics and
Oxford Univ. Press, 1999. physics, the MS and PhD degrees in computer
[34] X. Wang and J.T.L. Wang, ªFast Similarity Search in Three- science from the University of Maryland, College
Dimensional Structure Databases,º J. Chemical Information and Park. He is a principal investigator and computer
Computer Sciences, vol. 40, no. 2, pp. 442-451, 2000. specialist in the Laboratory of Experimental and
[35] X. Wang, J.T.L. Wang, D. Shasha, B.A. Shapiro, S. Dikshitulu, Computational Biology in the National Cancer
I. Rigoutsos, and K. Zhang, ªAutomated Discovery of Active Institute, National Institutes of Health.
Motifs in Three Dimensional Molecules,º Proc. Third Int'l Conf. Dr. Shapiro directs research on computational
Knowledge Discovery and Data Mining, pp. 89-95, 1997. nucleic acid structure prediction and analysis
[36] K. Zhang, J.T.L. Wang, and D. Shasha, ªOn the Editing Distance and has developed several algorithms and computer systems related to
Between Undirected Acyclic Graphs,º Int'l J. Foundations of this area. His interests include massively parallel genetic algorithms,
Computer Science, special issue on Computational Biology, vol. 7, molecular dynamics, and data mining for molecular structure/function
no. 1, pp. 43-57, 1996. relationships. He is the author of numerous papers on nucleic acid
morphology, parallel computing, image processing, and graphics. He is
an editor and author of the book Pattern Discovery in Biomolecular Data,
published by Oxford University Press in 1999. In addition, he is a lecturer
Xiong Wang received the MS degree in
in the Computer Science Department at the University of Maryland,
computer science from Fudan University, China,
College Park.
and the PhD degree in computer and information
science from the New Jersey Institute of
Isidore Rigoutsos received the BS degree in physics from the
Technology, in 2000 (thesis advisor:
University of Athens, Greece, the MS and PhD degrees in computer
Jason T.L. Wang). He is currently an assistant
science from the Courant Institute of Mathematical Sciences of New
professor of computer science in California State
York University. He manages the Bioinformatics and Pattern Discovery
University, Fullerton. His research interests
group in the Computational Biology Center of the Thomas J. Watson
include databases, knowledge discovery and
Research Center. Currently, he is visiting lecturer in the Department of
data mining, pattern matching, and bioinfor-
Chemical Engineering of the Massachusetts Institute of Technology. He
matics. He is a member of ACM, ACM SIGMOD, IEEE, and IEEE
has been the recipient of a Fulbright Foundation Fellowship and has
Computer Society.
held an adjunct professor appointment in the Computer Science
Department of New York University. He is the author or co-author of
eight patents and numerous technical papers. Dr. Rigoutsos is a
Jason T.L. Wang received the BS degree in member of the IEEE, the IEEE Computer Society, the International
mathematics from National Taiwan University, Society for Computational Biology, and the American Association for the
and the PhD degree in computer science from Advancement of Science.
the Courant Institute of Mathematical Sciences,
New York University, in 1991 (thesis advisor:
Kaizhong Zhang received the MS degree in mathematics from Peking
Dennis Shasha). He is a full professor in the
University, Beijing, People's Republic of China, in 1981, and the MS and
Computer and Information Science Department
PhD degrees in computer science from the Courant Institute of
at the New Jersey Institute of Technology and
Mathematical Sciences, New York University, New York, in 1986 and
director of the University's Data and Knowledge
1989, respectively. Currently, he is an associate professor in the
Engineering Laboratory. Dr. Wang's research
Department of Computer Science, University of Western Ontario,
interests include data mining and databases, pattern recognition,
London, Ontario, Canada. His research interests include pattern
bioinformatics, and Web information retrieval. He has published more
recognition, computational biology, image processing, sequential, and
than 100 papers, including 30 journal articles, in these areas. He is an
parallel algorithms.
editor and author of the book Pattern Discovery in Biomolecular Data
(Oxford University Press), and co-author of the book Mining the World
Wide Web (Kluwer). In addition, he is program cochair of the 2001
Atlantic Symposium on Computational Biology, Genome Information
Systems & Technology held at Duke University. He is a member of the . For more information on this or any computing topic, please visit
IEEE. our Digital Library at http://computer.org/publications/dlib.
Dennis Shasha received the PhD degree from
Harvard in 1984 (thesis advisor: Nat Goodman).
He is a professor at Courant Institute, New York
University, where he does research on biological
pattern discovery, combinatorial pattern match-
ing on trees and graphs, software support for the
exploration of unfamiliar databases, database
tuning, and database design for time series.
After graduating from Yale in 1977, he worked
for IBM designing circuits and microcode for the
3090. He has written a professional reference book Database Tuning: A
Principled Approach, (1992, Prentice-Hall), two books about a mathe-
matical detective named Dr. Ecco entitled The Puzzling Adventures of
Dr. Ecco, (1988, Freeman, and republished in 1998 by Dover) and
Codes, Puzzles, and Conspiracy, (1992, Freeman) and a book of
biographies about great computer scientists called Out of Their Minds:
The Lives and Discoveries of 15 Great Computer Scientists, (1995,
Copernicus/Springer-Verlag). Finally, he is a co-author of Pattern
Discovery in Biomolecular Data: Tools, Techniques, and Applications,
published in 1999 by Oxford University Press. In addition, he has co-
authored more than 30 journal papers and 40 conference papers and
spends most of his time in building data mining and pattern recognition
software these days. He writes a monthly puzzle column for Scientific
American.


Wyszukiwarka

Podobne podstrony:
TRÓJCA trinity ONE GOD IN THREE PERSONS
Three Dimensional Nickel Ion Transport
using design patterns in game engines
Domhoff, G W (1996) Finding meaning in dreams A quantitative approach r 3
Programming Survey Of Genetic Algorithms And Genetic Programming
Północ w ogrodzie dobra i zła Midnight in the Garden of Good and Evil (1997) Napisy Pl
Dua a in Arabic with Englsih Translation and Transilitration
Hofstede G Dimensions do not exist A reply to Brendan McSweeney
Pressure Relaxation in a Hole Surrounded by a Porous and Permeable Rock
Religion in trauma care grand narratives and sacred rituals
Doping in Sport Landis Contador Armstrong and the Tour de Fran
A Comparison between Genetic Algorithms and Evolutionary Programming based on Cutting Stock Problem
Patterns of damage in genomic DNA sequences from a Neandertal
SIMPSONS 02x04 Two Cars in Every Garage and Three Eyes on Every Fish
Sensitization of two dimensional detonations in nitromethane by glass microballoons

więcej podobnych podstron