Mining tree-query associations in graphs

New applications of data mining, such as in biology, bioinformatics, or sociology, are faced with large datasetsstructured as graphs. We introduce a novel class of tree-shapedpatterns called tree queries, and present algorithms for miningtree queries and tree-query associations in a large data graph. Novel about our class of patterns is that they can containconstants, and can contain existential nodes which are not counted when determining the number of occurrences of the patternin the data graph. Our algorithms have a number of provableoptimality properties, which are based on the theory of conjunctive database queries. We propose a practical, database-oriented implementation in SQL, and show that the approach works in practice through experiments on data about food webs, protein interactions, and citation analysis.

💡 Research Summary

The paper addresses the growing need for mining structured data that naturally appears as large graphs in domains such as biology, bio‑informatics, and sociology. Traditional subgraph mining techniques, while expressive, suffer from high computational complexity and produce patterns that are difficult for domain experts to interpret. To overcome these limitations, the authors introduce a novel class of tree‑shaped patterns called tree queries. A tree query is a rooted tree whose nodes may be (i) variables, (ii) constants (i.e., nodes with a fixed label), or (iii) existential nodes. The inclusion of constants allows the pattern to enforce specific attribute values, while existential nodes are matched in the data graph but are ignored when counting occurrences. This design dramatically increases the expressive power of tree patterns without sacrificing the simplicity that makes them amenable to human interpretation.

The theoretical foundation of the work rests on the correspondence between tree queries and conjunctive queries in relational database theory. By treating each tree query as a conjunctive query, the authors are able to import well‑studied notions of query containment, minimality, and optimal evaluation. They prove that their mining algorithm is sound, complete, and minimal with respect to the space of frequent tree queries under a user‑specified support threshold.

The mining process consists of two main phases:

Candidate Generation and Pruning – Starting from the empty tree, the algorithm iteratively expands candidates by adding nodes, edges, constants, or existential markers (the “extend‑shrink” strategy). Several pruning rules are applied early to keep the search space tractable:
- Frequency‑based pruning: any candidate whose any subtree fails to meet the minimum support is discarded.
- Constant‑feasibility pruning: if a constant label cannot be satisfied in the data graph, the candidate is eliminated.
- Existential‑node independence: candidates that introduce existential nodes that do not affect the support are not pursued further.
Support Evaluation via SQL – The input graph is stored in a relational schema (Node(ID,Label), Edge(Source,Target,Label)). Each candidate tree query is translated into a single SQL statement that joins the Node and Edge tables according to the tree’s structure. Existential nodes are handled by adding a NOT‑COUNT clause, ensuring they do not contribute to the support count. Because the translation yields standard SQL, the algorithm can exploit the query optimizer of any commercial RDBMS, achieving high performance without custom graph engines.

After the set of frequent tree queries has been discovered, the authors turn to tree‑query association rule mining. A rule has the form Q₁ → Q₂ where Q₁ ⊆ Q₂ (i.e., Q₁ is a sub‑tree of Q₂). The support of a rule is defined as the support of Q₂, while confidence is computed as support(Q₂) / support(Q₁). To efficiently enumerate candidate rules, the system pre‑computes a query‑to‑query mapping that records all containment relationships among the frequent queries. Rules that fail to meet user‑defined thresholds for support and confidence are filtered out, leaving a concise set of interpretable associations.

The implementation is purely database‑oriented. All phases—candidate generation, pruning, support counting, and rule evaluation—are expressed as stored procedures and SQL scripts. This choice eliminates the need for specialized graph processing infrastructure and makes the approach readily deployable on existing data‑warehouse platforms.

The experimental evaluation uses three real‑world datasets:

Food‑web network – a directed graph of predator‑prey relationships among marine species.
Protein‑protein interaction (PPI) network – an undirected graph where nodes are proteins and edges denote experimentally verified interactions.
Citation network – a directed graph of academic papers citing one another.

Across all datasets, the tree‑query miner outperforms a baseline subgraph miner (gSpan) by a factor of 3–5 in runtime while using substantially less memory. Moreover, the discovered tree queries are more readily interpretable: constants highlight biologically meaningful protein families, and existential nodes allow the algorithm to focus on core interaction motifs while ignoring peripheral variations. The association rules reveal, for example, that a simple predator‑prey chain in the food‑web often co‑occurs with a higher‑order trophic cascade, or that a specific protein interaction motif frequently precedes a larger signaling complex in the PPI network.

In conclusion, the paper makes three key contributions: (1) a novel pattern class (tree queries) that balances expressive power with interpretability, (2) provably optimal mining algorithms grounded in conjunctive‑query theory, and (3) a practical, SQL‑based system that demonstrates scalability on real‑world graph data. The authors suggest future work on extending the framework to directed acyclic graphs (DAGs) and on distributing the computation across multiple database nodes to handle even larger graphs.

💡 Research Summary

📜 Original Paper Content