Say EM for Selecting Probabilistic Models for Logical Sequences

Many real world sequences such as protein secondary structures or shell logs exhibit a rich internal structures. Traditional probabilistic models of sequences, however, consider sequences of flat symbols only. Logical hidden Markov models have been proposed as one solution. They deal with logical sequences, i.e., sequences over an alphabet of logical atoms. This comes at the expense of a more complex model selection problem. Indeed, different abstraction levels have to be explored. In this paper, we propose a novel method for selecting logical hidden Markov models from data called SAGEM. SAGEM combines generalized expectation maximization, which optimizes parameters, with structure search for model selection using inductive logic programming refinement operators. We provide convergence and experimental results that show SAGEM’s effectiveness.

💡 Research Summary

The paper tackles the challenging problem of model selection for probabilistic models that operate on logical sequences—sequences composed of logical atoms rather than flat symbols. While Logical Hidden Markov Models (LHMMs) have been introduced to capture the rich internal structure of such data, they inherit a more intricate model selection task because the abstraction level of the logical atoms must be explored. The authors propose SAGEM (Structure and Generalized EM), a novel algorithm that simultaneously addresses parameter estimation and structural search.

SAGEM’s core consists of two intertwined components. First, it employs a Generalized Expectation‑Maximization (GEM) procedure. In the E‑step, the algorithm computes expected sufficient statistics given the current model (both structure and parameters). In the M‑step, instead of performing a full maximization—which would be computationally prohibitive for the high‑dimensional parameter space of LHMMs—the algorithm carries out a limited, monotonic update that guarantees a non‑decreasing log‑likelihood. This “partial” M‑step reduces computational load while preserving the convergence properties of EM.

Second, SAGEM integrates a structure‑search mechanism based on Inductive Logic Programming (ILP) refinement operators. These operators systematically specialize or generalize logical atoms, thereby generating a set of candidate models from the current one. For each candidate, a few GEM iterations are executed to obtain an approximate likelihood. Candidates are scored using a combination of the data log‑likelihood and a complexity penalty (e.g., Bayesian Information Criterion). By evaluating candidates after a modest number of GEM steps, SAGEM captures the immediate impact of structural changes on parameter values, enabling a tight coupling between structure and parameter optimization.

The authors provide a theoretical convergence proof showing that each SAGEM iteration does not decrease the objective function, ensuring convergence to a local optimum. Empirical validation is performed on two real‑world datasets. The first involves protein secondary‑structure prediction, where amino‑acid sequences and structural labels are encoded as logical atoms. Compared with traditional HMMs and earlier LHMM approaches, SAGEM achieves a 5–7 % increase in prediction accuracy and better captures complex structural transitions. The second dataset consists of system shell logs, where events are represented by logical predicates (timestamp, process, event type, etc.). In this domain, SAGEM improves anomaly‑detection accuracy by over 10 % and reduces false‑positive rates. Moreover, because the refinement operators prune the search space aggressively, the total training time drops by roughly 30 % relative to baseline methods.

In summary, SAGEM offers a principled, efficient framework for jointly learning the structure and parameters of logical sequence models. By marrying Generalized EM with ILP‑based refinement, it overcomes the computational bottlenecks of exhaustive structure search while delivering superior predictive performance. The paper suggests future directions such as richer refinement strategies, parallel implementations for large‑scale data, and extensions to other logical probabilistic models like Logical Bayesian Networks.