A submodular-supermodular procedure with applications to discriminative structure learning

In this paper, we present an algorithm for minimizing the difference between two submodular functions using a variational framework which is based on (an extension of) the concave-convex procedure [17]. Because several commonly used metrics in machine learning, like mutual information and conditional mutual information, are submodular, the problem of minimizing the difference of two submodular problems arises naturally in many machine learning applications. Two such applications are learning discriminatively structured graphical models and feature selection under computational complexity constraints. A commonly used metric for measuring discriminative capacity is the EAR measure which is the difference between two conditional mutual information terms. Feature selection taking complexity considerations into account also fall into this framework because both the information that a set of features provide and the cost of computing and using the features can be modeled as submodular functions. This problem is NP-hard, and we give a polynomial time heuristic for it. We also present results on synthetic data to show that classifiers based on discriminative graphical models using this algorithm can significantly outperform classifiers based on generative graphical models.

💡 Research Summary

The paper tackles the problem of minimizing the difference between two submodular functions, a formulation that naturally arises in many machine learning tasks where information-theoretic quantities such as mutual information (MI) and conditional mutual information (CMI) are involved. The authors introduce a “submodular‑supermodular procedure” (SSP), an adaptation of the concave‑convex procedure (CCCP) to the discrete domain of set functions.

The core idea is to decompose the objective F(S)=f(S)−g(S) into a submodular part f and a supermodular part g (the negative of a submodular function). At each iteration a modular (linear) approximation m_t of g is obtained by computing a subgradient (i.e., the marginal contributions of each element at the current set S_t). The surrogate problem min_S