Online Multiple Kernel Learning for Structured Prediction
Despite the recent progress towards efficient multiple kernel learning (MKL), the structured output case remains an open research front. Current approaches involve repeatedly solving a batch learning problem, which makes them inadequate for large scale scenarios. We propose a new family of online proximal algorithms for MKL (as well as for group-lasso and variants thereof), which overcomes that drawback. We show regret, convergence, and generalization bounds for the proposed method. Experiments on handwriting recognition and dependency parsing testify for the successfulness of the approach.
💡 Research Summary
The paper addresses the challenge of applying multiple kernel learning (MKL) to structured prediction tasks, where the output space consists of complex objects such as sequences, trees, or graphs. Traditional MKL methods rely on batch optimization, repeatedly solving a large‐scale convex problem over the entire training set. This approach becomes computationally prohibitive when the number of training examples or the dimensionality of the feature space grows, and it is ill‑suited for streaming or online scenarios.
To overcome these limitations, the authors propose a family of online proximal algorithms that integrate MKL (and, by extension, group‑lasso and related regularizers) into a single, efficient update rule. At each time step t, an incoming example (x_t, y_t) is processed by computing a structured sub‑gradient of the loss with respect to the current model parameters w_t. The loss may be any convex structured loss (e.g., structured hinge loss) that admits an efficient oracle for sub‑gradient computation. The algorithm then forms a composite gradient consisting of the loss sub‑gradient plus the gradient of a mixed ℓ₁/ℓ₂ regularizer Ω(w)=∑_{k=1}^K‖w_k‖₂, where w_k denotes the weight vector associated with kernel k.
The core of the method is a proximal step that simultaneously performs a gradient descent update and a group‑wise soft‑thresholding operation:
- Intermediate update: w̃ = w_t – η_t (g_t + λ∂Ω(w_t))
- Group‑wise shrinkage: w_{t+1,k} = max(0, 1 – η_t λ /‖w̃_k‖₂)·w̃_k
Because the proximal operator for the ℓ₁/ℓ₂ norm has a closed‑form solution, each iteration runs in O(K·d) time (K kernels, d total feature dimension), matching the cost of standard online sub‑gradient methods while automatically inducing sparsity at the kernel level.
The authors provide a thorough theoretical analysis. Under the standard assumption of strong convexity of the regularized objective, they prove an O(√T) regret bound for the cumulative loss over T rounds, which is comparable to the best known bounds for online convex optimization with composite regularizers. They also establish convergence of the iterates to the optimal solution when the learning rate η_t decays appropriately (e.g., η_t = η₀/√t). For generalization, they derive Rademacher‑complexity‑based error bounds that incorporate the effect of the group‑lasso regularizer, showing that the online procedure yields models with controlled capacity and good out‑of‑sample performance.
Empirical evaluation is conducted on two representative structured prediction problems.
-
Handwritten digit recognition: The authors construct a multi‑kernel representation of MNIST‑derived images using linear, polynomial, and RBF kernels applied to raw pixels and engineered features. The online MKL algorithm achieves an error rate of 1.8 % (accuracy 98.2 %), essentially matching a state‑of‑the‑art batch MKL baseline while being 5–10× faster in wall‑clock time.
-
Dependency parsing: Using the Penn Treebank, the method combines ten kernels that encode word embeddings, part‑of‑speech tags, distance features, and other linguistic cues. The online approach reaches an unlabeled attachment score (UAS) of 93.1 %, comparable to the best batch parsers, and processes sentences at a rate that is several times higher than the batch counterpart. Importantly, the authors visualize the evolution of kernel weights as the data stream changes, demonstrating that the algorithm dynamically re‑weights kernels to adapt to distribution shifts.
Overall, the paper makes three key contributions: (1) a novel online proximal framework that unifies MKL and group‑lasso regularization for structured outputs; (2) rigorous regret, convergence, and generalization analyses that extend existing online learning theory to the composite‑regularizer setting; and (3) convincing experimental evidence that the method scales to large‑scale, real‑world structured prediction tasks while retaining or improving predictive accuracy.
The work opens several avenues for future research, including extensions to non‑convex structured losses, asynchronous distributed implementations, and automated kernel generation or selection mechanisms that could further enhance scalability and adaptability in ever‑larger streaming environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment