Model Building with Multiple Dependent Variables and Constraints
The most widely used method for finding relationships between several quantities is multiple regression. This however is restricted to a single dependent variable. We present a more general method which allows models to be constructed with multiple variables on both sides of an equation and which can be computed easily using a spreadsheet program. The underlying principle (originating from canonical correlation analysis) is that of maximising the correlation between the two sides of the model equation. This paper presents a fitting procedure which makes it possible to force the estimated model to satisfy constraint conditions which it is required to possess, these may arise from theory, prior knowledge or be intuitively obvious. We also show that the least squares approach to the problem is inadequate as it produces models which are not scale invariant.
💡 Research Summary
The paper addresses a fundamental limitation of traditional multiple regression, namely its focus on a single dependent variable, by introducing a method that simultaneously handles multiple dependent and independent variables within a single equation. The core idea is to form two composite variables—X as a weighted sum of the independent variables and Y as a weighted sum of the dependent variables—and then to maximise the Pearson correlation between X and Y. This “maximum correlation modelling” is conceptually rooted in canonical correlation analysis (CCA), but the author recasts the problem as a constrained optimisation that can be solved with ordinary spreadsheet software.
The methodological contribution consists of three parts. First, the author explains how CCA seeks linear combinations that maximise inter‑set correlation, yet standard CCA solutions often produce coefficients with undesirable signs or magnitudes and provide no straightforward way to impose external constraints. Second, the paper shows how to implement the optimisation in a spreadsheet (e.g., Microsoft Excel) using its built‑in Solver (or similar optimiser). The data are arranged with one variable per column; a dedicated row holds the decision‑variable weights. Two additional columns compute X and Y as the dot‑product of weights and data, and a separate cell evaluates CORREL(X,Y). The Solver is instructed to maximise this correlation, while the user supplies any number of linear equality or inequality constraints on the weights (e.g., non‑negativity, ordering, sum‑to‑one, or integer restrictions). The author stresses the importance of disabling the “assume linear problem” option, enabling automatic scaling, and adjusting the convergence tolerance to ensure that the optimality conditions are truly satisfied.
Third, the paper contrasts this approach with a naïve least‑squares formulation that minimises the sum of squared residuals of a linear equation of the form Σb y − Σa x − c = e. Because a normalisation (e.g., fixing one coefficient to 1) is required to avoid the trivial zero solution, the resulting model depends on which coefficient is normalised, leading to non‑equivalent regressions and a lack of scale invariance. In contrast, the maximum‑correlation method yields models that are invariant to the units of measurement: changing a variable’s scale merely rescales the associated coefficient, leaving the substantive relationship unchanged.
An empirical illustration uses data from 96 English Local Education Authorities. Three outcome variables (percentages of pupils achieving various exam thresholds) constitute the dependent set, while six contextual variables (school expenditure, socioeconomic composition, housing quality, proportion of native‑born pupils, population density and its square) form the independent set. The author imposes the constraint b₁ ≥ b₂ ≥ b₃ on the outcome weights to reflect the increasing difficulty of the achievement levels. After running the Solver, the optimal composite variables are:
Y = 2.871 y₁ + 1 y₂ + 1 y₃
X = 0.0071 x₁ + 0.471 x₂ + 0.432 x₃ − 0.0083 x₄ + 0.1007 x₅ − 0.0025 x₆
with a correlation of 0.9023, indicating a very strong linear relationship. The sign and magnitude of the coefficients provide substantive insights—for example, a small negative weight on the proportion of pupils from non‑UK backgrounds suggests that a higher proportion of such pupils is associated with better performance, consistent with other studies.
The paper concludes that maximum‑correlation modelling offers a practical, transparent, and computationally inexpensive alternative to traditional regression for situations where multiple outcomes and predictors must be combined. By allowing arbitrary linear constraints, the method can incorporate theoretical knowledge, policy requirements, or interpretability considerations directly into the model‑building process. Its scale‑invariant property ensures robustness to unit changes, and its implementation in ubiquitous spreadsheet tools makes it accessible to researchers across disciplines without specialized statistical software.
Comments & Academic Discussion
Loading comments...
Leave a Comment