Covariance and PCA for Categorical Variables

Covariances from categorical variables are defined using a regular simplex expression for categories. The method follows the variance definition by Gini, and it gives the covariance as a solution of simultaneous equations. The calculated results give reasonable values for test data. A method of principal component analysis (RS-PCA) is also proposed using regular simplex expressions, which allows easy interpretation of the principal components. The proposed methods apply to variable selection problem of categorical data USCensus1990 data. The proposed methods give appropriate criterion for the variable selection problem of categorical

💡 Research Summary

The paper introduces a novel framework for quantifying relationships among categorical variables by embedding categories into a regular simplex and then applying classical multivariate techniques such as covariance and principal component analysis (PCA). The authors begin by revisiting Gini’s variance definition for categorical data, which avoids assigning arbitrary numeric scores to categories. Building on this, each of the k categories of a variable is mapped to a vertex of a (k‑1)-dimensional regular simplex. In a regular simplex all vertices are equidistant, guaranteeing that the distance between any two categories is the same, which preserves symmetry and eliminates bias introduced by arbitrary coding schemes.

With this geometric representation, a categorical variable X with n observations is transformed into an n × (k‑1) matrix (\hat{X}) whose rows are the simplex coordinates of the observed categories. For two variables X and Y the covariance is defined as