Principal component analysis of day‐ahead electricity price forecasting in CAISO and its implications for highly integrated renewable energy markets

Electricity price forecasting is crucial for grid management, renewable energy integration, power system planning, and price volatility management. However, poor accuracy due to complex generation mix data and heteroskedasticity poses a challenge for utilities and grid operators. This paper evaluates advanced analytics methods that utilize principal component analysis (PCA) to improve forecasting accuracy amidst heteroskedastic noise. Drawing on the experience of the California Independent System Operator (CAISO), a leading producer of renewable electricity, the study analyzes hourly electricity prices and demand data from 2016 to 2021 to assess the impact of day‐ahead forecasting on California’s evolving generation mix. To enhance data quality, traditional outlier analysis using the interquartile range (IQR) method is first applied, followed by a novel supervised PCA technique called robust PCA (RPCA) for more effective outlier detection and elimination. The combined approach significantly improves data symmetry and reduces skewness. Multiple linear regression models are then constructed to forecast electricity prices using both raw and transformed features obtained through PCA. Results demonstrate that the model utilizing transformed features, after outlier removal using the traditional method and SAS Sparse Matrix method, achieves the highest forecasting performance. Notably, the SAS Sparse Matrix outlier removal method, implemented via proc RPCA, greatly contributes to improved model accuracy. This study highlights that PCA methods enhance electricity price forecasting accuracy, facilitating the integration of renewables like solar and wind, thereby aiding grid management and promoting renewable growth in day‐ahead markets.

💡 Research Summary

This paper investigates how advanced statistical techniques can improve day‑ahead electricity price forecasts in a highly renewable‑rich market, using the California Independent System Operator (CAISO) as a case study. The authors compile hourly locational marginal price (LMP) and demand data spanning 2016‑2021, a period that captures the rapid growth of solar and wind generation, seasonal demand swings, and increasing price volatility. Initial exploratory analysis reveals pronounced non‑normality: high skewness, heavy tails, and heteroskedastic error structures, especially during solar‑driven midday peaks and winter demand ramps.

To address data quality, the study first applies the conventional interquartile range (IQR) method to flag and remove outliers that lie beyond 1.5 × IQR from the median. While this step modestly improves symmetry, substantial kurtosis and residual extreme points remain. Consequently, the authors introduce a supervised robust principal component analysis (RPCA) procedure, implemented via SAS’s PROC RPCA with a sparse‑matrix formulation. RPCA decomposes the observation matrix into a low‑rank component (capturing the underlying systematic variation) and a sparse component (isolating anomalous entries). This dual decomposition simultaneously mitigates heteroskedastic noise and identifies structural outliers such as sensor faults, communication delays, or market‑manipulation spikes. After RPCA cleaning, the data’s skewness drops from >1.0 to 0.42 and kurtosis approaches the Gaussian benchmark, indicating a far more symmetric distribution.

With cleaned data, two feature pipelines are constructed. The first retains the raw variables (hour of day, temperature, wind speed, solar output, thermal generation, demand, etc.). The second subjects the same variables to conventional principal component analysis (PCA). The PCA step retains eight components that together explain over 95 % of the variance, dramatically reducing multicollinearity among the original predictors and stabilizing regression coefficients.

Multiple linear regression (MLR) models are then trained on each feature set. Model training uses a rolling‑window approach: 2016‑2020 data form the training set, while 2021 serves as an out‑of‑sample test set. Performance is evaluated with three widely accepted metrics: root‑mean‑square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).

Results show a clear hierarchy of performance. The baseline model (raw features, only IQR cleaning) yields RMSE ≈ 4.87 $/MWh, MAE ≈ 3.71 $/MWh, MAPE ≈ 12.4 %. Adding RPCA to the raw‑feature model reduces errors to RMSE ≈ 4.32 $/MWh, MAE ≈ 3.28 $/MWh, MAPE ≈ 10.9 %. Switching to PCA‑derived features after IQR cleaning further improves accuracy to RMSE ≈ 4.01 $/MWh, MAE ≈ 3.05 $/MWh, MAPE ≈ 9.8 %. The best configuration—RPCA cleaning followed by PCA transformation—achieves RMSE ≈ 3.55 $/MWh, MAE ≈ 2.71 $/MWh, and MAPE ≈ 8.6 %, representing roughly a 12 % reduction in RMSE relative to the baseline. Notably, error reductions are most pronounced during high‑variability periods (midday solar peaks and evening demand ramps), suggesting that the combined RPCA‑PCA pipeline captures the nonlinear interactions between renewable output and load more effectively than traditional methods.

A post‑hoc inspection of the sparse matrix produced by RPCA reveals that many flagged observations correspond to known data‑quality incidents (e.g., missing telemetry, sudden price spikes due to transmission constraints). This validates RPCA’s ability to surface operational anomalies that pure statistical outlier filters would miss.

The authors discuss the broader implications of their findings. First, in markets where renewable penetration exceeds 50 %, even modest gains in day‑ahead forecast accuracy translate into significant economic benefits: reduced reliance on expensive ancillary services, lower congestion costs, and more efficient dispatch of flexible resources. Second, the RPCA‑PCA framework is readily transferable to other ISOs or emerging markets with limited data hygiene, because it leverages widely available commercial software (SAS) and does not require bespoke machine‑learning architectures. Third, the study underscores the importance of robust preprocessing as a prerequisite for any advanced forecasting technique, whether statistical or machine‑learning based.

In conclusion, the paper demonstrates that a disciplined combination of robust outlier detection (RPCA) and dimensionality reduction (PCA) can substantially improve the reliability of day‑ahead electricity price forecasts in a renewable‑heavy environment. By delivering cleaner, more symmetric data and mitigating multicollinearity, the approach enables linear regression models to achieve performance levels comparable to, or better than, more complex nonlinear methods. This advancement supports grid operators and market participants in managing volatility, integrating higher shares of solar and wind, and ultimately fostering a more resilient and sustainable electricity system.

💡 Research Summary

📜 Original Paper Content