Improving the performance of the ripper in insurance risk classification : A comparitive study using feature selection

Improving the performance of the ripper in insurance risk classification   : A comparitive study using feature selection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Ripper algorithm is designed to generate rule sets for large datasets with many features. However, it was shown that the algorithm struggles with classification performance in the presence of missing data. The algorithm struggles to classify instances when the quality of the data deteriorates as a result of increasing missing data. In this paper, a feature selection technique is used to help improve the classification performance of the Ripper model. Principal component analysis and evidence automatic relevance determination techniques are used to improve the performance. A comparison is done to see which technique helps the algorithm improve the most. Training datasets with completely observable data were used to construct the model and testing datasets with missing values were used for measuring accuracy. The results showed that principal component analysis is a better feature selection for the Ripper in improving the classification performance.


💡 Research Summary

This paper addresses the well‑known weakness of the Repeated Incremental Pruning to Produce Error Reduction (Ripper) algorithm when applied to insurance risk classification tasks that involve substantial missing data. While Ripper is praised for its speed and ability to generate interpretable rule sets from large, high‑dimensional datasets, its greedy rule‑growing strategy tends to over‑fit the training data. Consequently, as the proportion of missing values in new instances rises, the algorithm’s classification accuracy deteriorates sharply. To mitigate this problem, the authors propose a preprocessing pipeline that incorporates feature selection before rule induction. Two distinct feature‑selection techniques are evaluated: Principal Component Analysis (PCA) and Automatic Relevance Determination (ARD) applied within a Bayesian neural‑network framework.

The experimental work uses two real‑world insurance datasets. The first is the well‑known UC Irvine “Car Insurance” dataset, containing 5,000 training instances and 4,000 test instances, each described by 86 attributes (5 categorical, 81 continuous). The second is a Texas state liability‑claims dataset, reduced to 5,446 instances (4,000 training, 1,446 testing) with 185 attributes after manual removal of identifiers and irrelevant fields. Both datasets are fully observed for training; for testing, missingness is artificially introduced at five levels (10 %, 25 %, 30 %, 40 %, 50 %). Missing values are distributed randomly across half of the attributes, yielding a total of 12 distinct test sets.

In the PCA‑Ripper pipeline, the authors first centre the data, compute the covariance matrix, extract eigenvalues and eigenvectors, and retain the components with the largest eigenvalues. The transformed lower‑dimensional representation is then fed directly into the standard Ripper algorithm. In the ARD‑Ripper pipeline, a two‑layer Bayesian neural network is trained; each input variable is associated with a hyper‑parameter α that controls the prior variance of its weight group. During evidence maximisation, variables with large α are effectively pruned because their weights shrink toward zero. The remaining variables are supplied to Ripper for rule induction.

Results show that, with no missing data, all three models (plain Ripper, PCA‑Ripper, ARD‑Ripper) achieve comparable accuracies. However, as missingness increases, the plain Ripper’s performance drops dramatically, especially beyond 30 % missingness. PCA‑Ripper consistently outperforms the other two approaches across all missing‑data levels, delivering an average improvement of 8–12 percentage points at 50 % missingness. The advantage is most pronounced on the high‑dimensional Texas dataset, where PCA reduces the number of rules by more than 30 % while preserving predictive power. ARD‑Ripper provides modest gains over the baseline but lags behind PCA‑Ripper; its Bayesian pruning sometimes removes variables that, although weakly correlated individually, contribute important non‑linear interactions.

The study concludes that feature selection is essential for maintaining Ripper’s robustness in the presence of missing data. Linear dimensionality reduction via PCA proves more effective than the Bayesian ARD approach for the examined insurance problems, likely because PCA preserves the dominant variance structure while eliminating noisy, redundant dimensions that exacerbate over‑fitting. By integrating PCA before rule induction, practitioners can obtain more compact rule sets, improve generalisation, and achieve higher classification accuracy on incomplete insurance records—an outcome directly relevant to underwriting, claims management, and risk‑based pricing. The authors suggest future work to explore imputation strategies combined with feature selection, as well as non‑linear dimensionality‑reduction techniques (e.g., kernel PCA, t‑SNE) to further enhance Ripper’s performance on complex, missing‑data‑rich domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment