Cause Identification from Aviation Safety Incident Reports via Weakly Supervised Semantic Lexicon Construction

Cause Identification from Aviation Safety Incident Reports via Weakly   Supervised Semantic Lexicon Construction

The Aviation Safety Reporting System collects voluntarily submitted reports on aviation safety incidents to facilitate research work aiming to reduce such incidents. To effectively reduce these incidents, it is vital to accurately identify why these incidents occurred. More precisely, given a set of possible causes, or shaping factors, this task of cause identification involves identifying all and only those shaping factors that are responsible for the incidents described in a report. We investigate two approaches to cause identification. Both approaches exploit information provided by a semantic lexicon, which is automatically constructed via Thelen and Riloffs Basilisk framework augmented with our linguistic and algorithmic modifications. The first approach labels a report using a simple heuristic, which looks for the words and phrases acquired during the semantic lexicon learning process in the report. The second approach recasts cause identification as a text classification problem, employing supervised and transductive text classification algorithms to learn models from incident reports labeled with shaping factors and using the models to label unseen reports. Our experiments show that both the heuristic-based approach and the learning-based approach (when given sufficient training data) outperform the baseline system significantly.


💡 Research Summary

The paper addresses the problem of automatically identifying the underlying causes—referred to as “shaping factors”—in aviation safety incident reports collected by the Aviation Safety Reporting System (ASRS). Accurate cause identification is essential for safety analysis and preventive measures, yet existing approaches rely heavily on manually crafted keyword lists or rule‑based systems that struggle with the informal, domain‑specific language found in ASRS narratives. To overcome these limitations, the authors propose a two‑pronged methodology that leverages a weakly supervised semantic lexicon automatically generated from the corpus and then applies two distinct labeling strategies.

Semantic Lexicon Construction
The authors adopt the Basilisk framework originally introduced by Thelen and Riloff, but they extend it with several linguistic and algorithmic enhancements tailored to the aviation domain. Starting from a small set of seed terms for each of the twelve predefined shaping factors (e.g., “pilot error”, “weather”, “communication failure”), the system extracts candidate noun phrases and verb phrases from the entire report collection. Candidate scoring combines traditional pattern‑based scores with a novel composite metric that integrates Pointwise Mutual Information (PMI) and TF‑IDF weighting. This hybrid score helps surface low‑frequency but semantically important terms that would otherwise be missed. Additionally, the authors incorporate synonym and abbreviation dictionaries to merge variant spellings (e.g., “ATC” vs. “air‑traffic control”) into single lexicon entries. The final lexicon contains roughly 1,200 lexical items linked to the twelve shaping‑factor categories.

Labeling Approaches
Two labeling pipelines are built on top of the lexicon:

  1. Heuristic Matching – A straightforward rule‑based method that scans each report for exact or stem‑matched occurrences of lexicon entries. When a match is found, the corresponding shaping factor is assigned to the report. This approach requires no labeled training data and can be deployed immediately.

  2. Learning‑Based Classification – A supervised (and transductive) text‑classification framework that treats cause identification as a multi‑label problem. The authors experiment with linear Support Vector Machines, Logistic Regression, and a graph‑based transductive SVM. Feature extraction includes tokenization, stop‑word removal, and lexicon‑driven feature selection. To mitigate class imbalance, they adjust class weights and test oversampling techniques such as SMOTE. The One‑vs‑Rest strategy is employed to handle the twelve overlapping labels.

Experimental Setup and Results
The study uses over 1,200 real ASRS reports, each manually annotated with the applicable shaping factors. A 10‑fold cross‑validation scheme evaluates both pipelines. Baselines consist of a naïve keyword‑list system (the original BASILISK lexicon without enhancements) and a random labeling baseline. Performance is measured using precision, recall, and F1 score.

  • The heuristic matcher achieves an average macro‑F1 of 0.62, excelling in frequent categories like “pilot error” and “equipment failure”.
  • The supervised classifiers surpass the heuristic when sufficient labeled data (>600 reports) are available, reaching a macro‑F1 of 0.78. Notably, transductive SVM shows a 5–8 % gain in recall for rare factors such as “fatigue” and “weather degradation”.
  • Error analysis reveals two main failure modes: (a) over‑general lexical entries that trigger false positives in unrelated contexts, and (b) implicit cause mentions that are not captured by any lexicon term (e.g., subtle references to fatigue without explicit wording).

Conclusions and Future Work
The research demonstrates that a weakly supervised lexicon, when coupled with either a simple heuristic or a more sophisticated classifier, can substantially improve cause identification in aviation safety narratives. The heuristic approach offers a low‑cost, immediate solution for organizations lacking annotated data, while the learning‑based method delivers higher accuracy once a moderate amount of labeled reports is collected. The authors suggest extending the lexicon with contextual embeddings (e.g., BERT) to better capture implicit mentions, and they plan to evaluate the framework on other safety‑critical corpora such as maritime incident reports. Overall, the work provides a scalable, domain‑adaptable pipeline that reduces manual labeling effort while delivering performance gains over traditional keyword‑based baselines.