Procedural Knowledge Extraction from Industrial Troubleshooting Guides Using Vision Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Industrial troubleshooting guides encode diagnostic procedures in flowchart-like diagrams where spatial layout and technical language jointly convey meaning. To integrate this knowledge into operator support systems, which assist shop-floor personnel in diagnosing and resolving equipment issues, the information must first be extracted and structured for machine interpretation. However, when performed manually, this extraction is labor-intensive and error-prone. Vision Language Models offer potential to automate this process by jointly interpreting visual and textual meaning, yet their performance on such guides remains underexplored. This paper evaluates two VLMs on extracting structured knowledge, comparing two prompting strategies: standard instruction-guided versus an augmented approach that cues troubleshooting layout patterns. Results reveal model-specific trade-offs between layout sensitivity and semantic robustness, informing practical deployment decisions.

💡 Research Summary

This paper investigates the feasibility of using vision‑language models (VLMs) to automatically extract procedural knowledge from industrial troubleshooting guides, which are typically presented as flow‑chart‑style diagrams combining textual labels with visual symbols (rectangles for actions, diamonds for decisions, arrows indicating flow, and “ja/nee” yes/no markers). The authors evaluate two open‑weight VLMs—Pixtral‑12B, which employs a Vision Transformer with cross‑attention between image patches and text tokens, and Qwen2‑VL‑7B, which uses a dynamic‑resolution strategy that processes text‑dense regions at higher resolution. Both models are tested on a proprietary dataset of twelve Dutch‑language troubleshooting manuals (24 pages total), each page containing 30‑100 entities (conditions, decisions, actions) and 30‑60 directed relations.

A uniform extraction schema is defined, mapping the visual elements to three entity types (Condition, Decision, Action) and a single relation type (isPreceededBy). The extraction task is framed as a structured JSON generation problem: given a page‑level image, the model must output a list of entities and their directed relations. Two prompting strategies are compared. The “standard” prompt supplies only the schema and a JSON example, while the “augmented” prompt additionally describes the visual conventions (e.g., diamonds denote decisions, arrows indicate flow direction, “ja/nee” label meanings). No fine‑tuning is performed; models run on an Nvidia A100 GPU with default token limits.

Evaluation uses exact‑match entity precision, recall, and F1, with a 0.9 similarity threshold after lemmatization, and analogous metrics for relations (source, target, and type must match). The gold standard comprises 548 entities and 536 relations manually annotated by domain experts.

Results show modest entity extraction performance (F1 ranging from 0.24 to 0.34) and very low relation extraction (F1 below 0.11) for both models. Qwen2‑VL‑7B achieves the highest entity F1 (0.340) under the standard prompt, while Pixtral‑12B is slightly more consistent across documents but never exceeds 0.30. Prompt engineering has divergent effects: for Qwen2‑VL‑7B, the augmented prompt improves relation F1 (0.061 → 0.107) at the cost of entity precision (0.305 → 0.203); for Pixtral‑12B, the augmented prompt degrades both entity and relation scores, suggesting that the longer prompt overwhelms its cross‑attention mechanism. Document‑level analysis reveals high variance for Qwen2‑VL‑7B (some pages achieve 0.78 entity F1, others 0.00), indicating sensitivity to layout complexity, scan quality, and possibly the presence of domain‑specific symbols.

The authors discuss that current VLMs can recognize basic shapes and extract textual labels but struggle with the structural reasoning required to reconstruct procedural flows. Architectural differences explain the observed trade‑offs: Qwen2‑VL‑7B’s dynamic resolution favors fine‑grained text but lacks explicit modeling of spatial relationships, whereas Pixtral‑12B’s cross‑attention can link visual regions to text but appears vulnerable to prompt length and token budget constraints.

Key take‑aways include: (1) relation extraction is the primary bottleneck for building usable procedural knowledge graphs; (2) explicit visual‑layout cues can steer models toward better graph connectivity but may reduce textual accuracy; (3) model performance is highly variable across documents, highlighting the need for robustness to heterogeneous industrial diagram styles.

Future work suggested by the authors involves (a) incorporating graph‑based post‑processing that leverages detected entity positions to infer missing edges, (b) domain‑specific fine‑tuning on a larger corpus of industrial diagrams to improve shape‑to‑function mapping, (c) designing multi‑scale attention mechanisms that balance global layout understanding with local text detail, and (d) augmenting training data with synthetic variations (noise, different fonts, degraded scans) to enhance generalization.

In conclusion, the study provides the first systematic assessment of open‑weight VLMs on industrial troubleshooting diagram extraction, revealing both promise and significant limitations. While VLMs can partially automate the extraction of procedural entities, reliable reconstruction of the underlying decision‑action flow remains elusive, necessitating further architectural innovations, targeted fine‑tuning, and hybrid approaches that combine VLM outputs with explicit graph reasoning.

Procedural Knowledge Extraction from Industrial Troubleshooting Guides Using Vision Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment