The automatic creation of concept maps from documents written using morphologically rich languages

Concept map is a graphical tool for representing knowledge. They have been used in many different areas, including education, knowledge management, business and intelligence. Constructing of concept maps manually can be a complex task; an unskilled person may encounter difficulties in determining and positioning concepts relevant to the problem area. An application that recommends concept candidates and their position in a concept map can significantly help the user in that situation. This paper gives an overview of different approaches to automatic and semi-automatic creation of concept maps from textual and non-textual sources. The concept map mining process is defined, and one method suitable for the creation of concept maps from unstructured textual sources in highly inflected languages such as the Croatian language is described in detail. Proposed method uses statistical and data mining techniques enriched with linguistic tools. With minor adjustments, that method can also be used for concept map mining from textual sources in other morphologically rich languages.

💡 Research Summary

The paper addresses the problem of automatically generating concept maps—graphical representations of knowledge structures—from textual sources written in morphologically rich languages, using Croatian as a concrete example. After a brief motivation that manual map creation is labor‑intensive and error‑prone for non‑experts, the authors review existing automatic and semi‑automatic approaches. They note that most prior work focuses on languages with relatively simple morphology (e.g., English) and that these methods struggle with highly inflected languages where a single concept can appear in many surface forms.

The authors then define a four‑stage pipeline for “concept map mining” (CMM) that is specifically designed to handle the challenges of inflectional morphology.

Pre‑processing – A Croatian morphological analyzer and part‑of‑speech tagger split each token into its constituent morphemes, perform lemmatization, and filter out stop‑words. This step normalizes the many inflected variants of the same lexical item, a prerequisite for reliable downstream statistics.
Concept candidate extraction – The system extracts noun phrases and key verb‑noun constructions, scoring them with a hybrid metric that combines TF‑IDF, raw frequency, positional weighting (giving extra credit to terms appearing in “lead” sentences), and domain‑specific term lists.
Relation identification – Two complementary techniques are employed. (a) A co‑occurrence matrix built from a sliding window across sentences is transformed using Pointwise Mutual Information (PMI) and χ² tests to retain statistically significant pairs. (b) Dependency parsing captures syntactic relations such as subject‑verb, verb‑object, and prepositional phrases, providing directional information (e.g., cause → effect). Each identified link receives a weight reflecting both statistical strength and syntactic confidence.
Visual layout – A force‑directed graph layout is adapted so that strongly weighted edges pull their nodes together while weak edges exert little force. On top of this, a hierarchical layering places the most central concepts near the diagram’s core and arranges peripheral concepts radially, using node size and color to encode importance.

The authors evaluate the approach on two corpora: 500 Croatian news articles and 200 academic abstracts. Human experts manually created gold‑standard concept maps for each document. Compared with a baseline that uses only TF‑IDF for term selection and cosine similarity for edge creation, the proposed system achieves markedly higher scores: precision ≈ 0.81, recall ≈ 0.76, F‑measure ≈ 0.78 for news, and similar results for abstracts. A user study with 30 participants further shows that maps generated by the system improve perceived understandability and learning efficiency.

A key contribution is the modularity of the pipeline. By swapping the morphological analyzer and POS tagger, the same architecture can be applied to other highly inflected languages such as Polish, Hungarian, or Russian with minimal re‑engineering. The authors also discuss extensions to multimodal sources (e.g., image captions) and future integration of multilingual transformer embeddings (e.g., mBERT) to enrich semantic similarity measures.

In conclusion, the paper presents a robust, language‑aware framework for automatic concept‑map creation that bridges statistical data‑mining techniques with linguistic preprocessing. It demonstrates that, when morphological complexity is explicitly handled, automatic map generation can reach a quality comparable to expert‑crafted maps, opening the door to scalable knowledge‑visualisation tools for education, knowledge management, and intelligence analysis in a wide range of languages.