Automatic Classification of Pedagogical Materials against CS Curriculum Guidelines

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Professional societies often publish curriculum guidelines to help programs align their content to international standards. In Computer Science, the primary standard is published by ACM and IEEE and provide detailed guidelines for what should be and could be included in a Computer Science program. While very helpful, it remains difficult for program administrators to assess how much of the guidelines is being covered by a CS program. This is in particular due to the extensiveness of the guidelines, containing thousands of individual items. As such, it is time consuming and cognitively demanding to audit every course to confidently mark everything that is actually being covered. Our preliminary work indicated that it takes about a day of work per course. In this work, we propose using Natural Language Processing techniques to accelerate the process. We explore two kinds of techniques, the first relying on traditional tools for parsing, tagging, and embeddings, while the second leverages the power of Large Language Models. We evaluate the application of these techniques to classify a corpus of pedagogical materials and show that we can meaningfully classify documents automatically.

💡 Research Summary

The paper tackles the labor‑intensive problem of mapping computer‑science course materials—lecture slides, assignments, videos, etc.—to the extensive ACM/IEEE curriculum guidelines. The guidelines, organized hierarchically into knowledge areas, units, topics, and learning outcomes, contain roughly 2,700 distinct items. Manually classifying a single course in the CS Materials system typically consumes an entire day, representing a major bottleneck for curriculum alignment and audit.

The authors propose two families of automated techniques. The first family relies on “classical” natural‑language processing (NLP). PDFs are parsed with PyPDF or PyMuPDF, ligatures are corrected, and the text is tokenized and POS‑tagged using NLTK. Base noun phrases (bNPs) are extracted via a simple regex pattern (adjectives followed by nouns). Two exact‑match strategies are implemented: count‑unweighted (raw frequency of each bNP in the document) and count‑weighted (frequency normalized by the number of guideline entries containing the bNP). To overcome the brittleness of exact matching, the authors also employ pre‑trained GloVe‑wiki‑giga‑300 word embeddings. Each phrase is represented by the average of its constituent word vectors; cosine distance (converted to a 0‑1 distance) is used to decide a match under a fixed threshold (0.3). Four embedding‑based variants are explored, differing in whether they weight by frequency and whether they consider the best match or all matches. In evaluation, exact‑match methods achieve recall around 11‑14 %, while embedding‑based methods improve recall to roughly 15‑21 %, still leaving a large portion of the guideline uncovered. The main limitation is the context‑independence of static word vectors, which cannot disambiguate polysemous terms (e.g., “dynamic programming” vs. “dynamic partitioning”).

The second family leverages large language models (LLMs). The authors use the open‑source Llama‑3.3 model, which supports up to 128 K input tokens—insufficient for feeding the whole JSON representation of the guidelines (≈630 KB). Consequently, they adopt a per‑category querying approach. An initial binary formulation (yes/no) proved overly permissive, so they switched to a 0‑to‑5 scoring scheme (llm‑5point) that allows ranking of categories by relevance. To reduce the massive number of queries (≈2,700 per document), they batch five categories per request (llm‑5point‑batch) and introduce contextual enrichment: each query can include the parent knowledge area and unit (llm‑5point‑context). A further pruning strategy (llm‑prune‑5point‑context) first asks the LLM to generate a short summary of each knowledge unit; only if the summary suggests relevance does the system query the individual categories within that unit. This pruning cuts the number of queries dramatically and, surprisingly, improves recall. Across the LLM variants, recall ranges from about 18 % (basic binary) to 22 % (pruned contextual), outperforming the classical NLP baselines.

Runtime considerations are discussed in depth. Even with asynchronous calls and batching, the per‑document cost remains high because each category still incurs a separate API interaction. On a local server this can degrade performance for other users; on a cloud platform it translates into non‑trivial monetary expense. The authors therefore present timing results (tens of seconds per document) but acknowledge that cost efficiency remains an open challenge.

Overall, the study demonstrates that both NLP pipelines and LLM‑based classifiers can substantially reduce the manual effort required for the initial classification step, which dominates the overall workflow. Classical NLP offers a low‑cost, easily deployable baseline, while LLMs provide superior semantic understanding and higher recall, especially when contextual information and pruning are incorporated. The authors argue for a hybrid “machine‑suggested shortlist + human verification” workflow as the most practical deployment.

Future work is outlined: (1) developing domain‑specific, smaller LLMs to lower inference cost; (2) refining pruning and routing algorithms to further cut query volume; (3) experimenting with multi‑turn prompting or chain‑of‑thought techniques to improve LLM reasoning; and (4) fine‑tuning models on annotated CS‑Materials data to boost accuracy. By advancing these directions, the community could achieve near‑real‑time, cost‑effective curriculum alignment, facilitating broader adoption of systematic CS program audits.

Automatic Classification of Pedagogical Materials against CS Curriculum Guidelines

💡 Research Summary

Comments & Academic Discussion

Leave a Comment