Measuring an Artificial Intelligence Systems Performance on a Verbal IQ Test For Young Children

We administered the Verbal IQ (VIQ) part of the Wechsler Preschool and Primary Scale of Intelligence (WPPSI-III) to the ConceptNet 4 AI system. The test questions (e.g., “Why do we shake hands?”) were translated into ConceptNet 4 inputs using a combination of the simple natural language processing tools that come with ConceptNet together with short Python programs that we wrote. The question answering used a version of ConceptNet based on spectral methods. The ConceptNet system scored a WPPSI-III VIQ that is average for a four-year-old child, but below average for 5 to 7 year-olds. Large variations among subtests indicate potential areas of improvement. In particular, results were strongest for the Vocabulary and Similarities subtests, intermediate for the Information subtest, and lowest for the Comprehension and Word Reasoning subtests. Comprehension is the subtest most strongly associated with common sense. The large variations among subtests and ordinary common sense strongly suggest that the WPPSI-III VIQ results do not show that “ConceptNet has the verbal abilities a four-year-old.” Rather, children’s IQ tests offer one objective metric for the evaluation and comparison of AI systems. Also, this work continues previous research on Psychometric AI.

💡 Research Summary

The paper presents a novel evaluation of an artificial intelligence system—ConceptNet 4—by administering the Verbal IQ (VIQ) component of the Wechsler Preschool and Primary Scale of Intelligence, third edition (WPPSI‑III), a standardized test originally designed for children aged 2½ to 7 years. The authors selected the five VIQ subtests (Vocabulary, Similarities, Information, Comprehension, and Word Reasoning) and directly mapped each test item into a format that ConceptNet can process. This mapping involved two stages: first, using the built‑in natural‑language processing utilities of ConceptNet for tokenization, lemmatization, and basic relation extraction; second, applying custom Python scripts to convert the question into a “concept‑edge‑concept” graph query that captures the semantic structure of the original prompt (e.g., “Why do we shake hands?” becomes a subgraph linking the concepts handshake, purpose, greeting, etc.).

Answer generation relied on a spectral method. The authors computed the Laplacian of the entire ConceptNet graph, derived transition probabilities between nodes, and identified the node(s) with the highest similarity to the query subgraph. A confidence score was assigned to each candidate answer, and the top‑ranked answer was compared against a pre‑compiled answer key derived from the WPPSI‑III scoring manual.

Results show that ConceptNet 4 achieved an overall VIQ score of approximately 100, which corresponds to the average performance of a four‑year‑old child. However, performance varied markedly across subtests. Vocabulary and Similarities yielded the highest scaled scores (≈115 and ≈110 respectively), indicating that the system is adept at retrieving word meanings and judging relational similarity when the task aligns with the graph’s factual, taxonomic knowledge. The Information subtest produced a moderate score (~90), reflecting decent recall of common‑sense facts such as “the ocean contains water.” In stark contrast, Comprehension and Word Reasoning received the lowest scores (≈75 and ≈70), revealing a weakness in handling questions that require understanding intent, purpose, or multi‑step reasoning—areas that are heavily dependent on contextual and causal common sense.

The authors interpret these findings in several ways. First, they argue that applying a human child’s psychometric instrument to AI offers an intuitive “age” metric, allowing researchers and the public to grasp AI language abilities in familiar terms. Second, the subtest disparity highlights a structural limitation of graph‑based AI: while static factual knowledge is well‑represented, dynamic inference that integrates situational context remains underdeveloped. Third, the paper suggests concrete pathways for improvement. Incorporating modern contextual embeddings (e.g., BERT, RoBERTa), multimodal grounding, and reinforcement‑learning‑driven reasoning could bolster performance on Comprehension and Word Reasoning. Additionally, augmenting the knowledge graph with explicit causal and purpose relations would directly address the types of “why” questions that currently trip the system.

Beyond the specific results, the study contributes to the emerging field of “Psychometric AI,” which proposes using established psychological assessments as benchmarks for machine intelligence. By demonstrating that a standard IQ test can be repurposed for AI, the authors open the door to systematic, cross‑modal comparisons between human cognition and artificial systems. Future work could extend this methodology to other age‑appropriate batteries (e.g., the Stanford‑Binet, Raven’s Progressive Matrices) and to a broader array of AI architectures, including large language models and multimodal transformers. Such comparative studies would help map the landscape of AI strengths and deficits, guide research priorities, and ultimately inform the development of systems that possess not only factual knowledge but also the nuanced, common‑sense reasoning characteristic of human intelligence.

💡 Research Summary

📜 Original Paper Content