Revealing economic facts: LLMs know more than they say

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We investigate whether the hidden states of large language models (LLMs) can be used to estimate and impute economic and financial statistics. Focusing on county-level (e.g. unemployment) and firm-level (e.g. total assets) variables, we show that a simple linear model trained on the hidden states of open-source LLMs outperforms the models’ text outputs. This suggests that hidden states capture richer economic information than the responses of the LLMs reveal directly. A learning curve analysis indicates that only a few dozen labelled examples are sufficient for training. We also propose a transfer learning method that improves estimation accuracy without requiring any labelled data for the target variable. Finally, we demonstrate the practical utility of hidden-state representations in super-resolution and data imputation tasks.

💡 Research Summary

This paper, “Revealing economic facts: LLMs know more than they say,” presents a novel and practical methodology for leveraging the internal representations of Large Language Models (LLMs) to estimate and impute economic and financial statistics. The core premise is that while LLMs are trained on vast corpora containing relevant data, this knowledge is not always directly accessible or reliably produced through text generation due to factors like hallucination suppression or the absence of exact data points. The authors hypothesize that richer, more generalized knowledge about entities (like firms or regions) is encoded within the models’ hidden states (embeddings).

The research is built on a technique called linear probing. The primary method, termed Linear Model on Embeddings (LME), involves:

Prompting: Using a simple completion prompt (e.g., “The population in Orange County, California in 2019 was”) for a given entity and year.
Extraction: Taking the hidden state vector corresponding to the last token of this prompt from a specific layer (layer 25 was found optimal in their experiments) of an open-source LLM.
Modeling: Training a regularized linear regression model (Ridge regression) on these extracted embeddings to predict the target numeric variable (e.g., unemployment rate, total assets).

The authors rigorously test this approach across five public datasets: US counties, EU NUTS-2 regions, UK and German districts, and US listed firms, covering variables such as population, GDP per capita, unemployment, total assets, and market capitalization.

Key Findings:

Superiority of LME over Text Output: Across almost all variables and datasets, the LME significantly outperformed the model’s own text outputs in estimating the ground-truth statistics, as measured by Spearman rank correlation. This robustly confirms that hidden states contain more accurate and richer economic information than what is revealed in direct text generation.
Data Efficiency: A learning curve analysis demonstrated that LME can achieve high performance with very few labeled examples—often just a few dozen samples are sufficient. This makes the approach highly practical for domains where labeled data is scarce or expensive to obtain.
Transfer Learning Without Target Labels: The paper proposes an innovative transfer learning method for scenarios where no labeled data exists for the target variable. It combines two sources: (a) predictions from LME models trained on other labeled variables for the same entities, and (b) the LLM’s text output used as a noisy label. By employing techniques like early stopping to prevent overfitting to the noise, this method can achieve accuracy that sometimes surpasses the quality of the noisy labels themselves.
Computational Efficiency vs. Reasoning Models: The authors compared LME against state-of-the-art reasoning models (specifically Qwen QwQ). While the reasoning model improved upon standard text generation, it was still outperformed by the simple LME approach. Crucially, LME is orders of magnitude more computationally efficient, requiring only a single forward pass to extract embeddings and the cost of training a small linear model.
Practical Applications Demonstrated:
- Data Imputation: Incorporating LLM embeddings as additional features consistently improved the accuracy of standard imputation methods (like K-Nearest Neighbors) for filling in missing values in economic datasets.
- Super-Resolution: The embeddings enabled the successful “downscaling” of economic statistics, such as inferring county-level data from only state-level information, by leveraging the nuanced entity knowledge within the LLM.

Implications and Conclusion: This work successfully demonstrates that the hidden states of LLMs serve as a powerful, compressed knowledge base that can be efficiently tapped for structured prediction tasks. It shifts the perspective on LLMs from being solely text generators to being versatile feature extractors or knowledge repositories. The proposed methods offer a cost-effective, data-efficient, and accurate alternative to relying on LLM text generation or traditional data sourcing and cleaning pipelines for economic and financial data tasks. The findings have significant implications for researchers and practitioners in economics, finance, and data science, opening new avenues for leveraging pretrained LLMs in analytical workflows.

Revealing economic facts: LLMs know more than they say

💡 Research Summary

Comments & Academic Discussion

Leave a Comment