Open FinLLM Leaderboard: Towards Financial AI Readiness
Financial large language models (FinLLMs) with multimodal capabilities are envisioned to revolutionize applications across business, finance, accounting, and auditing. However, real-world adoption requires robust benchmarks of FinLLMs’ and FinAgents’ performance. Maintaining an open leaderboard is crucial for encouraging innovative adoption and improving model effectiveness. In collaboration with Linux Foundation and Hugging Face, we create an open FinLLM leaderboard, which serves as an open platform for assessing and comparing AI models’ performance on a wide spectrum of financial tasks. By demoncratizing access to advances of financial knowledge and intelligence, a chatbot or agent may enhance the analytical capabilities of the general public to a professional level within a few months of usage. This open leaderboard welcomes contributions from academia, open-source community, industry, and stakeholders. In particular, we encourage contributions of new datasets, tasks, and models for continual update. Through fostering a collaborative and open ecosystem, we seek to promote financial AI readiness.
💡 Research Summary
The paper presents the Open FinLLM Leaderboard, an open‑source platform designed to benchmark and compare financial large language models (FinLLMs) and financial agents (FinAgents) across a broad set of multimodal financial tasks. Recognizing that existing static benchmarks such as FinBen and FinanceBench do not keep pace with the rapid evolution of models and new use‑cases, the authors collaborate with the Linux Foundation and Hugging Face to create a dynamic, community‑driven leaderboard that can be continuously updated with new datasets, tasks, and models.
The leaderboard organizes 42 publicly available financial datasets into seven task categories: Information Extraction (IE), Textual Analysis (TA), Question Answering (QA), Text Generation (TG), Risk Management (RM), Forecasting (FO), and Decision‑Making (DM). Each category contains concrete sub‑tasks (e.g., NER, relation extraction, sentiment analysis, stock movement prediction) that reflect real‑world financial workflows such as XBRL filing processing, market sentiment monitoring, credit scoring, fraud detection, and M&A decision support. All datasets are curated and validated by domain experts to ensure relevance and realism.
A standardized testing pipeline automates model acquisition (via Hugging Face or APIs), preprocessing, tokenization, zero‑shot inference, metric computation, min‑max normalization to a 0‑100 scale, and ranking. The pipeline supports multiple evaluation metrics appropriate to each task: accuracy, F1, ROUGE, BERTScore, and Matthews correlation coefficient (MCC). By default, models are evaluated in a zero‑shot setting to assess their out‑of‑the‑box generalization to financial contexts, but the architecture allows future extensions for fine‑tuned evaluations.
The initial benchmark includes several state‑of‑the‑art models: GPT‑4 (standard version), LLaMA 3.1 (8 B and 70 B), Gemini, Qwen2 (72 B and 7 B‑Instruct), and Xuanyuan‑70 B. Early results show GPT‑4 achieving the highest aggregate score, while LLaMA 3.1 demonstrates competitive performance on specific table‑based extraction tasks, illustrating that strengths vary across model families.
Beyond raw scores, the platform offers interactive demos. The FinGPT Search Agent demo showcases Retrieval‑Augmented Generation (RAG) by integrating real‑time data from sources such as Yahoo Finance, Bloomberg, PDFs, and Excel sheets into model responses. A side‑by‑side comparison UI lets users explore how different models answer the same query, facilitating transparent model selection. Additional child leaderboards map to business functions—Search Agent, AI Tutor, Compliance, Auditing—allowing enterprises to locate benchmarks most relevant to their operational needs.
Open governance is a core principle. The leaderboard, its code, and the evaluation pipeline are released under the Model Openness Framework and OpenMDW license, hosted on GitHub and Hugging Face. A CI/CD workflow automatically incorporates community submissions of new datasets, tasks, or model checkpoints, ensuring the leaderboard remains up‑to‑date and avoids the stagnation typical of static benchmarks.
The authors acknowledge limitations and future work: (1) extending the framework to evaluate fine‑tuned, domain‑adapted models; (2) addressing privacy and regulatory constraints by exploring synthetic data or encrypted evaluation methods; (3) developing dedicated metrics for multimodal quantitative reasoning, especially for time‑series and tabular forecasting tasks.
In summary, the Open FinLLM Leaderboard provides a transparent, extensible, and community‑driven infrastructure for assessing financial AI systems. By standardizing multimodal evaluation, supporting real‑world demos, and encouraging open contributions, it aims to accelerate the readiness of financial AI, helping practitioners, researchers, and industry stakeholders identify trustworthy, high‑performing models for a wide range of financial applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment