Coordinate Matrix Machine A Human-level Concept Learning to Classify Very Similar Documents
đ Original Paper Info
- Title: Coordinate Matrix Machine A Human-level Concept Learning to Classify Very Similar Documents- ArXiv ID: 2512.23749
- Date: 2025-12-26
- Authors: Amin Sadri, M Maruf Hossain
đ Abstract
Human-level concept learning argues that humans typically learn new concepts from a single example, whereas machine learning algorithms typically require hundreds of samples to learn a single concept. Our brain subconsciously identifies important features and learns more effectively. Contribution: In this paper, we present the Coordinate Matrix Machine (CM$^2$). This purpose-built small model augments human intelligence by learning document structures and using this information to classify documents. While modern "Red AI" trends rely on massive pre-training and energy-intensive GPU infrastructure, CM$^2$ is designed as a Green AI solution. It achieves human-level concept learning by identifying only the structural "important features" a human would consider, allowing it to classify very similar documents using only one sample per class. Advantage: Our algorithm outperforms traditional vectorizers and complex deep learning models that require larger datasets and significant compute. By focusing on structural coordinates rather than exhaustive semantic vectors, CM$^2$ offers: 1. High accuracy with minimal data (one-shot learning) 2. Geometric and structural intelligence 3. Green AI and environmental sustainability 4. Optimized for CPU-only environments 5. Inherent explainability (glass-box model) 6. Faster computation and low latency 7. Robustness against unbalanced classes 8. Economic viability 9. Generic, expandable, and extendableđĄ Summary & Analysis
This paper addresses the challenge of document classification by applying a human-level concept learning approach. The Coordinate Matrix Machine (CM2) is designed to achieve high accuracy with minimal data, using only one sample per class, thus outperforming traditional machine learning and advanced deep learning models that require large amounts of labeled examples. CM2 focuses on capturing the geometric and structural intelligence of documents by treating them as coordinate matrices rather than linear sequences, which makes it more effective for formal documents like bank statements or invoices where spatial positioning is crucial.đ Full Paper Content (ArXiv Source)
Human-level concept learning is a relatively less explored area of research. The goal is to provide a solution to a problem that is easy for humans to do, yet still challenging for machines despite their computational power. People can learn new concepts from only one or a very few samples, whereas machine learning methods require many examples to discover correlations and understand features. The reason is that people subconsciously select the important features and then generalize to other distinct areas .
In this paper, we deploy a human-level concept learning approach for document classification. Document classification is a common task in machine learning and Natural Language Processing (NLP), in which a document is assigned to one or more classes. Current approaches rely heavily on the document context to classify documents . Identifying the relevant topic is the primary motivator for such document classification . These models assume we have access to ample labelled data and that the document context is sufficiently informative to distinguish between classes. Some other methods use document image information, thereby exacerbating the labeling process. Since images are high-dimensional data, labeling so many training samples is not feasible in most real-world problems . While most approaches treat text in isolation, our solution rests on the premise that the meaning of a structured document is derived from both the text and its coordinates. This dual modality allows the model to leverage spatial grounding that purely text-based models overlook.
A practical scenario is receiving bank statements from other financial institutions along with loan applications; the task is to identify the statement templates so that existing functions can be applied to extract information from them. In practice, when we ask someone to classify these statements, we do not need to show them hundreds of samples per class. A single sample suffices to represent a class. This fact indicates that a single sample is sufficient for classification if the algorithm is well-designed. This is our motivation for developing our algorithm.
Our aim is to develop a coordinate matrix-based framework (Coordinate Matrix Machine, CM$`^2`$) as a purpose-built small model that achieves human-level concept learning without the prohibitive costs of Large Language Models (LLMs). Unlike âRed AIâ trends that rely on massive pre-training and energy-intensive GPU infrastructure, CM$`^2`$ is designed as a Green AI solution. It prioritizes computational efficiency and environmental sustainability by mimicking human structural identification rather than processing exhaustive semantic vectors. This enables a high-precision, explainable alternative that is both economically viable and runs on standard CPU hardware.
Challenges
When there is substantial variation in structured documents whose structure must be classified, machine learning techniques often perform poorly. We have identified the following challenges when dealing with bank statements in particular:
-
A large number of documents need to be labelled to train a model with sufficient accuracy, which is both time and resource-consuming.
-
There are way too many class labels at the template level. For the five Australian banks included in this experiment, 53 templates (classes) were available. There are not enough representative samples for each class.
-
All bank statements are very similar in context, as they include many transaction descriptions, dates, and amounts.
-
Majority of the words in bank statements are personal and highly contextual, such as account holderâs name, address, or transaction items, which produces lots of noise when training a classification model.
-
The words which are not noise are very similar across a range of bank statements, e.g., âNameâ, âAccountâ, âBalanceâ, âDateâ.
Due to these challenges, traditional machine learning models and more recent deep learning models perform poorly.
When classifying statements using the Term Frequency Vectorizer with Logistic Regression, the model with the best parameters achieved an $`F`$-measure of 79% on the templates over 53 templates in an 80-20 training-test split.
This lower performance can discourage users from using simplistic models and incline them to use more complex models like Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) , which are by default not explainable, and rely on additional techniques to explain, especially to the governing body for compliance purposes.
After spending hours building a CNN or Long Short-Term Memory (LSTM) RNN architecture, we found that the CNN achieved an $`F`$-measure of 93%, whereas the LSTM failed to classify 53 templates.
Our Contribution
In this paper, we apply a human-level concept learning to address this problem. We choose this approach because, as humans, we do not need hundreds of training samples to learn a single class, and we can classify documents even when the content is similar, sometimes even without reading the texts. Only one sample per template is sufficient, and we should consider not only the content but also the locations of specific words.
The work closest to ours in approach is that of Lake et al. . They have argued that in most cases, people can learn a new concept from just one or a handful of examples. At the same time, typical machine learning algorithms require hundreds of examples to perform similarly. They aim to address a different problem: recognising handwritten characters from image data. In this paper, we position CM$`^{2}`$ as a specialized solution optimized for entity extraction (not covered in this paper) and document classification in complex structural domains. By treating the document as a coordinate matrix rather than a linear sequence, CM$`^{2}`$ provides the âstructural intelligenceâ necessary for high-stakes industrial processing. Although the issues differ, both works use human-level concept learning by examining how humans solve the problem. Both approaches only use one sample per class and learn from that sample.
Given documents of similar structure, as humans, we examine the document structure and/or the positions of specific keywords (e.g., Account Name, Date, Bank Name). Similarly, in our approach, we propose a hybrid lazy learning algorithm, the Coordinate Matrix Machine (CM$`^2`$), that constructs an input matrix (in contrast to traditional vector-based input strategies) containing the coordinates of the keywords and uses this matrix to classify documents.
The rationale behind the technique is as follows:
-
With structured documents, the position carries more importance than the occurrence or the order of the occurrence for the terms. Furthermore, by augmenting the subject-matter expertsâ understanding, we can avoid building a large corpus, thereby reducing noise and improving classification performance.
-
This technique would allow us to avoid labeling a large number of documents, which is often required to train machine learning and deep learning models.
Advantages
The advantages of our algorithm are:
-
High Accuracy with Minimal Data (One-Shot Learning): The algorithm achieves human-level concept learning, successfully classifying documents using only a single sample per class. It outperforms traditional machine learning and advanced deep learning models that typically require hundreds or thousands of labeled examples to achieve comparable accuracy. This makes the framework particularly effective for low-data, layout-sensitive tasks where obtaining a large, diverse training corpus is infeasible.
-
Geometric and Structural Intelligence: Unlike text-only transformer-based models (like BERT or GPT) that process text as a linear sequence of tokens, CM$`^2`$ utilizes a coordinate matrix to capture the physical geometry of the document. This makes it significantly more effective for formal documents (bank statements, invoices) where the spatial position of elements is more informative than semantic flow. As a structure-aware document intelligence model, CM$`^{2}`$ ensures that even minor spatial deviations are captured, providing a level of layout-sensitivity that standard transformers lack.
-
Green AI & Environmental Sustainability: Aligned with the Green AI initiative, CM$`^2`$ prioritizes computational efficiency. By avoiding the energy-intensive pre-training and massive carbon footprint associated with âRed AIâ (large-scale LLMs), it offers an environmentally sustainable alternative for high-volume document processing.
-
Optimized for CPU-Only Environments: While modern NLP trends often rely on expensive, high-end GPU clusters, CM$`^2`$ is purpose-built to run on standard consumer-grade CPUs. Using static embeddings, such as GloVe, with $`O(1)`$ lookup latency rather than complex transformer passes, ensures near-zero inference time on basic hardware.
-
Inherent Explainability (Glass-Box Model): Unlike the âblack-boxâ nature of deep learning and LLMs, CM$`^2`$ is fully interpretable. Every classification decision can be traced back to specific coordinate markers in the source document. This transparency is mandatory in audit-compliant industries such as finance, banking, and law.
-
Faster Computation & Low Latency: The algorithm is designed for high-speed industrial throughput. By processing only âimportant featuresâ rather than every word in a document, and by avoiding the heavy clock-cycle requirements of neural network layers, it provides a much faster classification pipeline than modern transformer benchmarks.
-
Robustness Against Unbalanced Classes: Because the model relies on structural identification rather than statistical word frequency, it is inherently robust against unbalanced datasets. It does not suffer from the âmajority class biasâ that often plagues traditional machine learning classifiers.
-
Economic Viability: By eliminating the need for expensive GPU infrastructure, high-cost LLM API subscriptions, and the labor-intensive process of large-scale data labeling, CM$`^2`$ provides a significantly cheaper solution for enterprise-level deployment.
-
Generic, Expandable, and Extendable: The coordinate matrix framework is highly versatile. It can be easily adapted to incorporate any structured document templates or extended to include additional feature markers without requiring the model to be âre-trainedâ on a massive corpus.
Organization
The remainder of the paper is organized as follows. In Section 9, we briefly discuss the related work in document classification. Our algorithm is formally described in Section 10. Section 11 presents the design of the experimental investigation. We present and discuss the results in Section 12 and 13. Finally, in Section 14, we suggest future improvements and conclude the paper.
Related Work
As document classification has become an emerging area in text mining research, a large amount of prior work has focused on it. A typical document classification process has several pre-processing steps, and researchers have focused on each step to improve performance, such as stopword removal , Tokenization , Part-of-Speech Tagging , and Stemming .
The second step usually focuses on feature extraction and selection. For example, Yang et al. used titles and other tag data to label the text features. Shih and Karger used the geometry of the rendered HTML page to construct tree models based on the incoming-link structure. Term FrequencyâInverse Document Frequency (TF-IDF) is a widely used method for feature selection . Power et al. focused on web page classification and filtered the output of the TF-IDF algorithm to improve the performance.
Once the feature vector is identified, the third step is to apply a machine learning model for classification. Naïve Bayes is one of the probability-based classifiers . Decision tree-based approaches have also been used for various purposes, such as blocking inappropriate web content . Support Vector Machines (SVM) are also widely used for document classification . There are multiple approaches under artificial/deep neural networks. Hassan and Mahmood applied Convolution Neural Networks (CNN) for document classification. Unlike most machine learning models, Recurrent Neural Networks (RNNs) consider the sequence of word occurrences; therefore, documents with similar words can yield different outputs because of word order . The critical distinction in our methodology is that, whereas most approaches consider text and semantic proximity, we treat both text and coordinates as primary features. This enables CM$`^{2}`$ to excel in environments where the physical location of a token is as meaningful as the token itself.
While the landscape of document classification is currently dominated by Large Language Models (LLMs) and transformer-based architectures, these âRed AIâ approaches come with significant trade-offs in computational power, high-end GPU requirements, and interpretability. Recent research on Small Language Models (SLMs), such as DistilBERT , and on efficient few-shot frameworks, such as SetFit , aims to mitigate these costs by reducing model size; however, they remain fundamentally sequence-based. These models process text as a linear flow of tokens, often overlooking the rigid geometric structure and precise spatial coordinates that define formal documents such as bank statements or invoices.
A fundamental gap remains: standard transformers infer context from semantic proximity, whereas CM$`^2`$ directly leverages coordinate-based structural intelligence. By treating a document as a matrix of coordinates rather than a linear sequence, we provide a solution that is purpose-built for high-stakes industrial applications. This alignment with the Green AI philosophy ensures that the framework is not only hardware-agnostic and economically viable for CPU-only deployment, but also inherently interpretable , addressing the strict transparency and structural requirements mandated in the financial and legal sectors.
Methodology
We now describe in more detail the steps of our algorithm for constructing the coordinate matrix and inducing the classifier from it.
Pre-processing the Documents
Each training data point is a PDF file containing either a scanned image or a digital document. We have ensured that each page of the document is set to 300 dpi before running it through an Optical Character Recognition (OCR) engine to obtain the words and their corresponding coordinates. The OCR output is stored as eXtensible Markup Language (XML) files, which are used as input to train our model.
Building the Coordinate Matrix
For each class, only one sample is required for training. We accompany each XML file with a Comma-Separated Value (CSV) file that contains a key-value pair for each keyword in the document.
Statement A contains the term âAccount No.â followed by the account
number â061234-12345678â, the term âAccount Holderâ with the value âJohn
Doeâ, and the term âAccount Typeâ with the value âSavings Accountâ.
Whereas, in statement B, the relevant term âAccount No.â followed by the
number â064321-87654321â, and the term âAccount Nameâ with the value
âJane Smithâ is recorded. So in the CSV file for statement A, we have
the entries âAccount No.,061234-12345678â, âAccount Holder,John Doeâ
and âAccount Type,Savings Accountâ; and for statement B, we have
âAccount No.,064321-87654321â and âAccount Name,Jane Smithâ
recorded. We then search for keywords (e.g., âAccount No.â, âAccount
Nameâ) in the XML file and construct a matrix of top and left positions
for each keyword in each document, as shown in
Tab. 1.
| Document ID | Keyword | Top | Left |
|---|---|---|---|
| Statement A | Account No. | 254 | 1231 |
| Statement A | Account Holder | 261 | 1231 |
| Statement A | Account Type | 269 | 1231 |
| Statement B | Account No. | 1123 | 231 |
| Statement B | Account Name | 100 | 359 |
Example of the Coordinate Matrix
The first column comes from the training data, the second column comes from the CSV file, and each XML file is accompanied by the third column, which contains the search results for the keywords in the XML files. Algorithm [ccm] lists the steps.
Classifying New Documents
When a new document arrives, we run OCR on it. In the XML produced by the OCR, we identify all words in the test document. We do this to ensure the approach remains robust to rotations or shifts in the coordinates that can occur during document scanning.
We match the extracted words against all the keywords in the training data. Once we extract the matched keywords, we then form a coordinate matrix for the test cases, comprising the top and left positions.
Test case contains the term âAccount No.â followed by the account number â061111-11111111â, and the term âAccount Nameâ with the value âJane Doeâ. Thus, the coordinate matrix for the test case is shown in Tab. 2. The second column of the table lists all keywords present in the training coordinate matrix.
| Document ID | Keyword | Top | Left |
|---|---|---|---|
| Test case | Account No. | 1120 | 230 |
| Test case | Account Holder | â | â |
| Test case | Account Type | â | â |
| Test case | Account Name | 101 | 360 |
Coordinate Matrix for the Test Case
We then compute the distance between each keyword in the test data and every training sample. To calculate the distance between keywords, the Manhattan distance was used. We used Manhattan distance because it places greater weight on horizontal and vertical shifts than on diagonal shifts. The horizontal and vertical shifts are more likely in a document due to an extra empty line or space.
We have introduced only one parameter in this algorithm.
Maximum Penalty is the maximum distance allowed between two keywords. If the distance between the same keywords from two documents is more than this threshold, then the actual distance is substituted by this value. Besides, when a keyword is not found in a document, the distance for that keyword is set to this value. In other words, if the distance between the keyword found and its counterpart is more than the maximum penalty, we assume that the keyword is not found in the document.
We defined this parameter for two reasons:
-
There should be a distance value even when we cannot locate a keyword in a document, otherwise we cannot apply the mean function in the next step.
-
We aim to ensure the algorithmâs robustness by limiting the impact of a single keyword. Otherwise, if a keyword in two documents is very far apart, either due to poor OCR extraction quality or because it is a variant of the training data, the calculated distance will be too large.
Finally, we compute the mean distance for all keywords in each training sample to obtain a similarity score between the test case and that sample. The sample with the minimum distance to the test case is assigned to the class. Algorithm [classify] lists the steps, and Example [example] demonstrates the calculation.
Let us assume that the
maximum_penalty is 200.
Table 3 shows the training data and the test
case for each keyword.
| Document ID | Keyword | Calculation | Distance |
|---|---|---|---|
| Statement A | Account No. | $` | 254 - 1120 |
| Statement A | Account Holder | Not found | 200 |
| Statement A | Account Type | Not found | 200 |
| Statement B | Account No. | $` | 1123 - 1120 |
| Statement B | Account Name | $` | 101 - 100 |
The Distance between the Training Data and the Test Case for Each Keyword
The distance between âAccount No.â in Statement A is 200 because the Manhattan distance exceeds 200 pixels. For âAccount Holderâ and âAccount typeâ, the distances are 200 because these fields are not present in the test case. As a result, the distance between the test case and Statement A is 200, while this value is the average of 4 and 2 for Statement B. Therefore, the test case belongs to Statement B class with a similarity score of 3.
Complexity of the Algorithm
Let us consider a training dataset $`\mathcal{T}`$ of $`N`$ samples, where each sample comprises several keywords and $`M`$ is the total number of keywords in all $`N`$ samples. The time complexity to classify one test case containing $`L`$ number of words would be $`O(LM)`$, because for each of the $`M`$ available keywords, which can include several words, we have to search the test document. We can see that the computational complexity of our algorithm is linear to the size of the test data. Assuming that the number of keywords in each document is the same or within a range, $`M`$ is proportional to $`N`$ (i.e., $`M \sim N`$), and thus $`O \sim N`$. This means that the computational complexity of our algorithm is also linear in the number of training samples.
Experiments
In our experimental analysis, we compare six vectorization and nine classification techniques. To ensure our research remains grounded in Green AI principlesâprioritizing environmental sustainability and hardware accessibilityâwe intentionally exclude high-parameter LLMs that require GPU-intensive inference. Instead, we evaluate purpose-built small models and efficient benchmarks.
Our selection includes three Bag of Words vectorizer techniques (Term Frequency Vectorizer , TF-IDF Vectorizer , Hashing Vectorizer ), two word embedding techniques (Global Vector (GloVe) âs 6B pre-trained tokens, and Googleâs pre-trained Word2Vec ), and one paragraph embedding technique (Doc2Vec ). We specifically chose GloVe (Global Vectors for Word Representation) over modern transformer-based benchmarks like DistilBERT . Although DistilBERT is a âdistilledâ and smaller model, it still requires multiple transformer-layer passes, which consume significant CPU clock cycles. In contrast, GloVe provides static embeddings that function as a high-speed lookup table ($`O(1)`$ inference complexity). This choice allows us to maintain near-zero latency and a minimal carbon footprint, ensuring that our results are directly applicable to resource-constrained, CPU-only environments common in on-premises industrial data processing, where infrastructure costs and processing speed are mandatory constraints.
The classification algorithms used are: Logistic Regression, Decision Tree (with CART and C4.5), SVMs (with linear, polynomial, RBF, and Gaussian kernel), Random Forest, NaĂŻve Bayes, k-Nearest Neighbour. We also compared against several deep learning classifiers: an Artificial Neural Network, a CNN, and an LSTM-based RNN.
Data Set
We have randomly selected 475 statements from five banks that were crowdsourced for this research. The distribution of the banksâ statements is shown in Fig. 1. These statements fall under 53 templates. 16 of which have only 1-2 samples.
[r][r][0.55]Commonwealth Bank [r][r][0.55]Westpac
Bank [r][r][0.55]National Australia Bank
[r][r][0.55]ING Bank [r][r][0.55]Suncorp Bank
[r][r][0.75]Bank [t][][0.75]Number of
Statements [][][0.55]200 [][][0.55]150
[][][0.55]100 [][][0.55]50
[][][0.55]0
Experiment Design
Each of the vectorizers is applied with each classifier, except NaĂŻve Bayes, which we could not use with Hashing, Googleâs Word2Vec, and Doc2Vec vectorizers (because it is not compatible with them), making a total of 51 combinations to compare against our approach.
We have set aside 127 samples as test data while optimising the parameters using grid search. And used the remaining 348 to train models. We have ensured that for the minority classes, at least one sample remains in the training set. We employed 10-fold cross-validation (CV) for the machine learning models and 3-fold CV for the deep learning models to select the best parameters on the training set. Table 4 shows all the parameter values used during the grid search. Once the best parameters are identified, we train the final model on the entire training set and evaluate it on 127 test data points.
| Algorithm | Parameters |
|---|---|
| Frequency Vectorizer, | ngram_range: [1, 4) |
| TF-IDF Vectorizer, | max_features: [20, 40, 60, âŠ, 6800] |
| Hashing Vectorizer | |
| GloVe, | max_features: [20, 40, 60, âŠ, 6800] |
| Googleâs Word2Vec, | |
| Doc2Vec | |
| Logistic Regression | penalty: {none, $`\ell_2`$} |
C: {0.000001, 0.009, 0.001, 0.09, 0.01, 1, 5, 10, 25} |
|
max_iter: {100, 120, 130, 140, 150} |
|
| Decision Tree | criterion: {gini, entropy} |
min_samples_split: {2, 4, 5, 6} |
|
max_features: {auto, sqrt, $`\log`$2, none} |
|
| Support Vector Machine | C: {0.001, 0.01, 0.1, 1, 10, 100} |
kernel: {linear, poly, rbf, sigmoid} |
|
degree: {1, 2, 3} |
|
tol: {0.0001, 0.001, 0.1} |
|
decision_function_shape: {ovo, ovr} |
|
| Random Forest | n_estimators: {10, 11, 13, 15, 100, 115, 120, 125, |
| 150, 200} | |
criterion: {gini, entropy} |
|
min_samples_split: {2, 4, 5, 6} |
|
max_features: {auto, sqrt, log2, None} |
|
| NaĂŻve Bayes | alpha: {0., 0.0001, 0.001, 0.01, 0.1, 1, 10} |
fit_prior: {True, False} |
|
| $`k`$-Nearest Neighbour | n_neighbors: {3, 5, 7, 9, 11, 13} |
algorithm: {ball_tree, kd_tree, brute} |
|
leaf_size: {30, 35, 30, 45, 50, 55} |
|
| Artificial Neural Network, | activation: {relu, $`\tanh`$} |
| CNN, | optimizer: {SGD, Adam, Adamax, Adagrad, |
| LSTM-based RNN | Adadelta, Nadam, RMSprop} |
epochs: {10, 50, 100} |
|
learn_rate: {0.01, 0.1, 0.2} |
Parameters Used during Grid Search for Each Algorithm from Sci-kit Learn and KerasÂ
To assess the effect of training set size, we trained each model on 53 samples and evaluated its performance on the remaining data. We then augment the training data with 59 samples and evaluate the modelâs performance on the remaining data. Eventually, we trained each model on 53, 112, 171, 230, 289, and 348 training data points and tested its performance on the remaining data.
Results
Figure 2 shows the results for different algorithms used with different vectorization methods while the size of the training data is growing.
[l][l][0.75]CM2 [l][l][0.6]Term
Frequency Vectorizer [l][l][0.6]TFâIDF Vectorizer
[l][l][0.6]Hashing Vectorizer [l][l][0.59]GloVe with
TFâIDF Vectorizer
[l][l][0.6]Googleâs Word2Vec
[l][l][0.6]Doc2Vec [r][][0.85]Vectorizer
[][][0.85]F-measure
(%) [t][][0.85]Size of Training Data Set
[][][0.65]Logistic Regression [][][0.65]Decision
Tree [][][0.65]Support Vector Machine
[][][0.65]Random Forest [][][0.65]NaĂŻve Bayes
[][][0.65]k-Nearest Neighbour [][][0.65]Artificial
Neural Network [][][0.65]Convolutional Neural
Network [][][0.65]LSTM-based Recurrent Neural
Network [b][][0.63]348 [b][][0.63]289
[b][][0.63]230 [b][][0.63]171
[b][][0.63]112 [b][][0.63]53
[r][r][0.65]0 [r][r][0.65]25
[r][r][0.65]50 [r][r][0.65]75
[r][r][0.65]100
Each chart corresponds to a single machine learning approach, and the colours denote the vectorization method. As expected, the line charts show an upward trend, indicating that performance improves with increasing numbers of training samples. This upward trend also demonstrates the trade-off between labelled data and performance.
On average, the Hashing Vectorizer outperforms other vectorization methods for most classifiers, particularly with smaller training datasets. Googleâs Word2Vec yields the worst performance across all selected classifiers. This is primarily because banks sometimes use non-standard abbreviations in transaction descriptions to fit longer words into a fixed-width text field, and Googleâs pre-trained model is not designed for this.
The ensemble classifier Random Forest outperforms all other classification techniques, closely followed by a simple ANN. In contrast, LSTM performed very poorly because it requires large amounts of training data and long word sequences. In our experiment, the LSTM achieved training accuracy over 98% on 348 documents, but it struggled to identify more than two samples correctly during testing. Furthermore, bank statements do not contain any sentences that could benefit from techniques such as LSTMs.
For this experiment, we run our algorithm only once because we use only 53 samples, and the result is shown as a grey dashed line in the charts. As we can see, our method outperforms all machine learning methods, regardless of classifier and vectorizer type and training data size. Note that all other models are run with the best parameters obtained from the grid searches. Although we have reported the best performance of these models using the best parameters, our algorithm still yields better performance with a simple setting and the smallest training data set. This shows that selecting appropriate features and designing an appropriate approach are more important than switching or ensembling models.
Figure 3 shows the parameter sensitivity analysis for the maximum penalty, which is the only parameter required by our algorithm.
[t][][1]Maximum Penalty [b][][1]F-measure (%)
[][][0.65]500 [][][0.65]400
[][][0.65]300 [][][0.65]200
[][][0.65]100 [r][r][0.65]90
[r][r][0.65]92 [r][r][0.65]94
[r][r][0.65]96 [r][r][0.65]98
[r][r][0.65]100
We observe that, at the minimum, performance is lower because our
approach is sensitive to the coordinates of the keywords. In other
words, if the keyword is just a little shifted from its expected
position, we assume the keyword is not found. In practice, the locations
of keywords may change due to scanning or the presence of extra lines or
spaces. At maximum_penalty = 200, the accuracy is 100%. Let us
consider the resolution of an input image 2480 $`\times`$ 3500. If the
keyword is around 10% of the width apart from the original location of
the keyword, it is considered not found.
By increasing the maximum penalty by more than 250, we also observe a drop in accuracy because keywords that are relatively far apart are now considered matched. After 400, we do not observe a significant difference in performance. The reason is that when we set a very high threshold, only a few distances fall below it. Theoretically, all the Manhattan distances in a 2480 $`\times`$ 3500 document are less than 2480 + 3500. Therefore, setting a threshold above 2480 + 3500 is unnecessary and has no effect.
Discussion & Sustainability
We have demonstrated that, when appropriately designed, human-level concept learning is far superior to machine learning methods, especially when learning from limited data. In contrast to machine learning methods, which require hundreds of samples per class to learn the concept, our approach requires only one sample and leverages the concept more richly. As with humans, our approach attempts to match the test case to each known sample, estimate similarities, and select the most similar class. Upon closer inspection, our approach identifies keywords and matches them across documents. However, if the matched keywords are too far apart, the match is discarded, and the keyword is assumed to have no match. Similarly, as humans, we would not match keywords that are too far apart to detect a document type.
By avoiding the âblack-boxâ nature of large-scale transformer architectures, CM$`^2`$ provides inherent interpretability and a significantly lower carbon footprint. This aligns our work with modern requirements for audit-compliant and sustainable AI, proving that specialized structural intelligence can outperform general-purpose models in niche industrial domains.
Another advantage of the human-level concept learning is the way we set the parameters. In human-level concept learning, our understanding of the problem helps us define the values of the algorithm parameters. In contrast, in machine learning approaches, a parameter optimization technique (e.g., grid search) is required to find the best values. This can be very time-consuming and depends on the number of parameters to tune, the number of values to consider for each parameter, or the algorithmâs computational complexity. Furthermore, the algorithm must be run for each combination of parameter settings. Still, there is no guarantee that the best parameter value is among the considered parameter values.
When we want to set a value for maximum penalty, we should ask ourselves this question âwhat is the maximum distance that we want to allow to match keywords?â or How far a keyword can shift due to some extra line or extra space in a document? Based on sample documents, we estimate that $`200`$ is an appropriate value for the maximum penalty. Definitely, it is better to check the other values, but we expect the best value to be around $`200`$. On the other hand, assume we are using a deep learning model and want to set the values of $`\ell_1`$ and $`\ell_2`$. It is difficult to interpret these values in the context of our problem. Therefore, we must choose them using trial-and-error or a grid search. In most cases, grid search with more parameters tends to overfit .
Designing human-level concept learning begins by considering how humans perform the task. First, we examine the data, and then we should ask ourselves, as humans, how we solve the challenge. How do we recognize a template for bank statements whose contents are almost identical? The answer to the question identifies which feature to use and how to design the approach.
Conclusion
In this paper, we introduced the Coordinate Matrix Machine (CM$`^2`$), a novel framework for one-shot document classification that mimics human-level concept learning. By subconsciously identifying important structural features rather than processing exhaustive semantic data, CM$`^2`$ successfully addresses the challenge of classifying highly similar documents with only a single sample per class.
Our experimental results demonstrate that CM$`^2`$ outperforms traditional vectorization techniques and complex deep learning models in accuracy, speed, and robustness. Beyond performance metrics, this research makes a significant contribution to the Green AI initiative. We have demonstrated that purpose-built small models can achieve superior results in specialized domains without the immense carbon footprint, energy consumption, and hardware costs associated with Large Language Models (LLMs).
Furthermore, CM$`^2`$ addresses the âblack-boxâ limitations of modern neural network-based architectures. By prioritizing coordinate-based structural intelligence over linear sequences, our model provides inherent explainability, a mandatory requirement for audit-compliant industrial applications. We intentionally used static embeddings, such as GloVe, to maintain near-zero inference latency on standard CPUs, demonstrating that hardware-agnostic, low-cost AI is a viable and effective alternative to resource-heavy infrastructure.
Future work will explore the application of CM$`^2`$ in broader multi-modal contexts and its integration into real-time, high-throughput financial pipelines, continuing our commitment to sustainable, transparent, and efficient machine learning.
Acknowledgement
The authors thank the AI4Convery team for crowdsourcing 500 bank statements, anonymising them by removing personally identifiable information (PII), providing the statements in PDF format, and running OCR to produce XML for this research.