Invoice GL Code Classification
Explore 5,566 real invoices and predict GL codes using Aito predictive queries. Dataset from a public invoice classification benchmark with 9 GL codes, 1,253 vendors, and 36 product categories.
Database: aito-datasets
About This Dataset
Overview
Invoice GL Code Classification Dataset
This dataset contains 5,566 invoices from a public invoice classification benchmark. Each invoice has a natural language description and needs to be classified into one of 9 GL (General Ledger) codes.
Columns
- Inv_Id — Invoice identifier
- Vendor_Code — Vendor identifier (1,253 unique vendors)
- GL_Code — Target GL code for classification (9 codes)
- Inv_Amt — Invoice amount
- Item_Description — Natural language description of the invoice item
- Product_Category — Product category code (36 categories)
Key Properties
- 60 vendors map to multiple GL codes — real classification ambiguity
- 3 product categories are shared across GL codes
- Descriptions are messy natural language with shuffled word order
- Used in the Predictive Databases vs LLMs comparison experiment
Explore Data
Browse invoices
Interactive query — results displayed as table.
GL code distribution
Interactive query — results displayed as table.
Predict GL Code
Try predicting GL codes from invoice features
How to predict
Predicting GL Codes
Use the predict endpoint to classify an invoice into a GL code based on its features. The predictive database uses Bayesian inference over the full dataset to produce calibrated probability scores.
Try modifying the where clause with different vendor codes, descriptions, and amounts to see how predictions change.
Predict from description
Interactive predict — results displayed as table.
Predict with explainability
Interactive predict — results displayed as json.
Statistical Relationships
What relates to GL code?
Interactive relate — results displayed as table.
Evaluate Accuracy
About evaluation
Cross-Validation
The evaluate endpoint tests prediction accuracy on 20% of the data (every 5th row). This gives an unbiased estimate of how well the predictive database can classify unseen invoices.
The query below uses Item_Description, Vendor_Code, and Product_Category as input features to predict GL_Code, achieving 99.5% accuracy on 1,114 test samples.
GL Code prediction accuracy
Interactive evaluate — results displayed as json.