LLM-Efficient Adaptive Notation
A token-efficient serialization format that cuts payload size in half while preserving full data fidelity.
Arrays of objects with shared keys are collapsed into a compact header + rows table format.
Nested objects are flattened into dot-separated key paths, eliminating structural nesting overhead.
String values that contain no special characters are written without quotes, saving two tokens each.
Mixed arrays with partially shared keys use a hybrid format that adapts to the data shape.
We ran a comprehensive benchmark comparing three data serialization formats when used as LLM context: JSON (pretty-printed), LEAN (compact tabular encoding), and YAML. The goal was to answer two questions. How many tokens does each format burn to represent the same data? And can LLMs actually understand compressed formats as well as JSON?
Short answer: LEAN uses 47% fewer tokens and scores higher accuracy than JSON. Not equal. Higher.
Accuracy per 1K tokens spent. Higher is better.
| Format | Accuracy | Avg Tokens | Savings vs JSON |
|---|---|---|---|
| LEAN | 87.9% | 3,939 | -46.8% |
| YAML | 87.4% | 5,647 | -23.7% |
| JSON | 86.2% | 7,401 | baseline |
Token counts measured with the GPT-5 o200k_base tokenizer.
JSON with 2-space indentation is the baseline.
Uniform tabular structures. This is where LEAN really shines.
Datasets with nested or semi-uniform structures.
On Claude Haiku, LEAN outperforms JSON by +2.6 percentage points while using half the tokens.
| Question Type | JSON | LEAN | YAML |
|---|---|---|---|
| Field Retrieval | 78.0% | 81.1% | 79.5% |
| Aggregation | 82.7% | 83.6% | 82.7% |
| Filtering | 100.0% | 100.0% | 100.0% |
| Structure Awareness | 93.3% | 96.7% | 98.3% |
| Structural Validation | 80.0% | 80.0% | 80.0% |
| Dataset | JSON | LEAN | YAML |
|---|---|---|---|
| Employee records (100, flat) | 82.5% / 6,150 tok | 83.8% / 2,361 tok | 82.5% / 4,777 tok |
| E-commerce orders (50, nested) | 97.4% / 10,731 tok | 98.7% / 6,521 tok | 98.7% / 7,765 tok |
| Time-series (60, flat) | 73.2% / 3,609 tok | 76.8% / 1,461 tok | 75.0% / 2,882 tok |
| GitHub repos (100, flat) | 67.9% / 13,810 tok | 69.6% / 7,434 tok | 69.6% / 11,667 tok |
| Event logs (75, semi-uniform) | 94.4% / 6,252 tok | 98.1% / 5,028 tok | 98.1% / 5,078 tok |
| Nested config (deep) | 100% / 710 tok | 100% / 460 tok | 100% / 505 tok |
LEAN matches or beats JSON on every single dataset, while using 20-62% fewer tokens.
Employee records. Same data, three formats.
{
"employees": [
{
"id": 1,
"name": "Paul Garcia",
"email": "paul.garcia@company.com",
"department": "Engineering",
"salary": 92000,
"yearsExperience": 19,
"active": true
},
{
"id": 2,
"name": "Aaron Davis",
"email": "aaron.davis@company.com",
"department": "Finance",
"salary": 149000,
"yearsExperience": 18,
"active": false
}
]
}
employees: #[100](active|department|email|id|name|salary|yearsExperience) true|Engineering|paul.garcia@company.com|1|Paul Garcia|92000|19 ^false|Finance|aaron.davis@company.com|2|Aaron Davis|149000|18
The #[100] header declares the row count and column names once.
Each row is pipe-delimited, rows separated by ^.
No repeated keys, no braces, no quotes. Just data.
employees:
- active: true
department: Engineering
email: paul.garcia@company.com
id: 1
name: Paul Garcia
salary: 92000
yearsExperience: 19
- active: false
department: Finance
email: aaron.davis@company.com
id: 2
name: Aaron Davis
salary: 149000
yearsExperience: 18
YAML removes braces and quotes but still repeats every key per row.
195 questions generated dynamically across five categories.
| Category | Share | Example |
|---|---|---|
| Field retrieval | 34% | "What is Paul Garcia's salary?" โ 92000 |
| Aggregation | 28% | "How many employees in Engineering?" โ 17 |
| Filtering | 20% | "Active Sales employees with > 5 years?" โ 8 |
| Structure awareness | 15% | "How many employees in the dataset?" โ 100 |
| Structural validation | 3% | "Is this data complete and valid?" โ NO |
| Dataset | Rows | Structure | Questions |
|---|---|---|---|
| Uniform employee records | 100 | uniform | 40 |
| E-commerce orders | 50 | nested | 38 |
| Time-series analytics | 60 | uniform | 28 |
| Top 100 GitHub repos | 100 | uniform | 28 |
| Semi-uniform event logs | 75 | semi-uniform | 27 |
| Deeply nested config | 11 | deep | 29 |
| Valid complete (control) | 20 | uniform | 1 |
| Truncated array | 17 | uniform | 1 |
| Extra rows | 23 | uniform | 1 |
| Width mismatch | 20 | uniform | 1 |
| Missing fields | 20 | uniform | 1 |
LEAN saves ~47% tokens per LLM call compared to JSON, which directly translates to lower API costs.
Accuracy doesn't suffer. LEAN actually scored 1.7 percentage points higher than JSON (87.9% vs 86.2%).
On flat tabular data, LEAN saves 51-62%. If your data is arrays of uniform objects, the savings are massive.
YAML is a solid middle ground. 21% token savings over JSON with comparable accuracy.
Both models showed the same pattern. This isn't model-specific; compressed formats work across providers.
Benchmark code and full results available in the repo. All data generated deterministically with a seeded PRNG for reproducibility.
Install the TOON formatting plugin for Claude Code to get compact data output in your terminal.
claude plugin add fiialkod/toon-formatting-pluginView on GitHub
Connect any MCP client to the TOON/LEAN encoding server for automated format conversion.
{
"mcpServers": {
"toon": {
"command": "npx",
"args": ["@fiialkod/toon-mcp-server"]
}
}
}
View on GitHub