LEAN

LLM-Efficient Adaptive Notation

A token-efficient serialization format that cuts payload size in half while preserving full data fidelity.

-47% vs JSON
100% Lossless
0 Dependencies
Try Playground View on GitHub

Features

Tabular Arrays

Arrays of objects with shared keys are collapsed into a compact header + rows table format.

Dot-Flattening

Nested objects are flattened into dot-separated key paths, eliminating structural nesting overhead.

Bare Strings

String values that contain no special characters are written without quotes, saving two tokens each.

Semi-Tabular

Mixed arrays with partially shared keys use a hybrid format that adapts to the data shape.

Benchmarks

We ran a comprehensive benchmark comparing three data serialization formats when used as LLM context: JSON (pretty-printed), LEAN (compact tabular encoding), and YAML. The goal was to answer two questions. How many tokens does each format burn to represent the same data? And can LLMs actually understand compressed formats as well as JSON?

Short answer: LEAN uses 47% fewer tokens and scores higher accuracy than JSON. Not equal. Higher.

195 questions
11 datasets
2 models
3 formats
1,170 LLM calls

Efficiency Ranking

Accuracy per 1K tokens spent. Higher is better.

LEAN
22.3 acc%/1K tok
YAML
15.5 acc%/1K tok
JSON
11.6 acc%/1K tok
Format Accuracy Avg Tokens Savings vs JSON
LEAN 87.9% 3,939 -46.8%
YAML 87.4% 5,647 -23.7%
JSON 86.2% 7,401 baseline
LEAN achieves 87.9% accuracy (vs JSON's 86.2%) while using 46.8% fewer tokens.

Token Efficiency

Token counts measured with the GPT-5 o200k_base tokenizer. JSON with 2-space indentation is the baseline.

Grand Total (all 11 datasets)
JSON
47,345 baseline
LEAN
26,521 -44.0%
YAML
37,369 -21.1%

Flat-Only Track

Uniform tabular structures. This is where LEAN really shines.

๐Ÿ‘ฅ Uniform employee records (100 rows)
JSON
6,150 baseline
LEAN
2,361 -61.6%
YAML
4,777 -22.3%
๐Ÿ“ˆ Time-series analytics (60 days)
JSON
3,609 baseline
LEAN
1,461 -59.5%
YAML
2,882 -20.1%
โญ Top 100 GitHub repositories
JSON
13,810 baseline
LEAN
7,434 -46.2%
YAML
11,667 -15.5%
Flat Track Total
JSON
29,652 baseline
LEAN
14,512 -51.1%
YAML
24,021 -19.0%

Mixed-Structure Track

Datasets with nested or semi-uniform structures.

๐Ÿ›’ E-commerce orders (50 orders, nested)
JSON
10,731 baseline
LEAN
6,521 -39.2%
YAML
7,765 -27.6%
๐Ÿงพ Semi-uniform event logs (75 logs)
JSON
6,252 baseline
LEAN
5,028 -19.6%
YAML
5,078 -18.8%
๐Ÿงฉ Deeply nested configuration
JSON
710 baseline
LEAN
460 -35.2%
YAML
505 -28.9%
Mixed Track Total
JSON
17,693 baseline
LEAN
12,009 -32.1%
YAML
13,348 -24.6%

Retrieval Accuracy

Per-Model Accuracy

gpt-4o-mini
YAML
88.7% 173/195
LEAN
88.2% 172/195
JSON
87.2% 170/195
claude-haiku-4-5
LEAN
87.7% 171/195
YAML
86.2% 168/195
JSON
85.1% 166/195

On Claude Haiku, LEAN outperforms JSON by +2.6 percentage points while using half the tokens.

By Question Type

Question Type JSON LEAN YAML
Field Retrieval 78.0% 81.1% 79.5%
Aggregation 82.7% 83.6% 82.7%
Filtering 100.0% 100.0% 100.0%
Structure Awareness 93.3% 96.7% 98.3%
Structural Validation 80.0% 80.0% 80.0%

By Dataset

Dataset JSON LEAN YAML
Employee records (100, flat) 82.5% / 6,150 tok 83.8% / 2,361 tok 82.5% / 4,777 tok
E-commerce orders (50, nested) 97.4% / 10,731 tok 98.7% / 6,521 tok 98.7% / 7,765 tok
Time-series (60, flat) 73.2% / 3,609 tok 76.8% / 1,461 tok 75.0% / 2,882 tok
GitHub repos (100, flat) 67.9% / 13,810 tok 69.6% / 7,434 tok 69.6% / 11,667 tok
Event logs (75, semi-uniform) 94.4% / 6,252 tok 98.1% / 5,028 tok 98.1% / 5,078 tok
Nested config (deep) 100% / 710 tok 100% / 460 tok 100% / 505 tok

LEAN matches or beats JSON on every single dataset, while using 20-62% fewer tokens.

What the Formats Look Like

Employee records. Same data, three formats.

JSON 6,150 tokens for 100 rows
{
  "employees": [
    {
      "id": 1,
      "name": "Paul Garcia",
      "email": "paul.garcia@company.com",
      "department": "Engineering",
      "salary": 92000,
      "yearsExperience": 19,
      "active": true
    },
    {
      "id": 2,
      "name": "Aaron Davis",
      "email": "aaron.davis@company.com",
      "department": "Finance",
      "salary": 149000,
      "yearsExperience": 18,
      "active": false
    }
  ]
}
LEAN 2,361 tokens (-61.6%)
employees:
  #[100](active|department|email|id|name|salary|yearsExperience)
  true|Engineering|paul.garcia@company.com|1|Paul Garcia|92000|19
  ^false|Finance|aaron.davis@company.com|2|Aaron Davis|149000|18

The #[100] header declares the row count and column names once. Each row is pipe-delimited, rows separated by ^. No repeated keys, no braces, no quotes. Just data.

YAML 4,777 tokens (-22.3%)
employees:
  - active: true
    department: Engineering
    email: paul.garcia@company.com
    id: 1
    name: Paul Garcia
    salary: 92000
    yearsExperience: 19
  - active: false
    department: Finance
    email: aaron.davis@company.com
    id: 2
    name: Aaron Davis
    salary: 149000
    yearsExperience: 18

YAML removes braces and quotes but still repeats every key per row.

How We Tested

1
Format conversion Each dataset converted to all 3 formats
2
Query LLM Model receives formatted data + question, extracts the answer
3
Deterministic validation Type-aware comparison (92000 matches $92,000, case-insensitive). No LLM judge.
Models gpt-4o-mini, claude-haiku-4-5-20251001
Tokenizer gpt-tokenizer with o200k_base (GPT-5)
Temperature Default (not set)
Evaluation Deterministic string/number matching

Question Types

195 questions generated dynamically across five categories.

Category Share Example
Field retrieval 34% "What is Paul Garcia's salary?" โ†’ 92000
Aggregation 28% "How many employees in Engineering?" โ†’ 17
Filtering 20% "Active Sales employees with > 5 years?" โ†’ 8
Structure awareness 15% "How many employees in the dataset?" โ†’ 100
Structural validation 3% "Is this data complete and valid?" โ†’ NO

Dataset Catalog

Dataset Rows Structure Questions
Uniform employee records 100 uniform 40
E-commerce orders 50 nested 38
Time-series analytics 60 uniform 28
Top 100 GitHub repos 100 uniform 28
Semi-uniform event logs 75 semi-uniform 27
Deeply nested config 11 deep 29
Valid complete (control) 20 uniform 1
Truncated array 17 uniform 1
Extra rows 23 uniform 1
Width mismatch 20 uniform 1
Missing fields 20 uniform 1

Key Takeaways

1

LEAN saves ~47% tokens per LLM call compared to JSON, which directly translates to lower API costs.

2

Accuracy doesn't suffer. LEAN actually scored 1.7 percentage points higher than JSON (87.9% vs 86.2%).

3

On flat tabular data, LEAN saves 51-62%. If your data is arrays of uniform objects, the savings are massive.

4

YAML is a solid middle ground. 21% token savings over JSON with comparable accuracy.

5

Both models showed the same pattern. This isn't model-specific; compressed formats work across providers.

If you're stuffing structured data into LLM prompts, you're probably wasting half your tokens on JSON syntax. LEAN gives you the same (or better) accuracy for less than half the cost.

Playground

JSON
TOON

      
LEAN

      
JSON TOON LEAN
Characters - - -
Tokens - - -
Savings - - -

Ecosystem

Claude Code Plugin

Install the TOON formatting plugin for Claude Code to get compact data output in your terminal.

claude plugin add fiialkod/toon-formatting-plugin
View on GitHub

MCP Server

Connect any MCP client to the TOON/LEAN encoding server for automated format conversion.

{
  "mcpServers": {
    "toon": {
      "command": "npx",
      "args": ["@fiialkod/toon-mcp-server"]
    }
  }
}
View on GitHub