VertiPaq for humans: how Power BI compresses your data

· 5 min read · Read in Español
Share:

TL;DR

  • VertiPaq uses 3 techniques: Value Encoding (reduce range), Dictionary Encoding (indices), RLE (sequences)
  • Less cardinality = better compression (round decimals, clean duplicates)
  • A single outlier can ruin the entire column
  • DAX Studio + VertiPaq Analyzer to diagnose which column weighs most

VertiPaq is Power BI’s compression engine. It’s what lets you load millions of rows into memory and have queries fly. But nobody explains how it actually works.

Today we fix that.

The problem: wasted bits

The number 22 in binary is 10110. Five bits.

But if you store it as a 32-bit integer, you have 27 leading zeros doing nothing. Wasted space.

Multiply that by millions of rows and you understand why your model weighs so much.

Technique 1: Value Encoding

Imagine a column with values between 194 and 216. To store 216 you need 8 bits (2^8 = 256).

The trick: subtract the minimum from all values.

  • Original: 194, 200, 216 → needs 8 bits
  • With offset of 194: 0, 6, 22 → needs 5 bits

Three fewer bits per row. With 10 million rows, that’s a lot of RAM.

The problem: a single outlier ruins the whole column. If you have values from 1 to 100 but someone entered 999999, the entire column needs more bits because of that one value.

Technique 2: Dictionary Encoding

You have a “Color” column with values: Red, Red, White, Black, Blue, Red, Blue, Black, Black.

Instead of storing repeated text, VertiPaq creates a dictionary:

IDColor
0Red
1White
2Black
3Blue

And the column only stores: 0, 0, 1, 2, 3, 0, 3, 2, 2

Four unique values = 2 bits per row (2² = 4). Instead of storing “Black” (40+ bits per string) you store a 2 with 2 bits.

Cardinality matters: fewer unique values, better compression. A column with 4 colors compresses beautifully. A column with 1 million unique IDs doesn’t compress at all.

Technique 3: Run Length Encoding (RLE)

If your data is sorted:

Q1, Q1, Q1, Q1, Q1 (310 times)
Q2, Q2, Q2, Q2, Q2 (290 times)

Instead of 600 rows, VertiPaq stores: (Q1, 310), (Q2, 290). Two entries.

The trick: only works if equal values are consecutive. If the column is unsorted (Q1, Q2, Q1, Q2…), RLE compresses nothing.

VertiPaq decides internally how to sort to maximize RLE. You don’t control this directly.

What you CAN do

You don’t control the algorithm, but you control the material you feed it:

Reduce cardinality:

  • Round unnecessary decimals (3.14159 → 3.14)
  • Use date instead of datetime if you don’t need the time
  • Clean duplicates (“Madrid”, “madrid”, “MADRID” → “Madrid”)
  • Don’t import columns you don’t use

Reduce range:

  • Handle outliers before importing
  • Consider splitting problematic columns

Use correct types:

  • Numbers as numbers, not as text
  • Numeric keys compress better than text keys

Model well:

  • Dimensions compress better than repeating values in the fact
  • Star schema exists for a reason

Calculated columns vs measures

Calculated column: runs at refresh time, gets physically stored, takes up space.

Measure: only stores the formula, runs when someone looks at the visual.

That’s why “convert calculated columns to measures when you can” is an optimization rule. Same result, less storage.

Diagnosis

When your model is too heavy:

  1. DAX Studio - VertiPaq Analyzer tells you exactly which column takes up how much
  2. Look for high-cardinality columns that shouldn’t be
  3. Look for text columns that could be numeric IDs
  4. Check if there are columns you don’t use but imported “just in case”

Summary

VertiPaq uses three main techniques: Value Encoding (reduce range), Dictionary Encoding (indices instead of values), and RLE (compress sequences).

You don’t control the algorithm, but you control the quality of the material you give it. Less cardinality, less range, correct types, clean model.

The difference between a 2GB model and a 200MB model is usually in decisions made before writing a single line of DAX.


Bonus: VertiPaq and LLM Quantization

Working with Power BI, I wondered: can you apply these techniques to the weights of a language model?

Why it doesn’t work directly:

VertiPaq assumes sortable data with small differences, repeated values (RLE), and discrete columns. LLM weights are floats distributed ~N(0, σ), almost all unique, with no natural order.

But there are analogous techniques:

VertiPaqML equivalent
Dictionary encodingWeight clustering (K-means on weights, store indices)
Bit-packingQuantization (float32 → int8/int4)
RLEPruning (remove weights ~0)
Value encodingDelta encoding for federated learning updates

Weight clustering is closest to dictionary encoding:

# Conceptually:
original_weights = [0.0012, 0.0015, 0.0011, 0.8234, 0.8229, 0.8241]

# Clustering (k=2):
centroids = [0.00127, 0.82347]  # "dictionary"
indices = [0, 0, 0, 1, 1, 1]    # references

# Compression: 6 floats → 2 floats + 6 small indices

Quantization (int4/int8) won because it’s simpler and modern hardware has optimized instructions. Weight clustering requires a lookup step that adds latency.

The connection: Both worlds (BI and ML) solve the same problem: representing information with fewer bits without losing what matters. VertiPaq for fast analytical queries, quantization for inference on GPUs with limited VRAM.

Ideas travel between domains. Understanding compression in one context gives you an advantage in another.

To clean your data before it hits VertiPaq, master Power Query—the transformations you do there directly impact compression ratios. And if you’re seeing weird numbers after loading, the problem might be in your ETL, not the model itself.

Found this useful? Share it

Share:

You might also like