Understanding Elias-Fano representation

Last updated 2023-09-05. Written by Magnus Holm (judofyr@gmail.com).

Elias-Fano representation is a powerful technique which appears to be somewhat forgotten. It’s a way to store a sequence of sorted numbers efficiently. In some circles it’s taken for granted (i.e. computer science papers around succinct data structures or inverted indexes), but on the whole you’ll find very few resources about it online. There’s not even a Wikipedia article about the topic! And more confusingly you’ll find Wikipedia articles about everything else related to Elias: Elias gamma coding, Elias delta coding, Elias omega coding, Shannon-Fano-Elias coding. None of these have anything to do with Elias-Fano representation.

You might find some implementations of Elias-Fano representation, and when you look at the actual code it’s deceptively simple. This might make it easy to use, but it doesn’t help in anyway to understand how and why it works.

In this article we’ll slowly build up our intuition. Elias-Fano representation is based on splitting up our bits and using two different representation for the lower and higher bits. We’ll start by describing each of these representations separately to understand why we need both.

Use case: Inverted indexes
Representation 1: Fixed-size compact encoding
Representation 2: Unary Delta Encoding
When Unary Delta Encoding goes wrong
Elias-Fano: Best of both worlds
Why not just delta encoding?
Show me the code!

Use case: Inverted indexes

A typical use case where you would use Elias-Fano is for inverted indexes. Inverted indexes is a way of enabling full-text search, commonly used in search engines like ElasticSearch. The concept is quite simple: We give each of our document a unique document ID (typically a 32-bit integer). We then loop over all of the words in all of the documents and build an index as follows:

The word "elias" is mentioned in documents 1, 3, 9, 12, 14, 15.
The word "fano" is mentioned in documents 1, 5, 9, 10, 15.
The word "representation" is mentioned in documents 1, 2, 14, 15.
…

Once we have such an inverted index we can efficiently find all documents which matches a query string. To find all documents which contains “elias fano representation” we go through the inverted index and determine that documents 1 and 15 are the only ones who contain all of these words. The search engine would then retrieve the full content of these documents and do further scoring to find the best possible match.

The inverted index will end up having a lot of numbers and storing these efficiently is critical for good performance. We both want it take as little space as possible, but also the ability to quickly "jump" around.

Representation 1: Fixed-size compact encoding

Let’s say we have the numbers 1, 3, 9, 12, 14, 15 and we want to store them using as few bits as possible. A simple approach would be to store them in an array of fixed size numbers, and choosing the smallest possible bit size. For our example we see that all numbers can be stored in 4 bits:

This way we use 4 bits × 6 elements = 24 bits in total. That’s not too bad.

Representation 2: Unary Delta Encoding

There’s another way of storing the numbers 1, 3, 9, 12, 14, 15 in fewer bits:

We start by placing out fifteen "0" bits. The gaps between these bits represent the numbers from 0 to 15 (labelled in gray). Then for each number in our sequence we’ll place a "1" bit in the gap representing that number.

The total number of bits we need is 15 zeros (the highest number) + 6 ones (number of elements) = 21 bits. Wo-hoo, we’ve saved three bits compared to the fixed-size representation! It may not sound like much, but once you have hundreds of thousands of numbers then it definitely makes a difference.

I don’t think this representation has a name in the literature, but I’m sticking with "Unary Delta Encoding" for this article.

Retrieving the number at a specific position

Despite the ones being scattered all around the zeros, it turns out that there is a way of efficiently (i.e. in O(1) time) finding the n-th number in the sequence. Let’s call this function access(i), and for our sequence of 1, 3, 9, … we have access(0) = 1, access(1) = 3, access(2) = 9 and so on.

def access(self, i):
    return self.select(i) - i

Wow, that was very simple, no? But why does it work? And you might wonder what select means here.

select is a quite confusingly named function since it has so many meanings in different settings, but here we’re referring to the well-defined helper from the world of succinct data structures. There we use the helpers select and rank (which are inverse of each other):

select(i) returns the position of the i-th occurrence (zero-indexed) of "1" in a bit array.
rank(i) returns the number of "1" up until the position i.

Going back to our diagram, we have the following bit array: 010010000001000100101, and select works as follows:

select(0) is 1 since the first "1" is in position 1.
select(1) is 4 since the second "1" is in position 4.
select(2) is 11 since the third "1" is in position 11.

Another way of thinking of select is that it returns the number of bits before the i-th "1" bit: The third "1" has 11 bits in front of it. And going back to the way we constructed the bit array we see that …

The third "1" will have 9 zero bits in front of it. This is because the third number in our sequence is "9" and we therefore placed it in the gap between the 9th and the 10th zero bit.
The third "1" will have 2 one bits in front of it. This is because we’ve already placed "1" bits for the first and the second number.

If we take total number of bits in front (select(i)) and subtract the number of ones in front (i) we’ll find the number of zero bits in front (select(i) - i). And the number of zero bits in front of the i-th "1" bit is indeed the value of our sequence!

There exists plenty of algorithms of select which makes it efficient. The only downside is that these algorithms typically depend on an additional index over the bits so our whole data structure will take a bit more space. Luckily this index is much, much smaller than the actual bits and won’t matter much when we’re dealing with huge sequences.

Iterating over the numbers

An alternative way of thinking about this representation is as follows: For every number we place d "0" bits followed by a single "1" bit, where d is the difference/delta between the next number and the current number. Indeed, if you look at the diagram above you can see that …

… our first number is 1 and we start out with one "0" bit followed by a single "1" bit.
Then we have two more "0" bits before the next "1" bit, and our next number in the sequence is two more than one: 3.
Next there’s six more "0" bits, and adding three and six gives the next number: 9.

This observation makes it trivial to iterate over the numbers as well. If you’re already in the middle of the sequence and you know the current number you just need to find the next "1" bit and add the number of zeros you had to jump over.

Building the bit array

Our original "algorithm" for building these bits was based on "placing ones between the gaps". In practice, the building algorithm is much simpler in code:

def build(nums):
    bits = BitArray(size=len(nums) + max(nums))
    for idx, num in nums:
        bits.set(num + idx)
    return bits

The main reason why I didn’t start with this algorithm is that it’s not so intuitive what it does: It ensures that each number n will have n zero bits in front of it:

The first number (num=1, idx=0) causes the 2nd bit to be set to "1": 01000000….
The second number (num=3, idx=1) causes the 5th bit to be set to "1": 01001000…. This is because we’d like there to be three "0" bits in front, but we’ve also already placed one "1".
The third number (num=9, idx=2) causes the 12th bit to be set to "1": 01001000000100…. Now we’ve already placed two "1" bits so we need to adjust for this to get nine "0" bits in front.

When Unary Delta Encoding goes wrong

Unary Delta Encoding isn’t always going to improve space. For instance, let’s try to store the numbers 5, 7, 15 in both representations and compare the number of bits:

Using fixed-size representation we need 4 bits × 3 elements = 12 bits.
Using Unary Delta Encoding we need 15 zeros + 3 ones = 18 bits.

Ouch! This is worse! How can we then take advantage of Unary Delta Encoding?

Elias-Fano: Best of both worlds

Elias-Fano is literally the best of both worlds of compact encoding and Unary Delta Encoding. We split the bit representation of each number into two parts. The lower bits are stored using the compact representation while the higher bits are stored using Unary Delta Encoding.

Once again, for the sequence 1, 3, 9, 12, 14, 15:

A lot is happening in this diagram so take your time. In the middle we have our original sequence of numbers. We’ve decided to take out the lowest bit of each number and storing this separately (at the top). This causes the remaining bits to form a new sequence, with smaller numbers: 0, 1, 4, 6, 7, 7. We’ve then stored these numbers using the Unary Delta Encoding. In total we have 6 bits (at the top) + 7 zeros + 6 ones = 19 bits. We managed to save 2 more bits!

The reason for this is because now the Unary Delta Encoding has 7 as its highest number when it was 15 before. This means that we need 8 fewer zeros in its representation (since 15 - 7 = 8). On the other hand, we also have 6 more bits we need to store compactly, but since this is fewer than 8 we’ve on the whole used fewer bits.

We could have chosen to store two bits of each number in the compact representation, but this wouldn’t actually help anything. This causes the highest number in the Unary Delta Encoding to be 3. We’ve gone from storing 15 zeros to 3 zeros and saved 12 bits there. However, now we need to store 2 bits × 6 elements = 12 bits for the lower bits.

In general, it turns out that it’s optimal to use log2(u / n) bits for the lower parts, where n is the number of elements and u is the biggest number.

Interestingly, you’ll see that in the Unary Delta Encoding of the higher bits we have roughly the same amount of zeros and ones. This is actually key to seeing that we have found the optimal point: If we had way more zeros than ones (or the other way around) for our bits then intuitively there should be another representation which is more space efficient.

Why not just delta encoding?

Going back to our sequence of 1, 3, 9, 12, 14, 15 you might wonder why we can’t just store the delta encoding: Instead of storing the numbers themselves we can store the difference/delta to the previous number in the sequence:

The first element is just 1.
Next we have 3 - 1 = 2.
Next we have 9 - 3 = 6.
Next we have 12 - 9 = 3.
…

This gives us the sequence 1, 2, 6, 3, 2, 1 and we see that each the numbers can be stored in 3 bits. Using the simple fixed-size representation we need 3 bits × 6 elements = 18 bits which is even better than our Elias-Fano representation.

There’s however one big disadvantage of the delta encoding: In order to find the n-th number in the sequence you need to sum up all the first n deltas. If you want to find the millionth number you actually need to sum up a million integers. If you’re only iterating over the sequence from the beginning then there’s no problem, but the moment you need to be able to "jump" around in the sequence it’s not really feasible to use delta encoding.

Show me the code!

While exploring Elias-Fano representation I have of course implemented this from scratch. It’s the best way of fully learning something. I’ve chosen to implement it in Zig (because I’ve been learning that as well):

EliasFano.zig implements Elias-Fano representation.
darray.zig contains a select implementation.