Efficient natural sorting using magnitude prefixing

Last updated 2024-02-04. Written by Magnus Holm (judofyr@gmail.com).

Everyone prefers a nicely sorted list, no?

test1.txt
test10.txt
test11.txt
test2.txt
test3.txt
test4.txt
test5.txt
test6.txt
test7.txt
test8.txt
test9.txt

This list is what you get if you use a "default" string sorting function in any programming language. It’s technically correctly/lexically sorted, but humans typically prefer if the numbers inside the strings are sorted numerically (also known as "natural sorting").

There’s a rather elegant technique for implementing natural sorting which I call "magnitude prefixing". I can’t remember where I first read about this, but it doesn’t seem to be so well-known.

Prelude: Left-padding numbers
Better: Magnitude prefixed numbers!

Prelude: Left-padding numbers

Let’s start by talking about a suboptimal way of implementing natural sorting:

Natural sorting by left-padding numbers

Find every number in the string.
Replace them by a zero-padded versions (e.g. turn test10.txt into test00010.txt).
Store the replaced version separately (e.g. title and title_sort).

Storing the "sort key" as a separate field turns out to be very convenient: Your database doesn’t need to support custom sort ordering and a regular index makes querying efficient. It also makes it possible to add other types of normalization (e.g. convert to lower case).

There’s however two problems with this approach: First of all, you need to decide how many digits you want to support in advance. You want to naturally sort numbers up with 9 digits? Well, then you need to pad all numbers with up to 9 zeros, and any number with more digits will not be corrected sorted. In addition, the sorting key becomes extremely long when the string contains many smaller numbers. A timestamp such as 2024-02-04 19:26:34.867641 +0100 would be stored as 000002024-000000002-000000004 000000019:000000026:000000034.000867641 +000000100. It’s maybe not a big problem, but definitely adds a lot of unnecessary overhead and suddenly you’ll reach a limit in a surprising way.

Better: Magnitude prefixed numbers!

Turns out there’s a slight variant which is easy to implement and much better:

Natural sorting by magnitude prefixed numbers

Find every number in the string.
Replace them by the number of digits, followed by the original digits themselves.
Store the replaced version separately.

A few examples:

In test105.txt we have a number with three digits so we end up with test3105.txt.
In test20.txt we have a number of two digits so we end up with test220.txt.
In test2024.txt we have a number of four digits so we end up with test42024.txt.

This works because it ensures that all 1 digit numbers become "grouped" together before any 2 digit numbers and so forth.

Notice that we’re only using a single extra byte per number. The worst case is for single digit numbers and these strings become 100% longer. Therefore we know that the sorting string is maximum 2x larger than our initial string. This is way better than the 10x increase we saw when we used left-padded strings.

Handling more than 9 digits

What if we have a number with, let’s say, 14 digits? We clearly can’t prefix it with 14 since this makes to ordered together with the 1 digit numbers.

Inspired by the UTF-8 encoding, our solution to this problem is to let "9" act as a "continuation" byte: We first place a "9" and then we add another byte representing how many additional digits we have. This might go on multiple times if it’s a really huge number.

A few examples:

1 digit numbers are prefixed with "1"
5 digit numbers are prefixed with "5"
8 digit numbers are prefixed with "8"
9 digit numbers are prefixed with "90"
10 digit numbers are prefixed with "91"
11 digit numbers are prefixed with "92"
17 digit numbers are prefixed with "98"
18 digit numbers are prefixed with "990"
19 digit numbers are prefixed with "991"

This works for any number of digits: We can just keep adding "9"s.

Bonus: There’s no 0 digit numbers

There’s one property which lets us use one less byte for some numbers: Since every number has at least one digit we can start counting at 0 instead of 1. This means that 9 digit numbers are prefixed with "8" instead of "90" (and 18 digit numbers are prefixed with "98" instead of "990"). It doesn’t make a huge difference, but there’s no reason to not use 0.

This gives us the following rules:

1 digit numbers are prefixed with "0"
5 digit numbers are prefixed with "4"
8 digit numbers are prefixed with "7"
9 digit numbers are prefixed with "8"
10 digit numbers are prefixed with "90"
11 digit numbers are prefixed with "91"
17 digit numbers are prefixed with "97"
18 digit numbers are prefixed with "98"
19 digit numbers are prefixed with "990"

A JavaScript implementation

Okay, okay, I know you’re probably here for copy-pasting code. Here you go. This should be easily adjustable to other languages, but note that the integer division needs to round down.

const NINES = "999999999999999999999999999"

function magnitudePrefixed(s) {
  return s.replace(/\d+/g, (num) => {
    const digits = num.length - 1
    return NINES.slice(0, digits / 9) + (digits % 9) + num
  })
}

Bonus: It’s reversible!

Another cool thing about this approach is that it’s trivial to recover the original string: Just remove the prefix from any number:

function unmagnitudePrefixed(s) {
  return s.replace(/9*\d(\d+)/g, '$1')
}