Efficient natural sorting using magnitude prefixing
Everyone prefers a nicely sorted list, no?
-
test1.txt
-
test10.txt
-
test11.txt
-
test2.txt
-
test3.txt
-
test4.txt
-
test5.txt
-
test6.txt
-
test7.txt
-
test8.txt
-
test9.txt
This list is what you get if you use a "default" string sorting function in any programming language. It’s technically correctly/lexically sorted, but humans typically prefer if the numbers inside the strings are sorted numerically (also known as "natural sorting").
There’s a rather elegant technique for implementing natural sorting which I call "magnitude prefixing". I can’t remember where I first read about this, but it doesn’t seem to be so well-known.
Prelude: Left-padding numbers
Let’s start by talking about a suboptimal way of implementing natural sorting:
Storing the "sort key" as a separate field turns out to be very convenient: Your database doesn’t need to support custom sort ordering and a regular index makes querying efficient. It also makes it possible to add other types of normalization (e.g. convert to lower case).
There’s however two problems with this approach:
First of all, you need to decide how many digits you want to support in advance.
You want to naturally sort numbers up with 9 digits?
Well, then you need to pad all numbers with up to 9 zeros, and any number with more digits will not be corrected sorted.
In addition, the sorting key becomes extremely long when the string contains many smaller numbers.
A timestamp such as 2024-02-04 19:26:34.867641 +0100
would be stored as 000002024-000000002-000000004 000000019:000000026:000000034.000867641 +000000100
.
It’s maybe not a big problem, but definitely adds a lot of unnecessary overhead and suddenly you’ll reach a limit in a surprising way.
Better: Magnitude prefixed numbers!
Turns out there’s a slight variant which is easy to implement and much better:
A few examples:
-
In
test105.txt
we have a number with three digits so we end up withtest3105.txt
. -
In
test20.txt
we have a number of two digits so we end up withtest220.txt
. -
In
test2024.txt
we have a number of four digits so we end up withtest42024.txt
.
This works because it ensures that all 1 digit numbers become "grouped" together before any 2 digit numbers and so forth.
Notice that we’re only using a single extra byte per number. The worst case is for single digit numbers and these strings become 100% longer. Therefore we know that the sorting string is maximum 2x larger than our initial string. This is way better than the 10x increase we saw when we used left-padded strings.
Handling more than 9 digits
What if we have a number with, let’s say, 14 digits?
We clearly can’t prefix it with 14
since this makes to ordered together with the 1 digit numbers.
Inspired by the UTF-8 encoding, our solution to this problem is to let "9" act as a "continuation" byte: We first place a "9" and then we add another byte representing how many additional digits we have. This might go on multiple times if it’s a really huge number.
A few examples:
-
1 digit numbers are prefixed with "1"
-
5 digit numbers are prefixed with "5"
-
8 digit numbers are prefixed with "8"
-
9 digit numbers are prefixed with "90"
-
10 digit numbers are prefixed with "91"
-
11 digit numbers are prefixed with "92"
-
17 digit numbers are prefixed with "98"
-
18 digit numbers are prefixed with "990"
-
19 digit numbers are prefixed with "991"
This works for any number of digits: We can just keep adding "9"s.
Bonus: There’s no 0 digit numbers
There’s one property which lets us use one less byte for some numbers: Since every number has at least one digit we can start counting at 0 instead of 1. This means that 9 digit numbers are prefixed with "8" instead of "90" (and 18 digit numbers are prefixed with "98" instead of "990"). It doesn’t make a huge difference, but there’s no reason to not use 0.
This gives us the following rules:
-
1 digit numbers are prefixed with "0"
-
5 digit numbers are prefixed with "4"
-
8 digit numbers are prefixed with "7"
-
9 digit numbers are prefixed with "8"
-
10 digit numbers are prefixed with "90"
-
11 digit numbers are prefixed with "91"
-
17 digit numbers are prefixed with "97"
-
18 digit numbers are prefixed with "98"
-
19 digit numbers are prefixed with "990"
A JavaScript implementation
Okay, okay, I know you’re probably here for copy-pasting code. Here you go. This should be easily adjustable to other languages, but note that the integer division needs to round down.
const NINES = "999999999999999999999999999"
function magnitudePrefixed(s) {
return s.replace(/\d+/g, (num) => {
const digits = num.length - 1
return NINES.slice(0, digits / 9) + (digits % 9) + num
})
}
Bonus: It’s reversible!
Another cool thing about this approach is that it’s trivial to recover the original string: Just remove the prefix from any number:
function unmagnitudePrefixed(s) {
return s.replace(/9*\d(\d+)/g, '$1')
}