The Art of Comparing Apples with Oranges (or: DragonflyDB vs Redis)

Last updated 2022-06-02. Written by Magnus Holm (judofyr@gmail.com).

Unbeknownst to many, there is a productive way of comparing apples with oranges:

Step 1: Turn your orange into an apple.
Step 2: Compare your apple with with other apples.
Step 3: Show how your apple can turn into an orange.

Does this sound a bit confusing? Let’s look at a specific case.

The case of DragonflyDB

DragonflyDB is a recently released in-memory store which claims to be "probably the fastest one in universe". It’s still only in v0.1 and probably not quite ready for production, but they have some pretty interesting benchmark to show:

Dragonfly is a modern in-memory datastore, fully compatible with Redis and Memcached APIs. Dragonfly implements novel algorithms and data structures on top of a multi-threaded, shared-nothing architecture. As a result, Dragonfly reaches x25 performance compared to Redis and supports millions of QPS on a single instance.

This was recently discussed at Hacker News and many people pointed out that the benchmark is a bit meaningless. Redis is famously single-threaded: Every thing happens on a single thread and you gain nothing from running it on a machine with many cores. The benchmark was ran on a c6gn.16xlarge instance in AWS. This is a virtual machine which has 64 virtual CPUs and costs $2.7648 per hour. The Redis server might be x25 slower, but it also leaves 63 virtual CPUs idle which you can use for other types of work. Alternatively, if you wanted a designated instance for Redis you would never provision one of these. This is a completely apples-to-oranges comparison.

Or is it? This benchmark demonstrates DragonflyDB’s ability of scaling on a multi-core machine. That is remarkable in itself! This is a capability of DragonflyDB which Redis is completely lacking. Doesn’t DragonflyDB deserve to show off its great multi-threaded throughput?

By following the recipe above we can make a productive comparison between these systems.

Step 1: Run DragonflyDB in single-core mode

At the moment it seems like DragonflyDB always takes advantage of all the cores available on the machine. It should however be very trivial to add a command-line flag which makes it only on a fixed amount of cores. This might actually be a useful feature in itself if you’re running it on a machine which is used for multiple tasks.

Now we’ve been able turn our orange (multi-core database/cache) into an apple (single-core database/cache).

Step 2: Benchmark single-core DragonflyDB with Redis

Now we can do a completely fair apples-to-apples benchmark with Redis. Run DragonflyDB on a single core and compare its performance with Redis running "as normal". Running this "baseline" comparison will actually teach us a lot. DragonflyDB claims that part of their performance is due to (1) io_uring and (2) the fast hash table. If this is actually true, we could expect to see that it beats Redis on this benchmark. And if DragonflyDB doesn’t beat Redis then their "x25 performance" number is even more impressive, as it means that their able to catch up due to their multi-core architecture.

Whichever database is doing worst here is the one which has the most opportunity to improve. Since this actually is an apples-to-apples comparison it would probably be easy for the "loser" to use the same approach. The "winner" has proved that it works; now the rest can use these results. It might be a lot of work, but it shouldn’t need any architectural big changes.

Step 3: Show how DragonflyDB can scale across multiple cores

Next step: Run DragonflyDB using 1, 2, 3, …, 64 cores (on the same machine!) and for each of these run the benchmarks. Yes, it’s a lot of benchmarks. Yes, we need them all. Don’t skimp on this. Plot it on a graph. Look at it.

The best scenario we can expect is "perfect linear scaling": Double the number of cores, double the throughput. (If you’re able to find an approach where you get more than double the throughput, please reach out to me dear wizard.) Typically, we’re not able to achieve perfect linear scaling. There’s always some overhead in multi-core execution. This is fine! This is expected! However, we want to know how it behaves. Does it have linear scaling up until a certain point? If so, there’s no point in buying 64-core machines if we get the same performance from a 32-core machine.

Immediately you may now sense that there is something off with DragonflyDB’s "x25 performance" claim. They ran this on a machine with 64 virtual CPUs, and they compared against Redis which only used one. Considering we didn’t get a 64-fold throughput improvements, we are far away from perfect linear scaling here. (Some of this may be related to the way "virtual CPUs" works in AWS. I’m not an expert on this.) What’s going on here? And remember: There’s nothing wrong with not achieving perfect linear scaling. This is just how life is.

Also notice that here we’re not competing against Redis. There’s no reason to compete against Redis here. Redis can’t do any of this! We’re only comparing DragonflyDB against itself. This is an orange-to-orange comparison. A very tasty orange!

Step 4 (bonus): Redis can show how they scale across multiple machines

And now we’ve also made it possible for Redis to demonstrate their strength: Redis has a cluster-mode where data is stored on different machines. This has some limitations on what you can do (notably: you can only do transactional multi-key operation inside a single machine), but has its own set of strengths. Maybe they can show how they’re able to scale much further since there’s no limit to the number of machines you can spin up (while there is a limit to number of cores you can get).

There is no single number

I think part of the reason why we keep seeing these apples-to-oranges is that people desperately want to quantify and compare everything into a single number, or a single graph. "Aha, my product is X times better/faster/smaller than your product!" It’s very understandable: You want to convey the advantages in the quickest possible way.

Unfortunately, it’s rarely actually true. Different products have different qualities in different dimensions. For a specific use case (e.g. "run on a single machine with 64 virtual CPUs") you can come up with a number, but practically this number isn’t very useful since we all have different use cases.

By running multiple benchmarks, only varying one dimension at a time, we can understand how the system behaves under a much larger set of use cases. Yes, it’s a lot of work. Yes, it’s worth it.