Too Many Afterthoughts The greatest ideas are often far too late

Databases & UUIDs

Choose the right primary key to save a large amount of disk I/O

Cardboard boxes

Imagine you’re working in a large book warehouse and in charge of new arrivals. There’s a separate, digital system for metadata like authors, categories, etc., so the only information you’ll use during storage is the inventory number. Each book is identified by a unique number with many digits and all books must be findable by their number. To make handling quicker, books are packed in boxes, ordered by the inventory number. When looking for a book, the box must be identified first. Therefore, each box is labeled by the lowest inventory number it contains and the first number that’s in the next box.

Your job in the arrivals room is to pick up books-to-be-stored one by one, assign them a new inventory number in the metadata system, label them by number, and put them in a box as mentioned before. Now, the room is quite small and if you run out of space, you’ll need to move the filled boxes into the basement, which might be two floors down.


UUIDs Are Bad for Database Index Performance, enter UUID7!

UUIDs, Universal Unique Identifiers, are a specific form of identifier designed to be unique even when generated on multiple machines. Compared to autoincremented sequential identifiers commonly used in relational databases, generating does not require centralized storage of the current state, I.e., the identifiers that have already been allocated. This is useful when a centralized system would pose a performance bottleneck or a single point of failure. UUIDs are designed to be able to support very high allocation rates, up to 10 million per second per machine.


Partitioning InnoDB tables by time-based pseudo-sequential UUIDs

Time Twister Abstract Concept Illustration with Twisted Vintage Clock

Partitioning has multiple uses – spreading load onto multiple disks, cold storage of older data on cheaper disks, and probably others. Most importantly though, partitions are not for performance.

The main use case I’m going to explain is time-based partitioning as a tool for limiting the scope of stored data. This could be required to comply with a data retention policy or simply to save money on disk space.

Too Many Afterthoughts The greatest ideas are often far too late