Recently I developed a small software that generated random IDs to identify data. These IDs lived for a maximum of 30 days.
At first, I generated UUIDs (Universally Unique Identifiers), which gives me 122 random bits (a uuid is 128bit, but only 122 are random). However, I didn’t like the length of the UUIDs as they were used in URLs that users had to share.
At first, I just encoded the IDs in base64, which reduced the length to 22 characters, but it still felt too long for URLs.
RFC4122: a9e053a7-b643-4c7d-817a-84d1c3cc1708
base64: qeBTp7ZDTH2BeoTRw8wXCA==
b64 no padding: qeBTp7ZDTH2BeoTRw8wXCApackage main
import (
"encoding/base64"
"fmt"
"github.com/google/uuid"
)
func main() {
id := uuid.New()
fmt.Printf("RFC4122: %s\n", id.String())
fmt.Printf("base64: %s\n", base64.URLEncoding.EncodeToString(id[:]))
fmt.Printf("b64 no padding: %s\n", base64.RawURLEncoding.EncodeToString(id[:]))
}When generating random IDs, it’s all about the probability of collisions. The more bits you have, the lower the probability of two IDs being the same. However, more bits also means longer IDs.
The number of IDs you need to generate to have a probability of >50% for at least one collision can be calculated using the birthday paradox formula:
$$ n(c) = \frac{1}{2} + \sqrt{\frac{1}{4} + 2 \times \ln(2) \times 2^c} $$
With n being the number of IDs you need to generate and c being the number of bits.
Base64 encodes 6 bits per character, so the bit-count should be a multiple of 6. I will generate random bytes, so the bit-count should also be a multiple of 8. That gives me the least common multiple of 24 bits, which is 4 base64 characters.
This gives us the following number of IDs for different lengths:
The examples are generated with the following command:
for ((i=1; i <= 5; i++)); do
dd if=/dev/urandom bs=3 count=$i status=none | base64 | tr "+/" "-_"
done| Bits | IDs for 50% collision probability | Example |
|---|---|---|
| 24 | \(4823\) | X-4T |
| 48 | \(1.98 \times 10^{7}\) | bGwp9vYf |
| 72 | \(8.09 \times 10^{10}\) | ZMTwnEFrVs94 |
| 96 | \(3.31 \times 10^{14}\) | fA_lIJOEO2-DgVMf |
| 120 | \(1.36 \times 10^{18}\) | rNlrUjNy-AjrOfJqB1-Y |
Since I was only generating a few hundred IDs per day, I decided to go with 48 bits, which gives me a very low probability of collisions. If one happens, I can always generate a new one.