A MongoDB ObjectId is a globally unique ID that can be generated on any machine with high probability of it being distinct from any other ObjectId generated elsewhere. The contents is not random, but consists of a few pieces of data. While you should never rely on knowing an ObjectId value for security, there are cases where it’s important to understand how difficult it is to guess a generated ObjectId.
Let’s consider a simpler case: IPv4 addresses. For privacy reasons, you might not want to store real IP addresses, but you still want to count the number of distinct IP’s used. The trivial solution is to take a SHA1 hash of the IP address and store those. The problem is that this is not really a one-way hash, because the source range is so limited. A modern GPU will test through all possible IPv4 addresses in a fraction of a second.
In this case the source range of IPv4 contains only 32 bits of entropy. So how many bits of entropy does a MongoDB ObjectId contain?
Contents of an ObjectId
A MongoDB ObjectId consists of the following four pieces of data:
- a 4-byte value representing the seconds since the Unix epoch,
- a 3-byte machine identifier,
- a 2-byte process id, and
- a 3-byte counter, starting with a random value.
In total, there are 96 bits of data in an ObjectId. However, many of the bits can be guessed.
Below we’ll consider specifically the case of guessing a specific ObjectId – i.e. assuming you have a hashed value, how many tries do you need to find out the original ObjectId value.
If you know an ObjectId, finding other possibly created ObjectId’s is much simpler, as you just need to iterate the timestamp and counter. This means that from a large set of hashed IDs it may be possible to find some original values easily, even though it might be difficult to find the value for a specific hash.
The timestamp is one of the most easy-to-guess values in an ObjectId. The question is, how well you can guess when the ObjectId was created. Depending on the timespan, you get the following amounts of entropy:
- year – 25 bits
- month – 21 bits
- day – 16 bits
- hour – 12 bits
- minute – 6 bits
There’s also other ways the time range may be limited. For example, if you know the ID was generated during office hours the entropy is reduced by 2 bits.
The machine identified is the first three bytes of the MD5 hash of the machine host name, the mac/network address, or the virtual machine ID.
If you don’t know anything about the architectural setup or other IDs, you end up with a full 24 bits of entropy. However, if you have an ObjectId which you know was generated on the same machine, then you know the exact machine ID. Knowledge of the internal setup (for example host naming conventions) can also reduce the entropy to just a few bits.
The default range of process ID numbers on most Linux systems is 0 to 32768, providing 15 bits of entropy with no additional knowledge.
However, process IDs are normally consecutive numbers. Dedicated database or application servers typically don’t have a lot of processes running, so the process ID is probably below 1000 (possibly below 10 for containers) and typically pretty deterministic, varying only a few numbers between reboots.
Therefore I’d say you can assume at most 6-10 bits of entropy for the process ID, and if you have a scripted setup and know another ObjectId, can be as low as 1–2 bits.
The 3-byte counter provides at most 24 bits of entropy. Since the counter starts from a random value, this is typically the case. However, if you know an ObjectId created by the same process around the same time, it can drastically reduce the entropy.
Assuming inserts of on average 1 TPS, knowing an ObjectId created within a 1 hour timeframe from the unknown ObjectId would reduce the entropy to about 12 bits. The total entropy is further reduced because the counter value strongly correlates with the timestamp.
Choosing the hash algorithm
To create a truly one-way hash from a limited source range, you need to choose a hashing algorithm that is sufficiently slow compared to the source range size. Hash algorithms specifically designed to be slow include bcrypt and scrypt, both of which allow a configurable complexity.
Below are shown ballpark numbers of bits of entropy of a MongoDB ObjectId with various assumptions on prior knowledge. Included is the average time it would take to brute-force a bcrypt hash (with work factor 10) on a modern GPU computing 3300 hashes / sec.
|Prior knowledge||Entropy||Time to crack|
= 79 bits
|Creation day, educated guess of machine and process IDs||16+7+4+24
= 51 bits
|ObjectId created on same day on different machine, educated guess of machine and process IDs||16+4+3+24
= 47 bits
|ObjectId created on same day on same process, ~1 insert per second||16+0+0+12
= 28 bits
|ObjectId created on same hour on same process, ~1 insert per second||12+0+0+8
= 20 bits
This demonstrates that knowing an ObjectId that was created on the same machine/process as the unknown ObjectId radically reduces the amount of entropy, by providing knowledge of the machine and process IDs and of the counter.
The above assumes that the cracker knows the algorithm and any possible salt value you use in the hash – something you should assume anyway.