Preface
In business development, a large number of scenarios require unique IDs for identification: users need unique identity identifiers; products need unique identifiers; messages need unique identifiers; events need unique identifiers... etc., all need globally unique IDs, especially in distributed scenarios. Resource IDs are usually associated with horizontal scaling, and one solution to horizontal scaling is to generate unordered and unpredictable IDs.
This article will take a look at the common methods of generating unique IDs and their characteristics.
Characteristics of unique ID
The unique ID should have the following characteristics:
1) Uniqueness: The generated ID is globally unique, with an extremely low probability of conflict in a specific range
2) Availability: It can ensure high availability under high concurrency
3) Security: The generated ID cannot be predicted and will not expose system and business information
Common methods
Currently commonly used methods for generating unique IDs include: UUID, database auto-increment ID, Snowflake algorithm, and hash algorithm
Solution | Uniqueness | Orderliness | Availability |
Snowflake algorithm | Strong uniqueness | Approximately ordered | High availability |
UUID | Strong uniqueness | Unordered | High availability |
Database auto-increment ID | Strong uniqueness | Ordered | High availability |
Hash algorithms | Strong uniqueness | Not required | High availability |
Snowflake algorithm
The ID generated by the Snowflake algorithm can be predicted to a certain extent through its components.
The ID generated by the Snowflake algorithm consists of the following parts:
1) Sign bit: Always 0, indicating a positive number.
2) Timestamp: It occupies 41 digits, representing the time difference from a fixed point in time (such as 00:00:00 on January 1, 2010).
3) Machine Identifier: It includes the data center ID and machine ID, each occupying 5 digits, used to distinguish different server nodes.
4) Sequence number: occupies 12 bits, used for different IDs generated within the same millisecond
Is the Snowflake algorithm guessable?
During the testing process, we may encounter some resource IDs generated by the Snowflake algorithm. Can we guess other resource IDs?
Theoretically, it is possible, but it is麻烦.According to the algorithm, the timestamp uses milliseconds. Assuming that there are 5 machine IDs in total, one resource ID is generated every millisecond, then there are 60*60*1000*5=18,000,000 in an hour.
When the amount of data is too large, it is difficult to obtain information through brute force, and it is also easy to be discovered. When the cost of the attack exceeds the benefits, there is no point. Therefore, the Snowflake algorithm also guarantees security to some extent.
Snowflake algorithm demo
import time import threading class SnowflakeGenerator: def __init__(self, datacenter_id, worker_id): self.datacenter_id = datacenter_id self.worker_id = worker_id self.sequence = 0 self.last_timestamp = -1 # Bit lengths for different parts self.datacenter_id_bits = 5 self.worker_id_bits = 5 self.sequence_bits = 12 # Maximum values self.max_datacenter_id = -1 ^ (-1 << self.datacenter_id_bits) self.max_worker_id = -1 ^ (-1 << self.worker_id_bits) self.max_sequence = -1 ^ (-1 << self.sequence_bits) # Shift amounts self.worker_id_shift = self.sequence_bits self.datacenter_id_shift = self.sequence_bits + self.worker_id_bits self.timestamp_shift = self.sequence_bits + self.worker_id_bits + self.datacenter_id_bits self.lock = threading.Lock() def _current_milliseconds(self): return int(time.time() * 1000) def _til_next_millis(self, last_timestamp): timestamp = self._current_milliseconds() while timestamp <= last_timestamp: timestamp = self._current_milliseconds() return timestamp def generate_id(self): with self.lock: timestamp = self._current_milliseconds() if timestamp < self.last_timestamp: raise ValueError("Clock moved backwards. Refusing to generate id.") if timestamp == self.last_timestamp: self.sequence = (self.sequence + 1) & self.max_sequence if self.sequence == 0: timestamp = self._til_next_millis(self.last_timestamp) else: self.sequence = 0 self.last_timestamp = timestamp return ((timestamp - 1288834974657) << self.timestamp_shift) | \ (self.datacenter_id << self.datacenter_id_shift) | \ (self.worker_id << self.worker_id_shift) | \ self.sequence def generate_unique_id(prefix: str, datacenter_id: int, worker_id: int) -> str: generator = SnowflakeGenerator(datacenter_id, worker_id) snowflake_id = generator.generate_id() return f"{prefix}{snowflake_id}" # Create a generator, specifying the data center ID and worker machine ID datacenter_id = 0 worker_id = 1 # Generate user ID user_id = generate_unique_id("USER_", datacenter_id, worker_id) print(f"Generated user ID: {user_id}")
Database auto-increment
The database auto-increment ID may be the most familiar way to generate unique IDs for everyone. It has the advantages of simple use, meeting basic needs, and being naturally ordered. However, there are also defects:
1) Poor concurrency
2) The database write pressure is high
3) The database cannot be used after a failure
4) There is a risk of quantity leakage
The database auto-increment ID is the most common way to generate IDs. By using the database itself for settings, it can maintain uniqueness within the entire database. The advantages are simple to use, meet basic business needs, and are naturally ordered; the disadvantages are heavily dependent on the DB, and there may be single-point failures, data consistency issues, etc. due to some characteristics of database deployment. In view of the defects of the database auto-increment ID introduced above, there will be the following two optimization solutions:
1. Horizontal splitting of the database, setting different initial values and the same step size
As shown in the figure, it can ensure that the IDs generated by each database are not conflicting, but this fixed step method will also bring expansion problems. It is easy to think that when expanding, there will be a dilemma of no initial ID value to be divided. Solutions include:
- Determine the step size based on expansion considerations
- Increase other bit markers to distinguish expansion
This is actually a balance between demand and solution, and the most suitable method is chosen according to the demand.
2. Generate a batch of IDs
If you want to use a single machine to generate IDs, to avoid the expansion problem caused by fixed steps, you can generate a batch of IDs each time and give them to different machines to consume slowly. This will also reduce the database pressure to one-Nth, and can withstand for a period of time after a failure.
As shown in the figure, but the disadvantage of this approach is that server restarts and single-point failures can cause ID discontinuity. As always, there is no best solution, only the most suitable one.
UUID
UUID is a set of standards for generating globally unique identifiers, also known as GUID (Globally Unique Identifier). By using UUID, unique IDs can be generated in distributed systems. There are many ways to generate UUIDs.
UUID Format
The standard format of UUID is a string composed of 32 hexadecimal numbers, separated into five parts, such as:
467e8542-2275-4163-95d6-7adc205580a9
The number of digits in each part is: 8-4-4-4-12
When used in the system, the hyphen (-) is generally removed. Therefore, the format is as follows: 467e85422275416395d67adc205580a9
Characteristics
Uniqueness: Based on the combination of random numbers and timestamps, a globally unique ID is generated
Unordered: UUID is randomly generated and does not have time sorting
Performance: The generation speed of UUID is fast, suitable for high-concurrency environments
Can UUID resist unauthorized access?
UUID is completely unordered and cannot be guessed, which can prevent horizontal unauthorized access attacks
Hash algorithms
Hash algorithms can be used to generate unique IDs, but it is necessary to understand the characteristics and limitations of hash algorithms. Hash algorithms convert input data of any length into fixed-length hash values, which are usually fixed-length binary strings and are typically represented in hexadecimal.
Hash principle
- Hash algorithms convert input data into hash values deterministically. This means that the same input data will always generate the same hash value.
- Hash algorithms are typically one-way, meaning that the original data cannot be还原 from the hash value. This is because the hash value is usually shorter than the original data, resulting in the loss of some information.
Uniqueness
The goal of hash algorithms is to ensure that different input data generates different hash values. However, since the length of hash values is finite, hash collisions are inevitable. Hash collisions refer to different input data mapping to the same hash value.
Well-designed hash algorithms should minimize the probability of hash collisions, but they may not be able to eliminate them completely.
Methods to resolve hash collisions
For the purpose of generating unique IDs, it is necessary to consider how to handle hash collisions. One method is to add randomness to reduce the probability of collisions. Another method is to use a uniqueness index to verify whether the generated ID already exists.
Common hash algorithms
- Hash algorithms are widely used in fields such as cryptography, data integrity verification, data retrieval, data comparison, digital signatures, data structures (such as hash tables), and information security.
- In the context of generating unique IDs, hash algorithms can be used to create identifiers to ensure their uniqueness on given data.
Summary
To test for improper authorization, it is necessary to fully understand the generation method of resource IDs before determining whether it is possible to perform unauthorized test. The article introduces several common methods of unique ID generation, and it is necessary to flexibly judge the algorithm used by resource IDs during the testing process.

评论已关闭