leecode/thinkings/bloom-filter-en.md

66 lines
2.8 KiB
Markdown
Raw Permalink Normal View History

2020-05-22 18:17:19 +08:00
# Bloom-filter
## Scenes
Assume that you have a website which has so many audiences, what will you do if you want to know whether an IP address of a visitor is the first time to access your server or not.
### Can hashtable do this?
Of course yes.
Obviously, a hashtable with storing all IP addresses can tell us whether we have known one IP already. But, imagine that, if there are more than 1 billion IP addresses have been recorded, at least `4 Byte * 1,000,000,000 = 4,000,000,000 Byte = 4 GB` RAM is required. If it is not IP, but URL, the RAM required will be much larger.
### Bit
Another solution is using 1 bit to represent the access status of 1 IP, accessed or not.
For the same 1 billion IP addresses, now only `1 * 1,000,000,000 = 128 MB` RAM is required. If it is URL, this method uses much less spaces.
With this method, only two operations are needed: `set(ip)` and `has(IP)`.
However, this method has two fatal weakness:
1. If elements are not distributed uniformly, a lot of spaces will not be used which is inefficient in space.
> A good Hash function can be used to overcome this weakness.
2. If the elements are not integer (e.g. URL), `BitSet` is inapplicable.
> one or more Hash functions can also solve this.
### Bloom Filter
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
Actually, Bloom Filter is the second method with multiple hash functions.
Here are four interesting properties of Bloom filter:
- Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements.
- Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
- Bloom filters never generate false negative result, i.e., telling you that a username doesnt exist when it actually exists.
- Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements.
![bloom-filter-url](../assets/thinkings/bloom-filter-url.png)
### Application of Bloom Filter
1. Network crawler
whether an URL has been crawled or not.
2. `key-value` database
Whether a Key exists in this database or not.
Example:
Each region of HBase has a Bloom Filter, which can be used to find whether a key exists in this region or not quickly.
3. Phishing websites detection
Sometimes browsers may alert that a website you are accessing is a phishing website.
Browsers use Bloom Filter to find wheter URL of one website exists in the phishing website database.
> Hope this Algorithm can help you have a better understanding to `TRADEOFF`.