ブルームフィルタ - 確率的データ構造による高速存在判定 | 仕組み解説

この記事の要点

• ブルームフィルタ: 「要素が含まれない」を確実に判定し、「含まれる」は偽陽性の可能性あり
• 通常の Set と比べてメモリ使用量が 1/10 以下になることも
• Cassandra・BigTable・Chrome の Safe Browsing など広範囲で採用

ブルームフィルタ（Bloom Filter）は、要素が集合に含まれるかを確率的に高速判定するデータ構造です。偽陽性（False Positive）は発生しますが偽陰性（False Negative）は発生せず、メモリ効率が極めて高いため、分散システムやデータベースで広く使われています。本記事では、原理、実装、チューニング、応用例を体系的に解説します。

概要

ブルームフィルタとは

1970年にBurton Howard Bloomが考案した確率的データ構造で、ビット配列と複数のハッシュ関数を使って集合のメンバーシップを判定します。

flowchart TB
    Element["要素 'hello'"]
    Hash1["hash1('hello') = 3"]
    Hash2["hash2('hello') = 7"]
    Hash3["hash3('hello') = 11"]

    BitArray["ビット配列<br/>[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0]"]

    Element --> Hash1 --> BitArray
    Element --> Hash2 --> BitArray
    Element --> Hash3 --> BitArray

    Query["検索 'hello'"]
    Check["3,7,11 番目が全て 1?"]
    Result["YES: 含まれる可能性あり"]

    Query --> Check --> Result

注意: ブルームフィルタは削除操作ができません（削除すると他の要素のビットを誤って消す可能性）。削除が必要なら Counting Bloom Filter を使います。

偽陽性と偽陰性

判定結果	実際の状態	結果
含まれる	含まれる	正解（True Positive）
含まれる	含まれない	偽陽性（False Positive）
含まれない	含まれない	正解（True Negative）
含まれない	含まれる	偽陰性（発生しない）

ブルームフィルタは偽陰性は絶対に発生しないため、「含まれない」判定は100%正確です。

原則・定義

基本パラメータ

パラメータ	記号	説明
ビット配列のサイズ	m	フィルタのメモリサイズを決定
ハッシュ関数の数	k	偽陽性率に影響
挿入要素数	n	実際に格納する要素の数
偽陽性率	p	誤って「含まれる」と判定する確率

偽陽性率の計算式

ポイント: 偽陽性率 p は次の式で計算できます: p ≈ (1 - e^(-kn/m))^k

最適なハッシュ関数の数 k は k = (m/n) * ln(2) ≈ 0.693 * (m/n) で求まります。

必要なビット配列サイズ m は m = - (n * ln(p)) / (ln(2))^2 ≈ -1.44 * n * log2(p) で求まります。

計算例

100万要素を格納し、偽陽性率 1% を達成する場合:

n = 1,000,000
p = 0.01
m = - (1,000,000 * ln(0.01)) / (ln(2))^2 ≈ 9,585,058 bits ≈ 1.14 MB
k = (9,585,058 / 1,000,000) * ln(2) ≈ 6.6 ≈ 7 個のハッシュ関数

通常の Set（8バイト/要素と仮定）なら約 7.6MB 必要なので、約 1/7 のメモリで済みます。

構成要素

動作フロー

1. 初期化

// ビット配列をすべて 0 で初期化
const bitArray = new Uint8Array(Math.ceil(m / 8)).fill(0);

2. 要素の追加（Add）

function add(element: string): void {
  for (let i = 0; i < k; i++) {
    const hash = hashFunctions[i](element);
    const index = hash % m;
    setBit(bitArray, index, 1);
  }
}

3. 要素の検索（Contains）

function contains(element: string): boolean {
  for (let i = 0; i < k; i++) {
    const hash = hashFunctions[i](element);
    const index = hash % m;
    if (getBit(bitArray, index) === 0) {
      return false; // 確実に含まれない
    }
  }
  return true; // おそらく含まれる（偽陽性の可能性あり）
}

実装例

1. 基本的なブルームフィルタ（TypeScript）

import crypto from "crypto";

class BloomFilter {
  private bitArray: Uint8Array;
  private size: number; // m: ビット配列のサイズ
  private numHashes: number; // k: ハッシュ関数の数

  constructor(expectedElements: number, falsePositiveRate: number) {
    // m を計算
    this.size = Math.ceil(
      (-expectedElements * Math.log(falsePositiveRate)) / (Math.LN2 * Math.LN2),
    );
    // k を計算
    this.numHashes = Math.ceil((this.size / expectedElements) * Math.LN2);
    // ビット配列を確保
    this.bitArray = new Uint8Array(Math.ceil(this.size / 8)).fill(0);

    console.log(`Bloom Filter: m=${this.size}, k=${this.numHashes}`);
  }

  private hash(value: string, seed: number): number {
    const hash = crypto.createHash("sha256").update(`${value}:${seed}`).digest();
    // 最初の4バイトを数値に変換
    return (
      (hash[0] << 24) |
      (hash[1] << 16) |
      (hash[2] << 8) |
      hash[3]
    ) >>> 0; // 符号なし32bit整数
  }

  private setBit(index: number): void {
    const byteIndex = Math.floor(index / 8);
    const bitIndex = index % 8;
    this.bitArray[byteIndex] |= (1 << bitIndex);
  }

  private getBit(index: number): number {
    const byteIndex = Math.floor(index / 8);
    const bitIndex = index % 8;
    return (this.bitArray[byteIndex] >> bitIndex) & 1;
  }

  add(value: string): void {
    for (let i = 0; i < this.numHashes; i++) {
      const hash = this.hash(value, i);
      const index = hash % this.size;
      this.setBit(index);
    }
  }

  contains(value: string): boolean {
    for (let i = 0; i < this.numHashes; i++) {
      const hash = this.hash(value, i);
      const index = hash % this.size;
      if (this.getBit(index) === 0) {
        return false; // 確実に含まれない
      }
    }
    return true; // おそらく含まれる
  }

  // 現在の偽陽性率を推定
  estimateFalsePositiveRate(insertedCount: number): number {
    const ratio = insertedCount / this.size;
    return Math.pow(1 - Math.exp(-this.numHashes * ratio), this.numHashes);
  }
}

// 使用例
const bf = new BloomFilter(1000000, 0.01); // 100万要素、1%偽陽性

bf.add("user:12345");
bf.add("user:67890");

console.log(bf.contains("user:12345")); // true
console.log(bf.contains("user:99999")); // false (高確率)
console.log(bf.contains("user:00000")); // false または true (偽陽性の可能性)

2. Counting Bloom Filter（削除対応）

class CountingBloomFilter {
  private counters: Uint8Array; // ビットの代わりにカウンター
  private size: number;
  private numHashes: number;

  constructor(expectedElements: number, falsePositiveRate: number) {
    this.size = Math.ceil(
      (-expectedElements * Math.log(falsePositiveRate)) / (Math.LN2 * Math.LN2),
    );
    this.numHashes = Math.ceil((this.size / expectedElements) * Math.LN2);
    this.counters = new Uint8Array(this.size).fill(0); // 各セルがカウンター
  }

  private hash(value: string, seed: number): number {
    // 同上（省略）
    return 0;
  }

  add(value: string): void {
    for (let i = 0; i < this.numHashes; i++) {
      const index = this.hash(value, i) % this.size;
      if (this.counters[index] < 255) {
        this.counters[index]++;
      }
    }
  }

  remove(value: string): void {
    for (let i = 0; i < this.numHashes; i++) {
      const index = this.hash(value, i) % this.size;
      if (this.counters[index] > 0) {
        this.counters[index]--;
      }
    }
  }

  contains(value: string): boolean {
    for (let i = 0; i < this.numHashes; i++) {
      const index = this.hash(value, i) % this.size;
      if (this.counters[index] === 0) {
        return false;
      }
    }
    return true;
  }
}

実践メモ: Counting Bloom Filter は通常のブルームフィルタの4〜8倍のメモリを使います（1ビット→4〜8ビットのカウンター）。削除が不要なら通常版を使いましょう。

3. Python 実装例（mmh3 使用）

import mmh3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, size: int, num_hashes: int):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)

    def add(self, item: str):
        for seed in range(self.num_hashes):
            index = mmh3.hash(item, seed) % self.size
            self.bit_array[index] = 1

    def contains(self, item: str) -> bool:
        for seed in range(self.num_hashes):
            index = mmh3.hash(item, seed) % self.size
            if self.bit_array[index] == 0:
                return False
        return True

# 使用例
bf = BloomFilter(size=10000, num_hashes=7)
bf.add("apple")
bf.add("banana")

print(bf.contains("apple"))   # True
print(bf.contains("orange"))  # False (高確率)

メリット・デメリット

メリット

メモリ効率: 通常の Set の 1/10 〜 1/100 以下
高速: O(k) の定数時間（ハッシュ関数の数に依存）
偽陰性なし: 「含まれない」は100%正確
キャッシュフレンドリ: ビット配列が連続メモリ

デメリット

偽陽性: 確率的に誤判定が発生
削除不可: 通常版は削除操作をサポートしない
サイズ固定: 要素数を事前に見積もる必要
列挙不可: 含まれる要素の一覧は取得できない

ユースケース

1. Apache Cassandra（SSTable フィルタ）

Cassandra は各 SSTable（ディスク上のデータファイル）にブルームフィルタを持ち、存在しないキーへのディスクアクセスを回避します。

# Cassandra の nodetool でブルームフィルタ統計を確認
nodetool cfstats keyspace.table | grep "Bloom filter"
# Bloom filter false positives: 0
# Bloom filter false ratio: 0.00000
# Bloom filter space used: 1234567

2. Google Chrome（Safe Browsing）

Chrome は悪意のあるサイトのURLをブルームフィルタで高速判定し、該当する場合のみサーバーに問い合わせます。

3. HBase（StoreFile フィルタ）

HBase は Cassandra と同様、HFile（ディスクファイル）ごとにブルームフィルタを持ち、Get 操作を高速化します。

4. BitTorrent（ピア重複除外）

BitTorrent クライアントは、既に接続したピアをブルームフィルタで記録し、再接続を避けます。

5. CDN（キャッシュ判定）

Akamai 等の CDN は、キャッシュに存在しないオブジェクトへのルックアップをブルームフィルタで回避します。

落とし穴

1. 要素数の見積もりミス

実際の要素数が予想を超えると、偽陽性率が急上昇します。余裕を持って設計しましょう。

// 悪い例: 要素数の見積もりが甘い
const bf = new BloomFilter(1000, 0.01); // 1000個のつもり
for (let i = 0; i < 10000; i++) {
  bf.add(`item-${i}`); // 実際は1万個!
}
// → 偽陽性率が 50% 以上になる可能性

2. ハッシュ関数の品質

暗号学的ハッシュ（SHA-256）は遅いため、MurmurHash3 や xxHash を使いましょう。

3. Counting Bloom Filter のオーバーフロー

カウンターが 255（8bit）を超えるとオーバーフローします。4bit カウンター（最大15）でも実用上は十分です。

4. 偽陽性率の誤理解

「1%の偽陽性率」は、フィルタに含まれない要素を検索した際に1%の確率で「含まれる」と誤判定することです。含まれる要素の判定は100%正確です。

5. ビット配列のサイズ制限

JavaScript の TypedArray は最大 2GB 程度です。それ以上のフィルタが必要なら、外部ライブラリ（Redis Bloom）や複数フィルタの併用を検討します。

比較表

ブルームフィルタの種類

種類	メモリ	削除	偽陽性率	用途
Standard Bloom Filter	1x	✗	p	静的集合
Counting Bloom Filter	4〜8x	◯	p	動的集合
Scalable Bloom Filter	可変	✗	p	要素数不明
Cuckoo Filter	1〜2x	◯	p	削除が頻繁

代替データ構造との比較

データ構造	メモリ（100万要素）	検索	挿入	削除	偽陽性
Hash Set	約 8MB	O(1)	O(1)	O(1)	なし
Bloom Filter	約 1.14MB (1%FP)	O(k)	O(k)	✗	あり
Cuckoo Filter	約 1.5MB (1%FP)	O(1)	O(1)	O(1)	あり
Skip List	約 12MB	O(log n)	O(log n)	O(log n)	なし

ベストプラクティス

要素数を余裕を持って見積もる: 実測値の 1.5〜2倍で設計
偽陽性率は 1〜5% が実用的: 0.1% 以下はメモリ効率が悪い
高速ハッシュ関数を使う: MurmurHash3、xxHash、CityHash
削除が必要なら Cuckoo Filter を検討: メモリ効率も良い
定期的にリビルド: 要素数が増えたら新しいフィルタを作成
偽陽性後の処理を実装: フィルタが「含まれる」と返しても、実データで再確認
監視: 実際の偽陽性率をメトリクスで追跡
ビット配列の圧縮: RLE（Run-Length Encoding）で圧縮可能

まとめ

ブルームフィルタは、メモリ効率と引き換えに偽陽性を許容する、実用的な確率的データ構造です。

原理: ビット配列 + 複数ハッシュ関数
保証: 偽陰性なし、偽陽性あり
メモリ: 通常の Set の 1/10 以下
採用例: Cassandra、HBase、Chrome、BitTorrent、CDN
代替: Cuckoo Filter（削除対応、同等のメモリ効率）

「存在しない」を高速に判定する必要がある場面では、ブルームフィルタが最適解です。

応用トピック

Cuckoo Filter

ブルームフィルタの代替で、削除操作をサポートし、メモリ効率もほぼ同等です。Fingerprint（要素のハッシュ値の一部）を Cuckoo Hashing で格納します。

class CuckooFilter {
  // Fingerprint ベースの実装（詳細省略）
  // - 削除可能
  // - メモリ効率は Bloom Filter と同等
  // - 検索が O(1) (Bloom は O(k))
}

Scalable Bloom Filter

要素数が事前に分からない場合、フィルタを動的に追加する Scalable Bloom Filter を使います。

class ScalableBloomFilter {
  private filters: BloomFilter[] = [];
  private currentFilter: BloomFilter;
  private growthFactor = 2;

  add(value: string): void {
    if (this.currentFilter.isFull()) {
      // 新しいフィルタを追加（サイズを2倍に）
      this.filters.push(this.currentFilter);
      this.currentFilter = new BloomFilter(
        this.currentFilter.size * this.growthFactor,
        this.currentFilter.numHashes,
      );
    }
    this.currentFilter.add(value);
  }

  contains(value: string): boolean {
    // 全フィルタをチェック
    return this.filters.some((f) => f.contains(value)) || this.currentFilter.contains(value);
  }
}

Quotient Filter

ブルームフィルタの改良版で、キャッシュ効率とマージ操作（2つのフィルタを結合）をサポートします。

Redis Bloom Module

Redis の公式モジュールで、ブルームフィルタ・Cuckoo Filter・Count-Min Sketch 等を提供します。

# Redis Bloom のインストール
redis-server --loadmodule /path/to/redisbloom.so

# ブルームフィルタの作成
redis-cli BF.RESERVE myfilter 0.01 1000000

# 要素追加
redis-cli BF.ADD myfilter "user:12345"

# 検索
redis-cli BF.EXISTS myfilter "user:12345"  # 1 (存在)
redis-cli BF.EXISTS myfilter "user:99999"  # 0 (非存在の可能性大)

参考リソース

データベースインデックス最適化 - ブルームフィルタを使ったディスクアクセス削減
キャッシング戦略 - キャッシュヒット判定でのブルームフィルタ活用

この技術を体系的に学びたいですか？

未来学では東証プライム上場企業のITエンジニアが24時間サポート。月額24,800円から、退会金0円のオンラインIT塾です。

メールで無料相談する

← 一覧に戻る

概要

ブルームフィルタとは

偽陽性と偽陰性

原則・定義

基本パラメータ

偽陽性率の計算式

計算例

構成要素

動作フロー

1. 初期化

2. 要素の追加（Add）

3. 要素の検索（Contains）

実装例

1. 基本的なブルームフィルタ（TypeScript）

2. Counting Bloom Filter（削除対応）

3. Python 実装例（mmh3 使用）

メリット・デメリット

メリット

デメリット

ユースケース

1. Apache Cassandra（SSTable フィルタ）

2. Google Chrome（Safe Browsing）

3. HBase（StoreFile フィルタ）

4. BitTorrent（ピア重複除外）

5. CDN（キャッシュ判定）

落とし穴

1. 要素数の見積もりミス

2. ハッシュ関数の品質

3. Counting Bloom Filter のオーバーフロー

4. 偽陽性率の誤理解

5. ビット配列のサイズ制限

比較表

ブルームフィルタの種類

代替データ構造との比較

ベストプラクティス

まとめ

応用トピック

Cuckoo Filter

Scalable Bloom Filter

Quotient Filter

Redis Bloom Module

参考リソース

関連記事

おすすめ記事

バックプレッシャー - 過負荷制御とフロー制御のパターン

Redisキャッシュ実践ガイド - 高速なデータアクセスの設計と実装

Edge Functions - エッジコンピューティングの実践

合意アルゴリズム - Raft・Paxos・ZABの原理と実装

コンシステントハッシュ法 - 分散キャッシュ・データベースの負荷分散

Write-Ahead Log - WAL による耐障害性とクラッシュリカバリ