97久久香蕉国产线看观看,日批日出水久久亚洲精品tv,亚洲狠狠婷婷综合久久久久

使用 Node.js 和 Redis 探索 Bloom Filter 的魅力

在正確的用例中，布隆過濾器看起來就像魔法。這是一個大膽的說法，但在本教程中，我們將探討這種奇怪的數(shù)據(jù)結(jié)構(gòu)、如何最好地使用它，以及一些使用 Redis 和 Node.js 的實際示例。

布隆過濾器是一種概率性、單向數(shù)據(jù)結(jié)構(gòu)。在這種情況下，“過濾器”一詞可能會令人困惑；過濾器意味著它是一個活躍的事物，一個動詞，但將它視為存儲，一個名詞可能更容易。使用簡單的布隆過濾器，您可以做兩件事：

添加一個項目。
檢查之前是否未添加過某個項目。

這些是需要理解的重要限制 – 您無法刪除項目，也無法在布隆過濾器中列出項目。此外，您無法確定過去是否已將某個項目添加到過濾器中。這就是布隆過濾器的概率性質(zhì)發(fā)揮作用的地方——誤報是可能的，但誤報則不然。如果過濾器設(shè)置正確，誤報的可能性會非常小。

存在布隆過濾器的變體，它們添加了其他功能，例如刪除或縮放，但它們也增加了復(fù)雜性和限制。在繼續(xù)了解變體之前，首先了解簡單的布隆過濾器非常重要。本文僅介紹簡單的布隆過濾器。

有了這些限制，您可以獲得許多好處：固定大小、基于哈希的加密和快速查找。

當(dāng)您設(shè)置布隆過濾器時，您需要為其指定一個大小。此大小是固定的，因此如果過濾器中有一項或十億項，它永遠(yuǎn)不會增長超過指定的大小。當(dāng)您向過濾器添加更多項目時，誤報的可能性就會增加。如果您指定了較小的過濾器，則與使用較大的過濾器相比，誤報率會增加得更快。

布隆過濾器建立在單向散列的概念之上。與正確存儲密碼非常相似，布隆過濾器使用哈希算法來確定傳入其中的項目的唯一標(biāo)識符。哈希本質(zhì)上是無法逆轉(zhuǎn)的，并且由看似隨機的字符串表示。因此，如果有人獲得了布隆過濾器的訪問權(quán)限，它不會直接泄露任何內(nèi)容。

最后，布隆過濾器速度很快。與其他方法相比，該操作涉及的比較次數(shù)要少得多，并且可以輕松存儲在內(nèi)存中，從而防止影響性能的數(shù)據(jù)庫命中。

現(xiàn)在您已經(jīng)了解了布隆過濾器的限制和優(yōu)點，讓我們來看看可以使用它們的一些情況。

設(shè)置

我們將使用 Redis 和 Node.js 來說明 Bloom 過濾器。 Redis 是 Bloom 過濾器的存儲介質(zhì)；它速度快，位于內(nèi)存中，并且有一些特定的命令（GETBIT、SETBIT），可以提高實施效率。我假設(shè)您的系統(tǒng)上安裝了 Node.js、npm 和 Redis。您的 Redis 服務(wù)器應(yīng)該在 localhost 上的默認(rèn)端口上運行，以便我們的示例正常工作。

在本教程中，我們不會從頭開始實現(xiàn)過濾器；而是從頭開始實現(xiàn)過濾器。相反，我們將重點關(guān)注 npm 中預(yù)構(gòu)建模塊的實際用途：bloom-redis。 bloom-redis 有一組非常簡潔的方法：add、contains 和 clear。

如前所述，布隆過濾器需要哈希算法來生成項目的唯一標(biāo)識符。 bloom-redis 使用眾所周知的 MD5 算法，盡管它可能不適合 Bloom 過濾器（有點慢，有點過大），但可以正常工作。

獨特的用戶名

用戶名，尤其是在 URL 中標(biāo)識用戶的用戶名，需要是唯一的。如果您構(gòu)建一個允許用戶更改用戶名的應(yīng)用程序，那么您可能需要一個從未使用過的用戶名，以避免用戶名混淆和被攻擊。

如果沒有布隆過濾器，您需要引用一個包含曾經(jīng)使用過的每個用戶名的表，而大規(guī)模時這可能會非常昂貴。布隆過濾器允許您在用戶每次采用新名稱時添加一個項目。當(dāng)用戶檢查用戶名是否被占用時，您所需要做的就是檢查布隆過濾器。它將能夠絕對確定地告訴您所請求的用戶名是否先前已添加。過濾器可能會錯誤地返回用戶名已被使用，而實際上用戶名尚未被使用，但這只是為了謹(jǐn)慎起見，不會造成任何真正的傷害（除了用戶可能無法聲明“k3w1d00d47”） .

為了說明這一點，讓我們使用 Express 構(gòu)建一個快速的 REST 服務(wù)器。首先，創(chuàng)建 package.json 文件，然后運行以下終端命令。

npm 安裝bloom-redis –save

npm install express –save

npm install redis –save

bloom-redis 的默認(rèn)選項大小設(shè)置為 2 MB。出于謹(jǐn)慎考慮，這是錯誤的，但它相當(dāng)大。設(shè)置布隆過濾器的大小至關(guān)重要：太大會浪費內(nèi)存，太小則誤報率會太高。確定大小所涉及的數(shù)學(xué)非常復(fù)雜，超出了本教程的范圍，但幸運的是，有一個布隆過濾器大小計算器可以完成這項工作，而無需破解教科書。

現(xiàn)在，創(chuàng)建 app.js 如下：

 var   Bloom         =   require('bloom-redis'),   express       =   require('express'),   redis         =   require('redis'),      app,   client,   filter;  //setup our Express server app = express();  //create the connection to Redis client = redis.createClient();   filter = new Bloom.BloomFilter({    client    : client, //make sure the Bloom module uses our newly created connection to Redis   key       : 'username-bloom-filter', //the Redis key      //calculated size of the Bloom filter.   //This is where your size / probability trade-offs are made   //http://hur.st/bloomfilter?n=100000&amp;p=1.0E-6   size      : 2875518, // ~350kb   numHashes : 20 });  app.get('/check', function(req,res,next) {   //check to make sure the query string has 'username'   if (typeof req.query.username === 'undefined') {     //skip this route, go to the next one - will result in a 404 / not found     next('route');   } else {    filter.contains(      req.query.username, // the username from the query string      function(err, result) {        if (err) {          next(err); //if an error is encountered, send it to the client         } else {           res.send({              username : req.query.username,              //if the result is false, then we know the item has *not* been used             //if the result is true, then we can assume that the item has been used             status : result ? 'used' : 'free'            });         }       }     );   } });   app.get('/save',function(req,res,next) {   if (typeof req.query.username === 'undefined') {     next('route');   } else {     //first, we need to make sure that it's not yet in the filter     filter.contains(req.query.username, function(err, result) {       if (err) { next(err); } else {         if (result) {           //true result means it already exists, so tell the user           res.send({ username : req.query.username, status : 'not-created' });         } else {           //we'll add the username passed in the query string to the filter           filter.add(             req.query.username,              function(err) {               //The callback arguments to `add` provides no useful information, so we'll just check to make sure that no error was passed               if (err) { next(err); } else {                 res.send({                    username : req.query.username, status : 'created'                  });               }             }           );         }       }     });   } });  app.listen(8010);

要運行此服務(wù)器：node app.js。轉(zhuǎn)到瀏覽器并將其指向：https://localhost:8010/check?username=kyle。響應(yīng)應(yīng)該是：{“username”:”kyle”,”status”:”free”}。

現(xiàn)在，讓我們通過將瀏覽器指向 http://localhost:8010/save?username=kyle 來保存該用戶名。響應(yīng)將是：{“username”:”kyle”,”status”:”created”}。如果返回地址 http://localhost:8010/check?username=kyle，響應(yīng)將是 {“username”:”kyle”,”status “:”已使用”}.同樣，返回 http://localhost:8010/save?username=kyle 將導(dǎo)致 {“username”:”kyle”,”status”:”not -創(chuàng)建“}。

從終端中，您可以看到過濾器的大小： redis-cli strlen 用戶名-bloom-filter。

現(xiàn)在，對于一項，它應(yīng)該顯示 338622。

現(xiàn)在，繼續(xù)嘗試使用 /save 路由添加更多用戶名。您想嘗試多少就嘗試多少。

如果您再次檢查尺寸，您可能會發(fā)現(xiàn)尺寸略有增加，但并非每次添加都會增加。很好奇，對吧？在內(nèi)部，布隆過濾器在保存在 username-bloom 的字符串中的不同位置設(shè)置各個位（1/0）。然而，這些并不是連續(xù)的，因此如果您在索引 0 處設(shè)置一位，然后在索引 10,000 處設(shè)置一位，則兩者之間的所有內(nèi)容都將為 0。對于實際用途，一開始了解每個操作的精確機制并不重要，只需知道這一點即可這是正常的，您在 Redis 中的存儲永遠(yuǎn)不會超過您指定的值。

新鮮內(nèi)容

網(wǎng)站上的新鮮內(nèi)容可以吸引用戶回頭客，那么如何每次都向用戶展示新的內(nèi)容呢？使用傳統(tǒng)的數(shù)據(jù)庫方法，您可以向表中添加一個包含用戶標(biāo)識符和故事標(biāo)識符的新行，然后在決定顯示一段內(nèi)容時查詢該表。正如您可能想象的那樣，您的數(shù)據(jù)庫將增長得非常快，尤其是隨著用戶和內(nèi)容的增長。

在這種情況下，漏報（例如不顯示看不見的內(nèi)容）的后果非常小，這使得布隆過濾器成為一個可行的選擇。乍一看，您可能認(rèn)為每個用戶都需要一個布隆過濾器，但我們將使用用戶標(biāo)識符和內(nèi)容標(biāo)識符的簡單串聯(lián)，然后將該字符串插入到我們的過濾器中。這樣我們就可以對所有用戶使用單個過濾器。

在此示例中，讓我們構(gòu)建另一個顯示內(nèi)容的基本 Express 服務(wù)器。每次您訪問路由 /show-content/any-username （any-username 是任何 URL 安全值）時，都會顯示一條新內(nèi)容，直到該網(wǎng)站沒有內(nèi)容。在示例中，內(nèi)容是古騰堡計劃前十名書籍的第一行。

我們需要再安裝一個 npm 模塊。從終端運行： npm install async –save

您的新 app.js 文件：

 var   async         =   require('async'),   Bloom         =   require('bloom-redis'),   express       =   require('express'),   redis         =   require('redis'),      app,   client,   filter,      // From Project Gutenberg - opening lines of the top 10 public domain ebooks   // https://www.gutenberg.org/browse/scores/top   openingLines = {     'pride-and-prejudice' :        'It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.',     'alices-adventures-in-wonderland' :        'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it' }

如果您仔細(xì)注意開發(fā)工具中的往返時間，您會發(fā)現(xiàn)使用用戶名請求單個路徑的次數(shù)越多，所需的時間就越長。雖然檢查過濾器需要固定的時間，但在本例中，我們正在檢查是否存在更多項目。布隆過濾器能夠告訴您的信息有限，因此您正在測試每個項目是否存在。當(dāng)然，在我們的示例中，它相當(dāng)簡單，但測試數(shù)百個項目效率很低。

過時數(shù)據(jù)

在此示例中，我們將構(gòu)建一個小型 Express 服務(wù)器，它將執(zhí)行兩件事：通過 POST 接受新數(shù)據(jù)，并顯示當(dāng)前數(shù)據(jù)（使用 GET 請求）。當(dāng)新數(shù)據(jù)被 POST 到服務(wù)器時，應(yīng)用程序?qū)z查它是否存在于過濾器中。如果它不存在，我們會將其添加到 Redis 中的集合中，否則我們將返回 null。 GET 請求將從 Redis 獲取它并將其發(fā)送到客戶端。

這與前兩種情況不同，誤報是不行的。我們將使用布隆過濾器作為第一道防線。考慮到布隆過濾器的屬性，我們只能確定某些東西不在過濾器中，因此在這種情況下，我們可以繼續(xù)讓數(shù)據(jù)進入。如果布隆過濾器返回的數(shù)據(jù)可能在過濾器中，我們將根據(jù)實際數(shù)據(jù)源進行檢查。

那么，我們得到了什么？我們獲得了不必每次都檢查實際來源的速度。在數(shù)據(jù)源速度較慢的情況下（外部 API、小型數(shù)據(jù)庫、平面文件的中間），確實需要提高速度。為了演示速度，我們在示例中添加 150 毫秒的實際延遲。我們還將使用 console.time / console.timeEnd 來記錄 Bloom 過濾器檢查和非 Bloom 過濾器檢查之間的差異。

在此示例中，我們還將使用極其有限的位數(shù)：僅 1024。它很快就會填滿。當(dāng)它填滿時，它將顯示越來越多的誤報 – 您會看到響應(yīng)時間隨著誤報率的填滿而增加。

該服務(wù)器使用與之前相同的模塊，因此將 app.js 文件設(shè)置為：

 var   async           =   require('async'),   Bloom           =   require('bloom-redis'),   bodyParser      =   require('body-parser'),   express         =   require('express'),   redis           =   require('redis'),      app,   client,   filter,      currentDataKey  = 'current-data',   usedDataKey     = 'used-data';    app = express(); client = redis.createClient();  filter = new Bloom.BloomFilter({    client    : client,   key       : 'stale-bloom-filter',   //for illustration purposes, this is a super small filter. It should fill up at around 500 items, so for a production load, you'd need something much larger!   size      : 1024,   numHashes : 20 });  app.post(   '/',   bodyParser.text(),   function(req,res,next) {     var       used;            console.log('POST -', req.body); //log the current data being posted     console.time('post'); //start measuring the time it takes to complete our filter and conditional verification process          //async.series is used to manage multiple asynchronous function calls.     async.series([       function(cb) {         filter.contains(req.body, function(err,filterStatus) {           if (err) { cb(err); } else {             used = filterStatus;             cb(err);           }         });       },       function(cb) {         if (used === false) {           //Bloom filters do not have false negatives, so we need no further verification           cb(null);         } else {           //it *may* be in the filter, so we need to do a follow up check           //for the purposes of the tutorial, we'll add a 150ms delay in here since Redis can be fast enough to make it difficult to measure and the delay will simulate a slow database or API call           setTimeout(function() {             console.log('possible false positive');             client.sismember(usedDataKey, req.body, function(err, membership) {               if (err) { cb(err); } else {                 //sismember returns 0 if an member is not part of the set and 1 if it is.                 //This transforms those results into booleans for consistent logic comparison                 used = membership === 0 ? false : true;                 cb(err);               }             });           }, 150);         }       },       function(cb) {         if (used === false) {           console.log('Adding to filter');           filter.add(req.body,cb);         } else {           console.log('Skipped filter addition, [false] positive');           cb(null);         }       },       function(cb) {         if (used === false) {           client.multi()             .set(currentDataKey,req.body) //unused data is set for easy access to the 'current-data' key             .sadd(usedDataKey,req.body) //and added to a set for easy verification later             .exec(cb);          } else {           cb(null);         }       }       ],       function(err, cb) {         if (err) { next(err); } else {           console.timeEnd('post'); //logs the amount of time since the console.time call above           res.send({ saved : !used }); //returns if the item was saved, true for fresh data, false for stale data.         }       }     ); });  app.get('/',function(req,res,next) {   //just return the fresh data   client.get(currentDataKey, function(err,data) {     if (err) { next(err); } else {       res.send(data);     }   }); });  app.listen(8012);

由于使用瀏覽器 POST 到服務(wù)器可能會很棘手，所以讓我們使用curl 來測試。

curl –data“您的數(shù)據(jù)放在這里”–header“內(nèi)容類型：text/plain”http://localhost:8012/

可以使用快速 bash 腳本來顯示填充整個過濾器的外觀：

 #!/bin/bash for i in `seq 1 500`; do   curl --data “data $i" --header "Content-Type: text/plain" http://localhost:8012/ done

觀察填充或完整的過濾器很有趣。由于這個很小，你可以使用 redis-cli 輕松查看。通過在添加項目之間從終端運行 redis-cli get stale-filter ，您將看到各個字節(jié)增加。完整的過濾器將為每個字節(jié) xff 。此時，過濾器將始終返回正值。

結(jié)論

布隆過濾器并不是萬能的解決方案，但在適當(dāng)?shù)那闆r下，布隆過濾器可以為其他數(shù)據(jù)結(jié)構(gòu)提供快速、有效的補充。

? 版權(quán)聲明

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載。

THE END

模板教程
# 數(shù)據(jù)結(jié)構(gòu)