Bloom Filter
Data Structure and Algorithms
Bloom Filter composed by an n bits bit-vector which initialized all bits to 0 and k different hash functions. Bloom Filter support two operation, add and query. Add operation is used to add new element in to Bloom Filter. Add operation first compute k hash functions of the element and then set correspond bit 1.In the same way, query operation compute k hash functions of the element and then check if correspond bit is set. If all the correspond bits is set to 1, then we can claim the membership of this element. If one of the correspond bits is clear, this element is not belong to the set.
Time and Space Complexity
The time to add and query both have time complexity O(k) , where k is the constant we can set. So we can estimate the two operations complete in constant time. And Bloom Filter is also space-saving because it take only n bits to store Bloom Filter.Split Screen
Architecture
Split Screen has three components : Malware Database, Server and Client Agent.Fig 1 System Architecture |
System Flow
In my system , server will check malware database for newly update malware and add new malware to Bloom Filter. Then send a command to client for updating BloomFilter.Implementation
In my implementation, source code structure is as below:[server]
FFBFServer.py
[client]
FFBFCli.py
client.py
[common]
BF_Operation.py
FeedForwardBloomFilter.py
DirectoryList.pyHash.py
First we look into Hash.py, Although Bloom Filter need k different hash , in practice we can use linear combination of two hash functions instead. So Hash.py only compute two hash value about a file.
With Hash.py's help, FeedForwardBloomFilter.py implement add and query operation.
class FFBloomFilter(object): def add(self,fname): self.hashClass.hash(fname) val1 = int(self.hashClass.getHash(),16) val2 = int(self.hashClass.getHash2(),16) hashValue =val1%self.size for _ in range(self.hashNum): self.bf1[hashValue]+=1 hashValue+=val2%self.size hashValue%=self.size def query(self,fname): self.hashClass.hash(fname) val1 = int(self.hashClass.getHash(),16) val2 = int(self.hashClass.getHash2(),16) hashValue =val1%self.size for _ in range(self.hashNum): if self.bf1[hashValue]==0 : return False hashValue+=val2%self.size hashValue%=self.size hashValue =val1%self.size for _ in range(self.hashNum): self.bf2[hashValue]+=1 hashValue+=val2%self.size hashValue%=self.size return True
BF_Operation.py implement algorithms in SplitScreen as Fig 2
Fig 2 SplitScreen Architecture |
class BF_Operation(object): def __init__(self): self.hashNum = 10 self.bitSize = 1024*1024 def FFBF_INIT(self,path,outBF): ffbf = FeedForwardBloomFilter.FFBloomFilter(self.bitSize,self.hashNum) Dir = DirectoryList.DirList() Dir.ListAdd(path,ffbf) ffbf.serialize(outBF) def FFBF_SCAN(self,path,inBF,outBF): ffbf = FeedForwardBloomFilter.FFBloomFilter(self.bitSize,self.hashNum) ffbf.parsing(inBF) Dir = DirectoryList.DirList() ffbf.clear() matchdata = [] Dir.ListCheck(path,ffbf,matchdata) ffbf.serialize(outBF) return matchdataWhere ListAdd() and ListCheck() treat a path of directory as input and add(check) all files in directory. In the paper, there are two more algorithms, FFBF_HIT() and FFBF_verify(), which are not needed while we only need to test membership of files.
沒有留言:
張貼留言