Performance Analysis of Sketching Methods
DOI:
https://doi.org/10.3126/nccsrj.v2i1.60082Keywords:
MinHash, Sketch, Jaccard Similarity, DensificationAbstract
One of the problems in data mining or machine learning is the high-dimensional dataset. MinHash can generate sketches of sparse datasets efficiently reducing the dimension to a few thousand. These sketches, then, can be used for different machine-learning applications. It takes O(kd) computations to generate k hash values for a data point with d non-zeros. The Sketch of a data point is the vector of k hash values. Weighted Minwise Hashing is another method to generate a sketch of size k in O(kp/d) computations. Here, p is the size of the universal set. Optimal Densification is the most efficient and accurate method as it can generate k hash values in mere O(d + k) computations. In this paper, we investigate the performance of Optimal Densification, Weighted Minwise Hashing, and Vanilla MinHash by performing two experiments. Firstly, we investigate Jaccard similarity estimation accuracy on six different synthetic datasets. Then, we perform one nearest neighbor classification (1NN) of four real datasets. Optimal Densification outperforms both Weighted Minwise Hashing and Vanilla MinHash in terms of accuracy and time taken to generate the sketches.