Learning Supervised Binary Hashing without Binary Code Optimization
Miguel Carreira-Perpinan; Ramin Raziperchikolaei
The exact nearest neighbor search of large image datasets to find similar images to a query takes a long time. The goal of supervised binary hashing is to make the search faster by learning a hash function that maps patterns such as images to bit codes whose Hamming distances preserve semantic similarity information. In learning hash functions, most papers must deal with a difficult, slow and inexact optimization. Recently, it has been shown that the optimization can get simpler by learning the single-bit hash functions independently and making them different by using ensemble learning techniques. We show that it is even possible to dispense with the binary code optimization completely by assigning binary codes to the points based on their similarity to a set of seed points. Our method is very simple, scalable, and is competitive with the state-of-the-art methods in retrieval metrics.
Sub-linear Privacy-preserving Search with Unsecured Server and Semi-honest Parties
Beidi Chen
In Near-Neighbor Search (NNS), a new client queries a database (held by a server) for the most similar data (near-neighbors) given a certain similarity metric. The Privacy-Preserving variant (PP-NNS) requires that neither server nor the client shall learn information about the other party’s data except what can be inferred from the outcome of NNS. The overwhelming growth in the size of current datasets and the lack of a truly secure server in the online world render the existing solutions impractical; either due to their high computational requirements or non-realistic assumptions which potentially compromise privacy. PP-NNS having query time {\it sub-linear} in the size of the database has been suggested as an open research direction. In this paper, we provide the first such algorithm, called Secure Locality Sensitive Indexing (SLSI) which has a sub-linear query time and the ability to handle honest-but-curious parties. At the heart of our proposal lies a secure binary embedding scheme generated from a novel probabilistic transformation over locality sensitive hashing family. We provide information theoretic bound for the privacy guarantees and support our theoretical claims using substantial empirical evidence on real-world datasets.
On Nearest Neighbors in Non Local Means Denoising
Iuri Frosio; Kautz Jan
To denoise a reference patch, the Non-Local-Means denoising filter processes a set of neighbor patches. Few Nearest Neighbors (NN) are used to limit the computational burden of the algorithm. Here here we show analytically that the NN approach introduces a bias in the denoised patch, and we propose a different neighbors’ collection criterion, named Statistical NN (SNN), to alleviate this issue. Our approach outperforms the traditional one in case of both white and colored noise: fewer SNNs generate images of higher quality, at a lower computational cost.
Fast k-Nearest Neighbour Search via Prioritized DCI
Ke Li; Jitendra Malik
Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential dependence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) offers a promising way of circumventing the curse and successfully reduces the dependence of query time on intrinsic dimensionality from exponential to sublinear. In this paper, we propose a variant of DCI, which we call Prioritized DCI, and show a remarkable improvement in the dependence of query time on intrinsic dimensionality. In particular, a linear increase in intrinsic dimensionality, or equivalently, an exponential increase in the number of points near a query, can be mostly counteracted with just a linear increase in space. We also demonstrate empirically that Prioritized DCI significantly outperforms prior methods. In particular, relative to Locality-Sensitive Hashing (LSH), Prioritized DCI reduces the number of distance evaluations by a factor of 14 to 116 and the memory consumption by a factor of 21.
Generative Local Metric Learning for Nearest Neighbor Methods
Yung-Kyun Noh; Masashi Sugiyama; Daniel D Lee
The theoretical study of nearest neighbors goes back to T. Cover and P. Hart’s work in 1967 , which is based on the asymptotic behavior of nearest neighbor classification with many data. Their best-known contribution is the notion of upper bound of the error in the asymptotic situation, which is tighter than twice the Bayes error, and the idea of connecting nearest neighbor information to the underlying probability density functions. More recently, studies on nearest neighbors have developed various useful techniques for better using nearest neighbor information from the perspective of theoretical study of Cover and Hart for various machine learning applications.
In this work, we explain how a metric learning can be used with information from the generative models. A rough generative model can capture the local gradient and curvature of the underlying density function using the global shape of data. We derive equations to utilize the gradient and curvature information to remove the ambient noise of high dimensional space for nearest neighbor methods.
The applications include nearest neighbor classification, information-theoretic measure estimation, and regression using Nadaraya-Watson estimators. All of the work is based on an analysis which assumes many data, and often the presented results only appear in tasks with large numbers of data. The results in this work will help explain the behavior of nearest neighbors when treating large scale data.
Private Document Classification in Federated Databases
Phillipp Schoppmann; Adria Gascon; Borja Balle
We present two protocols for private document classification in distributed databases. Our protocols implement a nearest neighbours classifier allowing a client to classify a private document against a collection of labelled documents held by multiple mutually distrusting servers. The protocols are based on secure multi-party computation and differential privacy and provide formal privacy guarantees for all the parties involved in the computation. We evaluate our protocols in terms of speed and accuracy on synthetic and real datasets.
Optimizing Revenue Over Data-Driven Assortments
Deeksha Sinha; Theja Tulabandula
We revisit the problem of assortment optimization under the multinomial logit (MNL) choice model with general constraints and propose new efficient optimization algorithms. Our algorithms do not make any assumptions on the structure of the feasible sets and in turn do not require a compact representation of constraints describing them. For the case of cardinality constraints, we specialize our algorithms and achieve time complexity sub-quadratic in the number of products in the assortment (existing methods are quadratic or worse). Empirical validations using the billion prices dataset and several retail transaction datasets show that our algorithms are competitive even when the number of items is ~ 10^5 and beyond (100x larger instances than previously studied), supporting their practicality in data driven revenue management applications.
Fast Distance Metric Learning for Multi-Task Large Margin Nearest Neighbor Classification
Adams Yu
Good metric spaces are the keys for accurate nearest neighbor classifications, while the complexity to find such spaces usually pose great computation challenges. In this paper, we propose an efficient algorithm for the multi-task distance metric learning, which can be formulated as convex optimization problem over multiple positive semidefinite matrices which define the metrics. Existing approaches use all data instances to update all decision matrices in each iteration where multiple eigenvalue decompositions are involved. This leads to a high computational cost when the data size is large or the dimension of the matrix variables is high. To address this issue, we reformulate the original problem as a bilinear saddle-point problem and propose a primal-dual randomized coordinate method. It samples only a subset of data points and matrix variables per iteration for update, and hence requires fewer eigenvalue decompositions, while having a linear convergence rate.