Similarity Search – The Metric Space Approach

Pavel Zezula
Giuseppe Amato


Michal Batko

Series: Advances in Database Systems,
Vol. 32, 2006, XVIII, 220 p. Hardcover

ISBN: 0-387-29146-6



Download table of content

See the publisher’s page



In the Information Society, information holds the master key
to economic influence. Similarity Search-The Metric Space Approach focuses on efficient ways to locate user-relevant information in collections of
objects, the similarity of which is quantified using a pairwise distance
measure. This book is a direct response to recent advances in computing,
communications and storage which have led to the current flood of digital
libraries, data warehouses and the limitless heterogeneity of internet

Similarity Search-The Metric Space Approach
introduces state-of-the-art in developing index structures for searching
complex data modeled as instances of a metric space.

This book consists of two parts:

Part I presents the metric search approach in a nutshell by defining the
problem, describes major theoretical principles, and provides an extensive
survey of specific techniques for a large range of applications.

Part II concentrates on approaches particularly designed for searching in
large collections of data. After describing the most popular centralized
disk-based metric indexes, we present approximation techniques are presented as a way to
significantly speed up search time at the expense of some imprecision in query
results, and finally we discuss parallel and distributed disk-based metric

Similarity Search-The Metric Space Approach

is designed for a professional audience, composed of academic researchers as
well as practitioners in industry. This book is also suitable as introductory
material for graduate-level students in computer science.


I: Metric searching in a Nutshell

Similarity searching has become a fundamental computational task in a variety of
application areas, including multimedia information retrieval, data mining,
pattern recognition, machine learning, computer vision, biomedical databases,
data compression and statistical data analysis. In such environments, an exact
match has little meaning, and proximity/distance concepts
(similarity/dissimilarity) are typically much more fruitful for searching. In
this part of the book we explain how the metric space approach can be
effectively and efficiently used to deal with this problem.

Part I contains the following two chapters:

Chapter 1: Foundation of metric spaces

this chapter we introduce the problem of metric searching and justify its
importance with respect to other approaches. After defining a metric space we
show examples of several distance measures which are used for searching in
diverse data collections. We survey the paradigms used to express queries and we
discuss the basic partitioning principles, which are at the basis of query
execution. The reminder of this chapter is devoted to performance related
issues. Specifically, we discuss techniques aimed at reducing the number of
distance computations, we present useful metric space transformations, and we
explain concepts of approximate similarity search. Finally, we provide a
collection of analytic tools and approaches especially developed for metric
index structures. Go back

Chapter 2: Survey of existing approaches

this chapter, we give an overview of existing indexes for metric spaces. In
the interests of a systematic presentation, we have divided the individual
techniques into four groups. In addition we also present some techniques for
approximate similarity search. Specifically, we first explain techniques which
make use of ball partitioning, then we describe indexing approaches based on
generalized hyperplane partitioning. A significant group of indexing methods
that we also report are those that compute distances to characteristic objects
and then use these results to organize the data. In order to maximize
performance, many approaches combine several of the basic
principles into a single index. In this chapter we also discuss the most
important of these hybrid approaches. Finally, we treat the important topic of
approximate similarity search, which trades some precision in search results for
significant improvements in performance. Go back

Part II:
Metric Searching in Large
Collections of Data

Database scalability is a topic which has been well-explored and
much-debated, but there are still no easy answers. There are several ways to
achieve higher scalability of a database, but which of them to choose depends
greatly upon the unique needs of individual users. Creating a search index
structure which scales to very large dimensions presents many challenges, and
the task is becoming increasingly difficult as the amount of data grows. In
this part of the book, we assume that an index for processing large data stores
and accesses indexed features on a secondary memory, that is on a disk.
Go back

Part II contains the following three chapters:

Chapter 3: Centralized index structures

Most existing search structures have been designed to run on a single computer.
They are built with different
assumptions about type of distance function (discrete vs. continuous), form of
query (range, nearest neighbor, etc.), and temporal properties (static or
dynamic) of the data to be organized. Although many index structures have been
proposed as main memory structures, there are several indexes which organize
data using disk storage to allow the processing of a large volume of data. We focus on two basic approaches which store objects in secondary
memories. Specifically, we discuss tree-based structures and methods which
employ hashing (i.e., the key-to-address transformation) principles.
Go back

Chapter 4: Approximate similarity search

Similarity search in metric spaces is generally expensive and state-of-the art
access methods still do not provide an acceptable response time for highly
interactive applications. Fortunately, in many applications it is sufficient to
perform an approximate similarity search where an inaccurate result-set is
obtained. The attractiveness of this approach is emphasized by the fact that the
approximate search is typically performed much faster. In this chapter, we
present some of the most relevant approximate similarity search strategies.
Finally, the pros and cons of approximate similarity search are treated, and
evidence provided by tests conducted on real datasets is discussed.
Go back

Chapter 5: Parallel and distributed indexes

Centralized metric indexes achieve a significant speedup when compared to the
sequential scan. However, experience with centralized methods reveals that query
execution cost increases linearly with the growth of the dataset. In this
chapter, we present methods which solve this problem by exploiting parallel
computing power. The idea behind is easy in principle: as the dataset grows
in size, more independent computation and storage resources are added (CPUs,
disks, etc.), keeping the query response time low. Go back

eXTReMe Tracker