• Narrow screen resolution
  • Wide screen resolution
  • Auto width resolution
  • Increase font size
  • Decrease font size
  • Default font size

Introduction to Information Retrieval-Search Engines

This article aims to provide readers with an overview of the very basics of information retrieval. Understanding these principles can help you to optimise your website content for the search engines and also help you to analyse search engine algorithm changes. However, the details in this article are not intended to describe how modern search engines work, as they use many additional factors, including link analysis.

Information retrieval (IR) is the science of searching for documents / within documents. Information retrieval techniques form some of the most fundamental elements of web search engine technology. This article will discuss information retrieval in the context of search engines.

Indexes

It is unrealistic to remotely access documents in real-time when performing a search, as it would be exceptionally slow and unreliable. Therefore a local index is created, which for search engines is done by a crawler (aka spider). Thus, when you perform a search you are not actually searching the web, but are searching a version of the web as seen and stored by the crawler at some point in the past.

The index would not usually contain the whole document (this may, however, be stored in a separate document cache), but stores a representation of the terms relevant to the document that is quickly and easily searchable. There are various stages to this process (not all systems will include all of these stages):

1. Document

This is the document in its raw format with all text, structure and formatting.

2. Structure Analysis

Recognising headings, paragraphs, titles, bold text, lists, ..., etc.

3. Lexical Analysis

Converting the characters in the document into a list of words. This process may include analysing digits, hyphens, punctuation and the case of letters. Proper Noun Analysis can use the case and format of words/phrases to identify important information such as names, places, dates and organisations.

4. Stopwords Removal

The removal of words which occur very often and provide no ability to discriminate between documents. For example: "the", "it", "is". However, it can be seen that some search engines leave these words in the index and remove them at the user query level. This allows "+word" queries to be performed.

5. Stemming

This is a conflation procedure which reduces variations of a word into a single root. For example, both "worked" and "working" may be reduced to "work". The Porter Stemming Algorithm can be used to perform stemming.

After these processes have been performed we have a list of index terms for this particular document.

Index Term Weighting

We now need to calculate to what degree a term is relevant to a particular document. The following is an example of a weighting scheme:

* Index Term Frequency

This is the frequency of a term inside a document. The frequency is usually normalised within the particular document:

TermFrequency(term, document) = (no. occurrences of term in document) / (no. occurrence of term with max occurrences in document)

* Inverse Document Frequency

The inverse of the frequency of a term between all the documents in the set. Terms that appear in many documents are not very useful as they do not allow us to discriminate between documents.

IDF(term) =

log([no. documents in collection] / [no. documents in collection containing term])

* Weight

This is the actual index term weight for a particular term in a particular document:

Weight(term, document) = TermFrequency(term, document) * IDF(term)

Other items may be a factor in deciding weight, such as: the terms position in the document, whether it was in the title, whether it was bold, whether it was in a list, ..., etc.