Information retrieval is the science of searching for information. All search engine algorithms are based on information retrieval theories and methods. These algorithms are closely guarded secrets, but knowing the fundamentals of information retrieval is the key to unlocking these secrets of search engine optimisation.
Using these fundamentals, I will explain how you can build a site which will be readily understood by the search engines and therefore easily visible to the visitors you want to attract.
Anyone who uses the web knows the frustration of typing a phrase into a search engine only to find the results return pages about a similar – but unrelated – item. The companies running the search engines, themselves, are also aware of this and in order to make their results more accurate, they are turning to increasingly sophisticated information retrieval techniques.
Gone are the days when improving your ranking on the search engines was a simple case of repeating a number of keywords and phrases on your pages. In the past, to achieve page one ranking, all you needed to do was to have more keywords on your pages than any of your competitors. Now the search engines regard this as spamming.
Google recommends that web pages should be written to be easily read by humans, as opposed to search engines. (Quality guidelines: www.google.com/support/webmasters )
Information retrieval uses a mathematical approach to determine the weighting of specific phrases in a website. Once the weighting is calculated, this can be compared with other websites to determine which is the most relevant. This is a far more accurate method of ranking web pages because the meaning, or semantics, of the page text is captured. This makes for a system which is far less open to abuse than simply weighing up the density of relevant key words in the text.
Calculating the weighting of given keywords in a website varies with each search engine, but the methods are basically the same.
The average webpage contains more than the text you see. There is also code embedded in the content which tells the browser your visitors are using how to display the page. The first thing a search engine does, when it reads a page on your site, is remove this code, a process which is called linearization.
How linearization affects you: The more code you have on each web page, the more difficult it is for the search engine to perform linearization with any meaningful result. For example, if your page displays tables which are defined in html, that is, with the definition embedded in the page content, the search engine will remove the code defining the table. It will then read just the text. Some search engines may read the text column by column others may read row by row. In other words, the search engine may not read the text in your table the way you intended. The best way to get around this is to use cascading style sheets. These allow you to put much of the formatting information in a separate document leaving just a handful of codes in your content. This, in turn, makes the meaning of each page clearer to the search engines.
Once the search engine has performed linearization, the next step it takes is to remove stop words from the text. These are words which appear often, subjunctions, pronouns… words like “if”, “but”, “and” or “to”.
How the removal of stop words affects you: If you are using keyword density techniques to optimise your site, that is, if you have typed in your key phrase again and again, there is a very real danger that the search engine will see it as a stop word, too, and remove it. This would mean that when your target audience searched for those phrases or words on the web, your site would be invisible to them.
Next, the search engine aims to establish the context of the subjects within the page. Every sentence has a subject and a predicate. The subject is usually at the beginning of the sentence and refers to what the sentence is about. The predicate is the rest of the sentence that gives information about the subject. Within the predicate is usually a verb and an object. The object is normally the thing that is affected by the verb.
Local context analysis attempts to determine the subject, verb and object of each sentence in each paragraph or page. Using each subject found, it collects all the associated objects and builds a two tier hierarchy. Then for each object, it collects all the associated verbs, synonyms and other words and builds the third tier of information. This three tier hierarchy of information is known as a 'lexicographical tree'.
For any given subject, the richer your description, the bigger the tree will be. Local context analysis will result in a relevancy score for each subject and object based on the size of the tree. This relevancy score will be used later to determine the overall weighting of your web page.
How building the lexicographical tree affects you: If the sentences in your text are not properly constructed the search engine may not pick up on the interrelationship of the nouns and verbs in your text and this may cause it to catalogue your site incorrectly. This is why Google advises the use of good grammar and encourages the use of readable text, because the search engine picks out subjects, objects and verbs from each sentence.
Having established the relevancy of the subjects and objects in each web page, the next step is to look for other pages on the website that appear similar or have equally high relevancy scores for the same keywords. Part of this process investigates the use of synonyms to test how often the same or semantically similar words are used. This builds an index of semantics as the use of synonyms will help give the subject or object more meaning.
This can result in a higher ranking for a given page that may not even contain an exact match for the search keywords in question. This is because the derived latent semantics discovered on this page may be more relevant.
How Latent Semantic Indexing affects you: The better your description of the subjects and objects within each sentence, the more accurately the search engine can pin down what your site is about and hence determine the most relevant page. This will vastly increase the number of appropriate hits you will receive from the visitors you aim to attract.
However, while varying your vocabulary is good, beware of using words in an unusual context, even if it is grammatically correct to do so, as it may skew your results. It is often useful to relate an object being written about to senses or visual images – especially in direct copy writing – but pick your words carefully.
For example, a page describing a children’s ABC poster recently stated that ordering through the company’s online shop was “a piece of cake”. Shortly afterwards the site statistics started showing visitors who had been searching for information about cake decorating. An alternative way of putting it, without losing the informal tone or diluting the relevancy of the page, might have been “ordering is as easy as ABC”.
Having carried out Local Context Analysis and Latent Semantic Indexing, the search engine uses a mathematical algorithm on the results, called Term Vector Analysis, to give page a score for the total relevance, or weighting, to the search query. Term vector analysis is a mathematical method of determining the relevancy of multiple terms or keywords. This is performed by putting each keyword on an axis on a graph, and marking the relevancy score for each of those keywords for a given page. The result will be a vector having an angle and a magnitude. This then gives a method of comparing different pages for given combinations of keywords. The vector with the highest magnitude and the closest angle to search query will have the largest weighting.
See below for an example of the Term Vector Analysis Graph for the search terms "torches for cars":
Compare the weighting for 'SITE A', with a relevancy score of 0.7 for torches and 0.2 for cars, with 'SITE B', with a relevancy score of 0.6 for torches and 0.5 for cars. The graph clearly shows that that 'SITE B' is the closer match to the ideal weighting and therefore 'SITE B' will be ranked higher by the search engines.
How Term Vector Analysis affects you: You need to build a high relevancy score for the important keywords. The calculated weighting for combinations of your important keywords on your pages will consequently be higher. This will result in a greater chance of your web page being found on the search engines.
Now you know how the search engines work. Here are seven steps you can take to make sure your pages are search engine-friendly.
In summary, for your customers to find you, you need to keep in mind how the search engines work, not only when you conceive the design and page structure of your site but also when you write the content. If you don’t do this all the time and effort you expend on making a website may well be for nothing.
For your customers to find you, you need to keep in mind how the search engines work, not only when you conceive the design and page structure of your site but also when you write the content. If you don’t do this all the time and effort you expend on making a website may well be for nothing.
However, the good news is that it doesn’t cost anything to optimise your pages yourself and even trying these simple steps, you can achieve a marked increase in your site's visibility on the search engines.
Keyword analysis tools:
Validation
Further information about Information Retrieval
Google Guidelines