Indexing and keyword ranking techniques revisited: 20 years later

Learn how keyword-based ranking techniques have evolved – from the ‘little old ladies’ to the vector space model, to today.

When the acorn that would become the SEO industry started to grow, indexing and ranking at search engines were both based purely on keywords.

The search engine would match keywords in a query to keywords in its index parallel to keywords that appeared on a webpage.

Pages with the highest relevancy score would be ranked in order using one of the three most popular retrieval techniques:

Boolean Model
Probabilistic Model
Vector Space Model

The vector space model became the most relevant for search engines.

I’m going to revisit the basic and somewhat simple explanation of the classic model I used back in the day in this article (because it is still relevant in the search engine mix).

Along the way, we’ll dispel a myth or two – such as the notion of “keyword density” of a webpage. Let’s put that one to bed once and for all.

The keyword: One of the most commonly used words in information science; to marketers – a shrouded mystery

“What’s a keyword?”

You have no idea how many times I heard that question when the SEO industry was emerging. And after I’d given a nutshell of an explanation, the follow-up question would be: “So, what are my keywords, Mike?”

Honestly, it was quite difficult trying to explain to marketers that specific keywords used in a query were what triggered corresponding webpages in search engine results.

And yes, that would almost certainly raise another question: “What’s a query, Mike?”

Today, terms like keyword, query, index, ranking and all the rest are commonplace in the digital marketing lexicon.

However, as an SEO, I believe it’s eminently useful to understand where they’re drawn from and why and how those terms still apply as much now as they did back in the day.

The science of information retrieval (IR) is a subset under the umbrella term “artificial intelligence.” But IR itself is also comprised of several subsets, including that of library and information science.

And that’s our starting point for this second part of my wander down SEO memory lane. (My first, in case you missed it, was: We’ve crawled the web for 32 years: What’s changed?)

This ongoing series of articles is based on what I wrote in a book about SEO 20 years ago, making observations about the state-of-the-art over the years and comparing it to where we are today.

The little old lady in the library

So, having highlighted that there are elements of library science under the Information Retrieval banner, let me relate where they fit into web search.

Seemingly, librarians are mainly identified as little old ladies. It certainly appeared that way when I interviewed several leading scientists in the emerging new field of “web” Information Retrial (IR) all those years ago.

Brian Pinkerton, inventor of WebCrawler, along with Andrei Broder, Vice President Technology and Chief Scientist with Alta Vista, the number one search engine before Google and indeed Craig Silverstein, Director of Technology at Google (and notably, Google employee number one) all described their work in this new field as trying to get a search engine to emulate “the little old lady in the library.”

Libraries are based on the concept of the index card – the original purpose of which was to attempt to organize and classify every known animal, plant, and mineral in the world.

Index cards formed the backbone of the entire library system, indexing vast and varied amounts of information.

Apart from the name of the author, title of the book, subject matter and notable “index terms” (a.k.a., keywords), etc., the index card would also have the location of the book. And therefore, after a while “the little old lady librarian” when you asked her about a particular book, would intuitively be able to point not just to the section of the library, but probably even the shelf the book was on, providing a personalized rapid retrieval method.

However, when I explained the similarity of that type of indexing system at search engines as I did all those years back, I had to add a caveat that’s still important to grasp:

“The largest search engines are index based in a similar manner to that of a library. Having stored a large fraction of the web in massive indices, they then need to quickly return relevant documents against a given keyword or phrase. But the variation of web pages, in terms of composition, quality, and content, is even greater than the scale of the raw data itself. The web as a whole has no unifying structure, with an enormous variant in the style of authoring and content far wider and more complex than in traditional collections of text documents. This makes it almost impossible for a search engine to apply strictly conventional techniques used in libraries, database management systems, and information retrieval.”

Inevitably, what then occurred with keywords and the way we write for the web was the emergence of a new field of communication.

As I explained in the book, HTML could be viewed as a new linguistic genre and should be treated as such in future linguistic studies. There’s much more to a Hypertext document than there is to a “flat text” document. And that gives more of an indication to what a particular web page is about when it is being read by humans as well as the text being analyzed, classified, and categorized through text mining and information extraction by search engines.

Sometimes I still hear SEOs referring to search engines “machine reading” web pages, but that term belongs much more to the relatively recent introduction of “structured data” systems.

As I frequently still have to explain, a human reading a web page and search engines text mining and extracting information “about” a page is not the same thing as humans reading a web page and search engines being” fed” structured data.

The best tangible example I’ve found is to make a comparison between a modern HTML web page with inserted “machine readable” structured data and a modern passport. Take a look at the picture page on your passport and you’ll see one main section with your picture and text for humans to read and a separate section at the bottom of the page, which is created specifically for machine reading by swiping or scanning.

Quintessentially, a modern web page is structured kind of like a modern passport. Interestingly, 20 years ago I referenced the man/machine combination with this little factoid:

“In 1747 the French physician and philosopher Julien Offroy de la Mettrie published one of the most seminal works in the history of ideas. He entitled it L’HOMME MACHINE, which is best translated as “man, a machine.” Often, you will hear the phrase ‘of men and machines’ and this is the root idea of artificial intelligence.”

I emphasized the importance of structured data in my previous article and do hope to write something for you that I believe will be hugely helpful to understand the balance between humans reading and machine reading. I totally simplified it this way back in 2002 to provide a basic rationalization:

Data: a representation of facts or ideas in a formalized manner, capable of being communicated or manipulated by some process.
Information: the meaning that a human assigns to data by means of the known conventions used in its representation.

Therefore:

Data is related to facts and machines.
Information is related to meaning and humans.

Let’s talk about the characteristics of text for a minute and then I’ll cover how text can be represented as data in something “somewhat misunderstood” (shall we say) in the SEO industry called the vector space model.

The most important keywords in a search engine index vs. the most popular words

Ever heard of Zipf’s Law?

Named after Harvard Linguistic Professor George Kingsley Zipf, it predicts the phenomenon that, as we write, we use familiar words with high frequency.

Zipf said his law is based on the main predictor of human behavior: striving to minimize effort. Therefore, Zipf’s law applies to almost any field involving human production.

This means we also have a constrained relationship between rank and frequency in natural language.

Most large collections of text documents have similar statistical characteristics. Knowing about these statistics is helpful because they influence the effectiveness and efficiency of data structures used to index documents. Many retrieval models rely on them.

There are patterns of occurrences in the way we write – we generally look for the easiest, shortest, least involved, quickest method possible. So, the truth is, we just use the same simple words over and over.

As an example, all those years back, I came across some statistics from an experiment where scientists took a 131MB collection (that was big data back then) of 46,500 newspaper articles (19 million term occurrences).

Here is the data for the top 10 words and how many times they were used within this corpus. You’ll get the point pretty quickly, I think:

Word frequency
the: 1130021
of 547311
to 516635
a 464736
in 390819
and 387703
that 204351
for 199340
is 152483
said 148302

Remember, all the articles included in the corpus were written by professional journalists. But if you look at the top ten most frequently used words, you could hardly make a single sensible sentence out of them.

Because these common words occur so frequently in the English language, search engines will ignore them as “stop words.” If the most popular words we use don’t provide much value to an automated indexing system, which words do?

As already noted, there has been much work in the field of information retrieval (IR) systems. Statistical approaches have been widely applied because of the poor fit of text to data models based on formal logics (e.g., relational databases).

So rather than requiring that users will be able to anticipate the exact words and combinations of words that may appear in documents of interest, statistical IR lets users simply enter a string of words that are likely to appear in a document.

The system then takes into account the frequency of these words in a collection of text, and in individual documents, to determine which words are likely to be the best clues of relevance. A score is computed for each document based on the words it contains and the highest scoring documents are retrieved.

I was fortunate enough to Interview a leading researcher in the field of IR when researching myself for the book back in 2001. At that time, Andrei Broder was Chief Scientist with Alta Vista (currently Distinguished Engineer at Google), and we were discussing the topic of “term vectors” and I asked if he could give me a simple explanation of what they are.

He explained to me how, when “weighting” terms for importance in the index, he may note the occurrence of the word “of” millions of times in the corpus. This is a word which is going to get no “weight” at all, he said. But if he sees something like the word “hemoglobin”, which is a much rarer word in the corpus, then this one will get some weight.

I want to take a quick step back here before I explain how the index is created, and dispel another myth that has lingered over the years. And that’s the one where many people believe that Google (and other search engines) are actually downloading your web pages and storing them on a hard drive.

Nope, not at all. We already have a place to do that, it’s called the world wide web.

Yes, Google maintains a “cached” snapshot of the page for rapid retrieval. But when that page content changes, the next time the page is crawled the cached version changes as well.

That’s why you can never find copies of your old web pages at Google. For that, your only real resource is the Internet Archive (a.k.a., The Wayback Machine).

In fact, when your page is crawled it’s basically dismantled. The text is parsed (extracted) from the document.

Each document is given its own identifier along with details of the location (URL) and the “raw data” is forwarded to the indexer module. The words/terms are saved with the associated document ID in which it appeared.

Here’s a very simple example using two Docs and the text they contain that I created 20 years ago.

Citysearch Partners With Yext For Local Search

Yelp’s Consumer Alert Label On Business Profile Reviews

Yelp On South Park Leads To Hoax Lawsuit

Apple considers paid membership option

Australian takes out top fashion prize at iD Dunedin

The Best Street Style Beauty From Tokyo Fashion Week

Indexing and keyword ranking techniques revisited: 20 years later

Learn how keyword-based ranking techniques have evolved – from the ‘little old ladies’ to the vector space model, to today.

The keyword: One of the most commonly used words in information science; to marketers – a shrouded mystery

The little old lady in the library

The most important keywords in a search engine index vs. the most popular words

Recall index construction

The Vector Space Model

Solving the abundance problem

Citysearch Partners With Yext For Local Search

Yelp’s Consumer Alert Label On Business Profile Reviews

Yelp On South Park Leads To Hoax Lawsuit

Apple considers paid membership option

Australian takes out top fashion prize at iD Dunedin

Google Reveals Two New Web Crawlers

Retail marketers are most focused on performance-driven media

Frank McCourt Organizing a People’s Bid to Acquire TikTok

5 mistakes to avoid when adding new ecommerce channels

Sports ad inventory growth creates new, niche opportunities for brands