Jinfo: Powering up with predictive search

Powering up with predictive search
Jinfo Blog

3rd October 2011

Abstract

One of the most difficult aspects of searching for information on an unfamiliar subject is knowing the appropriate vocabulary to use. Each sector and industry has its own terminology which can cause a search barrier for the uninitiated. Tasha Bergson-Michelson shows how to overcome this barrier through the use of predictive search which involves visualising what the answer might be.

Item

One of the most crucial skills in searching for electronic information, whether in free or proprietary sources, is simply to stop and think. I was no more familiar with the fishing industry than the client, but I have developed an almost automatic habit, when beginning a search, of visualising the source I expect to house my answer. I “read” this image to anticipate the vocabulary my ideal source might use (or to determine when my vocabulary suffers a deficit). I also envision the likely author/publisher/poster of the information, and the format or medium in which I expect it to appear. I then use these clues to determine which operators and filters are available to help me match the picture in my head with a source on the web. The process of discerning what your answer will look like in order to strategise how to find it is called “predictive search”.

First, I tried a series of searches using common words to see what I could find. Since no answer presented itself, I paid careful attention to terms that appeared in my results –particularly by looking out for words that I did not already know – and discovered that the term for the amount of fish caught and brought to shore was "landings.” Landing has a technical sound to it. And indeed, using it as a keyword retrieves primarily documents that are technical or commercial in nature.

Beyond developing a sense for language, however, predictive search is about drawing on what one knows about how and where different types of information are commonly communicated.

Say, for example, that I want to know how much the Kenyan government collects in taxes on royalties. A simple search for [taxes on royalties kenya] brings back little of immediate value.

So instead, I take a moment to consider what the ideal source might look like.

In my mind, this source has several characteristics, and search engines like Google offer a number of solutions for defining them in my query:

It has the words “taxes” and “royalties” on the page, but not necessarily the word “Kenya”.
It is a table, executed as:
- a spreadsheet – probably Excel or
- a table in an academic document or an organisational report – with a caption that reads either “Table n” or “Figure n”.
Some of the columns probably have years as headers.

Any or all of these instincts could prove untrue, but since a simple search did not work, they allow me to develop a strategy. I decide to try for a government source, but not restrict myself to one ministry or department (by using the string [site:go.ke] to limit to sites from the Kenyan government), to try for an Excel spreadsheet [filetype:xls], and to look for a document that has data on a specified year or years within the past five years [2006..2011].

My new search, [taxesonroyaltiessite:go.kefiletype:xls 2006..2011], found precisely what I wanted:

... which takes me to this:

I did not use image searching this time – I thought this table was unlikely to be in someone’s document as an image – but for an increasing number of searchers it is the first step towards finding statistics that are likely to reside in a chart, table or graph. Consider a search like [world population growth] or [european union aquaculture production figure] to see why.

When I do use images, I need to consider my search terms carefully. Since popular image search tools are text-based, they match the terms on the page where the image was found, rather than corresponding with any content appearing in the image itself. I therefore need to predict what terms are likely to appear near the image I imagine. For example, figure and table frequently appear in captions of formal or scholarly writing, whereas pie chart does not, even when the author has used a pie chart to communicate specific data. In fact, just as I can use the term landings to filter for commercial information because it is a convention of the industry, I would generally be suspicious of any page whose text referred to a pie chart, since that language runs counter to conventions of formal document presentation. On the other hand, infographics are so popular right now that people tend to refer to them directly by that term, in the text of the page: “as part of our study of wedding insurance statistics … we produced the following infographic”, using it in page titles: “20 Most Expensive AdWords on Google (Infographic)” [link removed], or in tags, making it an excellent potential search term.

As I navigate the web, I try to pay attention to how people are using language, but also to the formats in which different kinds of information tend to be communicated and what is searchable about those various formats.

For example, I often use Google’s number range operator, and periodically look to see if other people are using it in ways I’ve not yet considered. (For those unfamiliar with this operator, placing two dots between two numbers, e.g., 2006..2011, causes Google to find pages that contain any number between and including the ones specified.) In this case, I wanted to find any deeper conversations among various users where they were sharing, interacting and developing their ideas together on the subject. When I thought about this problem, I first visualised the likely format of the page that had the answer I wanted:

My experience had trained me that conversations like the ones I hoped to find often had this distinctive format of a column of nested response boxes, which is typical of both forums and blogs. Having identified this kind of discussion area of a site as my target, I started by searching [google operator “number range” OR “two dots” ] and clicking on the Discussions corpus in the top half of Google’s left-hand panel. This move allowed me to view results from discussion groups/forums.

Next, I wanted to find “blog-formatted” sources, which may or may not actually be blogs. Everything from traditional blogs to recipe sites to newspapers now use blogging platforms to deliver their content, so a blog search tool might not deliver all the results I desire. However, one thing I know about blog-formatted sources is that they commonly show the number of comments the post has received. Since I want posts with lots of discussion, it makes particular sense to identify blogs by searching for the term comments: [“google “number range” OR “two dots” “comments 5..” OR “5.. comments”].

Finally, I sometimes need to work around the vagaries of language. We have all dealt with the fact that many electronic retrieval systems cannot “see” symbols, and search for data that comes with a percent sign (e.g., 20%) with strings like [percent OR percentage OR study OR survey OR report]. Another example arises when looking for discussions of forecasts or growth.

Luckily, our language has some “boilerplate” ways of reporting on these ideas. For example, when Vice President Biden recently visited Mongolia, reportedly interested in its coal reserves, a request arose for projections regarding the future of the Mongolian coal industry. Since projections can also be forecasts, estimates, or any number of other synonyms, it is messy to look directly for that concept. However, the structure of the sentence that tends to refer to such a projection looks something like: “In terms of coal production, we forecast an annual average growth rate of 17.0%, reaching 27.0mntpa (million tonnes per annum) by 2015” (Mongolia Mining Report Q3 2011). This kind of convention is the searcher’s best friend because, while it might be messy to search for all those synonyms, it is a relatively simple thing to search for [mongolia coal production “by 2015..2050” OR “by the year 2015..2050"]. I think of these searches as leveraging associated search term rather than the keywords themselves.

Visualisation is the secret weapon of an experienced searcher, cutting through the thicket of typical keyword matches to find the precise information you need to make evidence-based business decisions.

For more examples of how to use predictive search, and for information on the operators used here, visit Google’s Search Education team website and Google’s Help Centre.

About this article