Software Requirements

  1. The suggestions should appear on a ‘no hits page’ as well as on a search results page where records were successfully retrieved, but factors indicate that an alternate suggestion is warranted. See section II for more details on relevant factors.
  2. The suggestions should be based on terms indexed in the Evergreen database, not just on terms listed in a standard dictionary.
  3. The “Did you mean” suggestion should never lead to empty results. If a filter was applied to the original search or if the search was scoped to a specific organizational unit, the suggestion should only be offered if at least one record would be retrieved with those limits and scoping applied.
  4. Suggestions should be pulled from the search class that the user is searching. (i.e. if the user is searching by subject, suggestions should be pulled from indexed terms in the subject class; if the user is searching by author, suggestions should be pulled from indexed terms in the author class)
    1. With a general keyword search, users are likely to enter search terms from multiple search classes that are not near each other in the keyword blob. The Did You Mean? feature should provide a meaningful suggestion for searches that cross search classes in this way.
  5. The catalog should display up to three suggestions for a search query.
  6. The catalog should display the number of hits expected with each suggestion to give users further guidance on whether the suggestion might be useful to them.
  7. A local list should be available to allow sites to add (or remove) did you mean? suggestions that don’t get picked up (or are inappropriately picked up) by the algorithm.
  8. Variant terms from stemming, synonym list, etc.
    1. The suggested term(s) should be the exact terms used in the record, not a stemmed variation of the term or a variation from a synonym list. For example, if the user enters the search terms angelo’s ash in the search box, the Did You Mean? suggestion should not be angela’s ash, even though, through stemming, this search would successfully retrieve a set of results that includes Angela’s Ashes. The suggestion should be for Angela’s Ashes. Note also the inclusion of an apostrophe in the suggestion.
    2. If multiple suggested terms are offered, they should lead to a different set of results. For example, a search for home cookd could lead to suggestions for home cook or home cooked. However, since both suggestions would lead to the same set of results on a system that employs stemming, only one of these suggestions should be offered.
  9. The catalog should not offer a suggestion that is for the exact same search terms that were entered as part of the original query.

Factors for Identifying Suggestions

There are several factors that can be used in determining a) that a suggestion should be provided for a specific search query, b) whether the system should provide one, two, or three suggestions c) which suggestions should be provided and in what order.

The following factors are ones that the system might consider using to determine if a suggestion should be offered at all or which suggestion should be offered.

MassLNC/ECDI has supplied ranked examples of the behavior they would like to see for a “Did You Mean?” search to provide assistance to the developer in determining which factors should be used. MassLNC/ECDI is also willing to create more ranked examples if it is useful to the developers.

Whenever possible, developers should provide the ability for Evergreen sites to configure at what threshold a “Did You Mean?” suggestion should be offered. This threshold may differ depending on whether the original query was a one-word query or a multi-word query. The threshold may also differ for searches that lead to a no-hits page and ones that successfully retrieved results.

Recommendations from the developer should be based on whether it will support the expected behavior as illustrated in the search examples, the ease in which the feature can be added to PostgreSQL full-text search, the potential complexity for developing the feature, and the impact it will have on search performance.

Potential Factors:

  1. String proximity – if the search string is close to a string in the bibliographic record, the suggestion might receive more weight.
    1. These suggestions should be based on the entire search string, not on individual words that make up the string.
  2. Keyboard distance – Close string matches where the differing characters are close to each other on a keyboard are more relevant than ones that are further apart on the keyboard. Preferably, Evergreen admins should have the ability to configure the keyboard layout that is used, with QWERTY keyboard support being a minimum requirement.
  3. Phonetic matching – If the entered search string sounds like a string in the bibliographic record, the suggestion might receive more weight.
  4. First letter / sound – If the first letter or sound of the words in the entered search query differ from the first letter or sound of the words in the bibliographic record, the suggestion might receive less weight.
  5. Number of results – If the number of results retrieved with the suggested search is high in relation to the number of search terms, the suggestion may receive more weight.
  6. Popularity – If the number of popularity badges attached to records in the result set is high in relation to the number of popularity badges in the original search, the suggestion might receive more weight.
  7. Presence of search term in standard dictionary – For one-word searches, where we might see an overwhelming number of potential suggestions based on string proximity, searches where the entered search term is misspelled might require a lower threshold for offering a suggestion.
  8. Local list – As mentioned in requirement 7 above, a local list might determine if a suggestion should be offered, regardless of other factors in this list, or if a suggestion the system might otherwise pick should not be offered.
  9. Authority cross-references – the system should also have the ability to generate suggestions based on See from tracings available in MARC authority records for unauthorized entries:
    1. To make the suggestions as meaningful as possible, these suggestions should not be generated from related, broader, or narrower terms.
    2. The system should only offer a suggestion if it changes a word(s) in the search. See from tracings will sometimes direct users to the same search terms that were entered, but in a different order, to get the user to the correct location in an alphabetical list. Since word order doesn’t matter in keyword searching, users shouldn’t receive a did you mean? suggestion for a different order of the same search terms.
    3. Evergreen sites should have the ability to turn off suggestions for particular 4xx entries. For example, a site may want to generate suggestions for personal names (400) and geographic names (451), but not for any other types of See from tracings.
    4. Because end users may not understand why a suggestion may be offered when it comes from an authority record, the language that provides the suggestion should be different than other “Did You Mean?” suggestions (see example searches 6 and 8).
    5. If the user is searching the subject field, authority-based suggestions should only be generated for subject authorities and name authorities that are used in subject headings. If the user is searching the author field, authority-based suggestions should only be generated for name authorities that are used in author fields. For keyword searches, authority-based suggestions can be generated from any type of See from tracing.

Examples

These are examples showing ideal behavior for Did You Mean? functionality. Each example also points to specific requirements that are demonstrated through the example. Almost all examples demonstrate requirement 5a (string proximity) since this is a core requirement for getting Did You Mean? to work well.

The examples are shown in ranked order with the top-most examples being ones we see as being more critical to a successful “Did You Mean?” feature. The examples at the bottom of the list are nice to have, but not as critical.

Note: All examples show a count of records with popularity badges based on searches done on a production Evergreen system. This information should not display in the final development project. The information is included to provide guidance on which factors might be worthwhile to include in the final project.

Example 1

Author search: virginia wolf
9 results, 1 record with popularity badge in result set

Did you mean virginia woolf (262 results, 13 popularity badges)?

Explanation:
This one gets missed by a lot of “Did You Mean?” implementations, possibly because the words are spelled correctly or because we successfully retrieve several results for another author with the name virginia wolf. However, the high # of results and popularity badges for the suggested term is an indicator that the user may have typed incorrect search terms.

Example 2

Keyword search: stephanie meyers
59 results, 10 records with popularity badges in result set

Did you mean stephenie meyer (189 results, 44 badges)? stephanie myers (15 results, 1 badge)?

Suggestions that do not appear:
stephanie peters (320 results, 48 badges)
stephen morris (281 results, 35 badges)

Explanation:
Stephanie myers is probably closer in string proximity than stephenie meyer, but the factor pushing the latter suggestion, but people are most likely searching for the author of the Twilight series (stephenie meyer). The factors that could be used to push this suggestion higher might be be the number of search results or the number of records earning popularity badges in the result set. They are close for string proximity, but the other two factors point to stephenie meyer as something that more people might be looking for.

stephanie peters and stephen morris were Did You Mean? suggestions offered in different test systems. Stephanie Peters might be close in terms of string proximity, but the factor that ultimately excludes it as a suggestion is that the second name starts with a completely different letter/sound and that different letter is nowhere near the entered letter on the keyboard. However, it was generally seen as a more acceptable suggestion than stephen morris, which seemed to far off to be a good suggestion.

The component of this example where the system suggests stephanie meyer and stephanie myers was seen as the most critical piece for a functioning Did You Mean? feature. The other components of this example were not considered to be as important.

Example 3

TItle search: girl on a train
127 results, 16 records with popularity badges

Did you mean girl on the train (158 results, 26 records with popularity badges)?

Explanation:
We do have a title called “girl on a train,” but it’s not the bestselling novel most people are looking for. The number of results for the bestselling novel aren’t much higher, but the number of popularity badges associated with this search are a bit higher.

This suggestion is very close in string proximity, but if we ask the system to totally discard suggestions based on the first letter (or sound) of a word being different, it could create problems for this example.

Example 4

Title search: pet cemetery
2 results, 0 records with popularity badges

Did you mean pet sematary (27 results, 3 records with popularity badge)?

Explanation:
If we were to rely on a standard aspell dictionary, this suggestion would never be offered. By relying on indexed data instead, we do get the title. However, we have records with typos and unintentionally misspelled words, and we want to be careful about offering suggestions based on those typos. What makes this suggestion stand out as a good one compared to suggestions based on typos is the large number of results that are retrieved with the suggested search terms, especially when compared to the number of results retrieved in the original search.

Also note that if we were to totally exclude suggestions based on the first letter, this suggestion won’t be retrieved. However, this suggestion would score well if relying on phonetic first sound instead.

Example 5

Keyword search: twilite meg cabot
No hit page

Did you mean twilight meg cabot (4 results, 0 records with a badge)? twilite (8 results, 1record with a badge)

Suggestions that do not appear:

tillit, l.b. (3 records) (0 records with badges)
william b. cabot (37 records) (6 records with badges)

Explanation:
The user knows they want the meg cabot twilight, not the stephenie meyer one, but doesn’t know how to spell twilight. Although we’re looking at two different search classes, the system is able to pull the right suggestion. Another suggestion for records that intentionally misspell twilight is also acceptable, but we didn’t see this component of the example as critical behavior.

tillit, l.b. and william b. cabot were suggestions offered in other tested catalogs, but, in our opinion, are not close enough in string proximity to the entered search terms to be valid suggestions.

Example 6

Author search: ursula archer
No hits page

For your information:
Archer, Ursula is not used as an author. Poznanski, Ursula is used instead. (4 results).

Explanation:
The user enters the author name as it appears on the book cover. This is not the authorized name for the author, which limits the number of results retrieved (or, in this case, leads to no results) However, because 1) $aArcher, Ursula,$d1968- is in the 400 field for the name authority record for Poznanski, Ursula., and 2) this name authority is linked to author headings, the system determines that a suggestion should be generated for the authorized name. In the case of suggestions dervived from authority records, a user may be confused as to why the suggestion is being offered. Therefore, we adjust the language on these suggestions to make it clear why we are offering this suggestion.

Example 7

Title search: sematary
27 results, 3 records with popularity badges

Did you mean cemetery (2,358 results, 32 records with badges)?

Suggestion that does not appear:
cemetary (55 results, 5 records with popularity badge)

Explanation:
Although this search retrieves a healthy set of results, we want to direct users to the correctly spelled word. As a one-word search, we may have different thresholds for determining what the suggestion should be. Unlike the search in Example 4, there is nothing else from the search terms that would indicate they may actually be looking for a title with a misspelled word. The fact that the search term is not properly spelled might trigger this suggestion or the fact that the result set for ‘cemetery’ is so much higher.

Example 8

Keyword search: anime cinematography
Results 1-10 of about 440

For your information:
Anime (Cinematography) is not used as a subject heading. Animation (Cinematography) — Japan is used instead (65 results).

Explanation:
The user enters the more well-known term anime, but this is not an authorized term for LC subject headings. However, because Anime (Cinematography) is listed in the 450 field for the authority record for ‡a Animation (Cinematography) ‡z Japan, the system determines that a suggestions should be generated for this authorized heading.

Note: this is one of the few examples where the Did you Mean? suggestions leads to fewer results instead of more results. This is because the words anime and animation share the same stem, meaning that the original anime search is very imprecise. The suggestion leads to a better set of results, even if there are fewer results. On a system that does not employ stemming, the original search yields far fewer results, and the Did You Mean? suggestion broadens the search because it makes use of the proper subject heading.

Example 9

Keyword search: cats
18,518 results, 2,490 records with badges

No suggestions offered

Suggestion that does not appear:

cast (12,668 results) (2,114 records with badges)?

Explanation:
This word is in a standard aspell dictionary with a very large set of results with many popularity badges. There is no reason to believe the user was searching for something other than what they entered. NOTE: If the system did reach a threshold that determined a suggestion was required, we saw ‘cast’ as a very good suggestion.

Example 10

Title search: cemetery
2,358 results, 32 records with badges

No suggestions offered

Suggestions that do not appear:

sematary (27 results, 3 records with popularity badge)
cemetary (55 results, 5 records with popularity badge)

Explanation:
The user may have been looking for Pet Sematary or Accomodating Brocolli in the Cemetary: Or Why Can’t Anybody Spell. However, without more search terms, there isn’t enough evidence that they are looking for these titles. For this one-word search, since the search word is in a standard aspell dictionary or maybe because it retrieves many results, there is no need to offer suggestions.

Example 11

TItle search: girl on the train
158 results, 26 records with popularity badges

Did you mean girl on a train (127 results, 16 records with popularity badges)?

Explanation:
In this case, the system may be leading the user away from the bestselling novel, but since we do have a title called girl on a train with a healthy number of results, this suggestion is still reasonable. Overall, we found there was more harm in not offering a suggestion than there is in offering a suggestion that isn’t what the user is looking for.

Example 12

Keyword search: xats
No hits page

Did you mean cats (18,518 results, 2,940 records with badges)? axat (2 records, 1 record with badges)?

Suggestions that do not appear:

oats(1,337 results) (143 records with badges)
tats (486 results) (14 records with badges)
xtc (69 results) (6 records with badges)
at (search times out before I can get a count)

Explanation:
With a no hits page, this search term is obviously a typo. However, it’s difficult to tell what the user is searching for. Because we reached a no hits page,the system should try as much as possible to offer some type of suggestion, even if it’s difficult to determine what exactly the suggestion should be.

In this case, the first choice went to a suggestion where the first letter of the suggested term is just 1 QWERTY keyboard key away from the entered first letter. Although the second suggestion (axat) has few hits and is not much closer in string proximity than some of the rejected suggestions (oats, tats, etc.), it seemed more likely to be a possible match.

xtc, at and axat were all suggested terms we saw on other systems. However, in our opinion, axat was the only one of those three that seemed like a plausible suggestion.

Example 13

Keyword search: vats
137 results, 10 records with badges

No suggestions offered

Suggestions that do not appear:

vast (2,891 results, 400 records with badges)
cats (18,518 results, 2,940 records with badges)
bats (3,213 results) (336 records with badges)?

Explanation:
This word is in a standard aspell dictionary, but it has a small set of results for a one-word search. However, there is no reason to believe the user was searching for something other than what they entered. NOTE: The list of suggestions that do not appear were generally cosidered to be pretty good suggestions if the system did hit a threshold where it was determined that a suggestion was required. The first choice goes to a suggestion close in string proximity that also starts with the same first letter as the entered search term. The next two choices are words that differ only in the first letter. In both cases, those letters are adjacent to ‘v’ on a QWERTY keyboard.

Example 14

Keyword search: qats
4 results, 0 records with badges

Did you mean qatsi (4 results, 0 records with badges)? cats (18,518 results, 2,940 records with badges)?

Suggestions that do not appear:

oats(1,337 results) (143 records with badges)
tats (486 results) (14 records with badges)
squats (120 results) (21 records with badges)

Explanation:
This word is not in a standard aspell dictionary (even though it is, indeed, an actual word), but it retrieves some results. The absence from the dictionary or the retrieval of such a small number of results for a one-word search might be the trigger for providing some Did You Mean? suggestions. Suggesting a search term with the same starting letter was the first choice.