Statistically Improbable Phrases in Technology-Assisted Review

TwitterLinkedInFacebookGoogle GmailYahoo MailAOL MailEmailPocketEvernoteInstapaperShare


amazonAmazon.com computes and displays “Statistically Improbable Phrases” for its indexed books. It defines a Statistically Improbable Phrase as “a phrase that occurs a large number of times in a particular book relative to all [indexed] books.” You can use similar statistics to help you improve your technology-assisted review.

First, you compute which 3-word phrases appear more densely in a set of relevant documents than in a set of irrelevant documents (“key phrases”). Then, you view promising-looking key phrases in their original context. Finally, you select some to use as search terms to help you prioritize review of the larger universe of documents. You can repeat this process throughout the review. As more documents are reviewed, you can apply the process to successively more granular issues.

A great tool for finding key phrases is WordSmith 6 by Mike Scott and Lexical Analysis Software Ltd. (USD $88.73 or EUR €67.57 from lexically.net). (Note: I’m not affiliated with Mike or Lexical.)

The WordSmith interface

The process is simple:

1. Create a plain text file containing the text extracted from a set of relevant documents and another containing the text extracted from a set of irrelevant documents.

2. Use WordSmith’s WordList module to create an index for each file.

3. Use the Compute Clusters function (in the WordList menu bar) to create a list for each indexed text file of how many times each phrase appears in the file and the proportions of each phrase to the total amount of text in the file. I’ve found it best to use the default settings except for limiting the analysis to 3-word phrases and omitting numbers.

4. Use WordSmith’s KeyWords module to compare the two phrase lists, using the list of phrases from the irrelevant text as the “reference corpus wordlist.” You will get a list of key phrases in order of descending “keyness” (which is based on the relative frequency of each phrase in the two text files).

Most of the key phrases in this key phrase list will be substantively empty, but several will be highly relevant to the issues and therefore very useful for creating review sets from the document universe.

If you want to see how a key phrase is used in context in the full text of the relevant documents, you can use WordSmith’s Concord function. This will give you a list of each instance of the phrase as it appears in context in the original documents. The context is initially only one line of text (which is usually plenty, and you can stretch the display across two monitors). If you want to see more context, double-click the line of text to open the text file with the selected phrase highlighted.

Many variations of this process are possible. For example, you can set the Compute Clusters function to look for phrases of a set number of words or a range (for example, 3 to 5 words). If you have reviewed enough documents, you can also perform this kind of analysis on text that is relevant only to a particular narrow issue.

WordSmith can also be used in other ways to help review and analyze ediscovery documents.

TwitterLinkedInFacebookGoogle GmailYahoo MailAOL MailEmailPocketEvernoteInstapaperShare

One thought on “Statistically Improbable Phrases in Technology-Assisted Review

  1. Pingback: Efficient Ediscovery: Email Between the PartiesBits in the Balance

Leave a Reply

Your email address will not be published. Required fields are marked *