Förderverein Informationstechnik und Gesellschaft

Word Spotting

5 August 1999. Add DC on n-gram analysis.

3 August 1999. Word "NOT" added to paragraph 6 by DC.

2 August 1999. Thanks to Duncan Campbell.

Date: Thu, 05 Aug 1999 01:33:07 +0100 To: From: Duncan Campbell <> Subject: Re: Question for Duncan Campbell re: Word-Spotting Capabilities

The topic spotting methods that NSA is working on are based on n-gram analysis, which in my crude way I understand to be based on a comparison of n-dimensional matrices setting out the relative probablity of any text string of length n in two corpuses (corpi ?) of texts. One is the surveillance data, which can be massive, the second is the seed corpus, a chosen set of documents which are about the topic of interest.

In other words, you could show the computer the last six months of uk-crypto, and then say, find me anybody else talking about this stuff in the world's communications. The topic spotting system then ranks orders the target communications as to how closely the topics match to the uk-crypto corpus.

NSA has patented this method, and claims that it is completely language independent (true, if each corpus is in the same language) and highly effective despite high error rates (which seems very plausible). It is this latter claim that makes me suspect that it may work when they apply it to phoneme strings in the speech recognition problem. If you can do that, you don't need to go through the actual transcription phase.

I find the method elegant, as it neatly sidesteps all the well-understood problems of Boolean based searches.

Duncan Campbell