[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[FYI] Word Spotting


---------------------------- CUT ------------------------------

5 August 1999. Add DC on n-gram analysis. 

3 August 1999. Word "NOT" added to paragraph 6 by DC. 

2 August 1999. Thanks to Duncan Campbell. 

Date: Thu, 05 Aug 1999 01:33:07 +0100
To: ukcrypto@maillist.ox.ac.uk
From: Duncan Campbell <duncan@gn.apc.org>
Subject: Re: Question for Duncan Campbell re: Word-Spotting

The topic spotting methods that NSA is working on are based on n-gram
analysis, which in my crude way I understand to be based on a
comparison of n-dimensional matrices setting out the relative
probablity of any text string of length n in two corpuses (corpi ?) of
texts.   One is the surveillance data, which can be massive, the
second is the seed corpus, a chosen set of documents which are about
the topic of interest. 

In other words, you could show the computer the last six months of
uk-crypto, and then say, find me anybody else talking about this stuff
in the world's communications.   The topic spotting system then ranks
orders the target communications as to how closely the topics match to
the uk-crypto corpus. 

NSA has patented this method, and claims that it is completely
language independent (true, if each corpus is in the same language)
and highly effective despite high error rates (which seems very
plausible).   It is this latter claim that makes me suspect that it
may work when they apply it to phoneme strings in the speech
recognition problem.   If you can do that, you don't need to go
through the actual transcription phase. 

I find the method elegant, as it neatly sidesteps all the
well-understood problems of Boolean based searches. 

Duncan Campbell 


---------------------------- CUT ------------------------------