The Natural Language Toolkit (developerWorks)
Posted Jun 25, 2004 22:58 UTC (Fri) by
iabervon (subscriber, #722)
In reply to:
The Natural Language Toolkit (developerWorks) by cpeterso
Parent article:
The Natural Language Toolkit (developerWorks)
Right. You first look at all of your emails (so far), and count how many
of them each word appears in. Then you look at the new email, and count
how many times each word appears. You divide the latter by the former,
and then show the words which score highest. Note that you stem the words
beforehand, so "frequent" and "frequency" are counted together. Also note
that, when looking at your whole corpus, you count the doucments that a
word appears in, not the total number of times the word appears. So you
wind up with the words which are used frequently that are in a small
portion of your email.
It's not actually my idea. I encountered it when using the Remembrance
Agent's engine, which got it out of the document comparison literature
from around '96 (as well as how to optimize computation of these vectors
for a corpus and how to compare vectors efficiently), but as far I know,
nobody's done software where you actually see the top words, as opposed
to having the software use the information internally (and I don't know
if anyone's noted in the literature that the top five words are often
helpful; people generally use all of the words). It's the sort of thing
where the debugging code turns out to be really interesting. Our engine
for clustering documents was showing its results and the info about
individual documents; I was using emails, and realized that I could tell
when the emails were about from just the display.
(
Log in to post comments)