LWN.net Logo

Advertisement

E-Commerce & credit card processing - the Open Source way!

Advertise here

The Natural Language Toolkit (developerWorks)

The Natural Language Toolkit (developerWorks)

Posted Jun 25, 2004 22:15 UTC (Fri) by cpeterso (guest, #305)
In reply to: The Natural Language Toolkit (developerWorks) by iabervon
Parent article: The Natural Language Toolkit (developerWorks)

Pardon my ignorance <:) but what does "top 5 porter-stemmed words by term-frequency-inverse-document-frequency" mean? Does this mean the most common words in a particular email that are least common among ALL emails? i.e. what words make this email special compared to all other emails?

wow, that is a very interesting idea you had! :D


(Log in to post comments)

The Natural Language Toolkit (developerWorks)

Posted Jun 25, 2004 22:58 UTC (Fri) by iabervon (subscriber, #722) [Link]

Right. You first look at all of your emails (so far), and count how many
of them each word appears in. Then you look at the new email, and count
how many times each word appears. You divide the latter by the former,
and then show the words which score highest. Note that you stem the words
beforehand, so "frequent" and "frequency" are counted together. Also note
that, when looking at your whole corpus, you count the doucments that a
word appears in, not the total number of times the word appears. So you
wind up with the words which are used frequently that are in a small
portion of your email.

It's not actually my idea. I encountered it when using the Remembrance
Agent's engine, which got it out of the document comparison literature
from around '96 (as well as how to optimize computation of these vectors
for a corpus and how to compare vectors efficiently), but as far I know,
nobody's done software where you actually see the top words, as opposed
to having the software use the information internally (and I don't know
if anyone's noted in the literature that the top five words are often
helpful; people generally use all of the words). It's the sort of thing
where the debugging code turns out to be really interesting. Our engine
for clustering documents was showing its results and the info about
individual documents; I was using emails, and realized that I could tell
when the emails were about from just the display.

The Natural Language Toolkit (developerWorks)

Posted Jun 25, 2004 23:22 UTC (Fri) by dang (subscriber, #310) [Link]

You might be able to do something like this pretty easily with OpenFTS.

The Natural Language Toolkit (developerWorks)

Posted Jun 25, 2004 23:41 UTC (Fri) by cpeterso (guest, #305) [Link]

Thanks for the explanation! Do you remember an example of any particular email's top 5 words? This is a cool idea, but it sounds like a lot of work (programming and compute-time) to create email summaries. I bet most people don't spend much time reading or relying on their emails' subject lines. :\

The Natural Language Toolkit (developerWorks)

Posted Jun 26, 2004 0:29 UTC (Sat) by iabervon (subscriber, #722) [Link]

I don't remember a good example, but the most of the programming work is
already done in the remembrance agent, available (GPL) from
http://www.remem.org/, which indexes a corpus and then can efficiently
query it. The interface only really supports finding the documents which
match best, but the engine has the words and their scores internally.
It's not a lot of computation provided you do an initial scan and then
update it incrementally; using the stored info is too fast to tell on my
machine. (Note that what it displays for a match are what matched, not
the top words for the document in general)

Querying for one of the emails in the list gives, for it's match with
itself: "window, throwing, knife, pizza, stuck". I'm not sure whether
this is only completely clear to me due to remembering the email, or
whether other people could also figure out what the email was about from
that information.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds