Discovering Important Features Using TF-IDF
Ever since I discovered the nltk (natural language toolkit) for Python, I’ve become somewhat of a closet natural language junkie. Why? Because language is one of those areas of science for which you can’t write simple rules. After browsing the nltk documentation and online book, it’s pretty amazing how far you can get, though.
One feature I’ve come across in document parsing and analysis is known as TF-IDF (term-frequency inverse-document frequency). TF-IDF is a measure of how important a feature is for an object among all other objects in a collection. For example, all bikes in the Tour de France have two wheels (an unimportant feature) while only a few have electronic shifting (an important and differentiating feature).
Often TF-IDF is used to explore what the most important features of a document are. One could create a naïve algorithm that ranks term importance by order of term frequency (TF). In this post, “the” “and” and “a” are most likely to be the most common words. Should we then assume that “the” “and” and “a” are what this post is about?
Someone clever out there figured out that you could penalize these terms with inverse-document frequency (IDF). That is, the inverse of how frequently these terms occur in all documents. “The”, “and” and “a” occur in *all* my posts, so we’d want to penalize these terms. However, “TF-IDF” occurs quite frequently and doesn’t occur anywhere else on my blog (yet). We would then correctly assume that this post has an important feature called “TF-IDF”.
TF is simply the count (n) of word i in document j divided by the total number of words in document j:
![]()
IDF is a bit more tricky. It’s the log of the inverse of the count of how many documents (objects) include word (feature) i divided by the total number of documents. If 14 of 23 documents included the word “apple”, IDF would be log(23/14). You’ll notice that if all documents include the word, idf=log(1)=0, which means tf-idf = 0. Alternatively, if one of 1000 include the word “oxymoronic”, you’d get a very high tf-idf.
![]()
So what can you use TF-IDF for? Well, it’s not just for documents anymore. Imagine a few more use cases:
- Uncovering expertise for someone on twitter. (Dan Cederholm uses the word CSS frequently, and it doesn’t occur much in others’ tweets)
- Uncovering salient characteristics of a wine (many wines are said to be fruity, but few are described with the term ‘pencil lead’)
- Discovering unique terms in user reviews for a business. (‘Foie Gras’ is a bigram used many times in reviews for SF Chez Spencer but not for other restaurants)
- Finding someone’s ‘haunts’ on foursquare (I visit Ritual Roasters often, but so does everyone else — vs — I visit Sightglass Roasters every day, and very few other people do currently)
In the few cases I’ve gotten to use TF-IDF on large data sets, I’ve been very surprised about how well it works. Try it for yourself and see if you can discover hidden features in your data, as well.