Movie Word Cloud

I’ve calculated the word frequency in the subtitles of about 40k English movies. I thought I’d get something useful, but can’t see it yet. Listing highest ranking words proved almost entirely useless, so I excluded the most common stop-words. Here I listed the highest 500 ranking words.

An interesting question here would be: is the relative frequency any different than the relative frequency in English language in general? What words have higher frequency in movies than in everyday language? Can we detect movie speak by the lack of certain words?

wordcloud

But what seems more promising than doing a simple word count is Yahoo term extractor which, surprisingly, does a pretty good job. For example, American Beauty yields these terms:

  • lester burnham
  • neighbor jim
  • pruning shears
  • product launch
  • lover jim
  • miracle gro
  • typical teenager
  • geek boy
  • daughter jane
  • wife carolyn
  • eggshells
  • role model
  • high point
  • clogs
  • misery
  • loser
  • girlfriend
  • honey
  • roses
  • dad

In fact it’s almost as good as watching the movie :) . That’s why the next step will be listing the most common Yahoo-extracted terms for all the movies. Gotta do it while Yahoo’s still around :) .

This bit of Ruby code queries Yahoo term extraction API (you need an API key first):

require 'net/http'
require 'rexml/document'

app_id = '***************'
yahoo_uri = URI.parse('http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction')

resp = Net::HTTP.post_form(yahoo_uri, { 'appid' => app_id, 'context' => text  } )

terms = REXML::Document.new resp.body

terms.each_element("//Result") do |term| 
    puts term
end