Mozaik.

22 May, 2009

Movie Word Cloud

Posted by: Urban In: razno

I’ve calculated the word frequency in the subtitles of about 40k English movies. I thought I’d get something useful, but can’t see it yet. Listing highest ranking words proved almost entirely useless, so I excluded the most common stop-words. Here I listed the highest 500 ranking words.

An interesting question here would be: is the relative frequency any different than the relative frequency in English language in general? What words have higher frequency in movies than in everyday language? Can we detect movie speak by the lack of certain words?

wordcloud

But what seems more promising than doing a simple word count is Yahoo term extractor which, surprisingly, does a pretty good job. For example, American Beauty yields these terms:

  • lester burnham
  • neighbor jim
  • pruning shears
  • product launch
  • lover jim
  • miracle gro
  • typical teenager
  • geek boy
  • daughter jane
  • wife carolyn
  • eggshells
  • role model
  • high point
  • clogs
  • misery
  • loser
  • girlfriend
  • honey
  • roses
  • dad

In fact it’s almost as good as watching the movie :) . That’s why the next step will be listing the most common Yahoo-extracted terms for all the movies. Gotta do it while Yahoo’s still around :) .

This bit of Ruby code queries Yahoo term extraction API (you need an API key first):

require 'net/http'
require 'rexml/document'

app_id = '***************'
yahoo_uri = URI.parse('http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction')

resp = Net::HTTP.post_form(yahoo_uri, { 'appid' => app_id, 'context' => text  } )

terms = REXML::Document.new resp.body

terms.each_element("//Result") do |term|
    puts term
end
Shrani:
  • del.icio.us
  • digg
  • Reddit

3 Responses to "Movie Word Cloud"

1 | dare

May 22nd, 2009 at 17:22

hacking & sharing, pohvalno.

fajn vizuelna prenova bloga!

2 | dare

May 22nd, 2009 at 17:23

pa za boljšo sliko bi blo treba zmergat hear/heard, thing/things ipd.

3 | Urban

May 24th, 2009 at 17:57

prov maš.. stemming in lematizacija bi dost pomagala. sploh nism pomislu.

Comment Form


  • dare: zvito, ni kaj :)
  • Mozaik. » Blog Archive » iPhone GPS logger: [...] prejšnjem postu sem se razburjal nad odsotnostjo GPS-ov v fotoaparatih. Pa sem razmišljal naprej: imam telefon, ki [...]
  • dare: evo, zdele sem se cist slucajno spomnil, da sem ti pustil komentar in da si mogoce kaj odgovoril :) dost neucinkovita komunikacija, ce mene prasast

About


Researcher at FE, LTFE, programmer, photographer,
technology enthusiast, etc.

See what I share on Google Reader, and elsewhere.

Zadnje s fotobloga

La Grande Arche de la Défense Danger Seagull Mana Man Sink Bates Motel