Mozaik.

22 May, 2009

Movie Word Cloud

Posted by: Urban In: razno

I’ve calculated the word frequency in the subtitles of about 40k English movies. I thought I’d get something useful, but can’t see it yet. Listing highest ranking words proved almost entirely useless, so I excluded the most common stop-words. Here I listed the highest 500 ranking words.

An interesting question here would be: is the relative frequency any different than the relative frequency in English language in general? What words have higher frequency in movies than in everyday language? Can we detect movie speak by the lack of certain words?

wordcloud

But what seems more promising than doing a simple word count is Yahoo term extractor which, surprisingly, does a pretty good job. For example, American Beauty yields these terms:

  • lester burnham
  • neighbor jim
  • pruning shears
  • product launch
  • lover jim
  • miracle gro
  • typical teenager
  • geek boy
  • daughter jane
  • wife carolyn
  • eggshells
  • role model
  • high point
  • clogs
  • misery
  • loser
  • girlfriend
  • honey
  • roses
  • dad

In fact it’s almost as good as watching the movie :) . That’s why the next step will be listing the most common Yahoo-extracted terms for all the movies. Gotta do it while Yahoo’s still around :) .

This bit of Ruby code queries Yahoo term extraction API (you need an API key first):

require 'net/http'
require 'rexml/document'

app_id = '***************'
yahoo_uri = URI.parse('http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction')

resp = Net::HTTP.post_form(yahoo_uri, { 'appid' => app_id, 'context' => text  } )

terms = REXML::Document.new resp.body

terms.each_element("//Result") do |term|
    puts term
end
Shrani:
  • del.icio.us
  • digg
  • Reddit

3 Responses to "Movie Word Cloud"

1 | dare

May 22nd, 2009 at 17:22

hacking & sharing, pohvalno.

fajn vizuelna prenova bloga!

2 | dare

May 22nd, 2009 at 17:23

pa za boljšo sliko bi blo treba zmergat hear/heard, thing/things ipd.

3 | Urban

May 24th, 2009 at 17:57

prov maš.. stemming in lematizacija bi dost pomagala. sploh nism pomislu.

Comment Form


  • Roman: Si me spomnil kako hudo je stare sci revije brat. :) Na racun res neskoncnih arhivov knjig sem pripravljen zamenjat obcutek papirja v rokah. Verjet
  • Tapuwenec: This thing already flying retin a wrinkle last drop out when
  • dare: zx spectrum FTW!

About


Researcher at FE, LTFE, programmer, photographer,
technology enthusiast, etc.

See what I share on Google Reader, and elsewhere.

Zadnje s fotobloga

Trails in the Snow Reeds Upwards A book and a match Checkered Building Coca-Cola Zero