In my last article I showed an analysis of 617 movie scripts, identifying the most said words in those movies and also the trending of positive and negative words. That was done using different data sets, which means I had to do some data cleaning and blending. Today I’ll show you exactly what I did to clean and prepare the final data set using Pentaho Data Integration, a.k.a. Kettle.

The original data set for the movie scripts is from a study by the Cornell University. There are different text files, each file with it’s specific set of information. For my analysis I only used the movies titles metadata and the movies lines. Then for the positive/negative lexicon, I used yet another data set from a study by the University of Illinois at Chicago.

Understanding the data cleaning process

The first step of my transformation was to divide the movie titles metadata in two different sets. The original file contains information about each movie, including the year of release, IMDB rating, movie ID and the genres for each movie. I wanted to separate the genres from the movies and create a “genre only” table that would be matched with the movies later. My goal was to let people filter the movies by a genre.

This is what the original file looks like:

And my goal was to make two different files.

One with the movies and the relevant data:

And another one with the matching between movie and genre:

So let’s explore how this was done.

Cleaning the genres and dividing one file in two

The first step of the transformation is to load the text file into Pentaho Data Integration. It’s very important to always see the options during the setup of a step. It’s very common to just add a step into a transformation without almost zero configuration at all, which will often cause errors and performance issues.

In this case, the file doesn’t have a header and the separator isn’t a common one, so you’ll need to setup the step accordingly and define the field names by yourself.

The second step is a “Replace in string” which I used to remove undesired characters from the genres and the year. Some of the years would have a “/l” at the end (e.g. 1999/l), and the genres are enclosed by square brackets and apostrophes. For the latter I used a regular expression and for the years I used a direct search.

After that the stream is divided in two. When you divide it in two, Pentaho Data Integration will ask you if you want to distribute the data into both streams or if you want to copy all the data into them. For this case we want to copy, since each stream will use all of the data.

On the left I used the step “Split field to rows” so that each genre for each movie would be one unique line, instead of having one line for a combination of genres per movie. Then I remove all the unused fields and write a CSV file containing the matching table only.

On the right I just had to remove the genres and write the useful data on a CSV file. I did use the JavaScript step to transform the movie names into “Title Case”, by adding capital letters for each word. So, instead of having a movie title as “leaving las vegas” it will be “Leaving Las Vegas”, this was done for aesthetic purposes and has no impact on the information.

Cleaning the movie lines and blending the opinion lexicon

The second and final step is to load the actual movie lines and blend the opinion lexicon. I start this part by removing all the punctuation, done by using again the “Replace in string” step with a regular expression. Then I split the movie sentences into single words by using the “Split field to rows” step with a blank space as delimiter.

Now I have a huge set of words, but the set is not yet ready to receive the opinion lexicon. To make sure all words are counted equally, I use the “String operations” to make sure all words are lower case, and then I filter the stop words (in computing, stop words are words which have very little meaning, such as “and”, “the”, “a”, “I”, and are filtered out before or after processing of natural language data).

The blend is done by using the “Merge join” step twice, one for the positive words and then for the negative words. I used left join in this case because there are words that are neither positive nor negative, which I also want to consider in my final data set.

The result is a file with 4 fields: the movie ID, the word analyzed, and the information if it’s negative or positive word (or none, if it’s neutral).

Using the data sets to create a visualization

In my case, I used Tableau Public to create the visualization by using the three data sets: one containing the information about the movies, another with the genres for each movie, and the last one with all the words (and their classification) for each movie. Tableau lets you easily create a visualization with the three files at the same time. If you didn’t see the final result, I suggest you to come to this previous blog post and see.

If you want to see the transformation working, you can download it here. But you will also need to download the raw data sets to make it work. I won’t be making these available here because they’re big files (+30mb text files), but you can easily find them on the sources that I’ve mentioned earlier.

This was a brief example of data cleaning and blending that can be done with Pentaho Data Integration. Let me know if you have any doubts or questions.

Leave a Reply