Data Cleaning & Blending with Pentaho Data Integration

In my last article I showed an analysis of 617 movie scripts, identifying the most said words in those movies and also the trending of positive and negative words. That was done using different data sets, which means I had to do some data cleaning and blending. Today I’ll show you exactly what I did to clean and prepare the final data set using Pentaho Data Integration, a.k.a. Kettle.

Continue reading

What is Pentaho Lazy Conversion and why you should use it?

Whether you’re using the CSV Input step, or the Table Input, you might have noticed the lazy conversion checkbox and wondered what that means. Or you already faced an error because the lazy conversion was enabled, such as:

There was a data type error: the data type of java.lang.String object [Hello, world!] does not correspond to value meta [String<binary-string>]

Often the reaction is just turn off the lazy conversion, but that might (and probably will) hurt the overall performance of your transformation, even more if the input has thousands or hundreds of thousands rows.

Continue reading