In my last article I showed an analysis of 617 movie scripts, identifying the most said words in those movies and also the trending of positive and negative words. That was done using different data sets, which means I had to do some data cleaning and blending. Today I’ll show you exactly what I did to clean and prepare the final data set using Pentaho Data Integration, a.k.a. Kettle.
Whether you’re using the CSV Input step, or the Table Input, you might have noticed the lazy conversion checkbox and wondered what that means. Or you already faced an error because the lazy conversion was enabled, such as:
There was a data type error: the data type of java.lang.String object [Hello, world!] does not correspond to value meta [String<binary-string>]
Often the reaction is just turn off the lazy conversion, but that might (and probably will) hurt the overall performance of your transformation, even more if the input has thousands or hundreds of thousands rows.
Sometimes you need to break your data stream into multiple flows, do some kind of manipulation and then get them all back into the single stream it was before. Today I’ll be helping you consolidate the replicated columns that are created after joining these flows in Pentaho Data Integration, a.k.a Kettle.