Whether you’re using the CSV Input step, or the Table Input, you might have noticed the lazy conversion checkbox and wondered what that means. Or you already faced an error because the lazy conversion was enabled, such as:
There was a data type error: the data type of java.lang.String object [Hello, world!] does not correspond to value meta [String<binary-string>]
Often the reaction is just turn off the lazy conversion, but that might (and probably will) hurt the overall performance of your transformation, even more if the input has thousands or hundreds of thousands rows.
Lazy Conversion – what it does?
Consider that you have as input a big CSV file with 30 columns and hundreds of thousands rows. Pentaho Data Integration will try to identify all the columns and convert them into objects for their respective data, such as numbers, dates or string. But from these 30 columns, maybe only 5 are important for your transformation and will receive some kind of treatment. Maybe the other 25 will just be “passed through”.
That’s when lazy conversion comes in. It tries to hold off any conversion as long as it can, preventing Pentaho Data Integration of doing unnecessary work on the data that won’t be manipulated. Thus, making your transformation run faster.
How to avoid the errors?
If you’re getting an error such as the one described in the beginning, you might consider adding a Select Values step and convert the meta-data of your field in the respective tab. This way the binary to data-type conversion occurs and the rest of your transformation will start understanding what type your field is.
A little experiment
To make sure of it, I decided to do a test based in a CSV file of 115mb that has 1 million rows (a slightly modified version of the file Matt Casters made available in his blog). In this file there is a field called zip, which might contain no data (null). My transformation simply replaces the nulls with a default value using the If field value is null step.
The image below shows the transformation running without the lazy conversion option (otherwise you would get the data type error).
And this other image shows the same transformation, but with lazy conversion and a Select Values step converting the data-type.
From 3.3 to 1.4 seconds, we’re talking about an increase of 235% in performance. And that’s why you should think twice before turning off the lazy conversion option in your input steps!