Whether you’re using the CSV Input step, or the Table Input, you might have noticed the lazy conversion checkbox and wondered what that means. Or you already faced an error because the lazy conversion was enabled, such as:

There was a data type error: the data type of java.lang.String object [Hello, world!] does not correspond to value meta [String<binary-string>]

Often the reaction is just turn off the lazy conversion, but that might (and probably will) hurt the overall performance of your transformation, even more if the input has thousands or hundreds of thousands rows.

Lazy Conversion – what it does?

Consider that you have as input a big CSV file with 30 columns and hundreds of thousands rows. Pentaho Data Integration will try to identify all the columns and convert them into objects for their respective data, such as numbers, dates or string. But from these 30 columns, maybe only 5 are important for your transformation and will receive some kind of treatment. Maybe the other 25 will just be “passed through”.

That’s when lazy conversion comes in. It tries to hold off any conversion as long as it can, preventing Pentaho Data Integration of doing unnecessary work on the data that won’t be manipulated. Thus, making your transformation run faster.

How to avoid the errors?

If you’re getting an error such as the one described in the beginning, you might consider adding a Select Values step and convert the meta-data of your field in the respective tab. This way the binary to data-type conversion occurs and the rest of your transformation will start understanding what type your field is.

A little experiment

To make sure of it, I decided to do a test based in a CSV file of 115mb that has 1 million rows (a slightly modified version of the file Matt Casters made available in his blog). In this file there is a field called zip, which might contain no data (null). My transformation simply replaces the nulls with a default value using the If field value is null step.

The image below shows the transformation running without the lazy conversion option (otherwise you would get the data type error).

Without Lazy Conversion

And this other image shows the same transformation, but with lazy conversion and a Select Values step converting the data-type.

With Lazy Conversion

From 3.3 to 1.4 seconds, we’re talking about an increase of 235% in performance. And that’s why you should think twice before turning off the lazy conversion option in your input steps!

2 thoughts on “What is Pentaho Lazy Conversion and why you should use it?

  1. I have a similar issue, but in my case some columns itself are not available in csv, but those columns are there in DB table. So when I was trying to upload the CSV to table, it throws me error, saying the columns are invalid. So I unchecked the ‘Lazy conversion’ check box and it worked. Is it fine? or in my case also I can use select values (though I dont have at all those column in CSV).

    Thanks,
    Souvik.

    1. Hi Souvik. Could you share the exact error message? I believe that in your case it’s different because the input has less columns than the output. If you’re getting the same error as the blog post, you could always try to use the Select Values step so Pentaho Data Integration can understand the extra columns (even if it won’t do anything with them) and see if you have a gain in your performance.

Leave a Reply