Sometimes you need to break your data stream into multiple flows, do some kind of manipulation and then get them all back into the single stream it was before. Today I’ll be helping you consolidate the replicated columns that are created after joining these flows in Pentaho Data Integration, a.k.a Kettle.

There are many ways to achieve this goal. One of them is using the Modified JavaScript Value step. Now, download the transformation and let’s see in a step by step how I did it.

Let’s say I have a database from a company that exports fruits. They want to compare United States with all the other countries, aggregated. They want this report delivered every week by email, in an Excel file. Everything can be done in Pentaho Data Integration.

Basically, this company wants me to “group by” all the sales, of all fruits, and aggregate all countries that are not United States together, like this:

consolidating-replicated-columns-1

There are easier ways of achieving my end result with a simple and small data set like this one. But the goal is to show a very useful trick for a problem that happens more often than you can imagine.

The transformation

This is what the transformation looks like:

consolidating-replicated-columns

Let’s do a quick overview of the main steps:

  1. Starts by filtering out United States;
  2. Aggregates both streams by Fruits, adding up all sales (Quantity);
  3. Creates a Country constant called “Others” for the second stream;
  4. Does an outer join, because we want all the rows from both sources;
  5. Consolidates the columns with the Modified JavaScript Value step;
  6. Sort and removes the unused columns from the stream.

The important step here is the JavaScript, so before taking a closer look into it, let’s see how our stream will be right before it.

consolidating-replicated-columns-2

Because of the outer join, we found ourselves with two sets of our main columns, the first for United States and the second for the others. Sometimes we have null values in one side (when United States never imported a specific fruit), sometimes we have null values in the other side (when the other countries never imported that fruit), and, in some cases, we will have both columns filled.

That’s half the job done. Now let’s see how the consolidation works

JavaScript

The script starts by verifying if both columns have values. If true, it will create a new row and add the value of the others in it.
If only one of the columns has value, then it will grab that one.

These are the fields the Modified JavaScript step should be expecting:

consolidating-replicated-columns-4

Remember to set them all to String type! Even if they’re a number or integer. This is needed because of the way the scripts create the Row object. You can easily change this in the Select values step later.

After the execution of the Modified JavaScript Step, you will have the following result:

consolidating-replicated-columns-3

Now you only need to use the Select values step to remove the old columns, rename the consolidated ones and change the metadata as needed.

Leave a Reply