Filter for Unique Values

The generation rule "Unique" filters source values in order to produce only unique values. It can ensure uniqueness through multiple runs (multiple files or same file multiple times) by maintaining a file of unique values already produced.

For example, a sequence of source value such as "A,B,B,C,A,D,B" will be transformed into "A,B,C,D".


The basics of the Unique generation rule is very simple, but the configuration must adjusted accordingly to the number of different values the source may produce and the frequency at which new unique values are provided.

Prevent Infinit Loop and dead locks

The parameter "maximum number of attempts" is used to prevent the rule to fall into an infinit loop when asking the source a value that it hasn't provided yet. Since the source may produce a limited number of different values it is necessary to set a threshold that matches the number of unique source value in the longest sequence.

For example, consider a source that starts by producing values from a subset of 4 different values such as "A,B,C,D" two thousand times and then continue by producing values for another subset such as "E,F,G,H" four thousand times and so on so forth.

In order for the Unique generation rule to produce the unique values from the second sequence it has to be configured with a number of attempts greater than 200 (the length of the first sequenc of unique source values). Then, to produce the unique values coming after the second sequence it has to be configured with a "max number of attemps" greater than 400 (then length of the second sequenceof unique source values).

Setting a very high value for "max number of attempts" doesn't consume resources during generation. But choosing such a very high threshold results in longer delay to detect that there's no more unique values in the source.

Cache Control

You can control the cache mechanism to optimize the internal algorithm of the rule. There's two level of cache, the first is in memory and you can select how many of the most frequent values provided by the source are saved into this "in memory" cache.

The second cache is in a file, when a value is not found in the memory cache it is searched in the file cache.

At the end of the generation, the content of the "in memory" cache is saved in the "in file" cache which will contains all the unique values provided by the rule.

Add comment

Security code