Data Source Sampling

The primary goal of a sampler is to provide only a subset of the data of its source. The source data are selected based on a sampling frame and a sampling method inside the frame.

You can vsisit the Wikipedia page on data sampling if you want to know more on data sampling from a statistical point of view.

Why Using a Sampler?

A sampler is often used to introduce randomness in a finite sequence of ordered values. You can put in front of the ordered sequence generation rule both a loop-back to prevent stopping generation at the end of the source sequence and a sampler to jump from one value to the next with random steps.

For example, consider you have a data file containing alphabetically ordered strings that represent city names. If you use it as-is to generate addresses you will have location sorted alphabetically.

If you put a sampler in front of the generator providing the city names, the selected names will still be sorted but there will be hidden value skipped by the sampler. When reaching the end of the list of city names, the source becomes exhausted as well as the sampler.

To prevent stopping at the end of the list and start back to beginning of the list, you can use a loop-back in front of the sampler. Then, a new sampling sequence will start with new values generated since the step used between to successive value have little chance to be the same as for the previous sampling sequence.

Configuring the Sampler

To configure a sampler you need to specify the source of the sampled data, the sampling frame and the sampling steps generator.


The Sample Frame is an indetification of the values in the source data that are candidate. To define that frame you must provide the index of the first value when starting a new sampled sequence and the number of value to provide from the source before becomming exhausted.

If you use -1 to configure the sample size then the sample will not end before the source itself ends.

You also have to provide the source data generation and sampling steps generation. The first can provide any value and those values will be the one sampled. The other field generation rule should provide nulerical value that represent the step between the successives values of the source.


Consider the source value to be sample is just a sequence providing integers starting from 0 and incrementing with 1. The source values will be 0,1,2, ...

A sampler configured with

  • a start index of 2,
  • a sample size of 10,
  • a constant step bewteen sampled value of 3

will provides the following sequence: 2,5,8,11,14,17,20,23,26,29

Add comment

Security code