An Optimized Data Filtering Method

Necmi Kılıç
2 min readMar 4, 2020

Rule-based data filtering is one of the common steps in most applications and ETL jobs. If the data is huge -big data can be a better definition- beside a memory cost, there is also a performance/time cost for filtering data.

Assume that you have a data set with 100K number and the service/API uses 3 rule-based filters when creating a response. The numbers of eliminated data by rules as :

Rule A : 3000

Rule B: 15000

Rule C: 80000

In this case, the number of processes roughly: 100K(for Rule A) + 97K (for Rule B) + 82K (for Rule C) = 279K.

What happens if we run the rule set order by numbers of elimination data(Rule C, Rule B, Rule A) by descending?

In this case, the number of processes changes to:

100K(for Rule C) + 20K(for Rule B) + 5K (for Rule C) = 125K.

It is important that we assume data you are filtering by processing the data row-by-row.

You can see how much important to filtering data by starting from the rule that eliminate most.

To accomplish this, you can have your own algorithm or implementation. The easiest way is to change the order of processes by manual if you have comprehensive knowledge of the rules.

The other way is making it automated. Create an observer(listener) which collect statistics and initialize the service/API by reading this statistics to re-order the rule set.

--

--