5 min read

Data pipeline: Strategies to manage invalid records

Rédigé par Vincent Doba

Vincent Doba

Data validation is an important step of any data pipeline. It usually comes just after raw ingestion and before application of business rules. Usually, Validation are either technical or business. A technical validation can be checking that a field containing a date has the right date format. A business validation can be checking that some captors measure are not aberrant.

A basic validation job treats all input records, labeling each of them either as valid or invalid. You pass valid records to the next step of your data pipeline but what to do with invalid records ?

We have three strategies: drop them, save them in a quarantine table or process them along with valid records.

Note: in this article, we will not treat logging strategy to report validation failure to data pipeline managers. However, logging records validation failures is important and we encourage you to do so. Especially if you choose to drop your invalid records.

Drop invalid records

The first strategy is to drop the invalid records altogether: you filter out every records that does not pass validation.

Drop invalid records in your data pipeline

Advantages and drawbacks

This is the simplest strategy, as you don’t care about what to do with invalid records. You don’t have to think about what to do of invalid records.

However, dropping invalid records means you can’t analyze why the records didn’t match validation criteria. Also, you can’t reprocess only the invalid records, you have to read all your input records. Moreover, if the input source is a stream of records, you will lose invalid records permanently.

Use cases

You should use this strategy when in your data pipeline there is no need to analyze, report or reprocess invalid data. It is seldom but it may exist.

It can be the case when filtering out invalid data is part of the process: you have records coming from all part of the world, but you consider as valid only the records coming from a specific country. In this case, your validation rule makes it useless to analyze as you know why you reject invalid records. You will not reprocess as there is no reason that you will accept another country than the one you already keep. And you will not report records that fail your specific business validation rules to upstream source.

Another case is when you’re validating records from an external source that has a fixed data format/structure. For instance all the open data sources available on the web. As it is useless to report invalid records as external source will not change its format/structure to comply to your validation rules, you can drop invalid records.

However, be aware that if your upstream source changes, you may drop lots of records without noticing what’s wrong. So ensure that you have rules that are compatible with an upstream change.

Save invalid records in quarantine table

The second strategy is to store invalid records into its own quarantine database

Save failed records in quarantine table in your data pipeline

Advantages and drawbacks

You now don’t lose any record, you either process them or store them in a quarantine database. You can request this quarantine table to analyze why a record is invalid or do statistics over invalid records. It is now possible to reprocess invalid records or report them to upstream source.

However, it means creating and managing a new database in your data pipeline, along with the one with valid records. For instance if there is a schema change in your input data, you have to make sure to apply those changes not only on main data pipeline, but also on quarantine table.

It can be particularly tricky when you want to validate again records from this quarantine table: valid records after second pass have to be deleted from quarantine table, and records still invalid after second pass have either to be dropped or to erase their previous version in quarantine table. So you have to do some kind of join between your quarantine table and your reprocessed records in order to clean your quarantine table.

Use cases

It is a very common strategy. You will use this strategy when you want to keep a trace of your invalid records, but you want to clearly split invalid and valid records and to process only valid records for your downstream data pipeline.

For instance, you can use this strategy when you’re developing your data pipeline. By having a quarantine table, you can look at which invalid records you encounter and tune the validation rules or setup transformation to transform invalid records to valid records

As you don’t mix invalid and valid records, downstream only processes valid data records. So it is also to use when you don’t control your downstream or you want to have clean data for downstream. For instance if you send your data to an external business analyzing tool that is expecting only valid records.

Save invalid records with valid records

The third strategy to manage invalid records in your data pipeline by tagging them as invalid but saving them along with valid records

keep invalid records in your data pipeline

Advantages and Drawbacks

As for the quarantine table strategy, you keep all records, valid or invalid. But you don’t need to manage new database where to store invalid records, you pass records to downstream. It means that for instance if the data schema changes, you apply it only one time and you don’t need to remember to apply it to another table.

Moreover, if you reprocess your invalid records, it will be easier to remove now valid records. Indeed, it becomes a simple removing duplicates work: remove duplicate records and keep in priority the valid ones.

However, mixing invalid records with valid records means that your downstream processes have to take into account invalid records.

Use cases

Use this strategy when your validation are actually more warnings than errors. So there is no reason to not process invalid records.

You can also use this strategy when you have several different processing downstream of your data pipeline, each processing focusing only on parts of your records data. If you add good information metadata to your invalid records, and your downstream jobs or users can choose to process invalid records according to their needs. Your validation job doesn’t decide for them, it simply informs them.

For instance, imagine you have records representing orders, with customer name and order’s value. In your validation job you have a rule that check a given name have to be less than 16 characters. If you choose this method, a downstream job computing the total order values can ignore the name validation rule. Another one, extracting the top ten customers having the most orders can enforce the name validation rule.

Thus, this strategy is well suited when you feed a BI tool, with data analysts that will play with your data. By doing so, you’re giving them the more transparent view of the data. It’s up to them to choose if a validation rule should be applied for their analyze or not.

However, to do so you need detailed information about invalid fields in your records. You have to enrich your records with error metadata listing all invalid fields and explaining for each of them why they are invalid.

Conclusion

In this post, we saw three strategies to manage invalid records in a data pipeline. First you can drop them. Second you can save them in a quarantine table. And third you can tag them and send them downstream along with valid records.

Each strategy has its advantages and drawbacks. You should use them for different use cases:

  • Drop invalid records if you don’t care about analyzing them.
  • Save them in quarantine table if you don’t want to mix them with valid records.
  • Keep them with valid records if you want to provide all records to your downstream processing.

Of course, you can use one strategy for a part of your data pipeline and another for another part. For instance, imagine a data pipeline with two validation steps. You can keep invalid records in a quarantine table for one validation step and drop invalid records for the other.

One subject is pretty critical when you want to keep invalid records, either in quarantine table or with your valid record, is what information to save to help analyzing invalid records. It will be the subject of a future article.

If you need to design a data pipeline, feel free to contact us !

Cet article à été écrit par

Vincent Doba

Vincent Doba

Suivre toutes nos actualités

Data migration: Thinking about using AWS Data Pipeline? Think twice

4 min read

Machine learning metrics are as essential as your model

4 min read

Fundamentals of NLP with multi-choice question generation

6 min read