Data validation is an activity verifying whether or not a combination of values is a member of a set of acceptable combinations (page 8 of the ESS Handbook on Methodology for Data Validation).
The set of ‘acceptable values’ may be a set of possible values for a single field. But under this definition it may also be a set of valid value combinations for a record, column, or larger collection of data. We emphasise that the set of acceptable values does not need to be defined extensively. This broad definition of data is introduced to make data validation refer both to micro and macro (aggregated) data.
Data validation assesses the plausibility of data: a positive outcome will not guarantee that the data is correct, but a negative outcome will guarantee that the data is incorrect.
Data validation is a decisional procedure ending with the acceptance or refusal of data. The decisional procedure is generally based on rules expressing the acceptable combinations of values. Rules are applied to data. If data satisfy the rules, which means that the combination expressed by the rules is not violated, data are considered valid for the final use they are intended to. There is of course the possibility of using the complementary approach in which rules are expressed in ‘negative form’: in this case data are validated by verifying that predefined non-acceptable combinations of values do not occur.
Sometimes the rules used in a validation procedure are split in hard/fatal edits and soft/query edits and the not acceptable values are classified either as ‘erroneous’ or ‘suspicious’ depending on whether they fail hard edits or soft edits. Hard edits are generally rules that must necessarily be satisfied for logical or mathematical reasons (e.g., children cannot be older than their parents). An example of query edits taken from the UNECE glossary on statistical data editing is ‘a value that, compared to historical data, seems suspiciously high’ while for fatal edits is ‘a geographic code for a Country province that does not exist in a table of acceptable geographic codes’. This distinction is important information for the related ‘editing’ phase. In addition to this information, a data validation procedure may assign a degree of failure (severity) that is important for the data editing phase and for the tuning of data validation. Taking the example previously mentioned for soft edits, the severity can be evaluated by measuring the distance of the actual values with respect to the historical one.
In case of failure of a rule, data are exported from the data validation procedure or marked respectively, and are handled by the editing staff in order to correct values to make the rules satisfied, or data are considered acceptable and the rules of the data validation are updated. The data validation process is an iterative procedure based on the tuning of rules that will converge to a set of rules that are considered the minimal set of relations that must be necessarily satisfied.
The UNECE glossary on statistical data editing gives a different definition for data validation:
An activity aimed at verifying whether the value of a data item comes from the given (finite or infinite) set of acceptable values.
In this definition, the validation activity is referred to a single data item without any explicit mention to the verification of consistency among different data items. If the definition is interpreted as stating that validation is the verification that values of single variables belong to set of prefixed sets of values (domains), it is too strict since important activities generally considered part of data validation are left out. On the other hand, if the acceptance/rejection of a data item were intended as the final action deriving from some complex procedure of error localization, this definition would be too inclusive since it would involve also phases of the editing process not strictly related to the validation process.