Data validation

Data graphing begins with the assumption that all data is imperfect and its quality must be validated.

Data that is hand-collected (e.g. transcriptions of survey responses or copying numbers from published reports) is almost surely to be flawed because this is a tiring repetitive process that is susceptive to human error.

Archival datasets from commercial data providers are also renowned to be plagued by nonsense values and duplicate records, e.g. see Kellogg’s School of Management Research Computing unit list of articles on Databases Biases and Errors for some widely used financial databases (such as Compustat, CRSP, IBES, OptionMetrics).

Borrowing terminology from computer science, all data validation checks are underlined by six fundamental queries:

(1) Validation of existence

Does the data entry exist? If yes, should it be set to missing? If not, should I use interpolation or multiple imputation to recover the missing value?

Sometimes, the validation of existence may uncover unreasonable observations that may not comply with the data generating process, and therefore should be encoded as missing, e.g. recording impossible body temperatures, impossible girths of trees, impossible BMI indices, zero number of employees, negative risk factors, and so on. In this case, data graphing can serve as a validation check as part of an exploration exercise.

Other times, data may appear missing but should not. For example, consider the calculation of a variable called ‘Common Shareholders’ Equity (CSE)’, using data from the Compustat database. To calculate CSE we need access to the following items: CSE = common equity (item ceq) plus preferred treasury stock (item tstkp) minus preferred dividends in arrears (item dvpa). For all non-financial entities listed on the NYSE and AMEX stock exchanges during 1963-2011 there are 297,323 Compustat records of non-missing observations for ceq, but only 219,068 for tstkp and 238,023 for dvpa. This means that a blind calculation of CSE would reduce the sample to the lowest common denominator and lose about 25% of the sample. But are these values truly missing? Not really, because Compustat records unreported values as missing, whereas the preparer of financial statements omits to report flow information when this is zero (as is the case with treasury stock and preferred dividends that not all companies have). In this case, the validation of existence should replace missing values with the value of zero.

The recovery of missing data is known as imputation. Assigning a fixed value to a missing value, such as the 0, is the simplest form of imputation. Instead, we may need to assume a functional form and use the data that is available to recover the missing value, as for example in a linear interpolation. The idea is very simple. For two sets of known coordinates, (x0, y0) and (x1, y1), the linear interpolant for the unknown y is just the straight line that goes through these two points:

\Large \frac{y-y_0}{x-x_0} = \frac{y_1-y_0}{x_1-x_0}

Given a value of x, then we solve for y:

\Large y = y_0 +(x-x_0)  \frac{y_1-y_0}{x_1-x_0}

If the function is of quadratic form then you can consider a quadratic interpolant. The only difference is that you need 3 points on a curve in order to to find the 4th.

Alternatively, one could simply use regression analysis to recover missing values. This is a form of multiple imputation. For instance, the relation of age with height is well understood. We know that for men, height increases up to the age of about 18 and then stabilises. For women, height increases faster than men and stabilises at about 16 years of age. We can use this knowledge to fit an appropriate regression function and then use the prediction of that regression function in order to recover some missing values.

(2) Validation of data type

Should this be a number or a string? If a number, should it be an integer or a float? If a float, then how many decimal points should it contain? If a string, then how long should it be?

The key distinction between data types is that numerical data can be employed in computations, whereas string data cannot do computations but can be analysed as text. 

Numerical data can be in integer form of float form (i.e. have decimal precision), and then it becomes a question of how precise this number should be and how precise should it be presented. For example, it would not make sense to present the price of product with 4 decimal points (e.g. $25.3542), or the height of a person with rounded to the nearest first decimal (e.g. 1.8 metres).

There are certain special numerical data types, including dates and time that are stored as lapsed datetime steps from a specified datum, binary variables that represent Boolean arguments, and geographical role variables such as latitude, longitude and geometric boundaries.

For strings, there are two types: ASCII and Unicode. ASCII is for American Standard Code for Information Interchange, and describes the characters that an English-language keyboard can encode. Unicode is encoding system for typing non-English characters and script-based language.

(3) Validation of uniqueness

Is the data entry unique? If not, is it a false duplicate, an unidentified observation or an intentional copy?

The uniqueness of observations is a paramount requirement for any type of analysis. Uniqueness is defined as the single record of an observation as identified by a unique combination of its cross-sectional generating unit and the point in time when it was generated.

The validation of uniqueness is performed on every data generating process and its identifiers. For instance, in a survey, uniqueness in responses is ensured by assigning personal identifiers and the point in time if there are repeated interviews. For a sale transaction record, uniqueness is ensured by assigning an invoice or receipt number.

When validating uniqueness it is useful to think of datasets as being of one of following three types:

  • Time series datasets hold observations on one entity that are repeated over time. They are identified by a single time identifier given their time frequency. For example, quarterly data is identified by the year and the quarter within the year, e.g. Q3, 2018. Second frequency data is identified by the year, the day of the year and the second within that day, e.g. 12/June/2017 18:12:45.
  • Cross-sectional datasets hold observations on several entities identified at a fixed point in time. They are identified by one or more identifiers of the cross-sectional units. For example, we may need to identify the branch sales of competing supermarkets and unique identification requires the supermarket name (e.g. Woolworths, Coles, IGA, ALDI) and the branch location (e.g. Newtown, Bondi, Parramatta).
  • Panel data or longitudinal datasets hold repeated time observations over several individual units. These are identified by a combination of a time identifier and cross-sectional identifiers, e.g. a marketing survey may be repeatedly conducted on the same people every year in order to observe shifts in consumption habits over time, so we need a personal identifier (on the cross-section) plus a year identifier (on the time dimension) to ensure unique identification.

If unique identification fails then it means that there are duplicate observations that must be removed otherwise they would bias the analysis. It is up to you to determine whether an observation is a true duplicate or not.

Lack of identification should be considered as missing and excluded from analysis, unless this is a cross-sectional dataset that does not contain any identifiers, maybe due to confidentiality restrictions and the presence of duplicate may be reasonable. Hence why the intimate knowledge of the data generating process is critical.

(4) Validation of composition

Is there an expected structure to the value? Should the string of alphanumeric characters be in a certain order? Could the composite data be separated into separately meaningful values?

Composition describes the order in which alphanumeric characters are organised within a value. For instance, a variable holding addresses could include street name, suburb, post code and city, and this information may be recorded as part of another variable that would enable meaningful analysis. Numerical variables may also contain compositional structure, e.g. Australian postcodes contain four numbers and those beginning with the number 5 are in South Australia, those beginning with number 6 are in Western Australia and so on.

Composition is also relevant for qualifying data using identifiers. For instance, CUSIP is an alphanumeric identifier for North American securities that can be obtained via several renowned financial database providers. CUSIP usually take 8 digits but sometimes a 9th digit is added as a checksum to ensure unique identification. The first six characters identify the issuer (e.g. company) and may contain numbers or alphabetical characters. The last two digits indicate the type of issue. Equity issues are assigned two numeric characters but fixed income issues are assigned one or two alphabetic characters. 01 is reserved for option securities on equity. IPOs are identified as 00 or 10. SEOs are assigned with the values 20, 30, 40, 50, 60, 70, 80. 11 to 69 identify rights and warrants.

(5) Validation of range

Is there a permissible support to the range of variation? If yes, then what are the bounds of the distribution? If not, then how should extreme values be defined?

For example, Body-Mass-Index (BMI) is an important measure for several applications in public health, clinical practice, insurance underwriting, even in qualifying fashion show models (minimum BMI of 18). BMI is calculated as Mass in kg divided by Height in meters squared, e.g. 80 kg / 1.80^2 = 24.69 BMI. Consider a value of BMI of 10 for an adult of 18+ years. Could this be an admissible value? This would amount to someone of 1.70cm tall and 29kg, or of 1.50cm tall and 26kg. Even if this was possible, would you pool this observations with the rest for analysis? Can you say with confidence that this observation comes from the same population?

The validation of permissible range is straightforward if the variable is known to follow a one-sided bounded distribution (e.g. Log-Normal) or a two-sided bounded distribution (e.g. Beta). However, the validation of range is far more difficult if the variable follows an unbounded distribution that can extend to infinity (e.g. Normal, Student’s t), in which it requires a definition of extreme values (outliers).

(6) Validation of relation

Is there a deterministic relation or expected stochastic relation with other variables?

For example, in the financial world it holds that equity plus liabilities equals assets, the change in market price suggests non-zero volume unless the price has been extrapolated as a bid-ask mean, the maturity dates of options must be equal or greater than the exercise date, the dates of earnings per share forecasts must be lower or equal to the announcement dates of actual earnings, and so on.

An easy way to validate relation is by looking at the data through a scatterplot matrix. For example, consider the following relation on birthweight and gestation period:


I highlight in red two observations with very low gestation period (148 days and 181 days), yet with about average birthweight (116 and 110 ounces). The validation of range for gestation period alone does not raise a red flag about the validity of these two values because premature babies are plausible.

However, it is the validation of relation between birthweight and gestation period that raises serious concerns. It is inconceivable to have such premature babies with such high birthweight. It is only when these values are judged as a relation that they flagged as problematic, indicating likely a human error in recording the data.

Back to Data generating process ⟵ ⟶ Continue to Missing values

Demetris Christodoulou