Data generating process

Data Generating Process (DGP) describes the rules with which the data has been generated. Data graphing assumes good knowledge of the data generating process, as it guides the selection of appropriate encoding tools.

This is a critical assumption, because data graphing takes a data science approach. That is, we rely on the data to extract valid signals. If the data is not well understood then we are doomed to fail.

If you are employed by the same organisation that generates the data then you are expected to have intimate knowledge of the DGP. Often we need to work with data that has been generated by others, but thankfully reputable data-minded organisations issue detailed records of the data protocols. Excellent examples include the World Bank, the European Social Survey, Australian Bureau of Statistics, and more.

If there is no knowledge of the DGP then proceed with caution.

Survey DGP

Extensive surveys are never easy to analyse. There are so many factors going into the design of a survey, most notably the selection probabilities. Consider the Australian Bureau of Statistics (ABS) national surveys. The ABS data protocols for the DGP of every survey are extensive and warrant careful examination before engaging with the data. By reading ABS’s data protocols you learn about the how the data was collected, about the scale of each measure, the coverage of the survey, precision and estimation error, the sample employed and more.

For example, the 2017-2018 ABS National Health Survey data protocol explains that the survey covers 16,384 households with 21,315 individuals at a 76.1% response rate. The sample can be at best segregated at the level of the Statistical Area 4 which classifies Australia into 107 geographical areas. This is important information for data graphing because it means that a geospatial form of visualisation can only show up to 107 areas in Australia. This information also means that at best we can have about 199 individuals equally distributed in the 107 areas (i.e. 21,315/107=199). This is unlikely since urban areas are far more likely to be over-presented than rural areas, which means that certain areas would have much higher standard errors in the estimation of prevalence health risk factors.

Financial statement DGP

As another example of a DGP, consider financial statement data that is so ubiquitous (e.g. sales, assets, equity), many do not stop to think about the ramifications of using such data. In fact, financial statements are governed by a unique and rather complex data structure.

First, all transactions are recorded twice in the books, using the so-called ‘double-entry bookkeeping’ system, which means that there is a duplicate record of variation. This structure creates a network of deterministic identities, which means that the data matrix from the articulated financial statements is a rank deficient matrix of order one. For example, it would make little sense comparing the Assets/Equity ratio with the Debt/Equity ratio, because it holds that Assets – Debt ≡ Equity, which means that we can rewrite Debt/Equity as (Assets/Equity)-1.

Second, all initial records in the accounting books are positive distributed. Negative constructs are only created for reporting purposes. For instance, net profit is simply the difference between a positively distributed income variable and a positively distributed expense variable. Knowing  this key information helps greatly with choosing the right form of variable transformation.

Third, at year-end there are accounting adjustments that act as schedules for allocating revenue and expenses to the present and future periods. This means that there is a pre-determined relation between a purchased asset or an issue liability and the future allocation of this stock variable into periodic flows.

Fourth, there are two types of variation: stocks that accumulate over time (reported in the Statement of Financial Position), and flows that reset every year and measure the change in stocks (reported in the Statement of Cash Flows and the Income Statement). Knowing this structure of the DGP help specify the statistical context. For examples, of stock-and-flow contexts see the analysis on BHP’s real capital stock-and-flow and BHP’s waterfall of real capital.


Back to Data management ⟵ ⟶ Continue to Data validation

Demetris Christodoulou