Exploratory data analysis (EDA) refers to the exploration of data characteristics towards unveiling patterns and suggestive relationships, that would eventually inform improved modelling and updated expectations. Exploratory Data Analysis (EDA) provides the foundations for Visual Data Analytics (VDA).
“An approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a data set, uncover underlying structure, extract important variables, detect outliers and anomalies, test underlying assumptions, develop parsimonious models and determine optimal factor settings.” This is an accurate description of EDA in its purest form
The EDA philosophy
The father of EDA is John Tukey who officially coined the term in his 1977 masterpiece. Lyle Jones, the editor of the multi-volume “The collected works of John W. Tukey: Philosophy and principles of data analysis” describes EDA as “an attitude towards flexibility that is absent of prejudice”.
The key frame of mind when engaging with EDA and thus VDA is to approach the dataset with little to no expectation, and not be influenced by rigid parametarisations. EDA commands to let the data speak for itself. To use the words of Tukey (1977, preface):
“It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it… Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone –as the first step.”
The importance of John Tukey’s contribution of the development of EDA is aptly captured in Howard Wainer’s (1977) book review: “Trying to review Tukey’s Exploratory Data Analysis is very much like reviewing Gutenberg’s Bible.Everyone knows what’s in it and that it is very important, but the crucial aspect to report is that it has been printed… EDA is where the action is. This Tukey feels is detective work, finding clues here and there, trying to pick one’s path carefully amid the false trails and spoors which can lead us astray” (p.635)
Since the inception of EDA as unifying class of methods, it has influenced the development of several other major statistical developments including in non-parametric statistics, robust analysis, data mining, and visual data analytics. These classes of methods are motivated by the need to stop relying on rigid assumption-driven mathematical formulations that often fail to be confirmed by observables. Instead, EDA let’s the data suggest the appropriate specification.
The caveat to EDA
The antipode to EDA is to ignore data altogether in the foundation of a normative model. If the model fails to be statistically confirmed then it may be because one has observed the wrong data or did not observe enough data.
This is also EDA’s caveat, in that it entirely relies on data to discover the truth. If one does not have good knowledge of the the data generating process or has failed to perform data validation, then EDA is doomed to fail. The very step to EDA is therefore learning about the data itself, starting from the very step of the Graph Workflow, the data management step.
The EDA edifice
EDA begins by understanding the distribution of a variable and how it could be transformed in order to describe a more meaningful source variation. Transformations lie at the heart of EDA.
If the aim is to analyse a single variable, then a transformation could be useful in enhancing inference by reducing skewness and containing variation. If the aim is to analyse a relation, then transformations can help in expressing the relation in additive terms and enabling more straightforward linear inferences. The ultimate prize is to transform a variable into sufficient normality.
EDA comprises of a class of methods for exploring data and extracting signals from the data. Follow the links in the order they are provided in order to learn more about some of the key methods:
- Judging distributional form
- Assessing normality
- Transformations for [0,+∞) variables
- Linearising relations for [0,+∞) variables
- Transformations for (–∞,+∞) variables
- Transformations for [0,1] variables
- Visual mining
- Robust analysis