Data shouldn’t be allowed to speak for itself

11/6/2019 ☼ AnalysisBig data

The most important failure mode for management research that uses digital data is mistaking abundant quantitive data for complete data when analyzing and drawing conclusions from it. Patterns discovered in data can lead to theorizing, but patterning in data should not be mistaken for theory. In other words, data shouldn’t be allowed to speak for itself.

It’s important to think clearly about this because more and more human action and interaction relevant to management and organizations happens digitally, and data from digital action and interaction is becoming easier to store and analyze.

Both trends have implications for how researchers think about the process of doing research and drawing conclusions from it that are intended to inform practice. These implications are conditioned by the affordances and constraints of digital information in organizational settings.


  1. Large quantities of information can be collected, stored, and distributed relatively easily
  2. Routine and ambient data is relatively easy to collect prospectively (e.g. sociometers) or mine from systems retrospectively (e.g. data from corporate email servers).


  1. Hard to understand what digital information about a focal entity (individual or organization) remains unobserved or inaccessible
  2. Non-digital aspects of the focal entity remain hard to observe
  3. Identification of focal entities becomes easier (and robust anonymization thus becomes harder) as datasets grow in size and are combined
  4. Murky ethics of using data acquired retrospectively and opportunistically from routine user activity (such as cellphone location data)


  1. Conceptualizing research permission becomes more complex. The ethical consideration here is in understanding both the legal (liability-oriented) and moral (ethics-oriented) imperatives for permission to use digital data in research. This is especially important for data collected opportunistically from the activity of respondents who are not the legal owners of the data they produce in the course of routine activity.
  2. Analytic bias becomes harder to see. Most importantly, the size of large datasets leads to the implicit assumption that they are more complete datasets that speak for themselves.” While large quantities of data can be relatively easily collected, it remains difficult to understand what types of data (both digital and non-digital) are missing. Analyses of large digital dataset may thus be subject to unknown biases but this may become less apparent as datasets become larger.

[Originally written for a digital ethics forum run by the GovTech Lab in May 2019.]