A primer on the Developers Trust Alliance project from Developers Alliance President/CEO Bruce Gustafson

Part 1: Because Data Is Useful

Why do people collect data, anyway?

I think we can thank bureaucrats and scientists, for a start. In both cases, a desire to understand complex systems led to the brilliant idea of creating models; abstract approximations of the systems under study. And data is the stuff models are built of. Whether it’s census information, weather reports, tensile stress measurements, or opinion polls, data is what brings models to life. It shapes how we see the world around us, and allows us to predict the future and understand the past.

Several things quickly become obvious with just a little reflection. First, while the scale of data collection has become truly impressive, we’ve been measuring and recording our observations for centuries, for essentially the same purposes as today. Sure, we now have new sensors and vastly more storage capabilities, but the reasons we collect information are the same: to understand the world, to predict the future, to manage a complex environment. Our purposes aren’t new.

Second, our ability to integrate and process vast amounts of information from diverse datasets has opened entirely new doors to understanding. We can model the weather, not just observe it. We can decipher the genome. We can predict a baseball team’s performance from on-base percentage. We can find patterns - and act on them - in data that describes systems as large as a universe or as small as a bacterium.

Third, there is something deeply embedded in our species that drives us towards knowledge. There is simply something satisfying in learning a new thing, or gainng a new insight.

Lastly, and perhaps the most important catalyst for the project behind these words, is the recognition that most of us have no idea how much of the information out there is about us, how it’s being used, who has it, and what their motivations are. And as human beings, we fear the unknown. For centuries, the unknown has been where dragons live.

Luckily, the more we know, the fewer places there are for dragons to hide.

Part 2: But Is Data Valuable?

While it is common to talk about the “value” of data, the true value of data is in its usefulness - its utility. Data is potential; useful in the hands of someone capable of extracting insights, but valueless in isolated pockets or in the hands of someone without the skill to spot the patterns and understand what they might mean. To put it another way, I’ve always known where I was, what I was spending my money on day by day, whether it was raining, or how hard it was to find a parking spot outside my office on a particular morning. This information wasn’t really valuable - I didn’t even remember it a few hours after it happened. It became useful when my credit card company collected a year’s worth of transactions and graphed my habits. It became even more useful when some developer began to plot the fastest route from here to there, based on where I was and where the other drivers were. It took someone else’s work, and often more data, to make my data useful.

The other challenge of thinking “data has value” is that it locks you into an ownership mindset. If I own my data, then I can trade it for something I want from someone that values it more than I do. The one with lots of data is rich. While appealing in its simplicity, this way of thinking has problems.

First off, since the usefulness of data comes from being able to combine it with other data and build insights from that, then I could argue the actual value is in the “output” data, not the various inputs. And often the one that values the outputs is actually one of those that provided the inputs. Maybe we should be paying ourselves for our own data!

Second, the same data point can be used in many different ways, alongside many other sets of data, to create many, many useful outcomes. Put another way, how do we value a data point that will be used to find a cure for cancer by one analyst, but to forecast the price of softwood lumber by another, or to choose the hair color of actors in a commercial by a third? All that we can say is that the value is greatest when the data is widely shared. Allocating value at the source becomes intractable because the cost of replicating the data point is essentially zero, while potential uses are theoretically infinite. Data is never "used up", and unlike property, one person's use does nothing to limit using the same data elsewhere simultaneously.

Third, the usefulness of data changes in unpredictable ways. Data that appears worthless today can become valuable tomorrow, and vice versa. Random snapshots that happened to contain a stop sign in the background became more valuable once we started training AI for autonomous vehicles, for example. At the other extreme, a security breach at a major data aggregator can drive the value of data negative - actually making it a liability to "own." Making matters worse, similar data from different sources can have radically different values that only become apparent in hindsight: take, for example, a DNA sequence from someone that turns out to have a unique mutation. These are all non-trivial problems that only arise if we use the “data as value” way of thinking.

Two final thoughts on models of data value, specifically the “data as labor” concept and the “data is market power” idea. The challenge with data-as-work is, do people actually produce data in the same sense we produce labor, or is it more akin to “digital exhaust” - the unintentional byproduct of us doing other things? For the moment, nothing in the discussion persuades me to look at things any differently. Data is potential, and it only becomes useful when combined with other data and analyzed by those with the means and motivation. On the data-is-market-power idea, the challenging assumption is that aggregating data somehow creates barriers, as opposed to advantages, for market participants. So while a large, and more importantly diverse, dataset is more useful than a small dataset or a one-dimensional dataset in isolation, there is nothing in the collection of data that actually removes data points from circulation. The fact that my mapping application knows where I am doesn’t prevent my network provider from knowing the exact same thing, or from the operating system in my smartphone or my weather application or any one of the other services with access to this information also having this data. The advantage goes to the one that first aggregates enough data to solve a complex and valuable problem, and does it better than those that follow. I call this "competition."

Part 3: So, Is Knowledge Power?

I think, at least, that shared knowledge can be powerful. And powerful things can be scary, especially to those that are kept in the dark.

There is a real and honest fear held by many people today that unknown forces are using data in ways that threaten them, personally. But while part of this fear is real, much of the fear is due to a lack of understanding and awareness of how data is collected and used.

There are four things driving most of the fear and misunderstanding about data. First, people want to know what data exists and who has it? More importantly, people want to know what data holders are doing with this data. Third, people feel they are unable to influence how data that relates to them is used. And finally, people are worried whether data that they would deem personal is insecure.

A major focus of this project is to provide an easily understood overview of what data is out there, how it is collected, how it is used and how it is secured. As part of this explanation we’ll touch on how data can impact individuals and how it is used as part of automated decision making. We’ll also spend time talking about how market forces work in our favor to encourage good data stewardship in those we choose to trust.