Open Data: The end of corruption or the beginning of "Big Brother"?

Thought experiments are an often overlooked and underutilized

tool in the toolbox of science, policy, art, and basically any other field about which one can think. Einstein famously conducted thought experiments, which he credited with aiding some of his greatest achievements such as describing light moving as a photon or the Theory of Relativity. Schrodinger’s cat, Zeno’s paradox, and the prisoner’s dilemma are other types of well-known thought experiments. The power of the thought experiment is not in the conducting of the experiment but rather in the deliberate visualization of the consequences of the question itself.

I find that thought experiments are most impactful when confronting issues with a lot of mixed messages. For example, we’ve all had bosses, parents, sports coaches, or other people in our lives say something like “It’s important to take risks.” Often, these same people almost simultaneously say, “But don’t fail.” Obviously this is not useful and doesn’t allow for a “win” in either direction as failure often, although not always, comes hand in hand with risk-taking. If a person takes risks but fails, that’s a bad outcome. If that same person does not take risks but also does not fail, that too is a bad outcome as valuable growth opportunities are missed due to the mixed message. The tendency for most people is to be conservative. Thus, the only way forward is relying solely on external chance rather than internal skill and intuition.

The case is much the same in the current debate on the value of open data. Seemingly everyone has opinions on how “open” data should be. Perhaps this is because personal security is a topic that is important to everyone from corporations to individuals. In those opinions, however, two main camps of thought are emerging: ‘open data is important and the future’ and ‘open data is dangerous’. Let’s first take a look at the ‘important’ camp. The idea of the usefulness of open data, by and large, goes something like this: 1) we have long-standing issues around the world such as hunger, poverty, and food insecurity. 2) We have the tools to combat these issues, but it takes collaboration and sharing of data to provide new and innovative insights into how that data can be used. 3) Through that collaboration and sharing, coupled with enabling policies from government and supportive actions from the private sector, we can solve these issues. Implicit in those three steps is the idea that we can solve these issues with less capital, time, and resources than under the current, more protective paradigm around data. The theory has sound backing. Research does show that proper collaboration increases productivity. Whether a government works for or against the people, increasing the size of the pie will increase the size of each pie piece, even if some pieces of that pie should be larger than they are. This argument sounds compelling and reasonable. Without proper protection, however, data gained through collaboration can have a punitive effect, and unintended consequences. Take the lawsuit against AgriStats as an example. AgriStats is a company that collected data on poultry farms in order to provide comparative data about chicken production. The value that AgriStats provided to the clients was targeted areas for efficiency improvement so that operators could increase efficiency and maximize profit. The data was anonymised, and so the farmers thought their data was secure and would be willing to exchange their data for access to the service. Next, larger companies further up the supply chain then bought this anonymous-aggregate data. So far, nothing is out of the ordinary. However, they were allegedly able to reverse engineer the identifiable information, pressure farmers, and engage in what was ostensibly price fixing. Again, this is all alleged, but the importance of this case shows the care that must be taken about how to use open data responsibly.

This is where the second camp, ‘open data is dangerous’ also has a compelling argument that goes something like this: 1) value is created by knowing something about a field or business sector before my competitors know. 2) Obtaining that knowledge requires data for testing theories and proving a concept. 3) Therefore, data is my valuable resource and competitive advantage. What comes out of this line of thinking is that only data that is in a post-competitive state can be shared. Although collaboration is recognized as valuable, it must be protected with non-compete clauses, non-disclosure agreements, terms of use for personally identifiable information, and so on. This is the way most business is conducted, most university work is completed, and the way countries are viewing the future. For example, in the European Union, a General Data Protection Regulation (GDPR) is being created that all companies doing business in the EU will need to meet by May of 2018 to protect personal data. The theory is that consumers will be able to control when and how companies can use their data, and that incentives would be given such as discounts on popular products in exchange for this new data currency. How valuable is this data? Think about how valuable oil was in the early 20th century, and the large anti-trust lawsuits brought against John D. Rockefeller and Standard Oil. If you haven’t heard of Standard Oil, I am sure you heard of the companies this monopoly was broken up into (including the mergers and acquisitions over time). Companies that can trace their roots directly to Standard Oil are: ExxonMobil, Chevron, Marathon Petroleum, and BP (formerly British Petroleum). Fast forward to today, and the digital oil that is worth over $25-billion in profit (according to the Economist) are controlled largely by 5 companies: Alphabet (parent company of Google), Amazon, Apple, Facebook, and Microsoft. But are the giant data companies of today really a monopoly? Does Google really control search, Amazon retail, Apple the smartphone, Facebook the social data, and Microsoft the web browsing? Tech companies are still entering the field and can compete with the giants. Also, an argument can be made that society is benefitting from these large companies, even if this data makes the companies incredibly powerful.

So, at first glance, our thought experiment seems to be that sharing data causes an increase in productivity but will almost certainly lead to at least some level of unacceptable abuse that will have crippling side effects. Conversely, protecting data will drive down productivity, force more regulations, and lead to giant tech guardians of every aspect of our digital life. In the first case, the benefits of sharing are weighed against the potential for abuse. In the second case, the benefits of protection are weighed against the consequences of centralized power in the new data economy.

Two wrenches can also be thrown into this experiment. First, the sheer amount of data being created now is staggering. In fact, more data has been produced in the last two years than was produced in the entire history of humanity leading up to that point. New oil reserves were not found at this rate; meaning that opportunities are supposedly emerging faster than they can be fully realized in terms of their business and societal value. With the Internet of Things (basically all devices connected and talking to each other) growing exponentially, the amount of data will increase from now by a factor of 10 by 2020. Second, millennials are changing the data economy. A recent research project conducted by Mintel found that 60% of millennials are willing to share personal data compared to 30% for baby boomers. Of the millennials who said they would not provide data at all, 30% would change their mind if offered something as small as a $10 coupon by a company.

So how do these wrenches affect the thought experiment? In the first case, where open data is shared but unintended consequences increase, consequences are mitigated if people don’t actually view them as consequences. Understandably, this seems like a weak argument given that millennials still want a certain level of trust and stewardship of that data. However, if the value of that data is only a $10 coupon, just how much trust and stewardship are millennials really expecting? If data explodes 10-fold in the next two and a half years, is it really possible to analyze and interpret much meaning from that data if only 0.5% of all data is analyzed today? Collaboration can make better use of this data and provide societal, business, and research benefits.

In the second case where data is protected, it could become a barrier to entry for competition. Standard Oil was used earlier as an example of a monopoly. When previous anti-trust lawsuits were heard, a company’s size and market share versus that of its competitors were the main argument points. Size, especially today, is not an inherently bad thing when it comes to anti-trust lawsuits. In today’s economy, a company’s digital assets could be equally as important. Facebook recently bought tech startup Whatsapp for $22 billion, despite Whatsapp having little to no revenues to speak of. Did Facebook find some valuable market share it wanted to tap into, or was it an early warning of data hurting competition? Put another way, are these powerful, yet beneficial, tech companies going to remain beneficial or will they create new barriers to entry? I’m not sure, but to date the benefits far outweigh the consequences. With more and more big data analysts, deep learning techniques, and artificial intelligence, can we really expect anonymous data to remain anonymous no matter what regulations are put into place? Put another way, data will become open eventually, so we can either choose how it will happen or have it forced upon us.

Now our thought experiment is getting good as we’re painting a future of exploding data, a generation that is more open about privacy, and an economy whose barriers to entry depend on more than just physical size and market capital. All the while we have to consider the consequences of today. Either path has benefits and consequences that can be promoted and mitigated, respectively. From a moral standpoint, both arguments help and hurt people, so how can we choose what is best?

I wrote recently about the case for the philosopher, which you can find in my blog history. This is a good case for a philosopher, and a good philosopher to pattern our argument after is John Rawls. An American political and moral philosopher, he received the National Humanities Medal in 1999 from then-president Bill Clinton who remarked that Rawls’ work “helped a whole generation of learned Americans revive their faith in democracy itself”. So, this sounds like a good choice for a philosophical model from a philosopher who just so happened to use thought experiments like his famed “veil of ignorance”. In this thought experiment, Rawls impartially positions everyone as equals and encodes all of their intuitions, regardless of how seemingly relevant or irrelevant they are, in order to deliberate about justice. The idea behind this thought experiment is that if a person does not know anything about himself or herself (race, gender, religion, height, …etc.), it prohibits that person from arguing from a point of self interest and creates a just and moral society.

So let’s assume nothing about ourselves at the moment. We know everything about how data is used and collected now, and we know how the amount of data will increase in the future and what problems we need to solve. We don’t, however, know which role we are playing in this world. We could be the CEO of a major data company, or we can be a food-insecure person living without access to data as a valuable commodity. What would we want a data-rich society to look like? 1) We might want to make sure that we can get the most information as possible out of that data to ensure we can get the most societal benefit out of it as possible. 2) We might want to make sure that we can use that data to build products and businesses that can fairly compete with other businesses. 3) We would want to make sure there is some form of reciprocity. 4) We would also want to make sure that we are protected from unintended consequences to the best extent possible. 5) We want to make sure there is some form of recourse too if we are wronged.

For point 1), open data and collaboration is the best way to achieve this as all of society can participate and benefit from information created from data. For point 2), open data also is the best choice as the pace of innovation and profitability in total increases for businesses. For point 3), under the present paradigm closed data works best for reciprocity as agreements are needed ahead of time. For point 4), it’s not entirely clear either system works better than another for data protection, and both open data advocates and opponents try to put policies in place to protect from unintended consequences of data misuse. Lastly, for point 5), it appears that open data would be most beneficial if terms of use are upheld and companies and individuals continue to be held accountable for data mismanagement.

It would seem that an open data future, by a slight margin, is the way to go. However, just like what makes up an anti-trust lawsuit may need changing in the data economy, so too does how we live and work. Assuming we are eventually headed toward an open data world, let’s consider three roadblocks that many struggle to overcome. First let’s discuss individual data protection. No one wants to be the victim of identity theft, just as no farmer wants to be the victim of alleged price fixing in the above example. Whether data is open or closed, data security firms will still work to anonymise and protect the most personal data, and other actors will try to uncover that data and, either purposefully or accidentally, release information that should not be released and cause harm. With that known, it is important that punishments deter people from these courses of action. Some evidence from the U.S. indicates that governments may be thinking along those lines as punishments for identity theft have been increased and can carry as much as 15 years in prison. That’s a good deterrent, but stealing money from a corporation, by comparison, is a class B felony and can carry up to 20 years in prison. Additionally, as we record our lives online more and more, we build a profile of activities. The drawback of a profile is perhaps only being exposed to some of the advertisements and products available, but the benefit is that identity theft could be caught and stopped much sooner. It’s hard to prove and recover from identity theft, but increased and open data could lead to algorithms that predict a problem, monitor the situation, and lead authorities to apprehending the criminal. If you think that’s far fetched, think about the sports world. Las Vegas sports gamblers, who reported suspicious performance of the program in 2005, initially caught the point shaving scandal of the Toledo college football program. How did they do this? They used lots of data and predictive algorithms. The more open the data, the better the algorithms can be to catch unwanted actions.

A second roadblock is in the research sector. Large granting agencies are already calling for more, and larger, collaboration on projects. They recognize the value of the “wisdom of crowds”. But these grants still require a primary set of investigators, and where there used to be room for a junior faculty investigator seeking tenure, there is now only room for tenured, well-known faculty members as primary investigators. Additionally, the ability to become a first or corresponding author on these projects is slim to none for a junior faculty member. This is important because tenure, especially at major research universities, is almost entirely driven by how much a faculty member publishes as a first or corresponding author. Being a second, third, or other lower order author is of zero value to these junior faculty members. So, let’s assume a junior faculty member has a small research budget from smaller grants cobbled together from a variety of sources, has done some incredibly novel work, and wants to publish that data. That sounds great until that researcher catches the eye of higher up faculty members who are encouraging collaboration with major international teams as part of a broad topic of major importance to society. It’s a great opportunity and a wonderful way to build a powerful network and increase collaborative efforts. Unfortunately the buy-in for this is access to the researcher’s data and relegation to a lower-order authorship. It’s a tough situation in the publish-or-perish world of academia, and most will chose to forego the partnership to ensure promotion and tenure first. Does that mean that open data is inherently bad, or does this mean that the process by which this work is evaluated is out of date? In a world where a system thinking rather than single discipline thinking is becoming the norm, how can we integrate this into the tenure process? If we can agree that collaboration and sharing of data and resources are the way forward, then we need to ensure our future academic leaders are not punished for adhering to this approach and driven out of academia entirely.

The last roadblock is that of seeing open data as a competitive advantage in business. Joint ventures, collaborations, and transactions between businesses require all sorts of non-disclosure agreements and data protections. This is to ensure that whatever individual competitive advantage (or intellectual property) a business believes it has is only given away knowingly in exchange for value immediately or greater potential value over time. I have given this topic a lot of thought over the years, and I do believe that protecting data is usually at the expense of innovation. I also find that companies are more likely to share data “because they can” rather than “because they should”. Small and medium size companies, however, realize they are not in a fair fight with large, established corporations, and many countries have become renters’ economies due to market imperfections. Even if the playing field is level, the court systems are not. A large corporate partner can sue a small or medium sized enterprise, and even if the large corporate partner is wrong (often knowingly so), the costs of tying the smaller business up in prolonged court battles causes duress in the smaller business. This often leads to selling under market value. Given the risk and work that founders put into businesses, it seems sensible to protect data and be very careful about partnerships. That said, with the new, more socially responsible consumer (again armed with more data than ever) and the mitigating ways a business owner can raise money to fund work (think crowd funding and other access points), the marketplace seems to be catching up to these concerns. The world is a more open marketplace, and distribution channels are more available everywhere. Furthermore, not a lot of great evidence of data misuse, price fixing, collusion, and other fears that consumers have of big business seems to be occurring with the largest businesses in the tech space. By and large, the big 5 tech companies (Alphabet, Amazon, Apple, Facebook, and Microsoft) have made our lives significantly better. Beyond just business, places like Alphabet and Amazon have cutting edge ventures that push the boundaries of innovation, Apple continues to revolutionize personal technology, Facebook’s founder recently advocated for a future universal livable salary (perhaps a blog topic for another day), and former Microsoft founder Bill Gates is using his fortune to solve the world’s greatest humanitarian crises. In fact, did you know that Bill and Melinda Gates are the most effective philanthropists on a large scale in history? I know Rockefeller made more money (adjusted for inflation) and has given more money over time, but in terms of return on the dollars invested, Bill and Melinda Gates are the best. And what does the Gates Foundation want? Collaboration, sharing, and openness.

Increasing productivity can be good for both society and business, raising living standards and increasing profits. The question shouldn’t be “should we be more open with data or more protecting of data” but rather should be “how do we subtly alter the way we live and work such that we set ourselves up, both individually and collectively, for success?”