May 2007 Early Indications: A Data Economy?
The following is based on the opening talk presented at the Center for Digital Transformation's Spring 2007 Research Forum.
I. Roughly 20 years ago, Citibank CEO Walter Wriston said, "information about money has become almost as important as money itself." Since that time, complex secondary and tertiary risk markets have grown into a massive global financial information-processing mechanism. Stocks and bonds, traded on primary markets, are hedged by futures, options, and derivatives, as well as a variety of arcane (to the public) devices such as Enron's famous special purpose entities. These instruments are nothing more than information about money, and their growth helps prove the truth and wisdom of Wriston's comment.
Data, what Stan Davis once called, "information exhaust" or the byproduct of traditional business transactions, has become a means of exchange and a store of value in its own right. Hundreds or even thousands of business plans are circulating, each promising to "monetize data." While Google is an obvious poster child for this trend, there are many other, often less obvious, business models premised on Wriston's core insight, that information about stuff is often more valuable and/or profitable than the stuff.
Internet businesses are the first that come to mind. Both Linux and eBay have captured reputational currency and developed communities premised on members' skills, trustworthiness, and other attributes. These attributes are, in the case of eBay, highly codified and make the business much more than a glorified classified ad section. Information about retail goods is used by 7-Eleven Japan to drive new-product hypotheses in much the same way than analytical credit card operations such as Capital One develop offers "in silico". An astounding 70 percent of SKUs in a 7-Eleven are new in a given year, and such innovation in a seemingly constrained market is only possible because of effective use of data.
Amazon's use of purchase and browsing data remains unsurpassed. I recently compared a generic public page -- "welcome guest!" -- to my home page, and at least eighteen different elements were customized for me. These were both "more of the same," continuing a trend begun with a previous author or recording artist purchase, and "we thought you might like," recommendations based on customer behavior of others deemed similar to me. Of the eighteen elements of that home page, each had a valid reason for inclusion and was a plausible purchase.
Another less visible example of this trend is the Pantone system. Information about color is almost certainly more profitable than paint or ink. Pantone has a monopoly on the precise definition for colors used in commerce, whether in advertising or branding - Barbie pink and Gap blue are omnipresent - or in production processes: every brownie baked for use in Ben & Jerry's ice cream is compared to two Pantone browns to ensure consistency. Pantone is also global: Gap blue is the same in Japan as in New Jersey, and on shopping bags, neon signs, and printed materials. The private company does not disclose revenues, but it is now branching out into prediction businesses, selling briefings telling fashion, furniture, and other companies whether olive green will be popular or not next year.
II. A second trend crossing business, science, and other fields can colloquially be called "big data." We are seeing the growth of truly enormous data stores, which can facilitate both business decisions and analytic insights for other purposes. Some examples:
-The Netflix prize invites members of the machine learning community to improve the prediction algorithms used to recommend "if you liked X you might like Y" recommendations. While it is not clear that the performance benchmark needed to win the $1 million top prize can be reached incrementally, one major attractor for computer scientists is the size and richness of Netflix's test data set, the likes of which are scarce in the public domain: it consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.
-Earlier this month a new effort, the Encyclopedia of Life, was launched to provide an online catalog of every species on earth. In the past several years, however, geneticist Craig Venter sailed around the world on a boat equipped with gene sequencing gear. The wealth of the results is staggering: at least six million new genes were discovered.
-The data available on a Bloomberg terminal allows complex inquiries across asset classes, financial markets, and time to be completed instantaneously. Before this tool, imagine answering a simple question using spreadsheets, paper records, multiple currencies, and optimization: "What basket of six currencies - three short and three long - delivered the best performance over the past two years?"
-The Church of Latter-day Saints has gathered genealogical records into an online repository. The International Genealogical Index database contains approximately 600 million names of deceased individuals, while the addendum to the International Genealogical Index contains an additional 125 million names. Access is free to the public.
In the presence of such significant data sets, various academic disciplines are debating how the fields progress. Quantitative vs. qualitative methods continue to stir spirited discussion in fields ranging from sociology to computer science. The continuing relevance of such essays as C.P. Snow's The Two Cultures and David Hollinger's "The Knower and the Artificer" testify to the divide between competing visions of inquiry and indeed truth.
A fascinating question, courtesy of my colleague Steve Sawyer, concerns the nature of errors in data-rich versus data-poor disciplines. Some contend that data-rich disciplines tend to be wary of type I errors (false positives) and thus miss many opportunities by committing false negatives (type II) that are less visible. Data-poor communities, meanwhile, may be unduly wedded to theories given that evidence is sparse and relatively static: in contrast to Venter's marine discoveries, historians are unlikely to get much new evidence of either Roman or Thomas Jefferson's politics.
III. Given that data is clearly valuable, bad guys are finding ways to get and use it. Privacy is becoming a concern that is both widely shared and variously defined. Indeed, our commentator Lawrence Baxter, who used to be a law professor at Duke, noted that defining what privacy is has proven to be effectively impossible. What can be defined are the violations, which leads to a problematic state of affairs for both law and policy.
Data breaches are growing both in number and in size: in the past year and a half, there have been roughly 50 episodes that involved loss of more than 100,000 records. The mechanisms for loss range from lost backup tapes (that were not encrypted) to human error (government officials opening or publishing databases containing personally identifiable information) to unauthorized network access. In the latter category, retailer TJX lost over 45 million credit- and debit-card numbers, with the thieves, thought to be connected to Russian organized crime, gaining access through an improperly configured wireless network at a Marshall's store in Minnesota. Bad policies, architecture, and procedures compounded the network problem, to the point where TJX cannot decrypt the files created by the hackers inside the TJX headquarters transactional system.
Part of data's attractiveness is its scale. If an intruder wanted to steal paper records of 26 million names, as were lost by the Veterans Administration last year after a single laptop was stolen, he or she would need time, energy, and a big truck: counting filing cabinets, the records would weigh an estimated 11,000 pounds. A USB drive holding 120 gigabytes of data, meanwhile, can be as small as a 3" x 5" card and a half-inch thick.
Redefining risk management in a data economy is proving to be difficult, in part because IT workers have been slow to lead the way in both seeing the value of data and treating it accordingly. To take one notable example, the Boston Globe printed green-bar records containing personal data relating to 240,000 subscribers, then recycled the office paper by using it to wrap Sunday Globes for distribution. Not surprisingly, an arms race is emerging between bad guys, with tools such as phishing generators and network sniffers, and the good guys, who all too often secure the barn after the horse has run away.
IV. Who will be the winners? That is, what companies, agencies, or associations will use data most effectively? Axciom, Amazon, American Express, and your college alumni office might come to mind, but it is so early in the game that a lot can happen. Some criteria for a potential winner, and there will of course be many, might include the following:
-Who is trusted?
-Who has the best algorithms?
-Who has, or can create, the cleanest data?
-Who stands closest to real transactions?
-Who controls the chain of custody?
-Who can scale?
-Who has the clearest value proposition?
-Who understands the noise in a given system?
-Who can exploit network externalities?
Whoever emerges at the front of the pack, the next few years are sure to be a wild ride.