Note to readers: I originally wrote this piece in May of 2014. I was reminded of it today when a colleague from work suggested I look at the Spurious Correlations website, which I hadn’t checked out in a while. While we don’t often call it “Big Data” anymore, having moved on to the new jargon of “AI”, the article reads pretty much as if 4 years hasn’t passed. So I thought I’d bring back the post, with very minor edits (including this hilarious CB Insights chart about how mentions of “AI” have dwarfed those of “big data”), and remind you why you should be worried about bananas.
While I was wasting time looking at random Facebook posts the other day, I came across what might possibly be the most entertaining tool of all time, a website called Spurious Correlations. On this website you can pick from a myriad of variables and see how they are correlated with other random variables. The idea, of course, is to point out the ridiculous. For instance, I learned that there is an almost perfect correlation in 2000-2009 US spending on science, space and technology and spending on pets. My friend Bill played with the program and noted that it reinforced his hunch that cutting down on margarine consumption significantly improved your chances of staying married in Maine, to which my friend Calvin responded that, more likely, there are lots of very sad Mainers consuming full tubs of margarine and, well, that is not good for marriage.
And therein lies the rub, of course. Correlation is not the same as information and having a lot of data doesn’t necessarily tell you anything worthwhile without a lot of further analysis. Reaching back to grad school statistics (with a generous helping of Wikipedia refresher), the term “spurious correlation” was actually coined by mathematician and data scientist Karl Pearson.
Good old Pearson, born 1857, was considered one of the first biometricians. If he were alive today and saw the mass amount of biometric data being collected through FitBits and Basis Watches and weird wearables and Withings scales and other tech gizmos, he would probably get a head start on digging a grave in which to start spinning. The health and fitness industry has given a whole new life to the term spurious correlation by making every person who can walk and chew gum at the same time an armchair bioinformaticist. Note, walking and chewing gum at the same time may be correlated with increased calorie burn. Or better sleeping. Or the ability to wear two entirely uncorrelated wearables at the same time.
And yet every healthcare startup that enters my office, stands on a pitch day stage or markets their idea to a customer has a (now) obligatory slide in their deck about the massive amount of data which their product will generate and thus the promise that said product will generate a wealth of knowledge (can you hear the angels singing?) for which people will line up to pay. And what no one ever says is that data is not the same as information, much less knowledge, and it will take a lot of patient volume, detailed analytics, hard work and, god help me, math that is above my pay grade, to make that data of any use to anyone. Assuming it is of use at all, to anyone, that is.
I am concerned that much of the data collected in these products and the manner in which they are used could actually lead to some really bad decision-making. To illustrate my point, I offer you this from my jaunt through the Spurious Correlation website: the high correlation between deaths from heart catheterization (e.g., angioplasty) and the amount of US crude oil imports from Saudi Arabia.
From looking at the chart below, one might think that reducing crude oil imports should be a high priority for FDA regulators, as it might just lead to fewer people coding on the table during critical heart procedures. While cutting crude oil imports and giving everyone free Tesla’s is probably a great idea from the standpoint of reducing global warming, I am guessing that the negative side effects of heart catheterization would actually continue at their normal rate.
Another example from my new friend, Spurious Correlation: there is a .83 correlation between the cost of bananas and the number of people who died by becoming tangled in their bed sheets. Does this mean we need a new wearable that detects banana pricing when you enter Safeway and, when such pricing peaks, gives you an alarm at night when your sheets become untucked? Perhaps Apple has already built this into the new health product yet to launch? Maybe a cobranding deal with Chiquita Banana and Bed, Bath & Beyond is in the works? The possibilities are endless, but the useful information is not.
FYI, and while we are at it, it turns out that the cost of bananas is also highly correlated (.80) with the number of people killed by immunosuppressive agents. Someone, alert the FDA. Bananas’ predictive ability is obviously something that should qualify them for 510k oversight and integration into Meaningful Use 3.
I was enjoying the endless possibilities of Spurious Correlation when I coincidentally happened to read an article VC Names Robot to Board of Directors. I think it was just coincidence or correlation and not causation, anyway. The article’s headline made me think that a reporter must have wandered into a board room and mistaken for robots the 10 people who undoubtedly all looked alike in their khakis and blue shirts and brown hair with side parts, sort of like Austin Powers’ Fembots but the board room (aka male) version. But what I found out upon reading the article is this: “Aging Analytics UK, a company that conducts research on biotechnology and regenerative medicine, made two announcements this morning: first, that they’ve launched an new A.I. tool called VITAL (Validating Investment Tool for Advancing Life Sciences); and second, that they’ve licensed VITAL to Hong Kong V.C. firm Deep Knowledge Ventures, where the tool will become an “equal member of its Board of Directors.”
So yikes. First of all, here is a big data engine for biotechnology discovery that will hopefully do more than correlate banana pricing with avoidable methods of inducing death. But what is particularly interesting (scary?) is that a VC firm is going all in on the big data thing to predict positive outcome on investments in biotechnology. A partner at the firm was quoted in the article as saying, “We were attracted to a software tool that could in large part automate due diligence and use historical data-sets to uncover trends that are not immediately obvious to humans that are surveying top-line data.”
Now, Deep Knowledge Ventures’ spokesman does go on to say that humans will still participate in the investment decisions (fyi, for those confused, he is using the word “humans” as a synonym for “VCs”) and that the best decisions will be based on a combination of data and intuition, but I wonder. It is very difficult for people to say yes in the face of large reams of data that say no, even when the data is of dubious provenance or tells a story that might be like the one that equates banana pricing and risk of bedsheet strangulation. It takes a lot of intestinal fortitude thinking for humans to disregard data, which is, in many ways, exactly what VCs are supposed to be doing. It is hard to make transformational investments when you spend your time looking at historical data, as it often points you in the direction of either history or, worse, incrementalism. To disrupt history you have to decide that facts are somewhat inconsequential and take a leap of faith. Overreliance on data, big or otherwise, makes faith very inconvenient.
Humans are humans (even when they’re VCs), and often they can’t help themselves but to follow the data, even when the data leads them down the wrong-colored brick road. When asked whether their robot partner would be incorporated into board of director meetings, the folks at Deep Knowledge Ventures said, “… investors will firstly discuss the analytical reviews made by VITAL (aka the big data robot engine). All the decisions on investing will be made strictly after VITAL provides its data. We say that VITAL has been acknowledged as an equal member of the board of directors, because its opinion (actually, the analysis) will be considered as probably the most important one. So basically yes, it will be incorporated into meetings.”
And thus, according to me, there is a huge risk that intuition is left by the wayside.
In medicine in particular it is essential to keep a balance between data and intuition (the patient’s and the provider’s). Without data demonstrating that it is probably not effective, we would still be bleeding people to treat them. Clearly there has been a vast world of data created that, when properly analyzed and viewed in proper context, tells us better ways to take care of people. This is the whole premise behind IBM’s Watson program. We tend to call the results of this analysis “evidence-based medicine” and try to codify guidelines to ensure that everyone is using the highest, best level of knowledge to commonly treat certain conditions, such as heart diseases and asthma, etc. When we get to even more complicated diseases, such as cancer in its many forms, it seems that there can never be enough data to help us determine the best course of treatment or the optimal mix of drugs and chemo and radiation, etc. On the other hand, all of the data in the world can’t tell you what the patient thinks is in his or her own best interest—that is where the human touch is still needed. The data may tell you to use a particular chemo agent to treat a particular cancer and the patient may tell you that they would rather not bear the miserable side effects of that intervention and they will take their chances with something else.
So we now sit at an interesting crossroads where we are strapping people to sensors and collecting all the data in the world and, at the same time, still trying to figure out what the heck to do with all that data while being stared at by people who think that all data should be monetized. And by “people” I mean those who are asking where is the revenue that should have come in today from all that data that is clogging up the building, or at least the clouds overhead.
And yet we know for a fact (from the data), that most of the sensors are only sort of accurate at this stage of the game and only a few are highly accurate. We know that context for the data is essential but rarely available. We know that we need a mountain of clean data to establish good evidence of cause and effect and to help us discern which way to send a patient through the healthcare maze and that the only way to get that is to manually clean the data. And, because of this, we know that much of the data now being collected has the same predictive value as banana pricing.
We are going to have to do a lot better than the evidence-based guideline of “beware of your bedsheets” if we are going to improve medicine in a meaningful way through mass data aggregation and analysis. Lets hope our new robot overlords will provide enough venture backing to let entrepreneurs find the knowledge in the data and overcome the scourge of inflationary banana pricing.