Many times I feel that these days we are truly starting to live a life what was only science fiction just a few years ago. What a better example than Amazon Echo just learned to wake up for “computer” like in Star Trek. Also the six degrees of separation concept was described in Frigyes Karinthy’s short story in 1929, couple of years before Moreno started to study social networks. Adding to that my favourite science fiction novellist, Isaac Asimov, predicts the explosion of data science in his book, Foundation.
The entire book in itself is built around predicting the future using data and statistics. This branch of science is called “psychohistory” which is basically projecting the faith of humanity. The book is full of hints and principles of how this science can and should be used. Frankly, most of these principles can be totally applied to data science today, so let’s dig in.
You need a huge amount of data for reliable predictions.
Even the definition of psychohistory itself reinforces one of the key points of data science. You need lots of data. According to the book Hari Sheldon’s, the inventor of psychohistory, first teorem even describes how to properly size the data required.
Gaal Dornick, using nonmathematical concepts, has defined psychohistory to be that branch of mathematics which deals with the reactions of human conglomerates to fixed social and economic stimuli…
Implicit in all these definitions is the assumption that the human conglomerate being dealt with is sufficiently large for valid statistical treatment. The necessary size of such a conglomerate may be determined by Seldon’s First Theorem which…
Psychohistory aims at predicting the behaviour of crowds and it’s analysis is valid only for the masses. That is true as well in data science. Just imagine creating a churn score. You need a pretty large sample to be able to get any prediction working.
The amount of data mandates the use of a computer. Manual calculation is impractical at best.
We all know that data mining and data science requires computers. Even the sheer amount of data to be processed makes manual calculations basically impractical. If we add other factors from the 4Vs like variety and velocity this requirement is even more important.
Seldon removed his calculator pad from the pouch at his belt. Men said he kept one beneath his pillow for use in moments of wakefulness. Its gray, glossy finish was slightly worn by use. Seldon’s nimble fingers, spotted now with age, played along the files and rows of buttons that filled its surface. Red symbols glowed out from the upper tier.
While the use of computers for these kind of calculations seems obvious. At the time of publishing this book though the situation was quite different. No PCs, only some industrial computers were available, so you can consider this a pretty bold prediction.
Simple models can be improved by addigin more fields.
The simplest way to improve the performance of any model will always be adding more variables. At least in the case of low complexity modells this will hold true. It is also common in data science to create merged or derived variables further supporting the analyis.
He said, “That represents the condition of the Empire at present.”
Gaal said finally, “Surely that is not a complete representation.”
“No, not complete,” said Seldon. “I am glad you do not accept my word blindly. However, this is an approximation which will serve to demonstrate the proposition. Will you accept that?”
“Subject to my later verification of the derivation of the function, yes.” Gaal was carefully avoiding a possible trap.
“Good. Add to this the known probability of Imperial assassination, viceregal revolt, the contemporary recurrence of periods of economic depression, the declining rate of planetary explorations, the. . .”
He proceeded. As each item was mentioned, new symbols sprang to life at his touch, and melted into the basic function which expanded and changed.
Results of predictions are not yes or no, but given in percentages.
You will never hear a data scientist say that something will happen or will not happen. It is always a percentage of probability. The results of even the best predictions are never definite, just highly probable.
“It will end well; almost certainly so for the project; and with reasonable probability for you.”
“What are the figures?” demanded Gaal.
“For the project, over 99.9%.”
“And for myself?”
I am instructed that this probability is 77.2%.”
“Then I’ve got better than one chance in five of being sentenced to prison or to death.”
Use confidence intervals.
There is always uncertanity involved with predictions. This is expressed through the confidence interval in statistics. While in science it is already common language, I’m curious when will it find it’s way to casual discussions, like in the book.
Within another half year he would have been here and the odds would have been stupendously against us – 96.3 plus or minus 0.05% to be exact. We have spent considerable time analyzing the forces that stopped him.
Predictions on single individuals are much less reliable.
Basically going back to principle number one. If you do not have enough data it is very hard to predict. Getting a lot of data, in statistical terms, on a single individual is almost impossible. The best approach is to fit certain patterns to one’s behaviour, but then again you did not calculate with the will of the individual. Simply put, you will not really have enough (or diverse enough) data to predict on the basis of a single individual.
Seldon said, “I’ll be honest. I don’t know. It depends on the Chief Commissioner. I have studied him for years. I have tried to analyze his workings, but you know how risky it is to introduce the vagaries of an individual in the psychohistoric equations. Yet I have hopes.”
The near future is more predictable than longer time horizons.
The longer the time horizon, the higher the uncertanity. It is common sense, but also in data science predictions for a lot of time ahead tend to be more off. Something comes up, other variables (or people) start unexpectedly influencing the outcome of events. Soon long term predictions start to become a combintation of probabilities limiting the efficiency of models. Oh, and yes, this can even prove the mighty Hari Sheldon wrong.
I am Hari Seldon! I do not know if anyone is here at all by mere sense-perception but that is unimportant. I have few fears as yet of a breakdown in the Plan. For the first three centuries the percentage probability of nondeviation is nine-four point two.
Seldon is off his rocker. He’s got the wrong crisis. […] Then the Mule is an added feature, unprepared for in Seldon’s psychohistory.
Despite having so much of the data science principles written down in the the Foundation, we are not there where Asimov dreamt us to be. We are yet unable to predict history on such a scale it was done by Hari Sheldon, but who knows, maybe some day we will be able to do so! Until then the Foundation remains one of my favourite books not only because of foreseeing the data science “age”.