General Discussions

Data Science – Till Date

This post presents the key points (As I see it) from the article “50 Years of Data Science” written by David Donoho, Sept 18, 2015. Version 1.00.

Research in data science is moving to a meta-level, where we can ask a question like:
If we use such-and-such a method across all of science, how much will the global wide result improve?


Improving science over data has become a major challenge than improving data. We are heading to a generation where all of science itself will soon become data that can be mined. Data science has become another approach of scientific discovery along with experimentation, modeling and computation.


Tukey has identified four driving forces in the new science:

  • The formal theories of statistics
  • Accelerating developments in computers and display devices
  • The challenge, in many fields, of more and ever larger bodies
  • The emphasis on qualification in an ever wider variety of disciplines


“To sell suckers, one uses deceit and offers glamour,” wrote John Pierce of Bell Labs in 1969 as he discussed speech recognition. That was his advice against “mad inventors or untrustworthy engineers” that weren’t using a scientific approach.


Those fields where machine learning has scored success are essentially those fields where CTF (Common Task Framework) has been applied systematically.  Machine learning is a combination of predictive modeling culture and CTF.


Data scientist: person who is better at statistics then any software engineer and better at software engineer than any statistician.


Data scientists need to develop packages that abstract commonly used pieces of workflow and make them available for use in future projects. Research progress in the area is presenting a new workflow like every day.


Like we have metadata which is data about data, we have reached a generation of research which is working towards science of science, metascience. Data has been mined from past many years already. The science behind it has grown so strong and vast that, now we have enough science to mine it and find the hidden science in it.


Data scientists can’t blindly churn out methodology without showing concern for results being achieved in practice.


It’s about when science becomes increasingly mine-able. Instead of deriving optimal procedures under idealized assumptions within mathematical models, we will rigorously measure performance by empirical methods on the entire scientific literature or relevant subset of it.


Generating modeling culture
seeks to develop stochastic models which fit the data and then make inferences about the data generating mechanism based on the structure of those models.

Predictive modeling culture prioritizes prediction. It is effectively silent about the underlying mechanism generating the data and allows for many different predictive algorithms, preferring to discuss only accuracy of prediction made by different algorithm on various data sets.

Let me Know What you Think!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s