At the risk of stating the obvious, the biggest weakness of a data scientist is that they can’t practice their craft without high quality data. And creating a high quality dataset isn’t exactly trivial. This becomes the most obvious blocker to adding any kind of value via this discipline. Unlike engineering where you can roll up your sleeves and start building on day one, a data scientist can’t do much without first having the data.
In a big to medium sized organization, this problem is typically addressed by investing in data engineering first, getting the data flowing so that data scientists can then work on top of it and bring their skills to bear. An important feature of these data sets is that they are not static, but animate. As the business churns, data keeps flowing into the datasets, making them animate and evolving. The data science products built on top of them can then also evolve. This becomes a positive feedback loop, where once people see the value the data science products bring, it drives further investment in data engineering and collecting even richer data which in turn enables more powerful data science applications and so on.
While this story repeats many times over behind the closed doors of various organizations, I haven’t seen it unfold in the realm of open source. On the other hand, many excellent and widely used open source software projects exist. In a sense, the world of open source is lagging behind the corporate world in this dimension of data science maturity.
I’m not saying that no open source data sets exist, of course. There are many like MNIST (for handwriting recognition). But these were always meant to be static, to be used for benchmarking machine learning models. They are like statues, frozen in time. Beautiful statues but still statues.
What I have in mind are animate, living and breathing open data sets. As a hypothetical example, imagine there was an open database where every time anyone went grocery shopping, an entry was logged with each item they purchased, its price, the grocery outlet and its location, the date of purchase, etc. A data science application on top of this could be a recommender system that tells people where to shop given their grocery list under…