An Open Source Database for Eclipse Chasers | by Rohit Pandey

11 min read

12 hours ago

At the risk of stating the obvious, the biggest weakness of a data scientist is that they can’t practice their craft without high quality data. And creating a high quality dataset isn’t exactly trivial. This becomes the most obvious blocker to adding any kind of value via this discipline. Unlike engineering where you can roll up your sleeves and start building on day one, a data scientist can’t do much without first having the data.

In a big to medium sized organization, this problem is typically addressed by investing in data engineering first, getting the data flowing so that data scientists can then work on top of it and bring their skills to bear. An important feature of these data sets is that they are not static, but animate. As the business churns, data keeps flowing into the datasets, making them animate and evolving. The data science products built on top of them can then also evolve. This becomes a positive feedback loop, where once people see the value the data science products bring, it drives further investment in data engineering and collecting even richer data which in turn enables more powerful data science applications and so on.

While this story repeats many times over behind the closed doors of various organizations, I haven’t seen it unfold in the realm of open source. On the other hand, many excellent and widely used open source software projects exist. In a sense, the world of open source is lagging behind the corporate world in this dimension of data science maturity.

I’m not saying that no open source data sets exist, of course. There are many like MNIST (for handwriting recognition). But these were always meant to be static, to be used for benchmarking machine learning models. They are like statues, frozen in time. Beautiful statues but still statues.

What I have in mind are animate, living and breathing open data sets. As a hypothetical example, imagine there was an open database where every time anyone went grocery shopping, an entry was logged with each item they purchased, its price, the grocery outlet and its location, the date of purchase, etc. A data science application on top of this could be a recommender system that tells people where to shop given their grocery list under…

Source link

What's Hot

ADOPT: A Universal Adaptive Gradient Method for Reliable Convergence without Hyperparameter Tuning

Core AI For Any Rummy Variant. Step by Step guide to a Rummy AI | by Iheb Rachdi | Nov, 2024

SVDQuant: A Novel 4-bit Post-Training Quantization Paradigm for Diffusion Models

An Open Source Database for Eclipse Chasers | by Rohit Pandey | Apr, 2024

Core AI For Any Rummy Variant. Step by Step guide to a Rummy AI | by Iheb Rachdi | Nov, 2024

What Did I Learn from Building LLM Applications in 2024? — Part 1 | by Satwiki De | Nov, 2024

Introducing the New Anthropic Token Counting API | by Thomas Reid | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

ADOPT: A Universal Adaptive Gradient Method for Reliable Convergence without Hyperparameter Tuning

Core AI For Any Rummy Variant. Step by Step guide to a Rummy AI | by Iheb Rachdi | Nov, 2024

SVDQuant: A Novel 4-bit Post-Training Quantization Paradigm for Diffusion Models

Researchers at Cambridge Provide Empirical Insights into Deep Learning through the Pedagogical Lens of Telescopic Model that Uses First-Order Approximations

Our Picks

ADOPT: A Universal Adaptive Gradient Method for Reliable Convergence without Hyperparameter Tuning

Core AI For Any Rummy Variant. Step by Step guide to a Rummy AI | by Iheb Rachdi | Nov, 2024

SVDQuant: A Novel 4-bit Post-Training Quantization Paradigm for Diffusion Models

What's Hot

An Open Source Database for Eclipse Chasers | by Rohit Pandey | Apr, 2024

Related Posts

Leave A Reply Cancel Reply