Data science projects often involve developing machine learning (ML) models to solve business problems. While this may seem commonplace in business today, it still comes with several risks.
Namely, developing ML models is inherently uncertain, technically demanding, expensive, and time-consuming. These risks motivate project management frameworks specifically designed for data science projects in mind.
Here, I will describe one such approach and break down the key contributions of a project manager in this context.
The approach I like to use for data science projects is outlined by the 5-step framework illustrated below.
Digging deeper, here are a few key activities for each phase.
- Phase 0: Problem Definition & Scoping — Formulate the business problem. Design the data science solution. Define project milestones, tasks, and success metrics. Key role: Project Manager
- Phase 1: Data Acquisition, Exploration, & Preparation — Evaluate available data. Acquire and explore data. Develop data pipelines. Key roles: Data Engineer, Data Scientist
- Phase 2: Solution Development — Develop ML solution. Evaluate solution validity and value. Iterate with stakeholders and revisit past phases as needed. Key role: Data scientist
- Phase 3: Solution Deployment — Integrate solution into real-world business context. Develop solution monitoring pipeline. Key roles: ML Engineer, Data Scientist
- Phase 4: Evaluation & Documentation — Evaluate project outcomes. Deliver technical documentation and user guides. Reflect on lessons learned and future work. Key role: Project Manager
An important point here is that data science projects often do not progress linearly through each of these phases. Rather, some amount of iteration is required through key feedback loops. Here are a few examples of what this might look like.
- Phase 1 → Phase 0: When exploring the available data, it becomes clear that key information is not available, and the project plan must be revisited.
- Phase 2 → Phase 1: After training a handful of models, it is discovered that an exception was not properly handled in data preparation.
- Phase 2 → Phase 0: Preliminary models do not demonstrate strong predictive performance, which requires reevaluating the project’s value.
- Phase 4 → Phase 0: Every project has its opportunities for improvement. Upon completion, teams can evaluate these opportunities and kick off another project, starting with Phase 0.
The project manager (PM) is ultimately responsible for a project’s success. If the project is late, it’s on the PM. If costs exceed estimates, it’s on the PM. If the value doesn’t meet expectations, it’s on the PM.
While this responsibility involves a diverse range of tasks from multiple contributors, one key determinant of a project’s success is the PM’s execution of Phase 0 (as described above).
Phase 0 sets the foundation of a data science project. Just as a poorly constructed foundation will result in a difficult construction project, a poorly executed Phase 0 will result in a difficult data science project.
The 3 key elements of Phase 0 include Problem Diagnosis, Solution Design, and Implementation Plan [1].
1) Problem Diagnosis
Of the 3 elements, this is the most critical because if you get this wrong, you can spend a lot of time and money solving the wrong problem (i.e., little value is generated). Despite its importance, many tend to gloss over (if not skip entirely), taking the time to stop and think about the business problem.
Just as a doctor interviews a patient to produce a diagnosis, a PM interviews stakeholders to better understand the business problem and identify the root cause. Although there are many ways to do this, I like to keep things simple and focus on asking two key questions.
- What problem are you trying to solve? — this is always the best starting point for these conversations [1]
- Why is that important to the business? — this can kick off a series of 5 why-based questions to get to the problem’s root cause (see Toyota’s 5 Why’s approach) [2]
One of the PM’s most important skills is effectively collaborating with stakeholders to understand their problems. I discuss this further in a past article.
2) Solution Design
Once the business problem is clearly understood, the next step is to define how to solve it. Various solutions at various levels of complexity can address any given problem.
For instance, if customer churn is high due to a slow onboarding process, some potential solutions could be removing unnecessary onboarding steps, analyzing where drop-off occurs and reworking that step, customizing onboarding based on customer information, etc. Notice that these solutions may not require machine learning (and that’s okay).
Suppose, after extensive back-and-forth, the stakeholder wants to move forward with developing a personalized onboarding experience based on customer profiles. While this narrows things down, this solution can still be implemented in many ways. Therefore, the PM must use their judgment to propose a solution based on stakeholder conversations, similar industry projects, and available resources.
3) Implementation Plan
The final element of Phase 0 is translating the proposed solution into a concrete project implementation plan. This plan consists of two key pieces: a project roadmap and the project requirements.
A project roadmap consists of key project milestones. I like to base these milestones on Phases 1–4, as described above. Each phase consists of tasks assigned to a particular role (e.g., data engineer, data scientist, or ML engineer) and a due date [1].
Project requirements specify all the necessary resources for implementation, including data requirements, key roles, software tools, and compute infrastructure.
I will walk through Phase 0 for an example case study to solidify these ideas. While this is meant to be instructive, it is a real project I will implement (and document) in future articles of this series.