The Soft Skills You Need to Succeed as a Data Scientist | by Eirik Berge | Jun, 2023

The Soft Skills You Need to Succeed as a Data Scientist | by Eirik Berge | Jun, 2023

[ad_1]

Think back on previous projects that have involved a team effort. Think about those projects that have failed to meet deadlines, or have gone over budget. What is the common denominator? Is it too little hyperparameter tuning? To poor model artifact logging?

Probably not, right? One of the most common reasons for project failures is bad project management. Project management has the responsibility of breaking a project down into manageable phases. Each phase should then be continuously estimated for the amount of work left.

There is a lot more than this that a decided project manager is responsible for, ranging from sprint execution to retrospectives. But I don’t want to focus on project management as a role. I want to focus on project management as a skill. In the same way that anyone in a team can display leadership as a skill, anyone in a team can also display project management as a skill. And boy, is this a useful skill for a data scientist.

Let’s for concreteness focus on estimating a single phase. The fact of the matter is that much of data science work is very difficult to estimate:

  • How long will a data cleaning phase take? Completely depends on the data you are working with.
  • How long will an exploratory data analysis phase take? Completely depends on what you find out along the way.

You get my point. This has led many to think that estimating the duration of the phrases in a data science project is pointless.

I think this is the wrong conclusion. What is more accurate is that estimating the duration of a data science phase is difficult to do accurately before starting the phase. But project management is working with continuous estimation. Or, at least, this is what good project management is supposed to be doing 😁

Imagine instead of estimating a data cleaning job in advance that you are one week into the task of cleaning the data. You now know that there are three data sources stored in different databases. Two of the databases are lacking proper documentation, while the last one is lacking data models but is pretty well documented. Some of the data is missing in all three data sources, but not as much as you feared. What can you say about this?

Certainly, you don’t have zero information. You know that you won’t finish the data cleaning job tomorrow. On the other hand, you are very sure that three months are way too long for this job. Hence you have a kind of distribution giving the probability of when the phase is finished. This distribution has a “mean” (a guess for the duration of the phase) and a “standard deviation” (the amount of uncertainty in the guess).

The important point is that this conceptual distribution changes every day. You get more and more information about the work that needs to be done. Naturally, the “standard deviation” will shrink over time as you become more and more certain of when the phase will be finished. It is your job to quantify this information to stakeholders. And don’t use the distribution language I’ve used when explaining this to stakeholders, that can stay between us.

Having a data scientist able to say something like this is super valuable:

“I think this phase will take between 3 and 6 weeks. I can give you an updated estimate in a week that will be more accurate.

[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *