From MSc thesis to Lead Data Scientist

4 years of mistakes, learnings, challenges and successes

Finally, after some months of reflection, it seems that I have managed to organize my thoughts about writing online. Encouraged by a friend (who nailed some of his research work before finishing his PhD), and inspired by other AI tech writers as Chip Huyen, Charity Majors or my former colleague Sofian Hamiti. I believe that it is about time to try to put my two cents in the community as becoming lead data scientist is not something that you usually achieve entirely on your own.

What will you find below?

Today my plan is to give you a 360 view about of how I became a lead data scientist. At least, that’s how people inside/outside my company started to call me after my 3rd year at Siemens Energy. You might find this blog a bit redundant as rivers of digital ink have been already published on the Internet in similar posts. There are even reddit discussions, so your idea of this post could be similar to this:

If that is the case for you, feel free to leave. However, if you are still curious about how someone can go from a MSc thesis to a technical leading position in such a short time, it might be worth it to stay. My main goal is to try to picture an insider’s perspective and to demystify that there is nothing magic about it. Besides, it could help you to know that this is not only about hard-work, hard self-studying hours, and some ted-talk guru advice. Most of the success I had was due to sequential opportunities that happened. Sometimes with detailed planning, sometimes we were just lucky. In short, the team where I was working with managed to get 10000% over 100% of what these could offer. However, during some months it felt that people were looking at us as the fellowship of the ring in the last film:

Gimli Success GIF - Gimli Success Lord Of The Rings GIFs

Fortunately, we never stopped believing in ourselves and we always knew that despite having challenging tasks, all of them were always manageable. We were just shooting to Mars, knowing that our objective was the Moon. On top, I had the extreme luck of having the greatest mentor you could ask for in your first industry job.

To narrate my journey, I could not find a better analogy than the Hype Curve. Without further introduction let’s start from the beginning.

Hajpkurva – Wikipedia
The Hype Curve – A popularize chart by Gartner

The beginning (technology trigger)

I started my journey at my previous company in February 2018. I was full of joy because I finally managed to get thesis work outside my study area (Energy Systems). Since 2017 I knew that the traditional energy engineering job was something that I was not going to enjoy. Who would have said, after studying the same laws more than 10 times and being forced to use Excel for everything ;). Throughout my MSc studies I found much more interesting the Coursera Data Science Specialization than most of my subjects. The world of what they called “data science” was revelealed to me and I found it astonishing.

My first week was great and the onboarding was fairly smooth. My thesis supervisor at the company made one of these whiteboards that you expect to see in an HR campaign of a company trying to get new talent. The possibilities were endless for the project I could develop. In fact, in the next years this project became one of the most important AI-driven products of my division after the successful contributions of many clever people [1][2][3].

Time flew and after 6 months I finished my thesis and got an extension of my contract to work in diverse topics. I didn’t do ML for a while but I was able to get exposed to some basic cloud technologies (EC2, S3 and Redshift). The previous data scientist working there as well as a data engineer colleague provided very good support to me. I built some ETL scripts and started to become a bit better at coding (using loggers, parametrizing scripts, writing CLIs, connecting to DBs etc…). It was a bit stressing to not have many people to ask around specially after both of the data scientists in my team left, but I managed to embrace this. All of this Tech world was amazing and I was able to do things I couldn’t imagine one year before.

Besides I got some “fun” with managing dependencies in EC2, lucky me that the big whale saved me some months after.

Life as data scientist (peak of inflated expectations)

Around March-May 2019 I had my first customer facing meeting and my first business trip. I was a bit nervous because I did not know what was expected from me. Also, it was the first time to go on a project with people I had never met before. We spent 1 week in total and 2 days in the customer site. The customer’s head of analytics was probably one of the smartest guys I have ever meet (the guy had 30+ patents and a PhD) and despite being an engineer, he was training himself with a MSc in Data Science. Moreover, the customer had its own DS team and they were setting up their own cloud environment. The level of maturity they had was something that I did not expect at the time. My colleagues, who had been in other customer facing meetings, had the same impression as usually our customers were in the beginning of their ML journey.

I will always remember my noisy laptop burning out (at that time we didn’t have SEER) in the hotel room while the models were being trained. The week was very stressful, I drank a lot of coffee and slept very little. Nevertheless, after seeing the front-end PoC that a colleague pulled in 2 days as well as the cloud infra and predictions in the DB uploaded by another colleague, I was very relieved.

As you could expect, we managed to succeed. We developed a prototype in 1 week to detect anomalies in the vibration signals on the customer equipment using a combination of unsupervised and supervised learning. The customer was happy, the team was happy, everyone was happy!

After that my permanent contract approval that was pending
(or blocked, you never known) was sorted out and I got permantly hired!

This event happened at the same time that some students started in my team. I had the task to supervise them in their MSc thesis projects in collaboration with my mentor (which was my previous thesis supervisor). I realized by that time how important was to take care of new people. If I had had a different mentor when I started, there would have been no chance to reach that far. The fact that someone much senior than you in the business that you are working in spends time explaining you the context, the vision of the company, how the company makes money, what are the current burning points, what are the potential use cases for your skills etc. is absolutely critical. If you think that a great mentor is the one that reviews your code from A to B, or which model to use I will have to disagree. I am not saying that this is not important (because it is) but this is just a bonus.

In any case, I tried to take care of these 3 students in the best way I could. Two of them got an A as a thesis grade, so I think that we managed well (sorry Dhruv if you are not in the pictures but I did not managed to get yours).

Summer arrived and my mentor got the chance to present in our most important internal DS conference. He was kind enough to allow me to join him in the stage. I realized that he and his projects were well known in the company and I also found that people were doing cool things with AI outside my division. Thanks to this I was able to start expanding my network, get fresh ideas, and realize that we were quite advanced for our team size.

MLOps needs (valley of delusion)

Around November 2019 we started our first full scale machine learning project. After having consolidated 3 big data sources, we had sufficient data to predict some critical KPIs in order to plan the maintenance schedule of our gas turbines. The problem was that we had more than 4000 time series, and because of the characteristics of the data we did not manage to build a global model. In all of the benchmarks I performed, individual time series models were performing much better than a global one (disclaimer, now I am confident I will manage to outperform my previous experiments).

The solution seemed pretty obvious though, after watching a couple of online presentations, looking to the fable package and looking what other companies were doing, 4000 time series didn’t look that much. Adding the hyperparameter tuning jobs and the different algorithms we were ending up in about 100k training jobs.

Everything sounded very reasonable until I said that I needed X compute needs to run this experiment. Of course the classical EC2 we were using was not an option as we needed elastic compute needs on demand. By that time I was also absorbed in the MLOps trend and I was familiar with some tooling to build strong MLOps foundations (MLFlow, the new metaflow, kubeflow, TFX etc…). I had a good picture of what we needed in order to log all of these experiments, orchestrate the workflows, and persist our models. However after stating my needs I received something very different to what I needed.

I remember this time as a very painful one. A colleague of mine and I had to spent around 2-3 months testing something that I knew since day one that it had absolutely 0 chance to work. I knew it so badly that I was astonished when people were not seeing this. It was like if you were talking to people about the colour of the clear sky and they were telling you that the colour was yellow instead of blue. Fortunately, meanwhile we were wasting our time we were also running plan B and C thanks to the support of our data platform team and AWS.

Unless you have a dedicated MLOps platform team, multi-cloud is a must and you need to stay open source. My suggestion would be to use a managed solution. If you really need open source, please do not reinvent the wheel. Kubeflow, Flyte, Metaflow etc… are there for you.

We tried to run a couple of cases in the tooling that we were requested to use and well… absolutely nothing worked (surprise!). We ended up with a combination of AKS (Kubernetes), RStudio Teams (for launching VSCode, RStudio and JupyterLab in kubernetes sessions), RStudio Connect (for lightweight dashboards) and SageMaker + Step Functions for the production code (spolier: worked like charm!).

A lot of political fights and business shenanigans came during and after this period but finally we were able to work more or less properly.

Becoming lead data scientist (the slope of enlightenment)

By Q2 of 2020 we had more or less our data science platform (a.k.a SEER). We started to collaborate more and more with AWS to identify the knowledge gaps in the team and how to organize ourselves better. Finally, our non-technical work environment (managers and project managers specially) understood what an MLOps Engineer was and why most of the data scientist could not get full expertise in building models, deploying containers, making serverless workflows etc.

However there was no one in the team who could bridge the gap at the time and I accidentally became this person. Yes, you got it! I became a unicorn data scientist!

I spent most Sundays studying AWS infrastructure as well as reading thousand of articles about MLOps. Time was getting more and more reduced at working hours and our current PM was not ramping up his technical knowledge despite the increasing complexity of our projects. I had to take over most of the scrum process and I spent most of the time taskifying people. Little by little, my coding time became very limited but fortunately the team was so great that they needed minimal instructions to get the job done. Some quick draw.io charts, some coding skeletons and some paint drafts were more than enough to organize a 1-2 weeks sprint.

By summer 2020 I was spending around 20% of my time making examples of code, 60% organizing the teams in different projects and another 20% making slide decks about what we were doing, and the technical roadmap for our ML platform in the next quarter. Since a couple of months back people started to call me “Lead Data Scientist” in the internal e-mails and it started to make sense in my head. I became one of the persons to go to when it comes to Machine Learning technical topics in our division (apart from my mentor, who was the analytics topic lead in our division). I even made it to appear in the company intranet once!

The efforts in my team were recognized by different partners and we got the opportunity to present in different conferences. We got the offer to present a new feature of AWS SageMaker in reInvent2020. However, the feature did not make it on time and we ended up presenting our first ML at scale use case.

Rolling out projects (plateau of productivity)

After very hard work we managed to have our first version of MLOps tooling around 2021 Q1. We were writing some light unit tests, testing the individual components of our workflow, and we even wrote our custom-made integration test tooling. This amazing tool was able to test the workflows end to end bypassing the limitations of our jobs timeout in gitlab. Unfortunately, the mechanism of this tool is covered by my non expired NDA so I cannot reveal that much. But in short, we could think as some sort of predecessor (in terms of functionality) of what you can do now with SageMaker Pipelines or with Kubeflow Pipelines.

We made tremendous progress during summer and we even implemented an active model monitoring system based on model performance drift (that is, comparing the previous predictions with the actuals when the actuals arrived). Our implementation was an earlier version of what today is called the “Feedback endpoint” implemented by Gantry or SeldonCore. I was so proud of this since most of the monitoring capabilities of ML systems were centered around data-drift and latency metrics. We did not manage to implement incremental training and checkpointing, but still, I think it was a tremendous success:

Our beautiful error datamart according to our model metadata and model metrics (this later became a production dashboard) but by the time ggplot2 was sufficient. The dashboard was showing how the retraining was triggering every time that the error went up above a certain level.

We reached a minimum in technical debt, despite of still having plenty of work to do, it was fairly easy to make code changes and deploy them. We also developed very good observability capabilities and we were able to always identify when a production pipeline was failing and in which step. The nightmare of having to deploy code manually by connection to an EC2 instance in a VPC was gone and despite of some small legacy pipelines everything was looking bright. Most of the MLOps tooling we built was reusable and we were implementing it step by step in other projects.

At the beginning of 2021, 3 new MSc thesis students joined that year and they were running on-demand training jobs remotely in just 3-4 months! This would have been impossible two years ago. All the hours burning my local computer or running hard to debug nohups python commands in a 24/7 running EC2 instance were now in the past. We only needed people willing to learn, and they could build whatever they wanted.

How I would have felt in 2019 with the right ML platform

In addition, around March-April I got an additional position at company wide level and I felt quite rewarded by the recognition and trust of some of our global management team. Unfortunately, I was already burnt out, I felt I was a technical lead but I struggled with the little time I had to develop my technical skills. Some of my thoughts are very well summarized in this blog post by Charity Majors: “Questionable Advice, the trap of the premature senior”.

The worst part was having to explain certain things again and again to the same people. I was astonished at how some people were either so lazy or so unprepared for the positions they were holding. The second one is acceptable, the first one is very hard to swallow. Our team did pretty much the impossible and beyond to deliver on our projects, we were learning constantly and we were pushing the boundaries of our cloud provider day by day. We also went far beyond our job scope as data scientist or software developers. Still, there were some people that didn’t even bother to learn the basics. Unfortunately, my work ethics did not allow me to stay in a place maintaining this status quo, and that’s why I decided to leave.

Now, after a bit more than one year, I can look back and remember more good things that bad things. I still miss some of the technology I built there but I have to confess that my current team has managed to get pretty close to what I did in the past. In fact, in some areas I feel I have a better setup now. Besides, taking into account that it only took one year to reach the same status, it feels to me that my current team is somehow magical.

My path at Siemens Energy ended up handing over the torch to one of the students I supervised in the past and this made me really happy. I knew he was going to have a tremendous amount of work after I left, but I was confident that he was going to be able to make it. We presented in the Data Innovation Summit 2021 and a new era for me started at B2Holding ASA.

The lessons learnt

After all this text, you probably got more or less the idea of how I became lead data scientist but this post was about the lessons learnt and I think I can summarize them as follows:

  1. Try to get as much exposure to technology at the beginning of your career. The better you understand your technological context, the better will you become in collaborating with people that are not data scientists but have other essential technical roles.
  2. Try to work in projects that do not involve pure ML model building. It is a good way to get your hands dirty outside your knowledge area. The time I worked with ETL stuff, logging, interaction with object storage etc… has paid off.
  3. Find a mentor, and a good one (this one is the most difficult one but maybe luck is on your side).
  4. Try to expand your network, via internal conferences, the official internal communications channels of your company or externally by presenting/attending in conferences. This will give you an overview of what is going on in the sector. In addition, it is a great source of inspiration for new projects.
  5. MLOps and cloud computing are important, if you think that in 2023 you can survive with jupyter notebooks and just making your models the most accurate ones… well… best of luck!
  6. When I started to wrote this post it was 2022, so to add up on the previous one (MLOps), I think it is important to consider that ML is really going real time (whether you like it or not). Especially in the last six months I have started to see how multiple cloud providers are making this easier and easier day by day. And then we have Max Halford with River, Debezium and other open source tooling.
  7. Learn software engineering. I’m not saying you need to write unit tests all the time as a data scientist but it is useful if you produce proper code and you know the basics of testing software. Also, writing yourself unit tests instead of delegating this to MLOps Engineers is a good way of checking if your functions are too complex or not. Finally, you put yourself in the skin of others and this will improve your work cooperation with your peers. (check point 1).
  8. Write your docstrings or you will find me at your doorstep with a sharp knife (it wasn’t mentioned above but more posts will follow about this).
  9. Do not overcomplicate your algorithms unless you need to do so. Of course, in areas like NLP and CV neglecting the performance of CNNs, Transformers etc… will be foolish. However, you need to think that if you are working with tabular data, most of the non DS people dealing with these data do not go further beyond linear regression. Yes, MQN-HiTS (multi quantile neural hierarchical interpolation for time series) looks promising, but good luck explaining this to your stakeholders (also good luck beating trees for time series forecasting).
  10. Create outcome, not output. Skill level is rarely measured with output, output is what we produce as individual contributors, outcome in the other hand, is the impact of the stuff we produce. I encourage you to read this blog post about Staff Engineers archetypes. When I started SEER I was acting as architect, when I created the first E2E ML project I was acting as Tech Lead and Solver. Finding your archetypes is the base to understand which outcome can you create.
  11. If it is cheap and it doesn’t affect project deadlines, break as many things as you feel. Sometimes it is useful to learn by iteration.

Thank you for reading this post and feel free to follow me on Linkedin to know more about me!