(yet another) two kinds of data scientists
i'm in a bit of a conundrum in terms of tradeoffs as I try to hire more data scientists for Babbage
We’re hiring data scientists. I mean, we hired one last week, but we need more data scientists here at Babbage.
And when I think about it, I realise that there are two kinds of data scientists that I like:
Those that are good at analysing data and understanding business, who are good at statistics, can reason well logically and can avoid cognitive biases
Those who can write production-worthy code and put their models easily into production. This also includes the ability to hack, and now that it’s 2025, to use LLMs effectively in their workflows.
Every data science team needs an optimal balance of these two kinds of skills, and this balance can vary from team to team, depending upon its objectives.
I’m Type 1 according to this classification, for example. In a previous organisation, I ended up hiring lots more Type 1s. That worked well until I started reporting into a technical person who demanded that all our analyses “be put into production". Suddenly the lack of Type 2s in my team became a massive liability.
If you look at a conventional data science team in any company, you’ll see that they overindex on type 2. This leads to impeccable code and great software engineering practices but suboptimal models and logic.
My current problem
For the uninitiated, I’m building Babbage Insight, a “team of AI data analysts” (yes, we need to work on our branding and positioning). We don’t analyse data. We write code that can do that for us. We want to write code that analyse data “properly”, not just through “following processes”. This means that the code that we write needs to produce data scientists of type 1.
Now, what kind of employees do I need to write code that can produce data scientists of type 1?
For starters, they need to be able to code, and put things in production. For example, I’ve realised that I’m way too much of a Type 1, and the more the code I put into production (I wrote all of the data science into our (now largely scrapped) inaugural model), the worse our technical debt gets. So I’ve, for all practical purposes, stopped writing production code.
We cannot afford more “pure Type 1s” like me, given how much code we need to write.
The counterpoint is that in order to build models that can function like “Type 1 data scientists”, whoever writes the model needs to know well how Type 1 data scientists would think and solve the problem. And if they have not done any type 1 data science work, it is highly unlikely that they will be able to write code (however “agentic”) that can behave that way.
People who are good at both Type1 and Type2 data science work are unicorns - rare and not easily available (and when available, pricey). If i sit around waiting for them, I will lose valuable time.
This is like cricket - you have two axes, that you can call “batting” and “bowling”. Allrounders are not easy to come by (admittedly, India is still trying to replace Kapil Dev, 31 years after he retired). You need at least a basic level of skill in batting and bowling.
Do you then go with “bits and pieces players” (who are adequate, but only just about, at all skills)?
Or do you look for “genuine allrounders” - people who will get into the team based on one skill alone?
How I’m setting up
Given this background, this is how I’ve set up my hiring so far (and yes, I’m still hiring!). Let me know if this sounds good
Inevitably, when you are hiring for data science, the funnel can get really narrow really quickly. You can get a lot of very random profiles. You need to be prepared for that
There is one kind of profile that I know won’t fit for me, and are easy to weed out - people who have *only* worked on machine learning, and have done no other kind of data science work
I ask for code samples. Whether they are a type 1 or type 2, their skillsets become clear with this. There are people who have said “all the work is proprietary, and I have no samples to share”. I’m ignoring them.
Through CVs, it is easy to see if someone is a Type 1 or Type 2. Then we can interview in a way to cover their “other base”.
To cut risk, we’re hiring people on an initial 3 month contract, and then making them full time (India has sufficient number of people willing to sign on to a 3 month contract). This way we can make decisions based on a single 30 minute interview.
Any other tips? Anything else to look out for? Anyone you think I should hire? Oh, and some hilarious stuff is happening on the cover letter front.
Copypasting from my own LinkedIn:
Is AI turning cover letters obsolete?
I'm hiring after a gap of nearly two years, during which LLMs have become mainstream. And as I glance through the cover letters, there is a sort of sameness and unnecessary "formality" to them, suggesting a lot of them have been written using LLMs.
No harm in that, except that the cover letter now completely loses its signal - unless you've taken sufficient care to ensure that they look "human" and authentic.
So I'm doing the obviously logical thing - simply ignoring cover letters and judging candidates on the merit of their CVs and work samples. At least I like to believe that those present a more authentic picture of the candidates.
And then this:
The AI CoverLetter story gets even more freaky.
Yesterday I replied to one of our DataScience job applicants (who had sent in their CV with an elaborate cover letter starting "Dear Manu") with a clarifying question.
In short order I got the reply "can you please share JD, company details, location, etc.?"
My first reaction was "you applied for this job, so I assume you should know what you have applied for!"
On second thoughts, I suspect there exist bots that trawl LinkedIn for job openings, and upon finding a good match, craft a cover letter and take whatever action is suggested in the posting (in our case, it was to send CV to a certain email ID).
Who is going to take this story and the presence of automated résumé screeners, and write some dystopian fiction?
Here's my take: To build an agent that behaves like a Type 1 data scientist, you'll have to distill and abstract your own thought process and approach (since you would be in the top 1% of type 1 data scientists) into some sort of algo or mental model or framework that then needs to be "agentized". You can call it "Agent SK" :) Then that spec can go to the type 2 data scientist to build out