Should I stay, or should I go? My Google Advanced Data Analytics Capstone

Hey there! It’s me again. Doing more data stuff. 15,000 rows worth to be exact. I’m sure you’d rather see it than read any more fluff so heeeeeere we go.

Let’s Cut To The Chase

This time around for my Google Advanced Data Analytics certificate we’ve got a 15,000 row HR dataset for examination. The “Company” in this scenario has asked us to analyze the data to see what factors might influence employee satisfaction and turnover.

Eager to learn, I got out my handy dandy Jupyter notebook and dug in. If you’d rather see the guts than the gab you can check out the notebook itself on my Github (Link below). The source of the dataset on Kaggle is also listed below.

My Github page

Kaggle data source

The tale of the data was that of 10 columns

With the following traits

“Clean up, clean up, everybody everywhere.”

To kick things off, I had some tidying up to do. For this I relied on my trusty friends Pandas and Matplotlib. The full extent was:

Converting all column headers to a uniform snake_case
Removing 3008 duplicate rows
Removing outliers from the ‘tenure’ field. This was done after careful consideration and investigation.

The choice to remove all rows where ‘tenure’ was over 6 years came after some tinkering in the ol’ noodle, and some careful examination of the data. The final determining factors were as follows:

There were no employees who left the company after 6 years, and if we are trying to predict who will leave, that may not be very helpful. It also added to the number of employees who didn’t leave, and that category already made up 83% of the dataset.
There were only 824 rows that contained ‘tenure’ outliers of 7 years or more.
After adjusting the models later in the analysis to exclude these outliers, the models performed slightly better.

Let’s Get Curious!

There were several questions that immediately came to mind when looking at this dataset and thinking about the task at hand.

Is there a strong connection between employee satisfaction and leaving the company? If so, how strong?
How impactful will things like their promotion status or last evaluation be?
Do different departments and salary ranges have differing satisfaction and turnover rates?
How impactful is tenure? For me, tenure seems like a result of contentment with one’s workplace. If they are content they will probably stay, which makes tenure a very interesting variable.

Let’s start with the first question. How linear is the relationship between satisfaction and leaving?

The answer seems to be … sort of?

There are several key things that come from this graph:

Even an abysmal satisfaction score is not a super clear indicator that somebody will leave
It does become quite a bit more likely that somebody will leave as their satisfaction score goes down
The graph gets steeper around the .5 mark for satisfaction level. The company might want to put more emphasis on employees who rate their satisfaction level less than .5

Let’s dig further.

This graph sheds a little more light on our satisfaction score question. The majority of the folks who left had a satisfaction score of less than 0.5; however, there were still a decent amount of people who left with a score over 0.7.

Also, the vast majority of people who stayed had a satisfaction score of over 0.5. This part of the discovery prompted me to think about tenure again. I wondered what satisfaction levels were like over time for those who left.

Ah … Very Interesting

The distribution of satisfaction levels across tenure for those decided to leave the company was a gold mine. It was perhaps the most interesting graph of them all. Among it’s secrets were:

The satisfaction of those who left after 2 years was all over the map, and with a mean near 0.6. This could suggest that there are many reasons a newer employee might leave. Perhaps the company has positions or programs that are only 2 years long?
Years 3 and 4 get interesting. This is where it genuinely seems that people are leaving because they are unhappy. The collective satisfaction centers tightly around 0.4 for those who left during their 3rd year, and centers even more tightly around 0.1 for those in their 4th year.
Then it gets even more interesting! After 4 years the satisfaction levels skyrocket up to a 0.8 average, and stay pretty tightly grouped there.

But wait. There’s more we need for this story.

Poetry In Motion

The percentage of employees who leave and percentage of employees who stay makes almost a perfect convergence at the 5 year mark. And then, at year 6, returns to a distribution similar to that of year 3. We also know that nobody in the dataset left the company after year 6.

This also makes the “Number of Employees by Tenure” graph even more interesting. The peak is at 3 years, but tapers off quickly for years 4 and 5.

At this point I was a puzzled guy. Something happens after year 2 that begins a substantial decrease in not just the number of employees who stay, but the average satisfaction of all employees.

The peculiar nature of years 3-5 prompted me to check a number of areas for any other possible correlations with this trend. You can check my handy dandy Jupyter notebook for all the graphs, but here are the notable nuggets:

Turnover across departments or salary ranges was not notably different.
Satisfaction across departments or salary ranges was not notably different.
The average evaluation scores and hours worked did not change dramatically by tenure, department, or salary range.
No department had dramatically higher turnover than others.
Work accidents were also evenly distributed across department and tenure.
Satisfaction did drop dramatically when any employee was given more than 6 projects, and no employees with 7 projects stayed with the company; however, there were very few employees given 6 or 7 projects.
The vast majority of employees work more than 40 hours per week, with almost not difference in average work hours between departments or salary ranges.

The uniformity of distributions of much of the data began to make me think the data was synthetically made. It would be highly unusual for a sales department to have the same turnover as all other departments. Sales nearly always has a higher turnover than most departments. And then, it showed itself in one graph.

For a dataset to present with perfectly bordered boxes in a pair plot, the jury is in. This data is fake, but the analysis doesn’t have to be. Let’s take it for what it is. We’re presented with a few distinct categories of employees who left.

Those with extremely low satisfaction and extremely high monthly hours
Those with somewhat high satisfaction and and equally crazy hours. (13 hours per day for some!)
Those with somewhat low satisfaction and low monthly hours.
Those with extremely high monthly hours and a high evaluation.
Those with low monthly hours and a poor evaluation.

The distribution of the satisfaction scores of those who left is pretty absurd by itself. Many of the people who left with low satisfaction were pulling nearly 15 hours per day for a 5 day work week. It’s unlikely that they would have hit that level of work with such a low satisfaction score at all. They would have likely left far before they hit that level of either satisfaction score or work hours.

Let’s Pause

All jokes aside, I want to make it clear at this point that this scenario doesn’t represent a real scenario in many ways. If this were a real scenario I would be seeking further information about the different groups to see why they might be separated the way they are. I’d also be super curious about the 3-5 dip in satisfaction across the entire company, and especially among those who left.

In a real scenario there would also be opportunities to mine for more data, collect new data, or gather valuable context and feedback from subject matter experts. The goal of the certification was to demonstrate the use of some more advanced Python and machine learning techniques to assess data. With the realization that the data is synthetic, my options for further evaluation start to become limited. So the time has come to…

Feed It To The Machine!

With the number of relationships between variables being so uniform I had my doubts about the effectiveness of a linear regression or logistic regression model, but the assignment was to try them anyway. As a dutiful learner, I did just that. I decided to plot the correlation coefficients to see which relationships to employee turnover seemed the strongest.

With this in mind I ran several linear regression and logistic regression models to see if any of them could predict satisfaction level or turnover based on different variables. I used encoding and tried oversampling for the target value, but …

They did not succeed …

You can see all of my model building in the Jupyter notebook HERE.

On the surface level an accuracy of 81% and a Precision for the value of ‘0’ of 84% may seem good. But there’s more than meets the eye. The recall for the ‘1’ value of 14% is the troublesome thorn in the side of this here model.

This means that our goodest boy can only accurately predict when an employee will leave the company 14% of the time. Less than stellar.

So what now?

Again, If this were a real company I would have a lot of questions and further data to collect to break down why and when an employee might leave the company, but since the data is synthetic, let’s just “Cheat”.

Now, I don’t mean cheat in the usual sense. I just mean let’s dump a ton of data points through my computer’s processor and get the answer, even if we might not be able to explain it super well afterwards.

LET’S GET RANDOM!

… Random forest that is.

Boom … Through the power of letting a machine do all the work we’ve … uh … I mean I’ve been able to predict to 91% accuracy who will likely leave the company.

Now, I’m fully aware that a Random Forest model probably isn’t the best fit for this scenario at this stage. This is mostly because it’s hard to know what’s going on under the hood. In a real scenario I would also have the ability to ask a lot more questions about the data, and possibly collect more to help answer the question in a more “readable” way.

In a real scenario it’s likely that more categories would have more linear correlation and we could get a better picture through linear and logistic regression, while communicating with stakeholders about further data collection, and better model fitting along the way.

But alas

if all else fails

just let the machines do it