Organisation Diversity check in Python with DeepFace

Machine learning is a double-edged sword

 

Introduction

There has been a lot of controversy about how bias in machine learning portrays social inequalities and how this might affect the outcomes of minority groups (also see this article named “AI is sending people to jail”). To be blunt, I don’t agree. Looking past the headlines that try to stimulate the emotional part of our brain will show you that machine learning is a two-edged sword. This post was written to demonstrate the other side of this sword. I will describe a short experiment in which I will try to indicate how it can be used to identify, and possibly also counteract, a lack of diversity within certain organisations.

The idea came to mind during a search for a covid pocketbook. I ran into this medical book publishing firm. It’s a company that was started a couple of years ago by two medical students who have become successful publishers and medical doctors. As a way to demonstrate the medical community involvement in creating content for their books, they posted a collection of pictures with names of medical professionals engaged in medical education and that contributed to the pocketbooks. Looking at this contributors page I had the same shock reaction as the first time I attended college at a university: there are almost no people of other ethnicities. I was in the expectation that this imbalance had improved the years after I had completed my study more than half a decade ago, however, this was an erroneous assumption. It looked like there had been little change.

Nowadays, conducting automation, data science and machine learning with Python gives us the opportunity to look at data on a large scale with a relatively small amount of effort. It might also be used to assess and progress the level of diversity within organisations. I therefore stated the following question:

Question

Can machine learning be used to evaluate and monitor diversity within an organisation?

Method

1. Choosing the first target

The website discussed above looked like a good first test target to do a diversity check.

2. Scraping like a crazy man

I started with scraping the website links of all the subpages of the homepage. I explained this scraping algorithm in detail in this tutorial. The subpage links were saved in a JSON file.

3. Browse through web pages and download all images

We open the JSON file in which the links are stores, create a function for downloading images on webpages and use that function in a loop that goes through all the links from the JSON file. We use the rand_sleep_int function within the loop to create random time intervals between download requests. We use regex to search for jpg, jpeg, fig and png images

Images download links are reachable in different ways. In the final code, I used the tags “img”, “scrset” and “avatar”. “rand_sleep_int” creates a random time interval to avoid an overload of requests being made to the server.

4. Filter images in which the number of faces is equal to one

Use an if statement to check if the length of the list of face locations is equal to one:

5.  Give an identity number to every face

We want to avoid letting our machine learning algorithm assess the same face multiple times so we give an identification number to each face and if the same face is encountered we give it the same identification number. The identification numbers with a corresponding image file name will be saved in a dictionary as a key with a value. In this way we can prevent the same person from being counted more than once:

6.  The actual machine learning ethnicity check work

As Andrew Ng said in his interview with Lex Fridman about ML:

“In a software system the machine learning model is maybe five percent
or even fewer relative to the entire software system”

The guru’s expression is reflected in the second line of this code block (together with one line in step 6 it’s the only line until now that involves ML!):want to avoid letting our machine learning algorithm assess the same face multiple times so we give an identification number to each face and if the same face is encountered we give it the same identification number. The identification numbers with a corresponding image file name will be saved in a dictionary as a key with a value. In this way we can prevent the same person from being counted more than once:

After doing the ML work we save the data in a CSV file for further analysis as shown in the last line.

7. First results

Okeeeey, time for some results. We count ethnicity with pandas:

The Output

Ethnicity
white 102
asian 6
latino hispanic 3
middle eastern 2
indian 1
Name: Ethnicity, dtype: int64

We find 114 faces with an identified ethnicity of which 102 are white according to the DeepFace module, which is equal to around 89%. This is an organisation in Amsterdam, a city wherein around 51% of the people have a migration background…

 

8. Accuracy of ethnicity

Now let’s check what happens if you add a semi-random sample of 10 pictures with black people: 7 black female models, my paranymphs/friends during my PhD (down right) and me.

 

7 black females and two black males used in the sanity check. One of them seemingly eating some good Surinamese food :-).

The result:

Ethnicity
white 102
asian 7
black 7
latino hispanic 4
middle eastern 2
indian 1
Name: Ethnicity, dtype: int64
Gender
Man 76
Woman 47
Name: Gender, dtype: int64

9. Assumptions for a feasible organisation diversity checker

 

10. Identifying an industry for analysis

11. Performing the analysis

Results

1. DeepFace machine learning performance

2. Manual check

Conclusion