In this project I applied my knowledge of pandas, dimensionality reduction, Bokeh, along with concepts of natural language processing and word embedding.
When trying new cosmetic products, issues can arise due to ingredients contained in these items. The ingredient lists are long and difficult to understand. In this project, we used data science to create an ingredient-based recommendation system. The task is to process the ingredient lists for 1472 Sephora products via word embedding.
Using t-SNE (T-distributed stochastic neighbor embedding) to reduce dimensionality and group similar ingredients, we then created an interactive visualization in Bokeh.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
# Load the data
df = pd.read_csv('datasets/cosmetics.csv')
# Checking the first five rows
display(df.sample(n=5))
# Inspect the types of products
print(df['Label'].value_counts())
There are six categories of product in our data: (moisturizers, cleansers, face masks, eye creams, and sun protection) and there are five different skin types (combination, dry, normal, oily and sensitive).
In this project, we are focusing on moisturizers for people with dry skin.
# Filter for moisturizers
moisturizers = df[df['Label'] == 'Moisturizer']
# Filter for dry skin as well
moisturizers_dry = moisturizers[moisturizers['Dry'] == 1]
# Reset index
moisturizers_dry = moisturizers_dry.reset_index(drop=True)
To get to our end goal of comparing ingredients in each product, we first need to do some preprocessing tasks and bookkeeping of the actual words in each product's ingredients list. The first step will be tokenizing the list of ingredients in Ingredients
column. After splitting them into tokens, we'll make a binary bag of words. Then we will create a dictionary with the tokens, ingredient_idx
, which will have the following format:
{ "ingredient": index value, ... }
# Initialize dictionary, list, and initial index
ingredient_idx = {}
corpus = []
idx = 0
# For loop for tokenization
for i in range(len(moisturizers_dry)):
ingredients = moisturizers_dry['Ingredients'][i]
ingredients_lower = ingredients.lower()
tokens = ingredients_lower.split(', ')
corpus.append(tokens)
for ingredient in tokens:
if ingredient not in ingredient_idx:
ingredient_idx[ingredient] = idx
idx += 1
# Check the result
print("The index for decyl oleate is", ingredient_idx['decyl oleate'])
The next step is making a document-term matrix (DTM). Each cosmetic product will correspond to a document, and each chemical composition will correspond to a term. This means we can think of the matrix as a “cosmetic-ingredient” matrix. The size of the matrix should be as the picture shown below.
To create this matrix, we'll first make an empty matrix filled with zeros. The length of the matrix is the total number of cosmetic products in the data. The width of the matrix is the total number of ingredients.
# Get the number of items and tokens
M = len(moisturizers_dry)
N = len(ingredient_idx)
# Initialize a matrix of zeros
A = np.zeros((M,N))
Before we can fill the matrix, let's create a function to count the tokens (i.e., ingredients list) for each row. Our end goal is to fill the matrix with 1 or 0: if an ingredient is in a cosmetic, the value is 1. If not, it remains 0. The name of this function, oh_encoder
(a one-hot encoder function), will become clear next.
# Defining the oh_encoder function
def oh_encoder(tokens):
x = np.zeros(N)
for ingredient in tokens:
# Get the index for each ingredient
# Accessing the dictionary to get values.
idx = ingredient_idx.get(ingredient)
# Put 1 at the corresponding indices
x[idx] = 1
return x
Now we'll apply the oh_encoder()
functon to the tokens in corpus
and set the values at each row of this matrix. So the result will tell us what ingredients each item is composed of. For example, if a cosmetic item contains water, niacin, decyl aleate and sh-polypeptide-1, the outcome of this item will be as follows.
This is what we called one-hot encoding. By encoding each ingredient in the items, the Cosmetic-Ingredient matrix will be filled with binary values (1 or 0).
# Make a document-term matrix
i = 0
for tokens in corpus:
A[i, :] = oh_encoder(tokens)
i+=1
The dimensions of the existing matrix is (190, 2233), which means there are 2233 features in our data. For visualization, we should reduce this into two dimensions. We'll use t-SNE for reducing the dimension of the data here.
T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm that embeds high-dimensional data into a low-dimensional space of 2 or 3 dimensions. t-SNE can reduce the dimension of data while keeping the similarities between the instances, allowing us to visualize our data on a coordinate plane (vectorizing). All items in our data will be vectorized into 2-D coordinates, and the distances between the points will indicate the similarities between the items.
# Dimension reduction with t-SNE
model = TSNE(n_components = 2, learning_rate = 200, random_state = 42)
tsne_features = model.fit_transform(A)
# Make X, Y columns
moisturizers_dry['X'] = tsne_features[:,0]
moisturizers_dry['Y'] = tsne_features[:,1]
We are now ready to start creating our plot. With the t-SNE values, we can plot all our items on the coordinate plane. And the coolest part here is that it will also show us the name, the brand, the price and the rank of each item. Let's make a scatter plot using Bokeh and add a hover tool to show that information. Note that we won't display the plot yet as we will make some more additions to it.
from bokeh.io import show, output_notebook, push_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool
output_notebook()
# Make a source and a scatter plot
source = ColumnDataSource(moisturizers_dry)
plot = figure(x_axis_label = 't-SNE 1',
y_axis_label = 't-SNE 2',
width = 500, height = 400)
plot.circle(x = 'X', y = 'Y', source = source, size = 10,
color = '#FF7373', alpha = .8)
Why don't we add a hover tool? Adding a hover tool allows us to check the information of each item whenever the cursor is directly over a glyph. We'll add tooltips with each product's name, brand, price, and rank (i.e., rating).
# Create a HoverTool object
hover = HoverTool(tooltips = [('Item', '@Name'),
('Brand', '@Brand'),
('Price', '$@Price'),
('Rank', '@Rank')])
plot.add_tools(hover)
Finally, it's show time! Let's see how the map we've made looks like. Each point on the plot corresponds to the cosmetic items. Then what do the axes mean here? The axes of a t-SNE plot aren't easily interpretable in terms of the original data. Like mentioned above, t-SNE is a visualizing technique to plot high-dimensional data in a low-dimensional space. Therefore, it's not desirable to interpret a t-SNE plot quantitatively.
Instead, what we can get from this map is the distance between the points (which items are close and which are far apart). The closer the distance between the two items is, the more similar the composition they have. Therefore this enables us to compare the items without having any chemistry background.
# Plot the map
show(plot)
Since there are so many cosmetics and so many ingredients, the plot doesn't have many super obvious patterns that simpler t-SNE plots can have (example). Our plot requires some digging to find insights, but that's okay!
Say we enjoyed a specific product, there's an increased chance we'd enjoy another product that is similar in chemical composition. Say we enjoyed AmorePacific's Color Control Cushion Compact Broad Spectrum SPF 50+. We could find this product on the plot and see if a similar product(s) exist. And it turns out it does! If we look at the points furthest left on the plot, we see LANEIGE's BB Cushion Hydra Radiance SPF 50 essentially overlaps with the AmorePacific product. By looking at the ingredients, we can visually confirm the compositions of the products are similar (though it is difficult to do, which is why we did this analysis in the first place!), plus LANEIGE's version is $22 cheaper and actually has higher ratings.
It's not perfect, but it's useful. In real life, we can actually use our little ingredient-based recommendation engine help us make educated cosmetic purchase choices.
# Print the ingredients of two similar cosmetics
cosmetic_1 = moisturizers_dry[moisturizers_dry['Name'] == "Color Control Cushion Compact Broad Spectrum SPF 50+"]
cosmetic_2 = moisturizers_dry[moisturizers_dry['Name'] == "BB Cushion Hydra Radiance SPF 50"]
# Display each item's data and ingredients
display(cosmetic_1)
print(cosmetic_1.Ingredients.values)
display(cosmetic_2)
print(cosmetic_2.Ingredients.values)