Natural Language Processing (NLP) has fascinated me since I first read about the Turing testwhile studying rhetorical theory and technical communication in college. The complexities and subtleties of our communication always seemed like such a defining factor in what makes us a distinct and intelligent species, so training a machine to understand language transforms communication from something that can be so ambiguous, persuasive, and soulful into a something that seems mechanical, ordered, and predictable. Once I started coding, it wasn’t long before my curiosity drove me to better understand how we can use machine learning to gain new insight into natural language and derive nuances we might have missed. For example, a recent paper was published discussing how NLP was used to make new discoveries in materials science.
One of the NLP tools I’ve been playing with is the Universal Sentence Encoder (USE) hosted on Tensorflow-hub. USE is a pre-trained model that encodes text into a 512 dimensional vector. It is optimized for greater-than-word length text and is trained on a variety of data sources. There are a few different versions of USE. I choose the model that was trained using Deep Averaging Network (DAN) since it is lighter on resources than the Transformer based model. My first project using the tool was to generate wine recommendations based on the semantic similarity between wine descriptions and my search query.

The Data
The wine data encoded by the model comes from a wine review dataset found on kaggle.com. It contains around 130,000 rows of data and includes columns like country, description, title, variety, winery, price, and rating. After I put the data into a dataframe, I dropped rows that contained duplicate descriptions and rows that had null price. I also limited the data to wine varieties that had more than 200 reviews.
#import dependancies
import numpy as np
import pandas as pd
import sqlite3
from sqlite3 import Error#create a connection to the sqlite database.
conn = sqlite3.connect('db\wine_data.sqlite')
c = conn.cursor()#read the table in the database.
wine_df = pd.read_sql('Select * from wine_data', conn)#Drop the duplicate descriptions.
wine_df = wine_df.drop_duplicates('description')#drop null prices.
wine_df = wine_df.dropna(subset=['price'])#filter the dataframe to include only varieties with more than 200 reviews.
wine_df = wine_df.groupby('variety').filter(lambda x: len(x) > 200)
Reducing the data by excluding varieties with less than 200 reviews left me with 54 varieties of wine. By googling the remaining varieties, I was able to added a Color column so the user can limit their search by desired wine color.
#create a column named color.
wine_df["color"] = ""#used to update the database with the wine color. Manually updated each wine variety.
c.execute("update wine_data set color = 'red' where variety = 'Aglianico' ")#commit the update to the database so it saves.
conn.commit()#remove all the records without a color.
wine_df = pd.read_sql("select country, description,rating,price,province,title,variety, winery, color from wine_data where color in ('red', 'white', 'other')", conn)
wine_df.to_sql('wine_data', conn, if_exists = "replace")
After cleaning the data, I was left with 100,228 rows.
Setting up the Universal Sentence Encoder
The DAN based model is around 800mb, so I felt it was important to host it locally. Using the OS library, I set where the model gets cached and am able to call it from a local directory instead of downloading it each time.
import os#create the directory in which to cache the tensorflow universal sentence encoder.
os.environ["TFHUB_CACHE_DIR"] = 'C:/Users/Admin/Downloads'
download = tfhub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
After downloading the model, you will see a file appear in the directory named something like 1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47.
Creating the Functions
Even with the model downloaded, the first few iterations of the app were resource intensive and annoyingly slow. After a bit of research and revision, I decided to use a function as a means of reducing the overhead and time it takes for tensorflow to build a graph.
def embed_useT():
with tf.Graph().as_default():
text_input = tf.compat.v1.placeholder(dtype = tf.string, shape=[None])
embed = tfhub.Module('C:/Users/Admin/Downloads/1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47')
em_txt = embed(text_input)
session = tf.compat.v1.train.MonitoredSession()
return lambda x:session.run(em_txt, feed_dict={text_input:list(x)})#run the model.
embed_fn = embed_useT()#encode the wine descriptions.
result = embed_fn(wine_df.description)
Encoding all of the descriptions eats away at system resources and takes up two or more gigabytes of RAM. If you have limited access to memory in your environment, I recommend you save the numpy array of encoded values to the SQLite database. Calling the array from the database instead of the encoding it on the fly consumes more hard drive space, but it uses half of the RAM based on my testing. You can save the numpy array to the database using this solution I found on Stackoverflow:
def adapt_array(arr):
'''
http://stackoverflow.com/a/31312102/190597 (SoulNibbler)
'''
out = io.BytesIO()
np.save(out, arr)
out.seek(0)
return sqlite3.Binary(out.read())
def convert_array(text):
out = io.BytesIO(text)
out.seek(0)
return np.load(out)
# Converts np.array to TEXT when inserting.
sqlite3.register_adapter(np.ndarray, adapt_array)# Converts TEXT to np.array when selecting,
sqlite3.register_converter("array", convert_array)c.execute("create table embeddings (arr array)")conn.commit()c.execute("insert into embeddings (arr) values (?)", (result, ))conn.commit()#return the array
c.execute("select * from embeddings")
data = c.fetchone()[0]
After encoding the wine descriptions, I created a function that outputs wine recommendations by encoding a user’s query and finding the dot product of the two arrays:
def recommend_engine(query, color, embedding_table = result):wine_df = pd.read_sql('Select * from wine_data', db.session.bind)embedding = embed_fn([query])#Calculate similarity with all reviews
similarity_score = np.dot(embedding, embedding_table.T)recommendations = wine_df.copy()
recommendations['recommendation'] = similarity_score.T
recommendations = recommendations.sort_values('recommendation', ascending=False)#filter through the dataframe to find the corresponding wine color records.
if (color == 'red'):
recommendations = recommendations.loc[(recommendations.color =='red')]
recommendations = recommendations[['variety', 'title', 'price', 'description', 'recommendation'
, 'rating','color']]
elif(color == "white"):
recommendations = recommendations.loc[(recommendations.color =='white')]
recommendations = recommendations[['variety', 'title', 'price', 'description', 'recommendation'
, 'rating','color']]
elif(color == "other"):
recommendations = recommendations.loc[(recommendations.color =='other')]
recommendations = recommendations[['variety', 'title', 'price', 'description', 'recommendation'
, 'rating','color']]
else:
recommendations = recommendations[['variety', 'title', 'price', 'description', 'recommendation'
, 'rating','color']]
return recommendations.head(3).T
Test the function:
query = "fruity, rich, easy to drink, sweet"
color = 'red'recommendation = recommend_engine(query, color)
print(query)recommendation.head(3).T

It was fun exploring all of the wine data and coming up with a some-what light weight way to generate recommendations based on a search query. I plan on continuing to explore the Universal Sentence Encoder and think of new projects to challenge myself and improve my code. Check out the code on my github here: