Fine-Tuning Word2Vec with Gensim
Note: There are also some issues with doing this fine-tuning approach so it is not recommended, see: https://stackoverflow.com/questions/68298289/fine-tuning-pre-trained-word2vec-model-with-gensim-4-0.
Given a set of documents, I want to be able to get word embeddings. I could train a Word2Vec model from scratch but I found that if I used a pre-trained model and then did some fine-tuning, then those results made more sense. That is when I plot the word embeddings in 2D and see which words are neighboring which other words, the fine-tuned model had better results compared to the model trained from scratch.
This page will show how to generate those word embeddings via fine-tuining a pre-trained model using Gensim in Python.
Note: This uses gensim 3.x but the code won’t work with gensim 4+.
Computing the Word Embeddings
In this context, word embeddings are a representation of words in space such that words that have similar meaning are plotted closer together, while words that have different meanings are plotted further apart. Meaning is determined by the co-occurence of words. Here we are looking at a window of 5 words. Words that co-occur together are then moved closer in space to each other. For instance, machine learning and artificial intelligence might co-occur in a sentence and hence would be represented closer together in space.
Often it’s beneficial to make use of a pre-trained model since such models are trained on much larger datasets with many more words and hence might be better at capturing the underlying meaning of a word. We can then build on this prior model and fine-tune the word embeddings to match with our current dataset.
Pre-Trained Model
I downloaded the Glove model from the Stanford page. I chose the wiki-gigaword-100 model.
You need to then convert the file from the glove format to the word2vec format (see this stackoverflow for details). Depending on your version of gensim you have two options:
- If you have less than version 4 of gensim, which is me, then you need to add a header line that indicates the number of vector count and dimensions. There is a function for doing this: https://radimrehurek.com/gensim/scripts/glove2word2vec.html.
- If you have version 4+ of gensim, then you can simple set
no_header=True
when callingload_word2vec_format
.
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec("/path/to/glove/glove.6B.100d.txt", "/path/to/glove/glove_model2.txt")
Fine-Tuning the Model
This is borrowed from the reply in https://datascience.stackexchange.com/questions/10695/how-to-initialize-a-new-word2vec-model-with-pre-trained-model-weights.
We first setup our Word2Vec model to have 100 dimensions and include a word with a minimum frequency of 1. Then we build up the vocabulary from my list of lists. sents
here is a list of sentences where each sentence is a list of tokens. So something like [['I', 'am', 'a', 'sentence', '.'], ['Another', 'sentence', 'here']]
. You can use something like nltk.sent_tokenize
to get each sentence, and then nltk.word_tokenize
to get the tokens within each sentence.
from gensim.models import Word2Vec
model = Word2Vec(size=100, min_count=1)
model.build_vocab(sents)
total_examples = model.corpus_count
# Save the vocab of your dataset
vocab = list(model.wv.vocab.keys())
We can load the pre-trained model as follows:
from gensim.models import KeyedVectors
pretrained_path = "/path/to/glove/glove_model2.txt"
model_2 = KeyedVectors.load_word2vec_format(pretrained_path, binary=False)
# Add the pre-trained model vocabulary
model.build_vocab([list(model_2.vocab.keys())], update=True)
# Load the pre-trained models embeddings
# note: if a word doesn't exist in the pre-trained vocabulary then it is left as is in the original model
model.intersect_word2vec_format(pretrained_path, binary=False, lockf=1.0)
Finally, we can train (fine-tune) with our new data.
model.train(sents3, total_examples=total_examples, epochs=model.epochs)
Plot Embeddings
Extract Embeddings
You might consider extracting the embeddings only for those words in your actual corpus and exclude extra words from the glove model.
word_embeddings = np.array([ model.wv[k] if k in model.wv else np.zeros(100) for k in vocab ])
word_embeddings.shape # Should be len(vocab) by 100
Dimensionality Reduction
Also now the rows in word_embeddings
will correspond with each word in vocab
.
You can project the 100 dimensional word embeddings into 2D for visualization. Here I’m using UMAP for dimensionality reduction. You might want to standardize each dimension so no one dimension contributes too much to the final dimensionality reduction. I’ve not done that here hence the commented out code.
import umap.umap_ as umap
#from sklearn.preprocessing import StandardScaler
#scaled_we = StandardScaler().fit_transform(word_embeddings)
#embedding2d = reducer.fit_transform(scaled_we)
reducer = umap.UMAP()
embedding2d = reducer.fit_transform(word_embeddings)
Plot
import matplotlib.pyplot as plt
plt.scatter(
embedding2d[:, 0],
embedding2d[:, 1])