Monday, 10 February 2025

Find the semantic similarity of words using the gensim library

Python  Text Analysis 

Word similarity in computing refers to how closely two words are related based on their meaning, spelling, or usage. It is commonly used in natural language processing (NLP) for applications like search engines, chatbots, machine translation, and text analysis.

Types of Word Similarity

  1. Semantic Similarity (Meaning-based)

    • Measures how similar the meanings of two words are.
    • Example: "car" and "automobile" are highly similar, while "car" and "banana" are not similar.
    • Common Methods:
      • Word Embeddings (Word2Vec, GloVe, FastText) – Represent words as vectors based on context.
      • Ontology-Based (WordNet, ConceptNet) – Use human-defined word relationships.
  2. Lexical Similarity (Spelling-based)

    • Measures how similar two words look or sound.
    • Example: "color" and "colour" are lexically similar, but "car" and "automobile" are not.
    • Common Methods:
      • Edit Distance (Levenshtein Distance) – Counts the number of changes needed to turn one word into another.
      • Jaccard Similarity – Compares overlapping characters or bigrams.

Find Words' Semantic Similarity

Here is the code uses the gensim library to load a pre-trained Word2Vec model and find words that are semantically similar to the word "king". Here's a step-by-step explanation of the code:

  1. Importing the Required Library
from gensim.models import KeyedVectors

The KeyedVectors class from the gensim.models module is used to load and work with pre-trained word embeddings (like Word2Vec).

  1. Loading the Pre-trained Word2Vec Model
# Load pre-trained Word2Vec model (download from https://code.google.com/archive/p/word2vec/)
model_path = "GoogleNews-vectors-negative300.bin"
model = KeyedVectors.load_word2vec_format(model_path, binary=True)

model_path specifies the path to the pre-trained Word2Vec model file (GoogleNews-vectors-negative300.bin). This model was trained on Google News data and contains 300-dimensional vectors for 3 million words and phrases.

KeyedVectors.load_word2vec_format() loads the model from the specified file. The binary=True argument indicates that the file is in binary format.

  1. Finding Similar Words
similar_words = model.most_similar("king", topn=5)

The most_similar() method is used to find words that are semantically similar to the input word ("king" in this case).

The topn=5 argument specifies that the method should return the top 5 most similar words.

  1. Printing the Results
print("Words similar to 'king':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.2f}")

The results are stored in similar_words as a list of tuples, where each tuple contains a word and its similarity score (a float between 0 and 1, where 1 means most similar).

The for loop iterates through the list and prints each word along with its similarity score, formatted to two decimal places.

  1. Example Output

If you run this code, you might get an output like:

Words similar to 'king':
kings: 0.71
queen: 0.65
monarch: 0.64
crown_prince: 0.62
prince: 0.62

This output shows that the words "kings", "queen", "monarch", etc., are semantically similar to "king" with "kings" being the most similar.

Key Concepts

  1. Word2Vec: A neural network-based model that learns word embeddings (vector representations of words) by predicting words in context. Words with similar meanings tend to have similar vectors.

  2. Semantic Similarity: The measure of how similar two words are in meaning, based on their vector representations.

  3. Pre-trained Models: Models like GoogleNews-vectors-negative300.bin are trained on large datasets and can be reused for various NLP tasks without needing to train from scratch.

This code is useful for tasks like finding synonyms, understanding word relationships, or performing analogies.



Search