How to Implement Multilingual Search with Sentence Transformers

Published — Edited

After following this guide, you will be able to use a pre-trained model to populate a Qdrant database with text and to search it for relevant results.

Creating a searchable database is straightforward. You can achieve basic functionality with Postgres' full text search. The novel advantage of using a pre-trained model with Qdrant is the ability to ingest and search data in any language that the model supports. Over 50 languages are supported by the model that we will be using.

Let's consider a scenario where you have a collection of English and French song lyrics. The model can convert each song's lyrics into a vector, which can then be stored in Qdrant. Once this is done, you can run a search query in English and find semantically similar lyrics in both English and French.

This guide assumes that you are using Linux, specifically Ubuntu in my case, and that you have enough knowledge to follow along with any linked pages and examples.

Requirements

Code


from qdrant_client import QdrantClient
from qdrant_client.http import models
from sentence_transformers import SentenceTransformer
from torch.cuda import is_available
from uuid import uuid4


def text_to_vector(text):
    """
    Encodes the provided text into a vector, using a pre-trained model.

    :param text: Text to convert.
    :return: Vector representation of the text.
    """
    model = SentenceTransformer(
        device="cuda" if is_available() else "cpu",
        model_name_or_path="sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
        trust_remote_code=True
    )

    return model.encode(text).tolist()


def ingest(qdrant_client, text):
    """
    Ingests the provided text into Qdrant.

    :param text: Text to ingest.
    """
    qdrant_client.upsert(
        collection_name="example",
        points=[
            models.PointStruct(
                id=str(uuid4()),
                payload={"text": text},
                vector=text_to_vector(text)
            )
        ]
    )


def search(qdrant_client, query):
    """
    Searches Qdrant for text similar to the provided query, and prints the results.

    :param query: Query to search with.
    :return: None
    """
    results = qdrant_client.search(
        collection_name="example",
        limit=10,
        query_vector=text_to_vector(query),
        with_payload=True
    )

    print(f"Search Results for '{query}':")
    for result in results:
        score = round(result.score * 100, 2)
        print(f"\tScore: {score}\tText: '{result.payload['text']}'")


# Delete and recreate the Qdrant collection.
qdrant_client = QdrantClient(host="host.docker.internal", port=6333)
qdrant_client.recreate_collection(
    collection_name="example",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE)
)


# Ingest some example text.
ingest(qdrant_client, "Hello, World!")
ingest(qdrant_client, "Olá Mundo!")
ingest(qdrant_client, "こんにちは世界")
ingest(qdrant_client, "हैलो वर्ल्ड!")


# Search for similar text.
search(qdrant_client, "Hello, World!")
search(qdrant_client, "Olá Mundo!")
search(qdrant_client, "こんにちは世界")
search(qdrant_client, "हैलो वर्ल्ड!")

text_to_vector downloads and loads this model, and uses it to convert any input text into a 768 element vector. See here for a list of available models and their tradeoffs.

ingest creates a point and inserts it into the example collection.

search runs a Qdrant search to find and display results that are semantically similar to the input query.

Code Output


Search Results for 'Hello, World!':
        Score: 100.0    Text: 'Hello, World!'
        Score: 99.09    Text: 'हैलो वर्ल्ड!'
        Score: 96.71    Text: 'こんにちは世界'
        Score: 88.44    Text: 'Olá Mundo!'

Search Results for 'Olá Mundo!':
        Score: 100.0    Text: 'Olá Mundo!'
        Score: 89.48    Text: 'हैलो वर्ल्ड!'
        Score: 88.58    Text: 'こんにちは世界'
        Score: 88.44    Text: 'Hello, World!'

Search Results for 'こんにちは世界':
        Score: 100.0    Text: 'こんにちは世界'
        Score: 96.71    Text: 'Hello, World!'
        Score: 95.36    Text: 'हैलो वर्ल्ड!'
        Score: 88.58    Text: 'Olá Mundo!'

Search Results for 'हैलो वर्ल्ड!':
        Score: 100.0    Text: 'हैलो वर्ल्ड!'
        Score: 99.09    Text: 'Hello, World!'
        Score: 95.36    Text: 'こんにちは世界'
        Score: 89.48    Text: 'Olá Mundo!'

Notes