These are my comments about FAISS vector indexing library from post. The idea is simple really. FAISS is an index used for storing and search for vector embedding. AFAIK, it’s made for large scale applications. so, maybe that’s advantage over the llamaIndex vectorIndex storage. Maybe.
First, we create vector encoding using SentenceTransformer. in this example, it’s using paraphrase-mpnet-base-v2
encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
vectors = encoder.encode(text)
Then create index, and store these vector in FAISS index.
vector_dimension = vectors.shape[1]
index = faiss.IndexFlatL2(vector_dimension)
faiss.normalize_L2(vectors)
index.add(vectors)
Finally, we can search the index with the vector embedding coming from the search text.
search_text = "where is your office?"
search_vector = encoder.encode(search_text)
_vector = np.array([search_vector])
faiss.normalize_L2(_vector)
k = index.ntotal
distances, ann = index.search(_vector, k=k)
Full example Link to heading
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
data = [
["Where are your headquarters located?", "location"],
["Throw my cellphone in the water", "random"],
["Network Access Control?", "networking"],
["Address", "location"],
]
df = pd.DataFrame(data, columns=["text", "category"])
text = df["text"]
encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
vectors = encoder.encode(text)
vector_dimension = vectors.shape[1]
index = faiss.IndexFlatL2(vector_dimension)
faiss.normalize_L2(vectors)
index.add(vectors)
search_text = "where is your office?"
search_vector = encoder.encode(search_text)
_vector = np.array([search_vector])
faiss.normalize_L2(_vector)
k = index.ntotal
distances, ann = index.search(_vector, k=k)
results = pd.DataFrame({"distances": distances[0], "ann": ann[0]})
print(results)