databases/elasticsearch/manifests/03-notebook/vector-database.ipynb (515 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "QweRw5aWgx2E", "metadata": { "id": "QweRw5aWgx2E" }, "source": [ "Copyright 2024 Google LLC\n", "#\n", "Licensed under the Apache License, Version 2.0 (the \"License\");\n", "you may not use this file except in compliance with the License.\n", "You may obtain a copy of the License at\n", "#\n", " https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "Unless required by applicable law or agreed to in writing, software\n", " distributed under the License is distributed on an \"AS IS\" BASIS,\n", "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "See the License for the specific language governing permissions and\n", "limitations under the License." ] }, { "cell_type": "markdown", "id": "201cd5fa-25e0-4bd7-8a27-af1fc85a12e7", "metadata": { "id": "201cd5fa-25e0-4bd7-8a27-af1fc85a12e7" }, "source": [ "This section shows you how to upload Vectors into a new Elasticsearch index and run simple search queries using the official Elasticsearch client.\n", "\n", "In this example, you use a dataset from a CSV file that contains a list of books in different genres. Elasticsearch will serve as a search engine.\n", "\n", "Install kubectl and the Google Cloud SDK with the necessary authentication plugin for Google Kubernetes Engine (GKE)." ] }, { "cell_type": "code", "execution_count": null, "id": "VTHbyhzyhWFv", "metadata": { "id": "VTHbyhzyhWFv" }, "outputs": [], "source": [ "%%bash\n", "\n", "curl -LO \"https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl\"\n", "sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl\n", "apt-get update && apt-get install apt-transport-https ca-certificates gnupg\n", "curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg\n", "echo \"deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main\" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list\n", "apt-get update && sudo apt-get install google-cloud-cli-gke-gcloud-auth-plugin" ] }, { "cell_type": "markdown", "id": "51247bbb-a52f-4003-9596-439f60f3b3c9", "metadata": { "id": "51247bbb-a52f-4003-9596-439f60f3b3c9" }, "source": [ "Install Elasticsearch python client and fastembed libraries:" ] }, { "cell_type": "code", "execution_count": null, "id": "1c3b796a-3b3a-4322-a276-d72c1dc8540e", "metadata": { "id": "1c3b796a-3b3a-4322-a276-d72c1dc8540e" }, "outputs": [], "source": [ "! pip install elasticsearch fastembed python-dotenv" ] }, { "cell_type": "markdown", "id": "glSU95Rlhbvw", "metadata": { "id": "glSU95Rlhbvw" }, "source": [ "Replace \\<CLUSTER_NAME> with your cluster name, e.g. elasticsearch-cluster. Retrieve the GKE cluster's credentials using the gcloud command." ] }, { "cell_type": "code", "execution_count": null, "id": "e20-I1IkhZOt", "metadata": { "id": "e20-I1IkhZOt" }, "outputs": [], "source": [ "%%bash\n", "\n", "export KUBERNETES_CLUSTER_NAME= <CLUSTER_NAME>\n", "gcloud container clusters get-credentials $KUBERNETES_CLUSTER_NAME --region $GOOGLE_CLOUD_REGION" ] }, { "cell_type": "markdown", "id": "4VsgO6i0hEDN", "metadata": { "id": "4VsgO6i0hEDN" }, "source": [ "Download the dataset from Git." ] }, { "cell_type": "code", "execution_count": 67, "id": "6iX2LINVhDhl", "metadata": { "executionInfo": { "elapsed": 348, "status": "ok", "timestamp": 1722329123514, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "6iX2LINVhDhl" }, "outputs": [], "source": [ "%%bash\n", "\n", "export DATASET_PATH=export DATASET_PATH=https://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/main/databases/qdrant/manifests/04-notebook/dataset.csv\n", "curl -s -LO $DATASET_PATH" ] }, { "cell_type": "markdown", "id": "844f6040", "metadata": {}, "source": [ "Please run the next command and check if Elastic internal load balancer achieved an IP address. If you see ip address in the output proceed to the next step if blanc please repeat the command after a few minutes or check the status of elastic-ilb service from your console, proceed to the next step only when IP address appears." ] }, { "cell_type": "code", "execution_count": null, "id": "53aba95f", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "kubectl get svc elastic-ilb -n elastic --output jsonpath=\"{.status.loadBalancer.ingress[0].ip}\"" ] }, { "cell_type": "markdown", "id": "JHsHRZS3ixCh", "metadata": { "id": "JHsHRZS3ixCh" }, "source": [ "Create an .env file with environment variables required for connecting to Elastic in a Kubernetes cluster." ] }, { "cell_type": "code", "execution_count": 4, "id": "Kk1qBn-ci0vk", "metadata": { "executionInfo": { "elapsed": 1226, "status": "ok", "timestamp": 1722325270444, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "Kk1qBn-ci0vk" }, "outputs": [], "source": [ "%%bash\n", "echo ELASTIC_ENDPOINT=\"https://$(kubectl get svc elastic-ilb -n elastic --output jsonpath=\"{.status.loadBalancer.ingress[0].ip}\"):9200\" > .env\n", "echo USER=$(kubectl get secret elasticsearch-ha-es-elastic-user -n elastic --template='{{index .data \"elastic\"}}'| base64 -d) > .env" ] }, { "cell_type": "markdown", "id": "vsuV0hW2-b1k", "metadata": { "id": "vsuV0hW2-b1k" }, "source": [ "Import required python libraries:" ] }, { "cell_type": "code", "execution_count": 69, "id": "bb5ca67b-607d-4b23-926a-6459ea584f45", "metadata": { "executionInfo": { "elapsed": 342, "status": "ok", "timestamp": 1722329145088, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "bb5ca67b-607d-4b23-926a-6459ea584f45" }, "outputs": [], "source": [ "from dotenv import load_dotenv\n", "from elasticsearch import Elasticsearch\n", "from elasticsearch.helpers import bulk\n", "import os\n", "import csv\n", "from fastembed import TextEmbedding\n", "from typing import List\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "Xxa2Zt_g-WbD", "metadata": { "id": "Xxa2Zt_g-WbD" }, "source": [ "Load and prepare data from a CSV file for inserting it into an Elasticsearch index:" ] }, { "cell_type": "code", "execution_count": 70, "id": "013284ff-e4b6-4ad7-b330-17860121c4c1", "metadata": { "executionInfo": { "elapsed": 96876, "status": "ok", "timestamp": 1722329244565, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "013284ff-e4b6-4ad7-b330-17860121c4c1" }, "outputs": [], "source": [ "books = [*csv.DictReader(open('/content/dataset.csv'))]\n", "descriptions = [doc[\"description\"] for doc in books]\n", "embedding_model = TextEmbedding(model_name=\"BAAI/bge-small-en\")\n", "embeddings: List[np.ndarray] = list(embedding_model.embed(descriptions))" ] }, { "cell_type": "markdown", "id": "f6f00b67-901e-4470-ab26-94c3b0e010d8", "metadata": { "id": "f6f00b67-901e-4470-ab26-94c3b0e010d8" }, "source": [ "Establish a connection to the Elasticsearch cluster:" ] }, { "cell_type": "code", "execution_count": null, "id": "75f71220-349b-41f0-89ea-1ba7a1c52771", "metadata": { "id": "75f71220-349b-41f0-89ea-1ba7a1c52771" }, "outputs": [], "source": [ "load_dotenv()\n", "print([os.getenv(\"ELASTIC_ENDPOINT\")])\n", "client = Elasticsearch([os.getenv(\"ELASTIC_ENDPOINT\")], verify_certs=False,\n", " basic_auth=(\"elastic\",\n", " os.getenv(\"USER\"))\n", ")" ] }, { "cell_type": "markdown", "id": "df7eb305-6f3e-4215-8090-71d044a302aa", "metadata": { "id": "df7eb305-6f3e-4215-8090-71d044a302aa" }, "source": [ "Create an Elasticsearch index with defined schema:" ] }, { "cell_type": "code", "execution_count": null, "id": "b08ebd75-0b8c-4805-a40f-634d2d5df3de", "metadata": { "id": "b08ebd75-0b8c-4805-a40f-634d2d5df3de" }, "outputs": [], "source": [ "index_scheme = {\n", " \"settings\": {\n", " \"number_of_shards\": 3,\n", " \"number_of_replicas\": 1\n", " },\n", " \"mappings\": {\n", " \"dynamic\": \"true\",\n", " \"_source\": {\n", " \"enabled\": \"true\"\n", " },\n", " \"properties\": {\n", " \"title\": {\n", " \"type\": \"text\"\n", " },\n", " \"author\": {\n", " \"type\": \"text\"\n", " },\n", " \"publishDate\": {\n", " \"type\": \"text\"\n", " },\n", " \"description\": {\n", " \"type\": \"text\"\n", " },\n", " \"description_vector\": {\n", " \"type\": \"dense_vector\",\n", " \"dims\": 384\n", " }\n", " }\n", " }\n", "}\n", "client.options(ignore_status=[400,404]).indices.delete(index='books')\n", "client.indices.create(index=\"books\", body=index_scheme)\n" ] }, { "cell_type": "markdown", "id": "c491a826-0f86-4a25-a0ba-cfad62c79da5", "metadata": { "id": "c491a826-0f86-4a25-a0ba-cfad62c79da5" }, "source": [ "Prepare data for uploading:" ] }, { "cell_type": "code", "execution_count": 73, "id": "637e4922-d58c-4eb3-91c2-03252422c662", "metadata": { "executionInfo": { "elapsed": 321, "status": "ok", "timestamp": 1722329277923, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "637e4922-d58c-4eb3-91c2-03252422c662" }, "outputs": [], "source": [ "documents: list[dict[str, any]] = []\n", "\n", "for i, doc in enumerate(books):\n", " book = doc\n", " book[\"_op_type\"] = \"index\"\n", " book[\"_index\"] = \"books\"\n", " book[\"description_vector\"] = embeddings[i]\n", " documents.append(book)" ] }, { "cell_type": "markdown", "id": "UUF8rUP4ssw8", "metadata": { "id": "UUF8rUP4ssw8" }, "source": [ "Upload data:" ] }, { "cell_type": "code", "execution_count": null, "id": "7d1cae5f-ffa3-44ea-8b9e-fd376cdc185c", "metadata": { "id": "7d1cae5f-ffa3-44ea-8b9e-fd376cdc185c" }, "outputs": [], "source": [ "bulk(client, documents)" ] }, { "cell_type": "markdown", "id": "cbc7fb0d-1b75-4d39-b205-6fc2f7fff7ed", "metadata": { "id": "cbc7fb0d-1b75-4d39-b205-6fc2f7fff7ed" }, "source": [ "Define a function to query data from Elasticsearch.\n", "\n", "It prints each result separated by a line of dashes, in the following format :\n", "\n", "- Title: Title of the book, Author: Author of the book, Score: Elasticsearch relevancy score\n", "- Description of the book" ] }, { "cell_type": "code", "execution_count": 76, "id": "281a9791-8fb8-49f5-b80d-6ca849da4b88", "metadata": { "executionInfo": { "elapsed": 336, "status": "ok", "timestamp": 1722329310529, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "281a9791-8fb8-49f5-b80d-6ca849da4b88" }, "outputs": [], "source": [ "def handle_query(query, limit):\n", " query_vector = list(embedding_model.embed([query]))[0]\n", " script_query = {\n", " \"script_score\": {\n", " \"query\": {\"match_all\": {}},\n", " \"script\": {\n", " \"source\": \"cosineSimilarity(params.query_vector, 'description_vector') + 1.0\",\n", " \"params\": {\"query_vector\": query_vector}\n", " }\n", " }\n", " }\n", " response = client.search(\n", " index=\"books\",\n", " body={\n", " \"size\": limit,\n", " \"query\": script_query,\n", " \"_source\": {\"includes\": [\"description\", \"title\", \"author\", \"body\"]}\n", " }\n", " )\n", " for hit in response[\"hits\"][\"hits\"]:\n", " print(\"Title: {}, Author: {}, score: {}\".format(hit[\"_source\"][\"title\"], hit[\"_source\"][\"author\"], hit[\"_score\"]))\n", " print(hit[\"_source\"][\"description\"])\n", " print(\"---------\")" ] }, { "cell_type": "markdown", "id": "d6f3c7fe-0452-4b4d-aa81-aa833542f617", "metadata": { "id": "d6f3c7fe-0452-4b4d-aa81-aa833542f617" }, "source": [ "Query the Elasticsearch database. It runs a search query about `drama about people and unhappy love` and displays results." ] }, { "cell_type": "code", "execution_count": null, "id": "8351514a-a423-440b-8138-68face2b0417", "metadata": { "id": "8351514a-a423-440b-8138-68face2b0417" }, "outputs": [], "source": [ "handle_query(\"drama about people and unhappy love\", 2)" ] } ], "metadata": { "colab": { "name": "vector-database.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0rc1" } }, "nbformat": 4, "nbformat_minor": 5 }