databases/postgres-pgvector/manifests/02-notebook/vector-database.ipynb

{ "cells": [ { "cell_type": "markdown", "id": "6af02c0b", "metadata": {}, "source": [ " Copyright 2024 Google LLC\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\");\n", " you may not use this file except in compliance with the License.\n", " You may obtain a copy of the License at\n", "https://www.apache.org/licenses/LICENSE-2.0\n", "\n", " Unless required by applicable law or agreed to in writing, software\n", " distributed under the License is distributed on an \"AS IS\" BASIS,\n", " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "See the License for the specific language governing permissions and limitations under the License." ] }, { "cell_type": "markdown", "id": "201cd5fa-25e0-4bd7-8a27-af1fc85a12e7", "metadata": { "id": "201cd5fa-25e0-4bd7-8a27-af1fc85a12e7" }, "source": [ "This section shows you how to upload Vectors into a Postgres table and run simple search queries using the SQL syntax.\n", "\n", "In this example, you use a dataset from a CSV file that contains a list of books in different genres. Pgvector will serve as a search engine." ] }, { "cell_type": "markdown", "id": "51247bbb-a52f-4003-9596-439f60f3b3c9", "metadata": { "id": "51247bbb-a52f-4003-9596-439f60f3b3c9" }, "source": [ "Install a Postgres database connector, a Pgvector module to support vector types with INSERT commands, and a fastembed library to vectorize the data.\n", "\n", "Install kubectl and the Google Cloud SDK with the necessary authentication plugin for Google Kubernetes Engine (GKE)." ] }, { "cell_type": "code", "execution_count": null, "id": "O8a7cfxi0eQw", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 18598, "status": "ok", "timestamp": 1722001244912, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "O8a7cfxi0eQw", "outputId": "8f8cc053-6d77-4b1b-ae85-81ce5aa20bf1" }, "outputs": [], "source": [ "%%bash\n", "\n", "curl -LO \"https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl\"\n", "sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl\n", "apt-get update && apt-get install apt-transport-https ca-certificates gnupg\n", "curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg\n", "echo \"deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main\" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list\n", "apt-get update && sudo apt-get install google-cloud-cli-gke-gcloud-auth-plugin" ] }, { "cell_type": "markdown", "id": "Gxdu6vP0LeZd", "metadata": { "id": "Gxdu6vP0LeZd" }, "source": [ "Install the Postgres client" ] }, { "cell_type": "code", "execution_count": null, "id": "1c3b796a-3b3a-4322-a276-d72c1dc8540e", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 9208, "status": "ok", "timestamp": 1722005074291, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "1c3b796a-3b3a-4322-a276-d72c1dc8540e", "outputId": "94d374d5-dd38-418d-fa51-f1b292daa21a" }, "outputs": [], "source": [ "! pip install --upgrade --upgrade-strategy only-if-needed pgvector psycopg-binary psycopg fastembed python-dotenv" ] }, { "cell_type": "markdown", "id": "VIcjQZuG06NA", "metadata": { "id": "VIcjQZuG06NA" }, "source": [ "Replace \\<CLUSTER_NAME> with your cluster name, e.g. postgres-cluster. Retrieve the GKE cluster's credentials using the gcloud command." ] }, { "cell_type": "code", "execution_count": null, "id": "hycB0MTP07v7", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1228, "status": "ok", "timestamp": 1722001524988, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "hycB0MTP07v7", "outputId": "b4f6a981-7d55-4026-c2eb-0b45abead7ab" }, "outputs": [], "source": [ "%%bash\n", "\n", "export KUBERNETES_CLUSTER_NAME=<CLUSTER_NAME>\n", "gcloud container clusters get-credentials $KUBERNETES_CLUSTER_NAME --region $GOOGLE_CLOUD_REGION" ] }, { "cell_type": "markdown", "id": "YNmb37LZ0-6h", "metadata": { "id": "YNmb37LZ0-6h" }, "source": [ " Download the dataset from Git." ] }, { "cell_type": "code", "execution_count": null, "id": "PlzciQKZ1DT9", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 344, "status": "ok", "timestamp": 1722001527450, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "PlzciQKZ1DT9", "outputId": "1305840d-c3b7-4789-e2ff-de2d15469e28" }, "outputs": [], "source": [ "%%bash\n", "\n", "export DATASET_PATH=https://raw.githubusercontent.com/epam/kubernetes-engine-samples/internal_lb/databases/postgres-pgvector/manifests/02-notebook/dataset.csv\n", "curl -s -LO $DATASET_PATH" ] }, { "cell_type": "markdown", "id": "e8dcafca", "metadata": {}, "source": [ "Please run the next command and check if PGvector internal load balancer achieved an IP address. If you see ip address in the output proceed to the next step if blanc please repeat the command after a few minutes or check the status of pg-ilb service from your console, proceed to the next step only when IP address appears." ] }, { "cell_type": "code", "execution_count": null, "id": "095e6987", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "kubectl get svc pg-ilb -n pg-ns --output jsonpath=\"{.status.loadBalancer.ingress[0].ip}\"" ] }, { "cell_type": "markdown", "id": "JLLKxwDc1Gxo", "metadata": { "id": "JLLKxwDc1Gxo" }, "source": [ "Create an .env file with environment variables required for connecting to PGvector in a Kubernetes cluster." ] }, { "cell_type": "code", "execution_count": null, "id": "9k6yRV6_1M70", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 848, "status": "ok", "timestamp": 1722002565179, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "9k6yRV6_1M70", "outputId": "afeb04df-6aa4-48b7-a06a-156ab1a744c0" }, "outputs": [], "source": [ "%%bash\n", "echo PGVECTOR_ENDPOINT=$(kubectl get svc pg-ilb -n pg-ns --output jsonpath=\"{.status.loadBalancer.ingress[0].ip}\") > .env\n", "echo CLIENTUSERNAME=$(kubectl get secret gke-pg-cluster-superuser -n pg-ns --template='{{index .data \"username\"}}'| base64 -d) >> .env\n", "echo CLIENTPASSWORD=$(kubectl get secret gke-pg-cluster-superuser -n pg-ns --template='{{index .data \"password\"}}'| base64 -d) >> .env" ] }, { "cell_type": "markdown", "id": "320f0cb6-61c9-42fe-b361-ea3c92c35421", "metadata": { "id": "320f0cb6-61c9-42fe-b361-ea3c92c35421" }, "source": [ "Import required python libraries:" ] }, { "cell_type": "code", "execution_count": 6, "id": "bb5ca67b-607d-4b23-926a-6459ea584f45", "metadata": { "executionInfo": { "elapsed": 1, "status": "ok", "timestamp": 1722003109891, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "bb5ca67b-607d-4b23-926a-6459ea584f45" }, "outputs": [], "source": [ "from dotenv import load_dotenv\n", "from pgvector.psycopg import register_vector\n", "import psycopg\n", "import os\n", "import sys\n", "import csv\n", "from fastembed import TextEmbedding\n", "from typing import List\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "15f64563-f932-4a38-bd96-5b9d5cfadfd3", "metadata": { "id": "15f64563-f932-4a38-bd96-5b9d5cfadfd3" }, "source": [ "Connect to the Postgres cluster:" ] }, { "cell_type": "code", "execution_count": 21, "id": "013284ff-e4b6-4ad7-b330-17860121c4c1", "metadata": { "executionInfo": { "elapsed": 329, "status": "ok", "timestamp": 1722004908307, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "013284ff-e4b6-4ad7-b330-17860121c4c1" }, "outputs": [], "source": [ "load_dotenv()\n", "conn = psycopg.connect(\n", " dbname=\"app\",\n", " host=os.getenv(\"PGVECTOR_ENDPOINT\"),\n", " user=os.getenv(\"CLIENTUSERNAME\"),\n", " password=os.getenv(\"CLIENTPASSWORD\"))" ] }, { "cell_type": "markdown", "id": "f6f00b67-901e-4470-ab26-94c3b0e010d8", "metadata": { "id": "f6f00b67-901e-4470-ab26-94c3b0e010d8" }, "source": [ "Ensure the pgvector is installed and prepare the table for vectors:" ] }, { "cell_type": "code", "execution_count": null, "id": "75f71220-349b-41f0-89ea-1ba7a1c52771", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 307, "status": "ok", "timestamp": 1722004925255, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "75f71220-349b-41f0-89ea-1ba7a1c52771", "outputId": "1c824f1c-ca8f-46f8-d5c0-7c29151311e3" }, "outputs": [], "source": [ "conn.execute('CREATE EXTENSION IF NOT EXISTS vector;')\n", "register_vector(conn)\n", "conn.execute('DROP TABLE IF EXISTS documents;')\n", "conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, author text, title text, description text, embedding vector(384));')" ] }, { "cell_type": "markdown", "id": "df7eb305-6f3e-4215-8090-71d044a302aa", "metadata": { "id": "df7eb305-6f3e-4215-8090-71d044a302aa" }, "source": [ "Load data from a CSV: It reads the dataset.csv file for inserting data into a Postgres table." ] }, { "cell_type": "code", "execution_count": 24, "id": "b08ebd75-0b8c-4805-a40f-634d2d5df3de", "metadata": { "executionInfo": { "elapsed": 348, "status": "ok", "timestamp": 1722004932356, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "b08ebd75-0b8c-4805-a40f-634d2d5df3de" }, "outputs": [], "source": [ "books = [*csv.DictReader(open('/content/dataset.csv'))]" ] }, { "cell_type": "markdown", "id": "c491a826-0f86-4a25-a0ba-cfad62c79da5", "metadata": { "id": "c491a826-0f86-4a25-a0ba-cfad62c79da5" }, "source": [ "Declare the Embedding model and vectorize books descriptions from the dataset:" ] }, { "cell_type": "code", "execution_count": 25, "id": "637e4922-d58c-4eb3-91c2-03252422c662", "metadata": { "executionInfo": { "elapsed": 86378, "status": "ok", "timestamp": 1722005022023, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "637e4922-d58c-4eb3-91c2-03252422c662" }, "outputs": [], "source": [ "descriptions = [doc[\"description\"] for doc in books]\n", "embedding_model = TextEmbedding(model_name=\"BAAI/bge-small-en\")\n", "embeddings: List[np.ndarray] = list(embedding_model.embed(descriptions))" ] }, { "cell_type": "markdown", "id": "d1052335-63d3-4dd5-8e54-6627bb289abf", "metadata": { "id": "d1052335-63d3-4dd5-8e54-6627bb289abf" }, "source": [ "Insert data (author, title, book description and the corresponding vector) into the Postgres table:" ] }, { "cell_type": "code", "execution_count": null, "id": "7d1cae5f-ffa3-44ea-8b9e-fd376cdc185c", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 346 }, "executionInfo": { "elapsed": 375, "status": "error", "timestamp": 1722003281450, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "7d1cae5f-ffa3-44ea-8b9e-fd376cdc185c", "outputId": "922c78e6-84c1-4c8b-dd2d-de77e8436143" }, "outputs": [], "source": [ "for i, doc in enumerate(books):\n", " conn.execute('INSERT INTO documents (author, title, description, embedding) VALUES (%s, %s, %s, %s)', (doc[\"author\"], doc[\"title\"], doc[\"description\"], embeddings[i]))" ] }, { "cell_type": "markdown", "id": "a77a57d3", "metadata": {}, "source": [ "Query the Postgres database. It runs a search query about `drama about people and unhappy love` and displays results.\n", "\n", "This query performs a semantic search against the documents table in Postgres, retrieving a maximum of two results with highest match score relevant to your query text.\n", "It prints each result separated by a line of dashes, in the following format:\n", "\n", "- Title: Title of the book, Author: Author of the book\n", "- Book description as stored in your document's description row." ] }, { "cell_type": "code", "execution_count": null, "id": "281a9791-8fb8-49f5-b80d-6ca849da4b88", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 401 }, "executionInfo": { "elapsed": 390, "status": "error", "timestamp": 1722003318706, "user": { "displayName": "", "userId": "" }, "user_tz": -180 }, "id": "281a9791-8fb8-49f5-b80d-6ca849da4b88", "outputId": "6a3fca71-e066-411e-df0c-9fcc1e52d8c9" }, "outputs": [], "source": [ "query_vector = list(embedding_model.embed([\"drama about people and unhappy love\"]))[0]\n", "response = conn.execute('SELECT title, author, description FROM documents ORDER BY embedding <-> %s LIMIT 2', (query_vector,)).fetchall()\n", "for hit in response:\n", " print(\"Title: {}, Author: {}\".format(hit[0], hit[1]))\n", " print(hit[2])\n", " print(\"---------\")" ] } ], "metadata": { "colab": { "name": "vector-database (2).ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0rc1" } }, "nbformat": 4, "nbformat_minor": 5 }

databases/postgres-pgvector/manifests/02-notebook/vector-database.ipynb (542 lines of code) (raw):