databases/weaviate/manifests/02-notebook/vector-database.ipynb (399 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "fcd145fa-10d4-4597-9250-1c61984fc5bb", "metadata": { "id": "fcd145fa-10d4-4597-9250-1c61984fc5bb" }, "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "id": "201cd5fa-25e0-4bd7-8a27-af1fc85a12e7", "metadata": { "id": "201cd5fa-25e0-4bd7-8a27-af1fc85a12e7" }, "source": [ "This section shows you how to upload Vectors into a new Weaviate Collection and run simple search queries using the official Weaviate client. In this example, you use a dataset from a CSV file that contains a list of books in different genres. Weaviate will serve as a search engine." ] }, { "cell_type": "markdown", "source": [ "Install **kubectl** and the **Google Cloud SDK** with the necessary authentication plugin for Google Kubernetes Engine (GKE)." ], "metadata": { "id": "m6xkFsP9lANt" }, "id": "m6xkFsP9lANt" }, { "cell_type": "code", "source": [ "%%bash\n", "\n", "curl -LO \"https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl\"\n", "sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl\n", "apt-get update && apt-get install apt-transport-https ca-certificates gnupg\n", "curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg\n", "echo \"deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main\" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list\n", "apt-get update && sudo apt-get install google-cloud-cli-gke-gcloud-auth-plugin" ], "metadata": { "id": "N1HivL_jlEL-" }, "id": "N1HivL_jlEL-", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "**Replace** \\<CLUSTER_NAME> with your cluster name, e.g. weaviate-cluster. Retrieve the GKE cluster's credentials using the gcloud command." ], "metadata": { "id": "v7x8MfDRl9Z9" }, "id": "v7x8MfDRl9Z9" }, { "cell_type": "code", "source": [ "%%bash\n", "\n", "export KUBERNETES_CLUSTER_NAME=<CLUSTER_NAME>\n", "gcloud container clusters get-credentials $KUBERNETES_CLUSTER_NAME --region $GOOGLE_CLOUD_REGION" ], "metadata": { "id": "W0tJ9DpImOP_" }, "id": "W0tJ9DpImOP_", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Download the dataset from Git." ], "metadata": { "id": "I77EqyFcmcn5" }, "id": "I77EqyFcmcn5" }, { "cell_type": "code", "source": [ "%%bash\n", "\n", "export DATASET_PATH=https://raw.githubusercontent.com/epam/kubernetes-engine-samples/Weaviate/databases/weaviate/manifests/02-notebook/dataset.csv\n", "curl -s -LO $DATASET_PATH" ], "metadata": { "id": "F8Zy_NIzmeJR" }, "id": "F8Zy_NIzmeJR", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Create an .env file with environment variables required for connecting to Weaviate in a Kubernetes cluster." ], "metadata": { "id": "sNn04YZ-m9pZ" }, "id": "sNn04YZ-m9pZ" }, { "cell_type": "code", "source": [ "%%bash\n", "\n", "echo WEAVIATE_ENDPOINT=$(kubectl get svc weaviate-ilb -n weaviate --output jsonpath=\"{.status.loadBalancer.ingress[0].ip}\") > .env\n", "echo APIKEY=$(kubectl get secret apikeys -n weaviate --template={{.data.AUTHENTICATION_APIKEY_ALLOWED_KEYS}} | base64 -d) >> .env\n", "echo PALM_APIKEY=$(gcloud auth print-access-token) >> .env" ], "metadata": { "id": "uZnC1EZPm-7g" }, "id": "uZnC1EZPm-7g", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "id": "51247bbb-a52f-4003-9596-439f60f3b3c9", "metadata": { "id": "51247bbb-a52f-4003-9596-439f60f3b3c9" }, "source": [ "Install a Weaviate client:" ] }, { "cell_type": "code", "execution_count": null, "id": "c77e963b-c9ea-47a9-bdbb-029372838364", "metadata": { "id": "c77e963b-c9ea-47a9-bdbb-029372838364" }, "outputs": [], "source": [ "! pip install weaviate-client python-dotenv" ] }, { "cell_type": "markdown", "id": "320f0cb6-61c9-42fe-b361-ea3c92c35421", "metadata": { "id": "320f0cb6-61c9-42fe-b361-ea3c92c35421" }, "source": [ "Import Python libraries:" ] }, { "cell_type": "code", "execution_count": null, "id": "bb5ca67b-607d-4b23-926a-6459ea584f45", "metadata": { "id": "bb5ca67b-607d-4b23-926a-6459ea584f45" }, "outputs": [], "source": [ "import os\n", "import csv\n", "import weaviate\n", "import json\n", "from weaviate.connect import ConnectionParams\n", "from weaviate.classes.config import Configure\n", "from typing import List\n", "import numpy as np\n", "from dotenv import load_dotenv" ] }, { "cell_type": "markdown", "id": "15f64563-f932-4a38-bd96-5b9d5cfadfd3", "metadata": { "id": "15f64563-f932-4a38-bd96-5b9d5cfadfd3" }, "source": [ "Load data from a CSV file for inserting data into a Weaviate collection:" ] }, { "cell_type": "code", "execution_count": null, "id": "013284ff-e4b6-4ad7-b330-17860121c4c1", "metadata": { "id": "013284ff-e4b6-4ad7-b330-17860121c4c1" }, "outputs": [], "source": [ "with open('/content/dataset.csv') as csv_file:\n", " books = [*csv.DictReader(csv_file)]" ] }, { "cell_type": "markdown", "id": "df7eb305-6f3e-4215-8090-71d044a302aa", "metadata": { "id": "df7eb305-6f3e-4215-8090-71d044a302aa" }, "source": [ "Define a Weaviate connection, it requires an API Key for authentication:" ] }, { "cell_type": "code", "execution_count": null, "id": "13427782-ab29-4acc-9b24-59cf1f287619", "metadata": { "id": "13427782-ab29-4acc-9b24-59cf1f287619" }, "outputs": [], "source": [ "load_dotenv()\n", "auth_config = weaviate.auth.AuthApiKey(api_key=os.getenv(\"APIKEY\"))\n", "client = weaviate.WeaviateClient(\n", " connection_params=ConnectionParams.from_params(\n", " http_host=os.getenv(\"WEAVIATE_ENDPOINT\"),\n", " http_port=\"8080\",\n", " http_secure=False,\n", " grpc_host=os.getenv(\"WEAVIATE_ENDPOINT\"),\n", " grpc_port=\"50051\",\n", " grpc_secure=False,\n", " ),\n", " additional_headers={\n", " \"X-Palm-Api-Key\": os.getenv(\"PALM_APIKEY\")\n", " },\n", " auth_client_secret=auth_config\n", ")\n", "client.connect()" ] }, { "cell_type": "markdown", "id": "69af7764-5477-4e72-b0d2-0cdcc7127443", "metadata": { "id": "69af7764-5477-4e72-b0d2-0cdcc7127443" }, "source": [ "Create or recreate a collection \"Book\". Weaviate will vectorize all book descriptions using Vertex AI embedding model:" ] }, { "cell_type": "code", "execution_count": null, "id": "030e257a-b3ee-435a-9be2-64e3ed08b8cf", "metadata": { "id": "030e257a-b3ee-435a-9be2-64e3ed08b8cf" }, "outputs": [], "source": [ "if client.collections.exists(\"Book\"):\n", " client.collections.delete(\"Book\")\n", "collection = client.collections.create(\n", " name=\"Book\",\n", " vectorizer_config=[\n", " Configure.NamedVectors.text2vec_palm(\n", " name=\"description_vector\",\n", " source_properties=[\"description\"],\n", " project_id=os.getenv(\"GOOGLE_CLOUD_PROJECT\"),\n", " model_id=\"text-embedding-005\"\n", " )\n", " ],\n", ")" ] }, { "cell_type": "markdown", "id": "933e7a3d-843e-4dd1-9ab0-a4405947fc50", "metadata": { "id": "933e7a3d-843e-4dd1-9ab0-a4405947fc50" }, "source": [ "Insert data into the Weaviate collection:" ] }, { "cell_type": "code", "execution_count": null, "id": "10ff90db-52f8-43f8-9ba2-615578b603f8", "metadata": { "id": "10ff90db-52f8-43f8-9ba2-615578b603f8" }, "outputs": [], "source": [ "with collection.batch.dynamic() as batch:\n", " for i, doc in enumerate(books): # Batch import data\n", " print(f\"importing book: {i+1}\")\n", " batch.add_object(properties=doc)" ] }, { "cell_type": "markdown", "id": "9d0ca596-9688-4df3-a8cc-dc384c1e5234", "metadata": { "id": "9d0ca596-9688-4df3-a8cc-dc384c1e5234" }, "source": [ "Define the Weaviate query function. Weaviate converts the text query into an embedding, runs a vector search and displays results.\n", "\n", "It prints each result separated by a line of dashes, in the following format :\n", "\n", "- Title: Title of the book\n", "- Author: Author of the book\n", "- Publish date: Book publication date\n", "- Description: As stored in your document's description metadata field" ] }, { "cell_type": "code", "execution_count": null, "id": "7d1cae5f-ffa3-44ea-8b9e-fd376cdc185c", "metadata": { "id": "7d1cae5f-ffa3-44ea-8b9e-fd376cdc185c" }, "outputs": [], "source": [ "def handle_query(query, limit):\n", " result = (\n", " collection.query.near_text(\n", " query=query,\n", " limit=limit\n", " )\n", " )\n", " for hit in result.objects:\n", " book = hit.properties\n", " print(\"Title: {}, Author: {}, Publish date: {}\".format(book[\"title\"], book[\"author\"], book[\"publishDate\"]))\n", " print(book[\"description\"])\n", " print(\"---------\")" ] }, { "cell_type": "markdown", "id": "84d4af63-4527-4e3d-9f1b-5d0cf112caf2", "metadata": { "id": "84d4af63-4527-4e3d-9f1b-5d0cf112caf2" }, "source": [ "Run the query `drama about people and unhappy love`:" ] }, { "cell_type": "code", "execution_count": null, "id": "365ac7bc-f06c-4db9-81f9-67354fe18e44", "metadata": { "id": "365ac7bc-f06c-4db9-81f9-67354fe18e44" }, "outputs": [], "source": [ "handle_query(\"drama about people and unhappy love\", 2)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0rc1" }, "colab": { "provenance": [] } }, "nbformat": 4, "nbformat_minor": 5 }