databases/weaviate/manifests/02-notebook/vector-database.ipynb (399 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "fcd145fa-10d4-4597-9250-1c61984fc5bb",
"metadata": {
"id": "fcd145fa-10d4-4597-9250-1c61984fc5bb"
},
"source": [
"# Copyright 2024 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"id": "201cd5fa-25e0-4bd7-8a27-af1fc85a12e7",
"metadata": {
"id": "201cd5fa-25e0-4bd7-8a27-af1fc85a12e7"
},
"source": [
"This section shows you how to upload Vectors into a new Weaviate Collection and run simple search queries using the official Weaviate client. In this example, you use a dataset from a CSV file that contains a list of books in different genres. Weaviate will serve as a search engine."
]
},
{
"cell_type": "markdown",
"source": [
"Install **kubectl** and the **Google Cloud SDK** with the necessary authentication plugin for Google Kubernetes Engine (GKE)."
],
"metadata": {
"id": "m6xkFsP9lANt"
},
"id": "m6xkFsP9lANt"
},
{
"cell_type": "code",
"source": [
"%%bash\n",
"\n",
"curl -LO \"https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl\"\n",
"sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl\n",
"apt-get update && apt-get install apt-transport-https ca-certificates gnupg\n",
"curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg\n",
"echo \"deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main\" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list\n",
"apt-get update && sudo apt-get install google-cloud-cli-gke-gcloud-auth-plugin"
],
"metadata": {
"id": "N1HivL_jlEL-"
},
"id": "N1HivL_jlEL-",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Replace** \\<CLUSTER_NAME> with your cluster name, e.g. weaviate-cluster. Retrieve the GKE cluster's credentials using the gcloud command."
],
"metadata": {
"id": "v7x8MfDRl9Z9"
},
"id": "v7x8MfDRl9Z9"
},
{
"cell_type": "code",
"source": [
"%%bash\n",
"\n",
"export KUBERNETES_CLUSTER_NAME=<CLUSTER_NAME>\n",
"gcloud container clusters get-credentials $KUBERNETES_CLUSTER_NAME --region $GOOGLE_CLOUD_REGION"
],
"metadata": {
"id": "W0tJ9DpImOP_"
},
"id": "W0tJ9DpImOP_",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Download the dataset from Git."
],
"metadata": {
"id": "I77EqyFcmcn5"
},
"id": "I77EqyFcmcn5"
},
{
"cell_type": "code",
"source": [
"%%bash\n",
"\n",
"export DATASET_PATH=https://raw.githubusercontent.com/epam/kubernetes-engine-samples/Weaviate/databases/weaviate/manifests/02-notebook/dataset.csv\n",
"curl -s -LO $DATASET_PATH"
],
"metadata": {
"id": "F8Zy_NIzmeJR"
},
"id": "F8Zy_NIzmeJR",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Create an .env file with environment variables required for connecting to Weaviate in a Kubernetes cluster."
],
"metadata": {
"id": "sNn04YZ-m9pZ"
},
"id": "sNn04YZ-m9pZ"
},
{
"cell_type": "code",
"source": [
"%%bash\n",
"\n",
"echo WEAVIATE_ENDPOINT=$(kubectl get svc weaviate-ilb -n weaviate --output jsonpath=\"{.status.loadBalancer.ingress[0].ip}\") > .env\n",
"echo APIKEY=$(kubectl get secret apikeys -n weaviate --template={{.data.AUTHENTICATION_APIKEY_ALLOWED_KEYS}} | base64 -d) >> .env\n",
"echo PALM_APIKEY=$(gcloud auth print-access-token) >> .env"
],
"metadata": {
"id": "uZnC1EZPm-7g"
},
"id": "uZnC1EZPm-7g",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "51247bbb-a52f-4003-9596-439f60f3b3c9",
"metadata": {
"id": "51247bbb-a52f-4003-9596-439f60f3b3c9"
},
"source": [
"Install a Weaviate client:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c77e963b-c9ea-47a9-bdbb-029372838364",
"metadata": {
"id": "c77e963b-c9ea-47a9-bdbb-029372838364"
},
"outputs": [],
"source": [
"! pip install weaviate-client python-dotenv"
]
},
{
"cell_type": "markdown",
"id": "320f0cb6-61c9-42fe-b361-ea3c92c35421",
"metadata": {
"id": "320f0cb6-61c9-42fe-b361-ea3c92c35421"
},
"source": [
"Import Python libraries:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb5ca67b-607d-4b23-926a-6459ea584f45",
"metadata": {
"id": "bb5ca67b-607d-4b23-926a-6459ea584f45"
},
"outputs": [],
"source": [
"import os\n",
"import csv\n",
"import weaviate\n",
"import json\n",
"from weaviate.connect import ConnectionParams\n",
"from weaviate.classes.config import Configure\n",
"from typing import List\n",
"import numpy as np\n",
"from dotenv import load_dotenv"
]
},
{
"cell_type": "markdown",
"id": "15f64563-f932-4a38-bd96-5b9d5cfadfd3",
"metadata": {
"id": "15f64563-f932-4a38-bd96-5b9d5cfadfd3"
},
"source": [
"Load data from a CSV file for inserting data into a Weaviate collection:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "013284ff-e4b6-4ad7-b330-17860121c4c1",
"metadata": {
"id": "013284ff-e4b6-4ad7-b330-17860121c4c1"
},
"outputs": [],
"source": [
"with open('/content/dataset.csv') as csv_file:\n",
" books = [*csv.DictReader(csv_file)]"
]
},
{
"cell_type": "markdown",
"id": "df7eb305-6f3e-4215-8090-71d044a302aa",
"metadata": {
"id": "df7eb305-6f3e-4215-8090-71d044a302aa"
},
"source": [
"Define a Weaviate connection, it requires an API Key for authentication:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13427782-ab29-4acc-9b24-59cf1f287619",
"metadata": {
"id": "13427782-ab29-4acc-9b24-59cf1f287619"
},
"outputs": [],
"source": [
"load_dotenv()\n",
"auth_config = weaviate.auth.AuthApiKey(api_key=os.getenv(\"APIKEY\"))\n",
"client = weaviate.WeaviateClient(\n",
" connection_params=ConnectionParams.from_params(\n",
" http_host=os.getenv(\"WEAVIATE_ENDPOINT\"),\n",
" http_port=\"8080\",\n",
" http_secure=False,\n",
" grpc_host=os.getenv(\"WEAVIATE_ENDPOINT\"),\n",
" grpc_port=\"50051\",\n",
" grpc_secure=False,\n",
" ),\n",
" additional_headers={\n",
" \"X-Palm-Api-Key\": os.getenv(\"PALM_APIKEY\")\n",
" },\n",
" auth_client_secret=auth_config\n",
")\n",
"client.connect()"
]
},
{
"cell_type": "markdown",
"id": "69af7764-5477-4e72-b0d2-0cdcc7127443",
"metadata": {
"id": "69af7764-5477-4e72-b0d2-0cdcc7127443"
},
"source": [
"Create or recreate a collection \"Book\". Weaviate will vectorize all book descriptions using Vertex AI embedding model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "030e257a-b3ee-435a-9be2-64e3ed08b8cf",
"metadata": {
"id": "030e257a-b3ee-435a-9be2-64e3ed08b8cf"
},
"outputs": [],
"source": [
"if client.collections.exists(\"Book\"):\n",
" client.collections.delete(\"Book\")\n",
"collection = client.collections.create(\n",
" name=\"Book\",\n",
" vectorizer_config=[\n",
" Configure.NamedVectors.text2vec_palm(\n",
" name=\"description_vector\",\n",
" source_properties=[\"description\"],\n",
" project_id=os.getenv(\"GOOGLE_CLOUD_PROJECT\"),\n",
" model_id=\"text-embedding-005\"\n",
" )\n",
" ],\n",
")"
]
},
{
"cell_type": "markdown",
"id": "933e7a3d-843e-4dd1-9ab0-a4405947fc50",
"metadata": {
"id": "933e7a3d-843e-4dd1-9ab0-a4405947fc50"
},
"source": [
"Insert data into the Weaviate collection:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10ff90db-52f8-43f8-9ba2-615578b603f8",
"metadata": {
"id": "10ff90db-52f8-43f8-9ba2-615578b603f8"
},
"outputs": [],
"source": [
"with collection.batch.dynamic() as batch:\n",
" for i, doc in enumerate(books): # Batch import data\n",
" print(f\"importing book: {i+1}\")\n",
" batch.add_object(properties=doc)"
]
},
{
"cell_type": "markdown",
"id": "9d0ca596-9688-4df3-a8cc-dc384c1e5234",
"metadata": {
"id": "9d0ca596-9688-4df3-a8cc-dc384c1e5234"
},
"source": [
"Define the Weaviate query function. Weaviate converts the text query into an embedding, runs a vector search and displays results.\n",
"\n",
"It prints each result separated by a line of dashes, in the following format :\n",
"\n",
"- Title: Title of the book\n",
"- Author: Author of the book\n",
"- Publish date: Book publication date\n",
"- Description: As stored in your document's description metadata field"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7d1cae5f-ffa3-44ea-8b9e-fd376cdc185c",
"metadata": {
"id": "7d1cae5f-ffa3-44ea-8b9e-fd376cdc185c"
},
"outputs": [],
"source": [
"def handle_query(query, limit):\n",
" result = (\n",
" collection.query.near_text(\n",
" query=query,\n",
" limit=limit\n",
" )\n",
" )\n",
" for hit in result.objects:\n",
" book = hit.properties\n",
" print(\"Title: {}, Author: {}, Publish date: {}\".format(book[\"title\"], book[\"author\"], book[\"publishDate\"]))\n",
" print(book[\"description\"])\n",
" print(\"---------\")"
]
},
{
"cell_type": "markdown",
"id": "84d4af63-4527-4e3d-9f1b-5d0cf112caf2",
"metadata": {
"id": "84d4af63-4527-4e3d-9f1b-5d0cf112caf2"
},
"source": [
"Run the query `drama about people and unhappy love`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "365ac7bc-f06c-4db9-81f9-67354fe18e44",
"metadata": {
"id": "365ac7bc-f06c-4db9-81f9-67354fe18e44"
},
"outputs": [],
"source": [
"handle_query(\"drama about people and unhappy love\", 2)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0rc1"
},
"colab": {
"provenance": []
}
},
"nbformat": 4,
"nbformat_minor": 5
}