notebooks/prompt_critic.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Prompt Critic\n", "\n", "The `prompt_critic.ipynb` notebook is designed to critically evaluate prompt\n", "criteria used in the context of Large Language Models (LLMs). This notebook\n", "loads specific criteria for prompt evaluation and provides a structured critique\n", "on each of them. The primary areas of focus include:\n", "\n", "- **Ambiguity**: Assessing whether the criteria are clear and unambiguous,\n", " ensuring that there is no room for misinterpretation.\n", "- **Feasibility**: Evaluating if the criteria are practical and achievable within the constraints of the LLM's capabilities." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "import os\n", "import yaml\n", "\n", "from langchain_openai import AzureChatOpenAI\n", "from langchain_core.messages import HumanMessage\n", "from IPython.display import display, Markdown\n", "\n", "# Directory where the evaluation scenarios are stored\n", "base_path = \"../data/claude_sonnet_3_5_20240627\"\n", "\n", "# Scenario to evaluate\n", "scenario_id = \"12\"\n", "\n", "\n", "def get_model():\n", " \"\"\"Get the evaluator model.\"\"\"\n", " # Specify configuration for the AI Dial endpoint\n", " openai_endpoint = \"https://ai-proxy.lab.epam.com\"\n", " openai_deploymet_name = \"gpt-4o-2024-05-13\"\n", " openai_api_version = \"2024-05-01-preview\"\n", "\n", " # Read API key from the environment variables\n", " # Putting the key inside the notebook is not secure\n", " openai_api_key = os.environ[\"API_KEY\"]\n", "\n", " # Define GPT-4-omni model\n", " model = AzureChatOpenAI(\n", " temperature=0, # request deterministic behavior\n", " azure_endpoint=openai_endpoint,\n", " azure_deployment=openai_deploymet_name,\n", " api_version=openai_api_version,\n", " api_key=openai_api_key,\n", " )\n", "\n", " return model\n", "\n", "\n", "def read_file(file_path):\n", " \"\"\"Read the content of a file and return it as a string.\"\"\"\n", " with open(file_path, \"r\", encoding=\"utf-8\") as f:\n", " content = f.read()\n", " return content" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12_vanilla_to_react on piano_js\n" ] } ], "source": [ "# load scenario metadata and criteria\n", "def get_metadata(base_path, scenario_id):\n", " _metadata = read_file(os.path.join(base_path, scenario_id, \"meta.yaml\"))\n", "\n", " metadata = yaml.safe_load(_metadata)\n", "\n", " return (\n", " metadata.get(\"metadata\", {}),\n", " metadata.get(\"evaluation_steps\", {}).get(\"completeness\", []),\n", " metadata.get(\"evaluation_steps\", {}).get(\"accuracy\", []),\n", " )\n", "\n", "\n", "(metadata, completeness, accuracy) = get_metadata(base_path, scenario_id)\n", "\n", "print(\n", " f'{metadata[\"scenario_id\"]}_{metadata[\"scenario_name\"]} '\n", " f'on {metadata[\"repository\"]}'\n", ")" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Completeness:\n", "\n", "00: Verify the application contains a button container with \"Notes\" and \"Letters\" buttons.\n", "01: Verify the application renders piano keys for both natural and sharp notes.\n", "02: Verify clicking on a piano key plays the correct sound and highlights the key.\n", "03: Verify releasing the mouse button or moving the cursor out of the key stops the sound and removes the highlight.\n", "04: Verify pressing and releasing keyboard keys play and stop the corresponding piano notes.\n", "05: Verify clicking the \"Notes\" button displays note names on the keys.\n", "06: Verify clicking the \"Letters\" button displays letter names on the keys.\n", "07: Verify the fullscreen button toggles fullscreen mode for the application.\n", "08: Verify the application correctly handles simultaneous multiple key presses.\n" ] } ], "source": [ "print(\"Completeness:\\n\")\n", "for k, v in enumerate(completeness, 0):\n", " print(f\"{k:02d}: {v}\")" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy:\n", "\n", "00: Ensure the application does not contain unused imports or code.\n", "01: Ensure the codebase is structured and follows React best practices (state, hooks, effects).\n", "02: Ensure the application is free of memory leaks and unnecessary re-renders.\n", "03: Ensure the application is compatible with the latest version of React and TypeScript.\n", "04: Verify the application works correctly across modern browsers.\n", "05: Ensure the application is free of console errors and warnings.\n", "06: Verify that the new React app initializes successfully without errors.\n", "07: Verify that the codebase does not contain any TODOs.\n", "08: Ensure the application has the same DOM tree structure and classes as in the original application.\n", "09: Verify the application UI matches the original HTML structure visually.\n", "10: Verify the application handles rapid sequential key presses without audio overlap issues.\n" ] } ], "source": [ "print(\"Accuracy:\\n\")\n", "for k, v in enumerate(accuracy, 0):\n", " print(f\"{k:02d}: {v}\")" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "# Verify the application handles rapid sequential key presses without audio overlap issues\n", "\n", "The criterion is unclear for several reasons:\n", "\n", "1. **Ambiguity**: The term \"audio overlap issues\" is not clearly defined. It is not specified what constitutes an \"issue\" in this context. Does it refer to audio clipping, distortion, or multiple audio tracks playing simultaneously?\n", "2. **Assessment by LLM**: An LLM like GPT-4 can analyze code and provide insights based on patterns and best practices, but it cannot simulate rapid sequential key presses or play audio to detect overlap issues. This requires running the application in a browser and observing the behavior, which an LLM cannot do without additional tools.\n", "3. **Specificity**: The criterion does not specify what kind of audio is being referred to (e.g., sound effects, background music) or how the application should handle rapid key presses (e.g., by queuing sounds, by ignoring subsequent presses until the current sound finishes).\n", "\n", "To make the criteria clearer and assessable by an LLM, consider these improvements:\n", "\n", "1. **Clarify the Scope**: Specify what types of audio overlap issues should be checked and how the application should handle rapid key presses.\n", "2. **Static Analysis**: Since an LLM cannot run the code, focus on static code analysis to ensure best practices are followed to minimize the likelihood of audio overlap issues.\n", "\n", "Refined criterion:\n", "\n", "Ensure the JavaScript code includes mechanisms to prevent audio overlap issues, such as debouncing key press events or managing audio playback queues, to handle rapid sequential key presses effectively." ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "CRITIC_PROMPT = \"\"\"\n", "Assess the following criteria to determine if they are unambiguous,\n", "straightforward, specific, and can be assessed by an LLM without requiring tools\n", "such as a compiler, interpreter, or any other software and APIs.\n", "\n", "Criteria will be applied to the UI app code written in JavaScript and running in\n", "user browser.\n", "\n", "Provide concise reasoning for you decision and recommendation on how to improve\n", "the criterion. Be straightforward and clear in your assessment.\n", "\n", "Criterion: {evaluation_criteria}\n", "\n", "DO NOT assess any other criteria except the one provided.\n", "\n", "DO NOT recommend using external tools or APIs to assess the criterion.\n", "\n", "DO NOT recommend adding examples to the criterion.\n", "\n", "---Example 01---\n", "\n", "Input:\n", "\n", "Criterion: Ensure the application is free of console errors and warnings\n", "\n", "Output in Markdown format:\n", "\n", "# Ensure the application is free of console errors and warnings\n", "\n", "The criterion is unclear for several reasons:\n", "\n", "1. Ambiguity: The term \"console errors and warnings\" can be understood in\n", " different ways. It might refer to errors and warnings from JavaScript code,\n", " network issues, browser extensions, or third-party libraries.\n", "2. Assessment by LLM: An LLM like GPT-4 can analyze code and provide insights\n", " based on patterns and best practices, but it cannot run the code to check for\n", " runtime errors or warnings. Detecting console errors and warnings requires\n", " running the application in a browser and observing the console output, which\n", " an LLM cannot do without additional tools.\n", "3. Context Dependency: Console errors and warnings can vary depending on the\n", " environment. Different browsers or browser versions might produce different\n", " console outputs for the same code.\n", "\n", "To make the criteria clearer and assessable by an LLM, consider these\n", "improvements:\n", "\n", "1. Clarify the Scope: Specify which types of console errors and warnings should\n", " be considered.\n", "2. Static Analysis: Since an LLM cannot run the code, focus on static code\n", " analysis.\n", "\n", "Refined criterion:\n", "\n", "Ensure the JavaScript code follows best practices to minimize the likelihood of\n", "runtime errors and warnings in the browser console.\n", "\n", " \n", "---Example 02---\n", "\n", "Input:\n", "\n", "Criterion: Ensure the application is free of console errors and warnings\n", "\n", "Output in Markdown format:\n", "\n", "# Check that @ngrx/store is used for state management\n", "\n", "The criterion is clear and straightforward.\n", "\n", "To ensure the criterion is even more precise, consider these minor improvements:\n", "\n", "1. Clarify the Scope: Specify what aspects of @ngrx/store usage should be\n", " checked.\n", "\n", "Refined criterion: \n", "\n", "Check that @ngrx/store is imported and used for state management in the\n", "application, including the creation of actions, reducers, and selectors.\n", "\"\"\"\n", "\n", "model = get_model()\n", "\n", "prompt = CRITIC_PROMPT.format(evaluation_criteria=accuracy[10])\n", "\n", "message = HumanMessage(content=prompt)\n", "\n", "api_response = model.invoke([message])\n", "\n", "display(Markdown(api_response.content))" ] } ], "metadata": { "kernelspec": { "display_name": "auto_llm_eval", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 2 }

notebooks/prompt_critic.ipynb (311 lines of code) (raw):