# Prompt Critic

The `prompt_critic.ipynb` notebook is designed to critically evaluate prompt
criteria used in the context of Large Language Models (LLMs). This notebook
loads specific criteria for prompt evaluation and provides a structured critique
on each of them. The primary areas of focus include:

- **Ambiguity**: Assessing whether the criteria are clear and unambiguous,
  ensuring that there is no room for misinterpretation.
- **Feasibility**: Evaluating if the criteria are practical and achievable within the constraints of the LLM's capabilities.

In [29]:
import os
import yaml

from langchain_openai import AzureChatOpenAI
from langchain_core.messages import HumanMessage
from IPython.display import display, Markdown

# Directory where the evaluation scenarios are stored
base_path = "../data/claude_sonnet_3_5_20240627"

# Scenario to evaluate
scenario_id = "12"


def get_model():
    """Get the evaluator model."""
    # Specify configuration for the AI Dial endpoint
    openai_endpoint = "https://ai-proxy.lab.epam.com"
    openai_deploymet_name = "gpt-4o-2024-05-13"
    openai_api_version = "2024-05-01-preview"

    # Read API key from the environment variables
    # Putting the key inside the notebook is not secure
    openai_api_key = os.environ["API_KEY"]

    # Define GPT-4-omni model
    model = AzureChatOpenAI(
        temperature=0,  # request deterministic behavior
        azure_endpoint=openai_endpoint,
        azure_deployment=openai_deploymet_name,
        api_version=openai_api_version,
        api_key=openai_api_key,
    )

    return model


def read_file(file_path):
    """Read the content of a file and return it as a string."""
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()
    return content

In [30]:
# load scenario metadata and criteria
def get_metadata(base_path, scenario_id):
    _metadata = read_file(os.path.join(base_path, scenario_id, "meta.yaml"))

    metadata = yaml.safe_load(_metadata)

    return (
        metadata.get("metadata", {}),
        metadata.get("evaluation_steps", {}).get("completeness", []),
        metadata.get("evaluation_steps", {}).get("accuracy", []),
    )


(metadata, completeness, accuracy) = get_metadata(base_path, scenario_id)

print(
    f'{metadata["scenario_id"]}_{metadata["scenario_name"]} '
    f'on {metadata["repository"]}'
)

12_vanilla_to_react on piano_js


In [31]:
print("Completeness:\n")
for k, v in enumerate(completeness, 0):
    print(f"{k:02d}: {v}")

Completeness:

00: Verify the application contains a button container with "Notes" and "Letters" buttons.
01: Verify the application renders piano keys for both natural and sharp notes.
02: Verify clicking on a piano key plays the correct sound and highlights the key.
03: Verify releasing the mouse button or moving the cursor out of the key stops the sound and removes the highlight.
04: Verify pressing and releasing keyboard keys play and stop the corresponding piano notes.
05: Verify clicking the "Notes" button displays note names on the keys.
06: Verify clicking the "Letters" button displays letter names on the keys.
07: Verify the fullscreen button toggles fullscreen mode for the application.
08: Verify the application correctly handles simultaneous multiple key presses.


In [32]:
print("Accuracy:\n")
for k, v in enumerate(accuracy, 0):
    print(f"{k:02d}: {v}")

Accuracy:

00: Ensure the application does not contain unused imports or code.
01: Ensure the codebase is structured and follows React best practices (state, hooks, effects).
02: Ensure the application is free of memory leaks and unnecessary re-renders.
03: Ensure the application is compatible with the latest version of React and TypeScript.
04: Verify the application works correctly across modern browsers.
06: Verify that the new React app initializes successfully without errors.
07: Verify that the codebase does not contain any TODOs.
08: Ensure the application has the same DOM tree structure and classes as in the original application.
09: Verify the application UI matches the original HTML structure visually.
10: Verify the application handles rapid sequential key presses without audio overlap issues.


In [34]:
CRITIC_PROMPT = """
Assess the following criteria to determine if they are unambiguous,
straightforward, specific, and can be assessed by an LLM without requiring tools
such as a compiler, interpreter, or any other software and APIs.

Criteria will be applied to the UI app code written in JavaScript and running in
user browser.

Provide concise reasoning for you decision and recommendation on how to improve
the criterion. Be straightforward and clear in your assessment.

Criterion: {evaluation_criteria}

DO NOT assess any other criteria except the one provided.

DO NOT recommend using external tools or APIs to assess the criterion.

DO NOT recommend adding examples to the criterion.

---Example 01---

Input:

Criterion: Ensure the application is free of console errors and warnings

Output in Markdown format:

# Ensure the application is free of console errors and warnings

The criterion is unclear for several reasons:

1. Ambiguity: The term "console errors and warnings" can be understood in
   different ways. It might refer to errors and warnings from JavaScript code,
   network issues, browser extensions, or third-party libraries.
2. Assessment by LLM: An LLM like GPT-4 can analyze code and provide insights
   based on patterns and best practices, but it cannot run the code to check for
   runtime errors or warnings. Detecting console errors and warnings requires
   running the application in a browser and observing the console output, which
   an LLM cannot do without additional tools.
3. Context Dependency: Console errors and warnings can vary depending on the
   environment. Different browsers or browser versions might produce different
   console outputs for the same code.

To make the criteria clearer and assessable by an LLM, consider these
improvements:

1. Clarify the Scope: Specify which types of console errors and warnings should
   be considered.
2. Static Analysis: Since an LLM cannot run the code, focus on static code
   analysis.

Refined criterion:

Ensure the JavaScript code follows best practices to minimize the likelihood of
runtime errors and warnings in the browser console.

   
---Example 02---

Input:

Criterion: Ensure the application is free of console errors and warnings

Output in Markdown format:

# Check that @ngrx/store is used for state management

The criterion is clear and straightforward.

To ensure the criterion is even more precise, consider these minor improvements:

1. Clarify the Scope: Specify what aspects of @ngrx/store usage should be
   checked.

Refined criterion: 

Check that @ngrx/store is imported and used for state management in the
application, including the creation of actions, reducers, and selectors.
"""

model = get_model()

prompt = CRITIC_PROMPT.format(evaluation_criteria=accuracy[10])

message = HumanMessage(content=prompt)

api_response = model.invoke([message])

display(Markdown(api_response.content))

# Verify the application handles rapid sequential key presses without audio overlap issues

The criterion is unclear for several reasons:

1. **Ambiguity**: The term "audio overlap issues" is not clearly defined. It is not specified what constitutes an "issue" in this context. Does it refer to audio clipping, distortion, or multiple audio tracks playing simultaneously?
2. **Assessment by LLM**: An LLM like GPT-4 can analyze code and provide insights based on patterns and best practices, but it cannot simulate rapid sequential key presses or play audio to detect overlap issues. This requires running the application in a browser and observing the behavior, which an LLM cannot do without additional tools.
3. **Specificity**: The criterion does not specify what kind of audio is being referred to (e.g., sound effects, background music) or how the application should handle rapid key presses (e.g., by queuing sounds, by ignoring subsequent presses until the current sound finishes).

To make the criteria clearer and assessable by an LLM, consider these improvements:

1. **Clarify the Scope**: Specify what types of audio overlap issues should be checked and how the application should handle rapid key presses.
2. **Static Analysis**: Since an LLM cannot run the code, focus on static code analysis to ensure best practices are followed to minimize the likelihood of audio overlap issues.

Refined criterion:

Ensure the JavaScript code includes mechanisms to prevent audio overlap issues, such as debouncing key press events or managing audio playback queues, to handle rapid sequential key presses effectively.