Help us improve by providing feedback or contacting help@jisc.ac.uk
Research Problem
Rationale / Hypothesis
Method
Results
Analysis
Interpretation
Real World Application

Can Large Language Models (LLMs) be used to generate procedural side quests for games in a machine-readable format (JSON) by providing them with detailed game context?

Publication type:Research Problem
Published:
Language:English
Licence:
CC BY 4.0
Peer Reviews (This Version): (0)
Peer Reviews (All Versions): (0)
Red flags:

(0)

Actions
Download:
Sign in for more actions
Sections

Introduction

Utilizing Large Language Models (LLMs) for generating game side quests in a machine-readable format. The experiment focuses on freely accessible models, including ChatGPT 3.5, Claude 2.0, and Gemini 1.0. While these models do not require subscriptions or fees, their usage is subject to request limits within specific timeframes. Nonetheless, these limits remain sufficient for this study. These LLMs excel in content generation, and their output accuracy can be enhanced through precise instructions. To achieve procedural side quest generation, the models will first receive comprehensive context about the game, including environment details, background narrative, NPC information, items, enemy types, and other relevant elements. Clarity and specificity in these instructions will directly influence the quality of the generated content. The study will instruct the models to respond in JSON format, a structured data format widely used in game development. Additionally, providing examples of the desired output will further improve response accuracy. The models will generate the following elements in a sequential JSON structure: quest title, giver, objective, locations, enemies encountered, collectible items, and quest rewards.

Background and Motivation for the Research

The gaming industry has witnessed a significant increase in games' complexity and creative scope. Large-scale titles often incorporate numerous side quests, offering players rewards and items contributing to their primary objectives. However, the extensive nature of these games can lead to repetitive side quest design, requiring players to complete them multiple times to collect desired rewards. This is particularly evident in massively multiplayer online role-playing games (MMORPGs), where daily quest systems often necessitate repetitive gameplay, reducing player engagement. Traditionally, these games store pre-designed quest details in a database, retrieving them when players initiate interaction. The advent of fourth-generation Large Language Models (LLMs) demonstrates their enhanced capabilities in creative story generation and structured sequence production. This study investigates the potential of GPT-3.5, Claude 2.0, and Gemini 1.0 to generate side quests in JSON 6 format. The goal is to enable direct integration of these quests into game environments after validation, potentially mitigating repetition issues and enhancing the player experience.

Literature review

Summary of findings from research papers

Previous research on procedural quest generation using GPT language models has shown promising results. However, these studies have been limited by publicly available models, the need to train models on large datasets, and the focus on either quest generation or dialogue generation only. One recent study, "Generative AI in Mafia-like Game Simulation” (Kim, M. and Kim, S. 2023) demonstrated the potential of GPT-4 for generating meaningful dialogue and interacting with players in a game environment. Another study, "Generating Role Playing Game Quest Descriptions With the GPT-2 Language Model (Värtinen and Susanna, 2022)" showed that fine-tuning GPT-2 on a dataset of RPG quests can produce acceptable quest descriptions. The study, “Generating Video Game Quests from Stories” (M.K. Mishra, 2023) examines the impact of incorporating KG data on the quality and relevance of generated quests. This area remains relatively unexplored in the field of automated game design, “Text generation for quests in multiplayer role-playing video games” (Koomen, S.B.,2023) fine-tuning large language models for the use of quest backstory generation, “Player-Driven Emergence in LLM-Driven Game Narrative” (Peng et al., 2024) showed the ability to create narrative game stories using LLMs. Other studies (Stegeren, J.V. and Myśliwiec, J., 2021) (Al-Nassar, S. et al. 2023) (Värtinen, S., Hämäläinen, P. and Guckelsberger, C., 2022) have focused on using GPT-2 to generate dialogue lines for quest-giver NPCs and to create engaging quests by combining PCG with NLP using BERT and GPT-2. Overall, the existing research showcases the narrative ability to generate game quests and suggests that GPT language models have the potential to be used to create game NPCs that can generate game side quests procedurally. However, more research is needed to develop and evaluate new methods for fine-tuning and using these models in-game environments.

LLMs (Large Language Models)

Large language models (LLMs) are a class of statistical language models notable for their size and capabilities. LLMs leverage the transformer architecture, a neural network architecture introduced by Vaswani et al. (2017) that relies on self-attention mechanisms to process sequential data. The transformer's effectiveness in handling long-range dependencies within text sequences has been pivotal in the success of LLMs. Through extensive pre-training on massive text corpora, LLMs develop the ability to perform a wide range of natural language processing (NLP) tasks, including translation, summarization, and text generation in ways that are often indistinguishable from human-written content. Pioneering research institutions and companies such as Google AI, OpenAI, and DeepMind have been instrumental in developing and advancing LLMs.

Context Length and tokens

Within the realm of LLMs, context length plays a crucial role. Context length refers to the maximum number of tokens an LLM can process during a single instance. A token, in this context, represents a unit of text, often a sub-word segment or a portion of a word. This tokenization process allows the model to handle vocabulary complexities. The context length directly influences the model's ability to understand relationships across longer spans of text. A larger context length enables the LLM to maintain a broader understanding of the input, enhancing its ability to generate more coherent and contextually relevant responses or translations.

Research Problem

This research centers on utilizing Large Language Models (LLMs) for generating game side quests in a machine-readable format. The experiment focuses on freely accessible models, including ChatGPT 3.5, Claude 2.0, and Gemini 1.0. While these models do not require subscriptions or fees, their usage is subject to request limits within specific timeframes. Nonetheless, these limits remain sufficient for this study. These LLMs excel in content generation, and their output accuracy can be enhanced through precise instructions. To achieve procedural side quest generation, the models will first receive comprehensive context about the game, including environment details, background narrative, NPC information, items, enemy types, and other relevant elements. Clarity and specificity in these instructions will directly influence the quality of the generated content. The study will instruct the models to respond in JSON format, a structured data format widely used in game development. Additionally, providing examples of the desired output will further improve response accuracy. The models will generate the following elements in a sequential JSON structure: quest title, giver, objective, locations, enemies encountered, collectible items, and quest rewards. The game developers can use the generated JSON file and conduct validation checks with the game context then directly generate side quests via an algorithm.

Increasing accuracy in responses

Depending on the LLM the method that provides instructions can vary from one to another. However, instructions should be very precise and accurate. Consider a monster-hunting game. To optimize the accuracy and relevance of responses generated by the language model (LM), input instructions should prioritize specificity and consistency regarding in-game elements such as items, locations, monsters, and other pertinent details. Consider the following guidelines:

  1. Item Nomenclature: Furnish an explicit list of valid item names from which the LM can select during quest generation. This establishes a controlled vocabulary.

  2. Location Specificity: Define locations with distinct names and concise descriptions or categorical assignments (e.g., "Whispering Forest," "Abandoned Mine," "Royal Castle"). This provides contextual clarity for the LM.

  3. Monster Taxonomy: Categorize monsters according to archetypes (e.g., undead, mythical creatures, humanoids). This helps the LM associate monsters with appropriate environments and mechanics.

  4. Monster Attributes: Detail monster abilities, vulnerabilities, and other relevant characteristics using a structured format. This allows the LM to incorporate nuanced combat encounters into quests.

  5. Terminological Consistency: Maintain uniformity in the terminology used throughout the instructions. This prevents ambiguity and ensures the LM's correct interpretation of concepts.

  6. Diverse Examples: Provide quest examples that exhibit varying degrees of complexity, encompassing different combinations of locations, monsters, and items. This demonstrates the desired range of output to the LM.

  7. Structural Definition: Explicitly specify the required JSON structure and key-value pairs to be employed in quest generation. This guarantees compatibility between the LM's output and the game's quest generation algorithms.

  8. Negative Examples: Supplementing positive examples with instances of incorrect or undesirable quest data can help the LM learn to avoid generating similar outputs

    Providing instructions to LLM

    When giving instructions to LLM use a more precise structure.

    1. The role of LM and explains the task background.

      • Ex: “You are an assistant that creates side quests in a game. You need to create side quests in JSON format. follow the same structure. you can change a number of locations, items, objectives, and monsters according to the quest.”

    2. include the structured game details.

      1. Game Title

      2. Simple description of the environment.

      3. The surrounding area details, enemies, collectible/valuable items, and Non-Playable Characters are explained in point form.

    3. Provide necessary JSON examples.

      1. Example 1

      2. Example 2

    4. Request the task.

      • Ex: “Task: Create 2 side quests.”

A Progressive Complexity Approach

The simplest form of quests can be generated by following these steps:

  1. Randomly selecting a location from the map.

  2. Adding random monsters to the selected location.

  3. Adding random items to the location as a reward upon quest completion.

Initially, this approach was tested by providing simple descriptions, including lists of monsters, locations, and items, to the selected large language models (LLMs). The results from all three models were 100% accurate in generating the simplest type of quests based on these inputs. Consequently, when a player engages in a conversation with a non-player character (NPC) 12 backed by an LLM chatbot, the NPC can accurately generate these types of simple quests upon the player's request.

Example expected output:

{
"location": "random location from given list",
"monster_types": "random monster list",
"monster_numbers": "x, y, z ...", // random amount for selected monster list
"name_of_item": "random item or items from given list",
"reward_n_amount": "selected item, random number to indicate amount of the 
item"
}

Coming soon

During the testing process, it was observed that all three chosen LLMs were 100% accurate in identifying the point where the player requested a quest during the conversation and responding with the simplest form of quests. However, since these LLMs are designed for conversational purposes, their responses often included extraneous phrases such as "Here is your quest code," "Here is the JSON file you requested for," or "Here is JSON data." Additionally, the JSON file should not be visible to the player.

To address this issue, a middle algorithm is required to:

  1. Identify and separate the JSON file from the conversation.

  2. Identify and remove the aforementioned word phrases.

An alternative method to avoid this situation is to introduce a second bot. The first bot would function as a simple chatbot, while the second bot, also an LLM conversational bot, would observe the conversation and respond with the quest JSON file. In this approach, the game backend logic can simply extract the JSON portion of the response and ignore any accompanying explanations or recommendations.

Coming soonWhile all the quests generated so far are in their simplest form, increasing the complexity of the quests can be achieved by introducing additional instructions to the LLM bot responsible for quest generation. The requested JSON quest structure can be expanded to accommodate more complex quests.

The complexity of the requested quest structure increased.

{
"title": "title",
"giver": "NPC from given list",
"objective": "quest objective",
"locations": [
 { "name": "location", "role": "role" },
 { "name": " location ", "role": " role " },
 { "name": " location ", "role": " role " },
 { "name": " location ", "role": " role " },
 { "name": " location ", "role": " role " }
],
"monsters": [
 { "location": " location ", "type": " monster ", "count": 2 },
 { "location": " location ", "type": " monster ", "count": 2 }, 
 { "location": " location ", "type": " monster ", "count": 1 }
],
"items": [
 { "location": " location ", "name": " item " },
 { "location": " location ", "name": " item " }
],
"reward": {
 "xp": just a random number it will replace by game algorithm based on player level,
15
 "reputation": " just a random string it will replace by game algorithm",
 "item": " item " 
}
  • Title: LLM should create any title based on the game description.

  • giver: LLM should choose the giver from the given list.

  • location: LLM should choose the giver from the given list.

  • monster: LLM chooses the giver from the given list.

  • item: LLM chooses the giver from the given list.

  • objective: LLM should create any objective based on the game description.

  • role: role can be “start”, “goal”, “challenge”, or “return”.

    • Start: starting location

    • Return: ending location

    • Goal: if there is an item to collect

    • Challenge: if there are monsters to defeat

For this approach, all three models struggle with creating quests accurately. Because of this, the complexity of instructions is reduced by removing instructions that are related to

  • Role

  • XP

  • Reputation

Data and data descriptions

To analyze, the accuracy of those models

  • 10 game descriptions are created.

  • All 10 descriptions follow the same structure.

  • Each model was asked to generate 10 quests for each description.

  • Then do an analysis based on automatic and manual checks for all 300 quests.

For each quest that checks

  • Whether they follow the structure.

  • Do they use strings that are not in given lists?

  • Do they have JSON syntax errors?

    • Do they have repeated keys?

    • Do they have comments?

    • Do they have trailing commas?

Data Analysis Method

To analyze the data collected from the model responses the process is broken down into three steps.

  • Validate the JSON files.

    • For this validation use a simple coding IDE by opening the JSON files from it (WebStorm Help, n.d.). The IDE uses JSON Schema to validate JSON files. Then manually fixed the syntax errors from JSON files for the next checks.

  • Validate the file with the given structure.

    • Uses a simple JavaScript algorithm that checks the given structure by checking if the “keys” are available or not in the JSON file.

  • Validate both “key” and “value” strings in the files.

    • Uses a simple JavaScript algorithm that checks and compares the values from JSON files and the actual expected values.

For the JavaScript algorithm see reference (Poornajith, 2024).

Model performance evaluation.

Based on the JSON response given from the model analysis done by evaluating

  • Adherence to the JSON structure: Check if the responses strictly follow the provided JSON structure and schema. Ensure that all the required fields are present and properly formatted.

  • Consistency with the game setting: Evaluate whether the generated quests, locations, monsters, and items are consistent with the described in the game idea. Look for any inconsistencies or elements that don't fit the theme.

  • Variety and originality: Assess the variety and originality of the responses. Are the quests, locations, and descriptions sufficiently diverse, or do they feel repetitive or generic? Look for creative and unique elements that enhance the gameplay experience.

  • Integration with existing content: If the chatbot is generating content for an existing game, evaluate how well the generated quests integrate with the existing game world, lore, and mechanics.

  • Usability and compatibility: Check if the generated JSON data is usable and compatible with your game's algorithm or system for procedural quest generation.

Claude2

  • Number of Quests Out of given structure: 6

  • Number of Quests are with comments in JSON file: 0

  • Number of Quests are with trailing commas in JSON file: 0

  • Accuracy of “key” strings = 100%

  • Accuracy of “value” strings = 74.304%

Coming soon

Gemini 1.0

  • Number of Quests Out of given structure: 5

  • Number of Quests are with comments in JSON file: 52

  • Number of Quests are with trailing commas in JSON file: 15

  • Accuracy of “key” strings = 100%

  • Accuracy of “value” strings = 46.95975%

Coming soon

GPT 3.5

  • Number of Quests Out of given structure: 2

  • Number of Quests are with comments in JSON file: 0

  • Number of Quests are with trailing commas in JSON file: 0

  • Accuracy of “key” strings = 100%

  • Accuracy of “value” strings = 51.606%

Coming soon

Challenges

This research initially envisioned utilizing ChatGPT-4, Bard, LLaMa, and GPT4all for its experimental analysis. However, the cost limitations associated with ChatGPT-4, even with a subscription, became a significant barrier. Furthermore, its full capabilities remain accessible only via an even more expensive enterprise subscription plan. Google's recent introduction of its 'Gemini' model, with capabilities exceeding ChatGPT-4 in certain domains, presented a compelling alternative. Considering Bard and LLaMa's closer alignment with third-generation LLMs, and GPT-3.5's categorization as intermediate between the third and fourth generations, this study ultimately selected GPT-3.5, Gemini 1.0, and the freely accessible Claude 2.0 for its experimentation. These models offer a strong balance of capability and accessibility. The original research design, which aimed to create NPCs capable of both quest generation and dynamic player conversation, proved overly complex due to the potential for the model to deviate from the established logical narrative. Consequently, the study pivoted to prioritize the procedural quest generation aspect. It was determined that utilizing a separate instance of the model for quest generation and leveraging a well-developed external conversational LLM would be the most effective approach, ensuring a focused analysis of procedural quest generation capabilities.

Findings

Based on the analytical findings, the model with the highest accuracy is Claude2. However, for its application in a game, an accompanying validation algorithm is necessary. This algorithm would be responsible for validating the JSON string provided by the model and replacing any incorrect string values with the correct ones before generating a playable quest. The analysis also indicates that while the model has the potential to create straightforward quests, its accuracy diminishes as the complexity of the quest increases. 23 Given that all three models are designed as conversational models, they consistently exhibit their conversational capabilities. Additionally, each quest is accompanied by an explanation. Notably, Gemini has augmented the JSON file with comments to enhance its human readability. Based on the manual analysis, Gemini exhibits superior narrative content. Its quests are not confined to a repetitive style; instead, they span a variety of areas. In contrast, GPT-3 adheres closely to a predefined structure, resulting in quests that appear similar due to the use of random values. These quests tend to follow the same type with identical goals. Claude, on the other hand, attempts to emulate the example structure while also incorporating elements from both narrative quests. This balance between structure and narrative highlights the potential accuracy of both Claude and Gemini, particularly when strict instruction sets are available.

strategies for enhancing model accuracy based on analysis:

  • Strict Instructions: Providing clear and precise instructions is crucial. By formulating strict guidelines, we guide the models toward more accurate outputs. These instructions act as guardrails, ensuring that the generated content aligns with the desired outcome.

  • Tailoring Instructions to Specific Models: Recognizing the unique characteristics of each model allows us to tailor instructions accordingly. Different models have varying strengths and weaknesses. By adapting instructions to suit a specific model’s capabilities, we can optimize its performance.

  • Sufficient Data Feeding: Models thrive on data. Feeding them with diverse and extensive datasets enhances their understanding and ability to generalize. Sufficient data exposure helps models learn patterns, context, and nuances, ultimately contributing to improved accuracy.

Conclusion and future work.

Based on the findings from the analysis, all three selected models demonstrate the potential to generate procedural side quests in games. When sufficiently trained with a substantial dataset, they may achieve even more accurate results. Among the three models—Gemini 1.0, GPT 3.5, and Claude2—Claude2 stands out for its more accurate quest generation. However, GPT 3.5 excels in adhering to the given structure.

This study highlights the potential of the next generation of large language models (LLMs) to extend procedural generation beyond side quests. Main story quests could also be procedurally generated using LLMs. The study suggests that an LLM bot capable of observing player actions and decisions could generate executable code pieces that dynamically alter the game’s storyline based on player interactions.

Research topics above this in the hierarchy

Funders

No sources of funding have been specified for this Research Problem.

Conflict of interest

This Research Problem does not have any specified conflicts of interest.