(1) Overview
Introduction
The recent NetHack Learning Environment Challenge [] established the NetHack Challenge Task within the NetHack Learning Environment (NLE) [] as a grand challenge for AI. The environment has been extended with MiniHack [], which features a suite of smaller environments and subtasks for investigating specific behaviors. MiniHack also allows for easily creating new environments using the rich features and dynamics of the NetHack game.
Learning from scratch in NetHack is extremely challenging because of the very large observation space (1000s of different entities in randomized configurations), large action space (greater than 100 distinct compositional actions), and length of the game (>10k steps). During the NetHack Challenge, none of the symbolic, deep-rl, or hybrid agents submitted were able to win the game. An alternative approach is to use the tremendous amount of prior knowledge available in written texts and especially in the NetHackWiki.
Transfer learning has become the standard approach for solving Natural Language Processing (NLP) tasks, and has been used to achieve state-of-the-art results []. Sample efficiency can also be improved drastically with Large Language Models (LLM) demonstrating in context few-shot learning []. Smaller language models have also been shown to perform few-shot learning with fine-tuning []. This makes it attractive to frame a task as a language problem to leverage these models and utilize the knowledge in the pretrained model. Pretrained models have been successfully applied to Reinforcement Learning (RL) [], Interactive Decision-Making [], and Instruction Following []. However, these examples still utilize non-language elements with additional parameters that require training from scratch, precluding few-shot levels of sample efficiency.
For common sense reasoning, language models trained on natural language do not perform as well as humans []. One approach to improving the common sense reasoning capability of language models is by interacting with an environment to gather knowledge not normally written down. For example, if I put an item into a bag I can take it out again later.
The use of language representations in the observation and action space can improve generalization when compared to vector observations []. This is likely because of the compositional structure of language. Language representations can also be used to describe all observation modalities, allowing a proven language model architecture, like the transformer [] to be used without the need for bespoke multi-modal models.
In this paper, we present a wrapper for the NLE that uses language to encode the non-textual observations and similarly decode language actions to the supported discrete actions. To summarize the main reasons for choosing to create a language interface are as follows:
- A pretrained language model can be directly connected with the environment and so knowledge learned during language model pretraining can transfer to the agents task, improving sample efficiency.
- Language modeling alone does not always work for learning common sense. By embodying the agent the language model would need to learn common sense to achieve its goals.
- By using compositional language observations and actions we enable agents capable of compositional generalization.
- Exclusively using the language modality for observations and actions allows for adding additional features, new actions, new goals, and outputting explanations without changing the model architecture.
- To facilitate further research on language agents.
Related Work
Some of the earliest work on applying deep reinforcement learning to text-based games [, ] use the Evennia Game Engine and test on existing games or developed new ones to facilitate their research. The TextWorld environment [] is designed to test transfer learning and generalization on different language-based games. However, TextWorld lacks the complexity of real games, because of its limited quest generation capability. TextWorld is also slow to run, but this particular limitation was entirely mitigated in TextWorldExpress [], however the environments are still limited in complexity. Another text environment is WordCraft [], combining constituent entities in the presence of distractors to produce a goal entity which tests common-sense knowledge but also intentionally limits the scope of the environment.
Implementation and architecture
To translate the NLE non-language observations to language representations and language actions to the keyboard actions of the environment we wrap the base NLE. This process is visualized in the block diagram in Figure 1. The space of language interpretations of NLE observations is large, so we define some design goals to target the key functionality required to solve the problem:
- Include key entities, e.g. the agent must know about monsters so that it can attack or escape.
- Allow navigation. To be able to navigate the agent should be able to see obstacles and find a path to its goal.
- Allow effective usage of ranged tools. Ranged weapons are obstructed by obstacles and ray weapons can be reflected. So the agent must know the position of nearby obstacles.
- Performant. The NLE environment is extremely fast making it useful for RL experiments, and we want to minimize the overhead when adding the wrapper.
Language Encoding for Visual Elements
In the NLE the screen observations include glyphs, which are integers that uniquely represent all the objects in the game. The screen is 79 × 21 for a total of 1659 glyphs. To convert this into a serialized language representation we use the subject complement grammar. We maintain an array of glyph strings, so for each integer glyph, we can look up the corresponding string entity with O(1) efficiency. The entity is used as the subject, we then add the distance and direction relative to the agent as its subject complement forming a triple of entity, distance, and direction, e.g. {“giant ant”, “far”, “northeast”} (see for the flow). Performing this operation for each glyph yields a collection of triples. These triples go through a number of post-processing steps shown and described in Figure 2b to produce a language observation. The distances are quantized into buckets of adjacent, very near, near, far, and very far. The directions are also quantized into cardinal, inter-cardinal, and intermediate directions, e.g. north, northeast, and northnortheast. See Figure 3 for a visualization of how the distances and directions are calculated. A complete example observation is shown in Figure 4.
Visual View
Despite utilizing a compact language representation the screen size of 1659 means that an exhaustive description of every glyph would be enormous. To address this we draw inspiration from the unconscious and conscious bandwidth of the brain. The eyes process orders of magnitude more information unconsciously compared to what can be read consciously []. The limited information throughput when reading compared to a visual input implies that for the representations to be interpreted in a similar time frame (at least for a human) we should keep only the salient information and discard the rest. Discarding information in the input and reducing the environment’s observability will negatively impact the performance of an optimal agent. However, the objective of this work is to build an environment that enables the agent to find a solution, not necessarily the optimal one.
Discarding unnecessary information is achieved using Views, which capture specific glyphs based on rules. Currently, the wrapper has two views defined, the Fullscreen View and the Visual View. It is possible to add more Views by following the same design pattern.
Fullscreen View
This View captures all glyphs that are not floors, walls, or unexplored tiles. This process discards the majority of the information on the screen but retains the important items of interest, such as monsters, items, waypoints, etc. Using the Fullscreen View alone may potentially be enough to make progress on the game, but the exclusion of obstacles has disadvantages the Visual View tries to address.
One problem with the Fullscreen View is the lack of information on obstacles. This makes navigation more difficult. Another challenge is some actions can be impacted by obstacles, e.g. ray spells can be reflected by obstacles and hit the agent. To address this problem we try to include the key obstacles in the language observation using the Visual View. Here we simulate vision for the agent in the cardinal and inter-cardinal directions (The only directions the agent can directly interact in). By implementing a simple ray marching system we cast a ray that stops when it encounters a glyph that is either blocking vision or cannot be perceived, which can occur as NetHack also simulates a field of view. This ray reports any glyph that is not a floor and is mutually exclusive to the Fullscreen View (to avoid duplicates). This approach offers the agent some basic navigational queues and allows for safe usage of ray spells. To solve navigational tasks the agent will need to be more reliant on memory or apply a simpler strategy like wall following compared to a multi-modal agent using the raw NLE input.
Vector observations
The vector observation or bottom-line stats (blstats) includes useful features like hitpoints, gold, hunger, etc. To textualize this feature we create pairs of one or more vector features using the template [label]:[value], e.g. the vector values for current hitpoints and maximum hitpoints becomes “HP: 12/24”. When possible we encode the vector value using text, e.g. hunger values 0, 1, 2 become “Satiated”, “Not Hungry” and “Hungry” which are the same as the in-game representations of these states. An example observation for hunger is “Hunger: Satiated”.
Language Encoding Actions
There is a unique action in NetHack assigned to each keyboard button and additional actions are available by pressing modifier keys Ctrl and Shift. These actions also have names consisting of one or more words like apply, north, dip, etc. So using a language action space it is natural to assign these words to the actions. However, because many of the actions can be composed, the action definition can change, e.g. the a key refers to the “apply” action, so we map the action string “apply” to that key, but it can also refer to the first item in the agent’s inventory, so we might perform a composed action like “eat” “a”. Therefore, we also include a mapping of the action string “a” to the keyboard button a so both semantic language actions are available to the agent. Invalid actions raise a ValueError Exception which must be handled by the agent.
Scalability
The NLE is comparatively fast compared to other environments, running at 14.4k steps per second on an Intel Core i7 2.9 GHz CPU []. To implement this wrapper we require extensive string manipulation logic. This is compute intensive, so we implement the transformations in C++ using PyBind. We compare the environment Steps Per Second (SPS) running on a Ryzen 1700 CPU in Table 1 by taking random actions for 10k steps and taking the average of 3 runs. The results show that despite the complexity of the string manipulation, the wrapper retains 40% of the performance of NLE which is enough to run RL experiments. We have also implemented equivalent performance integration tests in the test suite to avoid regression when refactoring or adding new features.
ENVIRONMENT | SPS |
---|---|
NLE | 15k |
NLE with Language Wrapper | 6k |
Experiments
To validate the wrapper a Sample Factory [] implementation is included. This uses Asynchronous Proximal Policy Optimization (APPO) to optimize an agent online. The Agent uses the huggingface transformer Library [] for a policy and value function model. As a baseline, for this implementation, we used the nle-sample-factory-baseline. The results are shown in Table 2.
EXPERIMENT | AVG REWARD |
---|---|
sample factory baseline | 566 (16) |
language wrapper | 695 (22) |
Quality control
The wrapper includes integration tests to validate key functionality. It also includes performance tests to prevent regression during further development. All the tests are listed in Table 3. The environment has been validated at scale while training the included agent for 1 billion steps. Detailed explanations and examples of how to run and test the environment are documented in the README.
TEST NAME | RESULT |
---|---|
test_message_spell_menu | PASSED |
test_message_more_end | PASSED |
test_message_full_stop_end | PASSED |
test_message_bracket_end | PASSED |
test_message_parenthesis_end | PASSED |
test_message_multipage | PASSED |
test_message_takeoffall | PASSED |
test_filter_map_from_conduct | PASSED |
test_empty_tty_chars_returns_empty_message | PASSED |
test_filter_map_from_name | PASSED |
test_filter_map_travel | PASSED |
test_create_env_real | PASSED |
test_env_language_action_space | PASSED |
test_env_discrete_action_space | PASSED |
test_env_obsv_space | PASSED |
test_step_real | PASSED |
test_step_invalid_action | PASSED |
test_action_actions_maps_reflect_valid_actions | PASSED |
test_step_valid_action_not_supported | PASSED |
test_obsv_fake | PASSED |
test_blstats_condition_none | PASSED |
test_blstats_condition_flying | PASSED |
test_multiple_obsv_fake | PASSED |
test_step_fake | PASSED |
test_statue | PASSED |
test_warning | PASSED |
test_swallow | PASSED |
test_zap_beam | PASSED |
test_explosion | PASSED |
test_illegal_object | PASSED |
test_weapon | PASSED |
test_armour | PASSED |
test_ring | PASSED |
test_amulet | PASSED |
test_tool | PASSED |
test_food | PASSED |
test_potion | PASSED |
test_scroll | PASSED |
test_spellbook | PASSED |
test_wand | PASSED |
test_coin | PASSED |
test_gem | PASSED |
test_rock | PASSED |
test_ball | PASSED |
test_chain | PASSED |
test_venom | PASSED |
test_ridden | PASSED |
test_corpse | PASSED |
test_invisible | PASSED |
test_detected | PASSED |
test_tame | PASSED |
test_monster | PASSED |
test_plural_end_ey | PASSED |
test_plural_end_y | PASSED |
test_plural_default | PASSED |
test_plural_end_s | PASSED |
test_plural_end_f | PASSED |
test_plural_end_ff | PASSED |
test_plural_lava | PASSED |
test_wrapper_only_works_with_nle_envs | PASSED |
test_wrapper_requires_all_keys | PASSED |
test_play | PASSED |
test_time_reset | PASSED |
test_time_step | PASSED |
(2) Availability
Operating system
MacOS, Linux and Windows using WSL.
Programming language
Python 3.7 or higher.
Additional system requirements
None
Dependencies
See Table 4 for a list of the project dependencies.
COMPONENT | DEPENDENCY | FUNCTION |
---|---|---|
base | gym>=0.15, gym<=0.23 | Wrapper base class |
base | minihack>=0.1.3 | Enable wrapper for MiniHack |
base | nle==0.8.1 | Base environment |
base | pybind11>=2.9 | Implement high performance functions |
dev | black>=22.6.0 | Formatting Python |
dev | flake8>=4.0.1 | Linting Python |
dev | pytest>=7.1.2 | Test framework |
dev | pytest-cov>=3.0.0 | Test coverage |
dev | pytest-mock>=3.7.0 | Test mocks |
dev | pygame>=2.1.2 | Used for specific test |
dev | isort>=5.10.1 | Sort dependencies |
dev | numpy>=1.21.0 | Used for test framework |
agent | sample_factory>=1.121.4 | RL framework |
agent | transformers>=4.17.0 | Language model for agent |
Software location
Code repository GitHub
Name: https://github.com/ngoodger/nle-language-wrapper
DOI: https://www.doi.org/10.5281/zenodo.7456086
Licence: MIT License
Date published: 04/07/22
Language
English
(3) Reuse potential
As an interactive environment using the de facto standard OpenAI gym interface, this wrapper is specifically directly suited for developing online RL algorithms using language models. It could also be used for Offline RL, Imitation learning, and Decision-Making research, if additional work is done to record and save trajectories from a policy. Other possibilities for re-use include extensions or forking of the wrapper, which is fully permitted under the MIT License, and may be useful if users wish to modify some or all of the functionality. Feedback or contributions are welcome and can be made by raising GitHub Issues or Pull-Requests against the repository. Support can also be obtained by raising a GitHub Issue.