(1) Overview

Introduction

The recent NetHack Learning Environment Challenge [] established the NetHack Challenge Task within the NetHack Learning Environment (NLE) [] as a grand challenge for AI. The environment has been extended with MiniHack [], which features a suite of smaller environments and subtasks for investigating specific behaviors. MiniHack also allows for easily creating new environments using the rich features and dynamics of the NetHack game.

Learning from scratch in NetHack is extremely challenging because of the very large observation space (1000s of different entities in randomized configurations), large action space (greater than 100 distinct compositional actions), and length of the game (>10k steps). During the NetHack Challenge, none of the symbolic, deep-rl, or hybrid agents submitted were able to win the game. An alternative approach is to use the tremendous amount of prior knowledge available in written texts and especially in the NetHackWiki.

Transfer learning has become the standard approach for solving Natural Language Processing (NLP) tasks, and has been used to achieve state-of-the-art results []. Sample efficiency can also be improved drastically with Large Language Models (LLM) demonstrating in context few-shot learning []. Smaller language models have also been shown to perform few-shot learning with fine-tuning []. This makes it attractive to frame a task as a language problem to leverage these models and utilize the knowledge in the pretrained model. Pretrained models have been successfully applied to Reinforcement Learning (RL) [], Interactive Decision-Making [], and Instruction Following []. However, these examples still utilize non-language elements with additional parameters that require training from scratch, precluding few-shot levels of sample efficiency.

For common sense reasoning, language models trained on natural language do not perform as well as humans []. One approach to improving the common sense reasoning capability of language models is by interacting with an environment to gather knowledge not normally written down. For example, if I put an item into a bag I can take it out again later.

The use of language representations in the observation and action space can improve generalization when compared to vector observations []. This is likely because of the compositional structure of language. Language representations can also be used to describe all observation modalities, allowing a proven language model architecture, like the transformer [] to be used without the need for bespoke multi-modal models.

In this paper, we present a wrapper for the NLE that uses language to encode the non-textual observations and similarly decode language actions to the supported discrete actions. To summarize the main reasons for choosing to create a language interface are as follows:

  1. A pretrained language model can be directly connected with the environment and so knowledge learned during language model pretraining can transfer to the agents task, improving sample efficiency.
  2. Language modeling alone does not always work for learning common sense. By embodying the agent the language model would need to learn common sense to achieve its goals.
  3. By using compositional language observations and actions we enable agents capable of compositional generalization.
  4. Exclusively using the language modality for observations and actions allows for adding additional features, new actions, new goals, and outputting explanations without changing the model architecture.
  5. To facilitate further research on language agents.

Some of the earliest work on applying deep reinforcement learning to text-based games [, ] use the Evennia Game Engine and test on existing games or developed new ones to facilitate their research. The TextWorld environment [] is designed to test transfer learning and generalization on different language-based games. However, TextWorld lacks the complexity of real games, because of its limited quest generation capability. TextWorld is also slow to run, but this particular limitation was entirely mitigated in TextWorldExpress [], however the environments are still limited in complexity. Another text environment is WordCraft [], combining constituent entities in the presence of distractors to produce a goal entity which tests common-sense knowledge but also intentionally limits the scope of the environment.

Implementation and architecture

To translate the NLE non-language observations to language representations and language actions to the keyboard actions of the environment we wrap the base NLE. This process is visualized in the block diagram in Figure 1. The space of language interpretations of NLE observations is large, so we define some design goals to target the key functionality required to solve the problem:

Figure 1 

NLE Language Wrapper Block Diagram.

  1. Include key entities, e.g. the agent must know about monsters so that it can attack or escape.
  2. Allow navigation. To be able to navigate the agent should be able to see obstacles and find a path to its goal.
  3. Allow effective usage of ranged tools. Ranged weapons are obstructed by obstacles and ray weapons can be reflected. So the agent must know the position of nearby obstacles.
  4. Performant. The NLE environment is extremely fast making it useful for RL experiments, and we want to minimize the overhead when adding the wrapper.

Language Encoding for Visual Elements

In the NLE the screen observations include glyphs, which are integers that uniquely represent all the objects in the game. The screen is 79 × 21 for a total of 1659 glyphs. To convert this into a serialized language representation we use the subject complement grammar. We maintain an array of glyph strings, so for each integer glyph, we can look up the corresponding string entity with O(1) efficiency. The entity is used as the subject, we then add the distance and direction relative to the agent as its subject complement forming a triple of entity, distance, and direction, e.g. {“giant ant”, “far”, “northeast”} (see for the flow). Performing this operation for each glyph yields a collection of triples. These triples go through a number of post-processing steps shown and described in Figure 2b to produce a language observation. The distances are quantized into buckets of adjacent, very near, near, far, and very far. The directions are also quantized into cardinal, inter-cardinal, and intermediate directions, e.g. north, northeast, and northnortheast. See Figure 3 for a visualization of how the distances and directions are calculated. A complete example observation is shown in Figure 4.

Figure 2 

These figures show the computation flow for converting glyphs to the language observation. Figure 2a shows how for each glyph we compute the distance, direction and entity strings to produce a triple. Figure 2b shows the mapping for the collection of triples to the language observation, with example data on the right-hand side.

Figure 3 

In this figure, the player is represented by the @ symbol in the center. The color bands around the player represent distances, that are defined in the legend on the right. Glyphs that lie on cardinal or inter-cardinal directions (e.g. north, northeast, east) are defined as being in the direction they lie on. Glyphs that are located between the cardinal and inter-cardinal directions are assigned the direction of the direction band they fall within (e.g. northnortheast), regardless of their exact position within the band. Using this chart we can see how for any glyph position, the distances and directions can be quantized to text relative to the player.

Figure 4 

The screen in Figure 4a shows the glyphs included in the language observation listed in Figure 4b. The elements and rays from the Visual View are indicated with red, and the elements from the Fullscreen View are indicated with blue.

Visual View

Despite utilizing a compact language representation the screen size of 1659 means that an exhaustive description of every glyph would be enormous. To address this we draw inspiration from the unconscious and conscious bandwidth of the brain. The eyes process orders of magnitude more information unconsciously compared to what can be read consciously []. The limited information throughput when reading compared to a visual input implies that for the representations to be interpreted in a similar time frame (at least for a human) we should keep only the salient information and discard the rest. Discarding information in the input and reducing the environment’s observability will negatively impact the performance of an optimal agent. However, the objective of this work is to build an environment that enables the agent to find a solution, not necessarily the optimal one.

Discarding unnecessary information is achieved using Views, which capture specific glyphs based on rules. Currently, the wrapper has two views defined, the Fullscreen View and the Visual View. It is possible to add more Views by following the same design pattern.

Fullscreen View

This View captures all glyphs that are not floors, walls, or unexplored tiles. This process discards the majority of the information on the screen but retains the important items of interest, such as monsters, items, waypoints, etc. Using the Fullscreen View alone may potentially be enough to make progress on the game, but the exclusion of obstacles has disadvantages the Visual View tries to address.

One problem with the Fullscreen View is the lack of information on obstacles. This makes navigation more difficult. Another challenge is some actions can be impacted by obstacles, e.g. ray spells can be reflected by obstacles and hit the agent. To address this problem we try to include the key obstacles in the language observation using the Visual View. Here we simulate vision for the agent in the cardinal and inter-cardinal directions (The only directions the agent can directly interact in). By implementing a simple ray marching system we cast a ray that stops when it encounters a glyph that is either blocking vision or cannot be perceived, which can occur as NetHack also simulates a field of view. This ray reports any glyph that is not a floor and is mutually exclusive to the Fullscreen View (to avoid duplicates). This approach offers the agent some basic navigational queues and allows for safe usage of ray spells. To solve navigational tasks the agent will need to be more reliant on memory or apply a simpler strategy like wall following compared to a multi-modal agent using the raw NLE input.

Vector observations

The vector observation or bottom-line stats (blstats) includes useful features like hitpoints, gold, hunger, etc. To textualize this feature we create pairs of one or more vector features using the template [label]:[value], e.g. the vector values for current hitpoints and maximum hitpoints becomes “HP: 12/24”. When possible we encode the vector value using text, e.g. hunger values 0, 1, 2 become “Satiated”, “Not Hungry” and “Hungry” which are the same as the in-game representations of these states. An example observation for hunger is “Hunger: Satiated”.

Language Encoding Actions

There is a unique action in NetHack assigned to each keyboard button and additional actions are available by pressing modifier keys Ctrl and Shift. These actions also have names consisting of one or more words like apply, north, dip, etc. So using a language action space it is natural to assign these words to the actions. However, because many of the actions can be composed, the action definition can change, e.g. the a key refers to the “apply” action, so we map the action string “apply” to that key, but it can also refer to the first item in the agent’s inventory, so we might perform a composed action like “eat” “a”. Therefore, we also include a mapping of the action string “a” to the keyboard button a so both semantic language actions are available to the agent. Invalid actions raise a ValueError Exception which must be handled by the agent.

Scalability

The NLE is comparatively fast compared to other environments, running at 14.4k steps per second on an Intel Core i7 2.9 GHz CPU []. To implement this wrapper we require extensive string manipulation logic. This is compute intensive, so we implement the transformations in C++ using PyBind. We compare the environment Steps Per Second (SPS) running on a Ryzen 1700 CPU in Table 1 by taking random actions for 10k steps and taking the average of 3 runs. The results show that despite the complexity of the string manipulation, the wrapper retains 40% of the performance of NLE which is enough to run RL experiments. We have also implemented equivalent performance integration tests in the test suite to avoid regression when refactoring or adding new features.

Table 1

NLE Language Wrapper Performance.


ENVIRONMENTSPS

NLE15k

NLE with Language Wrapper6k

Experiments

To validate the wrapper a Sample Factory [] implementation is included. This uses Asynchronous Proximal Policy Optimization (APPO) to optimize an agent online. The Agent uses the huggingface transformer Library [] for a policy and value function model. As a baseline, for this implementation, we used the nle-sample-factory-baseline. The results are shown in Table 2.

Table 2

Average reward Mean (Standard Deviation) of 3 runs for 1B steps.


EXPERIMENTAVG REWARD

sample factory baseline566 (16)

language wrapper695 (22)

Quality control

The wrapper includes integration tests to validate key functionality. It also includes performance tests to prevent regression during further development. All the tests are listed in Table 3. The environment has been validated at scale while training the included agent for 1 billion steps. Detailed explanations and examples of how to run and test the environment are documented in the README.

Table 3

NLE Language Wrapper integration tests.


TEST NAMERESULT

test_message_spell_menuPASSED

test_message_more_endPASSED

test_message_full_stop_endPASSED

test_message_bracket_endPASSED

test_message_parenthesis_endPASSED

test_message_multipagePASSED

test_message_takeoffallPASSED

test_filter_map_from_conductPASSED

test_empty_tty_chars_returns_empty_messagePASSED

test_filter_map_from_namePASSED

test_filter_map_travelPASSED

test_create_env_realPASSED

test_env_language_action_spacePASSED

test_env_discrete_action_spacePASSED

test_env_obsv_spacePASSED

test_step_realPASSED

test_step_invalid_actionPASSED

test_action_actions_maps_reflect_valid_actionsPASSED

test_step_valid_action_not_supportedPASSED

test_obsv_fakePASSED

test_blstats_condition_nonePASSED

test_blstats_condition_flyingPASSED

test_multiple_obsv_fakePASSED

test_step_fakePASSED

test_statuePASSED

test_warningPASSED

test_swallowPASSED

test_zap_beamPASSED

test_explosionPASSED

test_illegal_objectPASSED

test_weaponPASSED

test_armourPASSED

test_ringPASSED

test_amuletPASSED

test_toolPASSED

test_foodPASSED

test_potionPASSED

test_scrollPASSED

test_spellbookPASSED

test_wandPASSED

test_coinPASSED

test_gemPASSED

test_rockPASSED

test_ballPASSED

test_chainPASSED

test_venomPASSED

test_riddenPASSED

test_corpsePASSED

test_invisiblePASSED

test_detectedPASSED

test_tamePASSED

test_monsterPASSED

test_plural_end_eyPASSED

test_plural_end_yPASSED

test_plural_defaultPASSED

test_plural_end_sPASSED

test_plural_end_fPASSED

test_plural_end_ffPASSED

test_plural_lavaPASSED

test_wrapper_only_works_with_nle_envsPASSED

test_wrapper_requires_all_keysPASSED

test_playPASSED

test_time_resetPASSED

test_time_stepPASSED

(2) Availability

Operating system

MacOS, Linux and Windows using WSL.

Programming language

Python 3.7 or higher.

Additional system requirements

None

Dependencies

See Table 4 for a list of the project dependencies.

Table 4

List of core dependencies for the wrapper. The Component column identifies what the dependency is required for, where those marked base are required to use the wrapper, dev are required for development, and agent are required to train or run the included sample factory agent. The Dependency column specifies the library name and the version. Finally, the Function column specifies the role of the library in the project.


COMPONENTDEPENDENCYFUNCTION

basegym>=0.15, gym<=0.23Wrapper base class

baseminihack>=0.1.3Enable wrapper for MiniHack

basenle==0.8.1Base environment

basepybind11>=2.9Implement high performance functions

devblack>=22.6.0Formatting Python

devflake8>=4.0.1Linting Python

devpytest>=7.1.2Test framework

devpytest-cov>=3.0.0Test coverage

devpytest-mock>=3.7.0Test mocks

devpygame>=2.1.2Used for specific test

devisort>=5.10.1Sort dependencies

devnumpy>=1.21.0Used for test framework

agentsample_factory>=1.121.4RL framework

agenttransformers>=4.17.0Language model for agent

Software location

Code repository GitHub

Name: https://github.com/ngoodger/nle-language-wrapper

DOI: https://www.doi.org/10.5281/zenodo.7456086

Licence: MIT License

Date published: 04/07/22

Language

English

(3) Reuse potential

As an interactive environment using the de facto standard OpenAI gym interface, this wrapper is specifically directly suited for developing online RL algorithms using language models. It could also be used for Offline RL, Imitation learning, and Decision-Making research, if additional work is done to record and save trajectories from a policy. Other possibilities for re-use include extensions or forking of the wrapper, which is fully permitted under the MIT License, and may be useful if users wish to modify some or all of the functionality. Feedback or contributions are welcome and can be made by raising GitHub Issues or Pull-Requests against the repository. Support can also be obtained by raising a GitHub Issue.