Gridsampler – A Simulation Tool to Determine the Required Sample Size for Repertory Grid Studies

The repertory grid is a psychological data collection technique that is used to elicit qualitative data in the form of attributes as well as quantitative ratings. A common approach for evaluating multiple repertory grid data is sorting the elicited bipolar attributes (so called constructs) into mutually exclusive categories by means of content analysis. An important question when planning this type of study is determining the sample size needed to a) discover all attribute categories relevant to the field and b) yield a predefined minimal number of attributes per category. For most applied researchers who collect multiple repertory grid data, programming a numeric simulation to answer these questions is not feasible. The gridsampler software facilitates determining the required sample size by providing a GUI for conducting the necessary numerical simulations. Researchers can supply a set of parameters suitable for the specific research situation, determine the required sample size, and easily explore the effects of changes in the parameter set.


Introduction
The repertory grid interview (RGI) is a person-centered data collection method originating from the field of Personal Construct Psychology (PCP) [1]. The goal of an RGI is to get an insight into how an individual sees (a particular part of) the world. While the RGI was originally invented to be used in clinical psychology, it is now applied across a wide range of research fields, for example, market research, political research, organizational research, management research, etc. (see 2 for an overview). The RGI can be described as a semi-structured interviewing technique. In its classic form, the RGI is conducted in a face-to-face setting. However, over the last decade, several software programs have become available which allow conducting automated online RGIs without the need of an interviewer being present [e.g., 3]. The input to an RGI is a list of objects from the field under investigation, for example, a list of car brands ("Mercedes", "Porsche", etc.) in a market research study. During the interview, subjects are prompted to talk about these objects, usually by comparing two or three of them at a time, thereby stating what makes them similar or different (see 2 for more details on the procedure). The differences and similarities identified by the subjects are recorded in the form of bipolar attributes (e.g., "good quality vs. poor quality") and are called constructs in PCP terminology (we will use the more generic term attribute in this article). In PCP, the attributes which are elicited in an RGI are hypothesized to reveal through which patterns a subject looks at the world. Knowing these patterns allows for better understanding of the subject [1]. Different from other qualitative interviews, an RGI not only yields qualitative data (attributes) but also quantitative data. The quantitative data are rating scores of the objects on the elicited attributes (e.g., "Mercedes" scores high on "good quality"). In other words, ratings of all objects on all attributes, which represent a relevant difference or similarity to the subject, are obtained. In a standard RGI, usually 8 to 30 objects are used and 8 to 12 attributes are elicited. Figure 1 depicts the results of a small exemplary repertory grid interview from the field of market research with five objects and six attributes. Each row represents one bipolar attribute and each column represents one object. The numeric values reflect the object's ratings on the attributes. A value of 1 indicates that the left and a value of 5 that the right pole fully applies. It can be seen that the subject experiences the brand Toyota as having a rather boring and Porsche and Tesla as having a good design. The ratings in Figure 1 are additionally color coded to facilitate visual discrimination of the values.
The classical use case of RGIs is in clinical diagnostics, where one is interested in the data of a single subject only. Here, for example, a therapist may use the RGI to facilitate the communication with the client or support the formulation of clinical hypotheses [e.g., 1,4]. In other areas, like market or organizational research, the use cases usually focus on more than one individual. For example, a researcher may be interested in identifying the salient topics contained in the attributes of a whole group or population [e.g., 5,6]. The software presented in this article is aimed at the latter scenario. The vantage point for this scenario is a list of elicited attributes from a set of collected RGIs. To identify relevant topics in this list of attributes, content analysis is a commonly used methodological approach [7,8]. The goal of content analysis is to group attributes with similar meaning into the same category. The categories themselves can either be predefined or are inductively derived from the attributes themselves [7,8]. As a result of the content analysis procedure, each attribute is assigned to one of the mutually exclusive categories. For example, the category "high vs. low quality" may contain attributes like "reliable quality vs. makes a lot of trouble", "quality car vs. poor quality car", etc. Each category reflects one topic that is present in the attribute data (the terms category and topic are mostly interchangeable). Especially in consumer and market research, this type of analysis is frequently applied to identify relevant key topics of a market segment [e.g., 3,5,6]. Table  1 shows the results for the top 8 categories from a content analysis. The columns show the names of the categories, the number of attributes, and a few sample attributes assigned to each category.
When conducting a study with multiple repertory grids followed by content analysis, two goals with regard to the collected data are typical: a) identifying all (or the 90% most frequent) topics which play a role in the subjects' perception of the field under investigation. In other words, what are the most frequent descriptive categories used by subjects when they talk or think about a certain set of objects?; b) eliciting a minimal number of attributes per category, for example, to fulfill the requirements for a follow-up statistical analysis. To achieve these goals, it is essential to choose an appropriate sample size during the planning phase of the study. In statistical analysis, calculating the required sample size is often a standard procedure once parameters like the alpha and beta error have been set and the (usually unknown) effect size has been hypothesized. For the subsequent calculation, specialized computer programs are readily available [9]. However, in qualitative research settings, for example using RGIs, the situation is less well defined as most parameters are unknown, making the estimation of a required sample size difficult. Also, no standard method or procedure seems to be available. As a consequence, a priori determination of the required sample size is rarely done in the literature. In practice, several (auxiliary) approaches are used. To address goal a), either additional RGIs are conducted until a saturation of categories occurs, i.e., until no further attributes which do not fit into any of the existing categories are elicited in an additional RGI [10,11]. Alternatively, a simple rule of thumb is applied, suggesting around 15 to 25 RGIs to reach category saturation [e.g., 12]. However, relying on sampling additional RGIs until saturation occurs may not be feasible when interviews have to be scheduled a long time in advance, or when resources are known to be limited. To formally address goal b), to the best of our knowledge, no systematic approach has been applied in the literature in conjunction with repertory grid studies.

Objectives
Currently, there is no software available which supports the determination of the required sample size for repertory grid studies. The software gridsampler is built to fill that void, allowing the researcher to explore the effects on the required sample size when input parameters are systematically varied according to different research goals or study settings (see section User Interface for details on the parameters that can be set). Three typical sample scenarios, for which the software is useful, are the following: 1) For a marketing study, a researcher wants to recover the majority (e.g., 80%) of the topics (represented by the categories) which are relevant to the target population. Additionally, to improve interpretability of each topic, the goal is to obtain at least three attributes per category. 2) For a usability study, a researcher wants to discover all usability problems (i.e., 100%) that the target customers experience. Here, it suffices if each usability problem is mentioned once. 3) A researcher wants to examine if the attribute counts per category differ between two groups, for exam-ple, women and men. For this purpose, she wants to make sure that the number of attributes in the most frequent categories is sufficiently high (e.g., at least 5) to not violate the requirements for subsequent statistical tests.
All scenarios above can be transformed into the following generic question format, which the software helps to answer: "What is the probability of obtaining a result where at least C percent of the categories contain a minimum of M attributes when using the sample size N?" In all of the above scenarios, choosing a too small sample size may have practical consequences, for example, missing a business opportunity. Hence, the researcher will be interested in estimating the probability for achieving the goal of the study, given a certain sample size. In order to enable the calculation of a required sample size for a particular study setting, a conceptual framework of how the attribute data is generated must be defined. Within this framework, several parameters that can be adjusted according to different research scenarios play a role: a) the expected distribution of the number of elicited attributes, b) the expected distribution of the category counts, and c) the number of subjects sampled in the study. For each set of input parameters, the gridsampler software runs simulations which are suited to answer the generic question above. By reviewing the resulting probabilities for different parameter sets, researchers can make an informed choice with regard to an appropriate sample size. The presented simulation approach by itself is

No. Topic/Category Name Attribute Count
Sample attributes assigned to category 1 High quality -low quality 30 "reliable performance vs. makes a lot of trouble" "quality car vs. poor quality car" "produced to last long vs. frequent repairs" 2 Good safety features -low safety 24 "sets safety standards vs. no really safe car" "gives a feeling of protection vs. not feeling safe" 3 Good design -bad design 18 "aesthetic proportions vs. ugly" "modern design -old fashioned look" "looks special -ordinary design" 4 Good price -overpriced 14 "expensive -reasonably priced" "good value for money -little value for money" 5 Technically advanced vs. technically behind 12 "technically advanced vs. poor engineering" "highly advanced vs. low technology level" 6 Eco-friendliness vs. non-eco-friendly 9 "eco-friendly -not eco-friendly" " low fuel consumption vs. unacceptably high fuel consumption" 7 Conveys status vs. no status car 7 "high-status car -embarrassing to drive" "demonstrate personal value -nobody takes notice" 8 Sufficient space vs. too small 7 "spacious interior vs. feeling boxed in" "enough space for my needs vs. problems transporting things" computationally straightforward. However, programming a numerical analysis is often not feasible for many applied researchers from the fields where repertory grids are used.
In the next section, we will outline the data generation steps and describe the parameters that can be adjusted at each step.

Implementation and architecture User interface
The gridsampler GUI is displayed in Figure 2. It consists of three panels where different study parameters can be set (panel 1 and 2) and the simulation can be started (panel 3). In the following, the three panels along with the underlying data generating process are described in detail.

Number of attributes per RGI
When conducting an RGI, the number of attributes elicited per interview usually varies across subjects and different fields of study. For some fields of study (e.g., interpersonal relations), subjects may on average possess more attributes than for others fields (e.g., yogurt brands). Also, different subjects tend to differ with regard to the number of attributes they have available to describe the objects under investigation. For example, one subject might mention six attributes and another subject ten. Furthermore, in some research settings, the number of elicited attributes may be fixed by design or is limited by time constraints of the participants (e.g., for high-level executives). Hence, depending on the specific research setting, different distributions of the number of elicited attributes per interview can be expected. The software allows defining an expected distribution for the number of elicited attributes. To simulate the number of attributes, a sample is drawn from this distribution where each random value represents the number of attributes in a single RGI. To specify the distribution, the minimum and maximum number of attributes and the probability for each number of attributes to occur can be adjusted interactively. For convenience, the "Probability Presets" section at the bottom of panel 1 allows choosing a predefined distribution. Note that we use continuous distributions (e.g., the normal distribution) in the presets, as we assume that most readers will be familiar with their shapes. However, we discretize the distribution afterward, as the number of attributes is a discrete distribution. In most research cases, the attribute distribution will have a bell-shaped form, with most subjects expressing an average number of attributes and low and high numbers being less probable. A typical distribution is shown in Figure 2.

Category elicitation probabilities
In panel 2, the number of categories (i.e., topics) that are assumed to exist in the target population can be set. Additionally, the probability for an attribute from each category to be elicited can be adjusted interactively. For each interview, the attributes are sampled according to this probability distribution. This model assumes that the individual interview results are produced by an underlying population distribution. The repertory grid procedure requires that identical or highly similar attributes are not elicited twice, such that an attribute from each category will only occur once per interview [2]. Hence, sampling the attributes is a process that is not stochastically independent, as sampling without replacement is applied. Like the number of attributes generated per subject, the probabilities and the total number of categories are usually unknown to the researcher. However, based on previous research literature, reasonable assumptions can often be made. In most research using content analysis of attributes, the empirical distribution of category counts follows approximately an exponential distribution [8,13,14]. A typical distribution is shown in Figure 2.

Methodology
To model the elicitation process of attributes in RGIs, a two-stage sampling approach is applied. At the first stage, a discrete probability distribution for the number of attributes drawn per interview is defined. A single interview is simulated by randomly drawing one time from the first distribution with replacement. This value represents the number of attributes (A) sampled for subject S. At the second stage, weighted sampling without replacement from a finite distribution is applied. The number of elements sampled in the second stage is given by the realization of the random variable from the first stage (i.e., value A).  Drawing a fixed number of elements from a finite population without replacement, where a) the elements have unequal probabilities of being drawn, and b) the drawing of one element affects the probability of the remaining, leads to the multivariate Wallenius' noncentral hypergeometric distribution (MWNHD, [15]). In our case, however, the number of attributes drawn is not fixed but is itself a random variable, making the situation more complex. Hence, the resulting distribution is a compound distribution with the number of attributes being a random variable used as a hyper-parameter in the second sampling stage. As both distributions do not necessarily follow any standard form, the resulting compound distribution is not trivial. For this reason, we apply a simulation approach to approximate the expected mean count per category and the corresponding quantiles of the resulting compound distribution.

Documentation
The software is hosted on GitHub. Documentation of the software is provided on GitHub pages (http:// markheckmann.github.io/gridsampler) and on the "About" tab in the software itself (see Figure 2). Also, an interactive tour introducing the main GUI components is available by clicking on "Tour" in the navigation bar. Additionally, a tooltip is displayed when hovering over one of the six action buttons, with a brief description of what action is prompted.

Architecture
The software is written in R and implements a browserbased UI based on the shiny package [16]. The simulations are performed using functions contained in the standard R distribution. For the reorganization of the simulation results and the display of the data, several additional packages publicly available on the CRAN server (http:// cran.r-project.org) are used (see section Dependencies). The software is installed and loaded by typing the commands install.packages("gridsampler") and library(gridsampler) into the R console. The GUI is opened by typing gridsampler(). For users who are not familiar with R or cannot install R on their system, a web version is available under http://gridsampler.openrepgrid.org.

Future developments
Currently, the software supports the a priori estimation of the required sample size. Another important class of questions concerns the post-hoc analysis of the results. After having conducted a study which yields a distribution of category counts, the top 3 to 10 categories often receive special attention, for example, by basing action plans on the most frequent topics. However, as the distribution of category counts is the result of a sampling process, it contains sampling error. In other words, the order of the top 10 categories might have been different for a different sample. In a future release, we plan to include features which allow the examination and visualization of this and related types of statistical uncertainty.

Quality control
For bug tracking and version control, git is used as provided by GitHub (see section Availability). The GitHub issue tracker allows for easy reporting of bugs, suggestions for enhancements and feature requests. It is meant to be the primary mode of user feedback. On the "About" tab in our software, a link to the GitHub issue page is provided to quickly route users.
To ensure that the sampling procedure works correctly, the simulation results are checked against theoretical results for specific cases where the resulting distribution is known. As outlined in section Methodology, the resulting theoretical distribution is a multivariate Wallenius' noncentral hypergeometric distribution when the number of sampled attributes is kept constant for all RGIs. In the tests we check if the simulation results converge towards this distribution for a high number of draws. The tests are run multiple times, and each test uses a fixed but random number of attributes and random values for the category distribution. As a result, a different special case is checked in each test run. These functional tests are run automatically when rebuilding the package. The package is automatically rebuild via Travis CI (https:// travis-ci.org), each time the code in the GitHub repository is modified.

Operating system
The software runs on all operating systems that support a standard R installation. This includes the three major operating systems MacOS, Windows, and Linux.

Additional system requirements
A browser must be installed on the system with JavaScript enabled.

(3) Reuse potential
The software can be used to explore and determine the required sample size during the design phase of any study that employs the repertory grid in conjunction with content analysis. This is a very common research scenario [e.g., 5,10,11,13,14], for which currently no software is available. While gridsampler was designed based on the authors' experience with repertory grid studies in particular, the software is, however, generic. gridsampler can also be used in studies which apply different attribute elicitation methods, for example, free choice profiling [17], sentence completion tasks [18], or the flash profile [19]. Hence, the software has potential to be used across different fields which extend beyond Personal Construct Psychology and psychology in general. Researchers interested in extending the software or fitting it to their own field-specific requirements may contact the authors or directly fork the GitHub repository and submit a pull request after additional features have been added.

Use case
In the following, a use case from the field of organizational performance evaluation is outlined [20]. The study was conducted in 2011. The supervisory board of an institution which was in charge of the economic and cultural development of a region in central Germany was unsatisfied with the institution's performance. While each executive had their own view on where the deficits lie, it was unclear how the performance was perceived by other stakeholders and what they would consider as problematic aspects. For this reason, a study was conducted with the goal to identify the central topics stakeholders use to evaluate the organization's performance and to assess how well the organization performs with regard to these criteria. Also, the board wanted to know in how far the two main groups of stakeholders, regional politicians and company executives, differ in their perception of the institution. To answer these questions, the repertory grid method was considered an appropriate choice, as it allows collecting quantitative evaluations of those performance criteria which each stakeholder considers important. In order to find out which evaluative topics are relevant to all stakeholders and how the two groups differ, it was necessary to elicit an exhaustive list of evaluation criteria (i.e. attributes) with a sufficient number of attributes within each category (i.e., topic) to facilitate category interpretation and allow for a statistical follow-up analysis (e.g., group comparisons). For these purposes, a minimum of five attributes per category was considered sufficient by the researchers. This scenario corresponds to research goal b) described in the introduction. In order to estimate the required sample size, first, the number of expected attributes per RGI had to be defined. Most interviewees were high-level executives with limited time resources, so the maximum time they would spend on an interview was 45 and 60 minutes. For this reason, the number of elicited attributes (i.e. performance criteria) per RGI was estimated to be quite low, ranging between 4 and 8. The settings for the distribution of the "Number of Attributes per RGI" in panel 1 in Figure 2 were set accordingly to these expectations. It is expected, that on average an RGIs yields 6 attributes with lower and higher numbers being less probable. The second parameter required for the sample size estimation is the distribution of the categories. From previous experience, it was estimated that approximately 20 topics would be relevant for the performance evaluation, with the topics exponentially decreasing in frequency. The distribution for the "Probability of Categories" (panel 2 in Figure 2) was set accordingly. Additionally, the settings "Minimum Count (M)" of attributes per category includes the value 5, which corresponds to the value defined as sufficient (see above). Figure 2 contains the complete settings for the simulation of the required sample size for the described study scenario.
The results of the simulation are shown in the lower part of panel 3 in Figure 2. A magnified version of the panel is shown in Figure 3. The y-axis shows the probability for obtaining a minimal number of attributes per category (M) for a specific proportion of the categories (coverage C) given a certain sample size (N). For our settings, it can be seen that for N = 70 interviews the probability of recovering all (100%) of the categories (i.e., coverage C = 1) with each containing at least five attributes is around .85 (see green line in rightmost graphic). The probability of recovering 95% of the categories (i.e., coverage C = .95), each containing at least five attributes, is already very close to 1.0 for N = 70. In other words, the probability for fulfilling the goals of the study, i.e., recovering most of the relevant categories used for performance evaluation and obtaining at least five attributes per category, was very high with N = 70 interviews. Even if some categories (e.g., 5%) would contain slightly less than five attributes, this would not have severe consequences for the follow-up analyses.
The settings used in the simulation where deemed as realistic by the authors. However, to safeguard the study against unexpected results, the authors also construed a worst-case scenario. While it was considered improbable that there would be more than 20 categories, which would in turn lower the number of attributes per category, there was a substantial risk that fewer attributes per interview might be collected. The interview time would mostly be limited to one hour, and it may happen that some subjects have problems understanding the technique or that too much time is spent on explaining the study goals before starting with the core RGI procedure. As a result, the average number of attributes may turn out to be lower than expected. Figure 4 shows the results for the worst-case setting. Here, the number of attributes varies between 4 and 6 with 5 attributes on average. In this scenario, the probability of recovering all of the categories (C = 1.0), with each containing at least M = 5 attributes, is approximately 50%. However, the probability of recovering 95% (C = .95) of the categories, with each containing more than 5 attributes, was still around 90%. Even in this case, only the minority of categories would contain less than 5 attributes, which was still considered acceptable for analysis purposes. For these reasons, N = 70 interviews were considered sufficient for the purpose of the study.
Generally speaking, there is no golden rule for how to decide which sample size is adequate. It depends on what the study goal requires. The results of the simulation are curves which merely indicate the probability of obtaining a certain result given a set of assumptions. From these probability results, the users must infer themselves what can be considered sufficient. For example, if there would have been constraints on the supervisory board's budget for the study which would have only allowed for conducting 40 interviews, the situation would have been different. In this case, it would have been clear that we might not have been able to recover enough attributes per category to allow for solid statistical follow-up analyses and would have needed to communicate this to the client.
To outline how the software can be used for other scenarios, we will assume that the study goal was a different one. In several studies, RGIs are used to collect a comprehensive list of attributes at a first stage, which is used for designing the questions for a quantitative study at the second stage [e.g., 21,22]. If this had been the goal, it would have been sufficient to elicit each category at least once. In this case, we could have set a minimal required count of M = 1 attributes in the simulation settings in the lower part of panel 3. This setting corresponds to the sample use case 2 described in section Objectives above. By pressing Redraw with New Settings with a "Minimum Count" of M = 1 and a "Coverage" of C = .95, .1, Figure 5 is obtained. In the rightmost plot, it can be seen, that in order to recover 100% of the categories (i.e., C = 1) with a probability close to 1.0, N = 40 interviews are required. If the researcher is already satisfied when recovering 95% of the categories (C = .95), N = 20 to 30 RGIs would suffice to yield a probability close to 1.0. Note that the first value (N = 40) is different from the rule of thumb (N between 15 and 25) given in the literature [10][11][12]. This again shows the necessity of running simulations before conducting multiple grid studies in conjunction with content analysis.
The quality of the simulation results stands or falls by the accuracy of the assumptions they are based on. While it is possible to make reasonable choices for the distributions specified in panel 1 and 2 (see Figure 2), we must be aware that these choices involve error and that the drawn conclusions are only approximate. For this reason, it is  advisable to explore the effects of several sets of assumptions before making a final choice about the sample size. In our experience, it is good practice to build a worst, normal and best case scenario, and compare the results. The facilitation of this exploration process makes gridsampler a valuable tool in all research designs which use attribute elicitation followed by content analysis.