(1) Overview

Introduction

There is a paradigm shift in the collection of data for medical research [, ]. As manual data collection through case report forms is an expensive and time-consuming task [], there is substantial interest in real world data (RWD) which is a rich source of information for scientific research. RWD is data on patient health status and/or delivery of healthcare which is routinely collected during care []. Using RWD for trials can reduce costs, but there are also scientific benefits, for example new data collection options and the possibility to answer research questions for which a trial was otherwise not feasible []. An important challenge for RWD however is the management, processing and merging of large and divergent datasets [].

Castor Electronic Data Capture (EDC) is a cloud-based clinical data management platform allowing researchers to securely and efficiently manage their clinical research data []. While there are a few options for directly integrating RWD into a Castor database, there are limitations.

Importing simple RWD is possible using the user interface of Castor EDC, but this does not allow for import of survey data and does not account for differences in columns and variable coding between databases. Moreover, importing data through the user interface is limited to a maximum of 25.000 data points and cannot be used programmatically. It is possible to directly link the Castor database to a RWD database, such as an electronic health record [], but this needs to be configured separately for each database.

Castor EDC offers a public HyperText Transfer Protocol Secure (HTTPS) Application Programming Interface (API) that conforms to the Representational State Transfer (REST) architecture style. This API can be used to securely interact with the study database and import data []. A prior Python project implemented some of the endpoints in the API, but has little support for differences between databases, a small testing suite and can only be used with considerable knowledge of Python []. For R there is an open source package available to interact with the API, but this package currently only supports reading data from the API, not writing data into the database [].

We aimed to develop an open source Python package to make integration of RWD accessible for clinical researchers, that can be used with little knowledge of programming or Python and which can be easily integrated with other data science tools such as pandas or R.

Implementation and architecture

The software package consists of three separate modules: the client (CastorClient), a local representation of the researcher’s database (CastorStudy) and a set of functions which can be used to import RWD or other study data (importer).

CastorClient

CastorClient supplies the key functionality of the package. It authenticates the user to the Castor API and has defined functions to interact with all the endpoints defined in the API. It allows for direct querying of database endpoints, but requires some knowledge of programming in Python. Examples of endpoints are creating a new survey package for a study participant, updating study data for a participant or retrieving information about the study. For an example, see Code Snippet 1.

Code Snippet 1 Example use of the CastorClient.


from castoredc_api import CastorClient
# Create a client with your credentials
c = CastorClient('MYCLIENTID', 'MYCLIENTSECRET', 'data.castoredc.com')
# Link the client to your study in the Castor EDC database
c.link_study('MYSTUDYID')
# Then you can interact with the API
# Get all records in the study
c.all_records()
# Create a new survey package
c.create_survey_package_instance(survey_package_id="SURVEY-PACKAGE-ID",
                        record_id="TEST-001",
                        email_address="example@fakemail.com",
                        auto_send=True)

CastorStudy

CastorStudy is a local representation of the researcher’s study database. It downloads the raw study data through the API using the CastorClient module in the background and then augments and formats the study for easy searching. The local representation is split into two parts: one part defines the structure of the study and the other part defines the data collected in the study. For a visualisation of the study structure, see Figures 1 and 2.

Figure 1 

Overview of the Castor study structure.

Figure 2 

Unified modelling language (UML) diagram of the CastorStudy implementation.

The structure of a study is defined at the upper level by forms. These can consist of different types: study, report or survey. A study form is a planned phase in the study, a report form is an unscheduled or repeated event, and a survey form is a survey or electronic patient-reported outcome measure. Each form consist of a collection of fields which are the variables specified by the study protocol to be collected. Forms and fields can be considered the templates in which data per record is filled.

The data of the study is defined at the upper level by records. Each record is a study patient and is linked to a collection of form instances. These are instances of the study, report and survey forms and contain information on among other things when forms were created and filled in. Each form instance contains a collection of data points, which are instances of fields. These hold values and information for each variable in the study per record.

Mapping the study structure locally allows the researcher to quickly find specific fields or forms and is used when validating data to be imported. After the study structure is mapped, a representation of the study is created in Python. This only contains forms, steps and fields, and can be seen as the case report forms of the study and does not include data. When importing data, the structure is used to assess if fields exist in the study and validate values. Moreover, one can search for specific fields or forms in the structure to get information on these fields, see Code Snippet 2.

Code Snippet 2 Example use of the CastorStudy.


from castoredc_api import CastorStudy
# Link to Castor database
study = CastorStudy('MYCLIENTID', 'MYCLIENTSECRET', 'MYSTUDYID', 'data.castoredc.com')


# Map only the study structure locally
study.map_structure()
# Find the field labelled “med_name”
field = study.get_single_field(“med_name”)


# Map the study data locally (also maps structure)
study.map_data()
# Get all data points in the study
data_points = study.get_all_data_points()


# Export your study to pandas dataframes or CSV files
study.export_to_dataframe()
study.export_to_csv()

One can also map the study data, which first maps the study structure and then exports all data from the Castor database, linking it to the structure and allowing the data to be accessed as Python objects. These functions are used to export data from the API to data analysis tools such as Python (pandas) and R []. For examples, see Code Snippet 2.

Importer

The importer can be used to read, clean, validate and import RWD into the Castor database. The importer is started by calling the function import_data with a set of configuration options. First the simple case of uploading data is shown, thereafter the complex options are described. In all cases the main input is the data to be imported, a xls(x) file with records as rows and variables as columns (Table 1).

Table 1

Example of a data file.


PATIENTMEDICATIONSTARTDATESTOPDATEDOSEUNITS

110001Azathioprine05-12-201905-12-20200.05g/day

110002Vedolizumab17-08-201817-09-2020300mg/4 weeks

110003Ustekinumab19-12-201703-06-201990mg/8 weeks

110004Thioguanine25-04-202027-05-202115mg/day

110005Tofacitinib01-03-202031-12-299910mg/day

This is combined with a second xls(x) file that maps the columns in the data file to the fields in the Castor database (Table 2). Headers “other” and “castor” are obligatory. For special cases, for example checkbox, radio or dropdown fields that define an ‘other’ option, see the software documentation [].

Table 2

Example of a link file.


OTHERCASTOR

patientrecord_id

medicationmed_name

startdatemed_start

stopdatemed_stop

dosemed_dose

unitsmed_units

There are three other options that need to be specified when uploading data. These are the study, whether the data is labelled and what the target is of the import. The study is defined by a CastorStudy (see above). Labelled data defines whether data in the data file contains labels (woman/man) or values (0/1) for data points that use an option group. The target defines whether the data to be imported are for a study, report or survey form. When importing report or survey data, it is also necessary to define the name of the report or survey. When uploading survey data a fall-back (fake or researcher owned) e-mail address needs to be supplied. While the surveys are not sent to this e-mail address, the database does not accept survey instances without a fall-back e-mail address. For examples of each data type, see Code Snippet 3.

Code Snippet 3 Example use of simple import.


from castoredc_api import CastorStudy
from castoredc_api import import_data
# Link to Castor database
study = CastorStudy('MYCLIENTID', 'MYCLIENTSECRET', 'MYSTUDYID', 'data.castoredc.com')
# Import labelled study data
imported_data = import_data(data_source_path="studydatafile.xlsx",
                        column_link_path="studylinkfile.xlsx",
                        study=study,
                        label_data=True,
                        target="Study")
# Import non-labelled report data
imported_data = import_data(data_source_path="reportdatafile.xlsx",
                        column_link_path="reportlinkfile.xlsx",
                        study=study,
                        label_data=False,
                        target="Report",
                        target_name="Medication")
# Import labelled survey data
imported_data = import_data(data_source_path="surveydatafile.xlsx",
                        column_link_path="surveylinkfile.xlsx",
                        study=study,
                        label_data=True,
                        target="Survey",
                        target_name="Example Survey Package",
                        email="example@fakemail.com")

Before uploading the data file to the Castor database, the package prepares the data for import and validates all values. Every column in the data file is checked to see whether a target field exists and is part of the target. Every record is checked to determine whether it exists in the study. All labelled data is translated to the corresponding values if applicable. For all columns the values to be imported are checked against minimum or maximum allowed values (numeric and year fields), the allowed option groups (dropdown, radio and checkbox fields), and the correct formatting (date, datetime and time fields). When one or more errors are encountered in the data file, the process is aborted and the program outputs a file indicating for which values an error was found.

The import_data function can be supplied with some extra configuration options for complex situations. These are translating variable names between databases, merging multiple variables into one, formatting options and an option for asynchronous interaction with the API. See the software documentation [].

Quality control

The software has been thoroughly tested and there is a testing suite with more than 500 tests. Tests coverage is 98% and covers almost all aspects of the CastorClient, CastorStudy and importing suite. As a large proportion of the tests interact with the Castor database, tests are currently run through a Github hosted runner. Other users can run tests on their local machine for development after contact with the code owner.

(2) Availability

Operating system

Platform Independent

Programming language

Python >= 3.8;

Additional system requirements

No specific requirements

Dependencies

Pandas >= 1.3.1; numpy >= 1.21.1; openpyxl >= 3.0.7; tqdm >= 4.62.0; httpx >= 0.19.0;

List of contributors

van Linschoten, Reinier Cornelis Anthoniusa, b

Knijnenburg, Sebastiaan Laurensc

Lutro, Andreasc

a Department of Gastroenterology & Hepatology, Franciscus Gasthuis & Vlietland, Rotterdam, Netherlands

b Department of Gastroenterology & Hepatology, Erasmus MC, Rotterdam, Netherlands

c Castor, Amsterdam, The Netherlands

Software location

Archive

Name: CastorEDC API

Persistent identifier: https://pypi.org/project/castoredc-api/

Licence: MIT

Publisher: Reinier Cornelis Anthonius van Linschoten

Version published: v0.1.9.1

Date published: 13/07/2023

Code repository

Name: CastorEDC API

Identifier: https://github.com/reiniervlinschoten/castoredc_api

Licence: MIT

Date published: 13/07/2023

Language

English

(3) Reuse potential

The described software originated from the IBD Value study []. This longitudinal multicentre non-randomised cluster trial combines patient-reported outcomes and RWD to study variation in quality of care in inflammatory bowel disease and the effect of a uniform care pathway on quality of care. CastorEDC API has been successfully used to merge data from eight hospitals, two different electronic health records and a registry with patient-reported outcomes into a single database in Castor EDC. Moreover, the Erasmus Medical Centre is testing this software for importing data into Castor EDC. The scope of the software is not limited to data from electronic health records or patient-reported outcomes. It allows creating automatic scripts to transform, validate and import differing types of data from large and diverging databases, such as data from wearable devices or lifestyle and environmental data, into a single Castor study database.

The use of test-driven development and continuous integration to ensure high code quality and proper code formatting led to a modular codebase which can be easily extended. As validity and reliability of data are of utmost importance in scientific research, a large testing suite was pivotal for ensuring correct functionality, both during development and when refactoring and cleaning up the code base.

While the software has been extensively tested and is ready for production, there remain some possible improvements. Most important among them are full asynchronous export of study data and more extensive configuration options for differences in data structures between the data file and the Castor database. When exporting study data for a large study, the API requests for the raw data take considerable time and block the program until the Castor database responds. Currently export happens partially asynchronously, but full asynchronous requests can speed this up considerably. Moreover, there is no support for mapping between the columns in the data file and the fields Castor database other than one-to-one or many-to-one. These improvements are being tracked on the project at Github and users of the software are encouraged to contribute or submit other feature requests [].

Data Accessibility Statement

The described software is free to use and open source accessible on https://github.com/reiniervlinschoten/castoredc_api.