An ObsPy library for event detection and seismic attribute calculation: preparing waveforms for automated analysis

We have implemented an extension for the observational seismology obspy software package to provide a streamlined tool tailored to the processing of seismic signals from non-earthquake sources, in particular those from deforming systems such as glaciers and landslides. This seismic attributes library provides functionality to: (1) download and/or pre-process seismic waveform data; (2) detect and catalogue seismic events using multi-component signals from one or more seismometers; and (3) calculate characteristics ('attributes'/'features') of the identified events. The workflow is controlled by three main functions that have been tested for the breadth of data types expected from permanent and campaign-deployed seismic instrumentation. A selected STA/LTA-type (short-term average/long-term average), or other, event detection algorithm can be applied to the waveforms and user-defined functions implemented to calculate any required characteristics of the detected events. The code is written in Python 2/3 and is available on GitHub together with detailed documentation and worked examples.


Introduction
Seismology provides an attractive tool to investigate physical processes in deforming systems. The seismic signals from active glaciers, for example, could enable monitoring of mechanisms including basal sliding [6,13,24], fracturing [9], melt water drainage [4], and iceberg calving [14,15]. The detection of seismic events from the recorded continuous seismic waveform data is a vital first step in any analysis. Event catalogues thus constructed are needed for local seismicity studies, comparisons between locations, or detection of change over extended time periods. However, the automated detection of seismic events is complicated in environmental and geotechnical seismology by the diverse populations of signal generation mechanisms. Those generated by active glaciers, for example, can be expected to span several orders of magnitude in duration and amplitude [16]. Volcanoes [10], landslides [19] and mining activity [25] similarly produce a broad range of seismic signals. As a result, the majority of cryoseismology studies to date use manual identification of events [6,17,18]. Manual techniques are not readily scalable nor exactly reproducible: in particular, they are not a first choice for monitoring applications nor for the data-driven detection of change.
Established event detection algorithms have largely been developed for earthquake seismology [1,2,3,7]. These algorithms, including STA/LTA (short-term average/longterm average) [1] and template matching [3], are applied in real-time to seismic data to detect earthquakes and produce event catalogues [2,7,8,23]. The core classes in the obspy software package are therefore designed assuming event metadata is available online alongside the waveform data [22]; additional functions are included outside the core classes to select events directly from waveform data. These standard algorithms are generally only applicable to non-earthquake signals by employing an experimental approach to parameter selection [5,16]. The seismic attributes library provides software tools to download seismic waveform data from online repositories and detect environmental and geotechnical seismic events using a choice of algorithms. Algorithm options include the classic, recursive and delayed STA/LTA algorithms [22], and the newly developed multi-STA/LTA algorithm [11]. The multi-STA/LTA can simultaneously extract both short and long duration events of very different signal-to-noise levels and potentially enables real-time monitoring of cryoseismic events.
Event catalogues constructed using the seismic attributes library would be expected to comprise diverse signals generated by a range of mechanisms as noted above. The various signals typically need to be separated into related clusters based on the characteristics of their waveforms in order to study the events further [10,19]. For example, Provost et al. [19] consider 71 attributes based on waveform data in their study of landslide seismicity. These are broadly split into four categories: (1) waveform attributes (e.g. duration, energy, kurtosis); (2) spectral and spectrogram attributes (e.g. discrete Fourier transform); (3) network attributes (e.g. station with maximum amplitude); and (4) polarity attributes (e.g. azimuth, inclination). The seismic attributes library includes functions to calculate a number of standard signal properties: duration, ratio between ascending and descending time, energy in the autocorrelation function, energy in the frequency filtered spectrum, and the direction of wave propagation. These are provided in three bundles of attribute functions describing the waveform, spectrum and polarity of the signal; we do not include network attributes as these are unordered, discrete variables (and largely application dependent). User-defined functions can be added to derive customised characteristics within our software architecture. The correlation between attributes can be investigated using a plotting function to select the best subset for the subsequent application of (e.g.) clustering to a given set of events, informing the removal of redundant variables if appropriate. Data-driven techniques more generally, such as machine learning algorithms, may be readily applied to the calculated attributes to inform the current and future state of the glacier (i.e. identify signals in the lead-up to a large event [20]).

Implementation and architecture
The analysis of waveform data from environmental and geotechnical seismic deployments requires multiple distinct steps. Following the principle that one should write programs that do one thing well, and write programs that work together [21], the seismic waveforms library is correspondingly split into three primary functions to streamline the workflow. The name and purpose of these functions are as follows: • get waveforms(); this function downloads waveform data from an online repository or alternatively reads data stored locally.
• get events(); this function uses an event triggering algorithm to produce an event catalogue based on the seismic signals in one or more components of one or more seismic stations.
• get attributes(); this function produces a pandas DataFrame of the attributes for each event using functions included in the library, or user-defined attribute functions.
The library includes the numerous attribute functions that are called by get attributes(). Additional functions are also included to analyse the output of the primary functions (e.g. plotting), together with several private functions that support the primary workflow (documented in the code itself). The implementation of the three primary functions is outlined below.
get waveforms Seismic datasets can build to a high data volume (several terrabytes) if high sampling rates are required over extended periods of time, or if using data from seismic arrays with multiple seismometers. This can result in memory problems when processing the waveforms and also disk space limitations. The first function in our workflow therefore acts as a gateway, to manage the volume of data downloaded from online repositories using obspy clients. Waveform data from each seismometer is written into separate files comprising only a single day of data for a single component (e.g. vertical, north, east). The user can specify the write location of the files to split different time periods across several external drives. This same function checks if the requested data have previously been downloaded (by default) and can optionally not check for, or not download missing data (e.g. in the case of known data gaps). The waveform data are written as .mseed files using standard obspy functions. The get waveforms function recombines downloaded or locally stored files into a single obspy Trace object for the requested time period, thus preparing the downloaded files for the next steps in the workflow. The user should examine small subsets of the time period of interest (e.g. a week at a time) based on factors including data volume and computer hardware performance. The seismic waveforms for the requested time period and components of a given seismometer are output as an obspy Stream object. The streams from different seismometers are combined using the '+' operator into a single object.

get events
As noted in the introduction, several algorithms exist to trigger events from single component waveform data. The get events code first separates the signals from multiple seismometers and combines their (usually) three components into a single waveform. The user can optionally specify whether the component waveforms are combined as the sum-of-squares of their amplitude (i.e. Euclidean norm) to give the wave energy, or the absolute value of the wave amplitude. Further, to ensure that taking the absolute value of the amplitude does not affect the results (e.g. doubling frequency) we tested a computationally intensive option to fit the (time-varying) principle component of the direction of wave oscillation to obtain a signed amplitude; this gave identical event detections to the absolute amplitude and so is not included in our published version of the code due to the significantly longer computation time.
We use an adapted version of the obspy coincidence trigger function to create a reference catalogue of events based on the STA/LTA characteristic function at each seismometer for their combined component waveforms. Our version of the coincidence trigger function includes adjustments in the algorithm to better align with the outputs needed in our workflow and improvements to computational performance that are possible for our narrower use of this function (see Figure 1 for schematic of algorithm). This function can use the standard STA/LTA algorithms available within obspy [22], and optionally the recently developed multi-STA/LTA algorithm [11]. In all cases, locationspecific algorithm parameters are likely to be more successful for environmental seismology applications. The start and stop times of the detected event from a single seismometer can be extended to consider small gaps in between triggers (up to length thr event join), creating a single, longer-duration record. The reference event catalogue is created by finding times when n (i.e. thr coincidence sum) of the N seismometers in the array have temporally coincident records. The physical dimensions of the seismic array are calculated at this step to estimate an upper bound on the  Figure 1: Top: Diagrammatic representation of algorithm used to identify events from a single seismometer comprising one or more channels (traces show their Euclidean norm). The characteristic function for an STA/LTA algorithm is used to 'trigger' events (shown in red) and small gaps between these events (shorter than thr event join) are ignored (orange). No events are present at other times (shown in green). Bottom: Representation of algorithm used to identify events in the 'reference event' and 'trace' catalogues for seismic arrays (with multiple seismometers). The example shows an indicative array with N = 3 seismometers and thr coincidence sum 2 simultaneous detections. Events identified for the single seismometers traces are shown in red (as given in the top panel). The duration of these events is extended at each end by half the delay in arrival time of a wave between the most distant seismometers in the array (shown in deep red). The reference event is identified as the times when thr coincidence sum 2 seismometers have a detection (shown in red or deep red); small gaps with fewer seismometers are joined over (shown in orange). The events at each seismometer are simply the events identified from the single seismometer traces (red but not deep red), but with any times between events that occur during the reference event also included (shown in grey). delay in arrival time of the signal at different seismometers. This delay is added to the duration of the single seismometer records to ensure the reference events are not artificially shortened. The reference event catalogue, which includes a reference start time and duration for each event, is output as a pandas DataFrame to provide a convenient format to write to file and interrogate. Our adapted version of the coincidence trigger function provides additional utility for other researchers by outputting a secondary catalogue of trace metadata, for each seismometer, for all identified reference events. The triggered records at each seismometer that occur during (at least) part of the time period covered by a given reference event are joined together (if not already a single record). This record may extend beyond the time period of the reference event, be contained entirely within it, or have no detection at all in the case of weaker events. This trace (metadata) catalogue, which includes the trigger start time and duration for all seismometers, is similarly output as a pandas DataFrame. The trace metadata can be used to extract waveform data for identified events in the catalogue in a format that will be familiar to seismologists, for example, as shown in Figure 2

get attributes
The get attributes function is written following further principles [21], that one should write flexible and open programs. This function extracts the waveform data for a given seismometer for the duration of a given event listed in the reference event, or trace (metadata) catalogue. The waveforms for each component are stored as separate Trace objects in a single stream. This Stream object is passed to one or multiple attribute functions to derive the characteristics of the given event as measured by a given seismometer. The values of the requested attributes for the array of seismometers are output as a pandas DataFrame. The correlation between the spectral attributes included in the library are shown in Figure 3 for events detected at Ilulissat, Greenland (DK.ILULI) on 1 January 2018 using the recursive STA/LTA algorithm.
Optionally, custom attribute functions may be added to calculate any chosen characteristic of a waveform. Custom functions must take a Stream object containing one or more components as an input, and output the attribute name and value. Public functions to combine the component waveforms into the wave energy or absolute value of the wave amplitude are included in the seismic attributes library to aid the user. The attribute functions (as many as are required) are passed to the get attributes function as optional parameters, with return values stored as above.

Quality control
We have tested the seismic attributes library to find any software bugs and to ensure outputs are reproducible by other researchers running the code or using different platforms [12]. We first created test cases based on glacier seismic signals from a field campaign seismic array on the Whillans Ice Stream in West Antarctica [24]. Plausible uses of the code were tested by selecting some or all of the seismometers in the array, some or all of the three components of the signal recorded at each seismometer (i.e. vertical, north and east), and considering different durations of requested time (with and without small data gaps). The optional function inputs were also tested in this manner. In application, e.g. [11], the event catalogue produced using the seismic attributes workflow compared favourably to events that were visually classified using the same set of seismometers [18]. Finally, the attribute functions included in the library were tested on mock waveform data to ensure they produce the theoretically expected results.
Testing was primarily carried out using a Python 3.8 installation on a macOS 10/11 system (Unix). The key features of the seismic attributes library were tested on other platforms for compatability, including both Python 2.7 and 3.7/8 for the most up to date version of obspy, numpy, pandas and seaborn available for that installation.
The code was further tested on both macOS 10/11 and Windows operating systems, especially to verify functionality of the different file systems. The reference event and trace (metadata) catalogues were compared between these platforms to assess the consistency of code functionality.  Figure 3: Corner (or pair) plot illustrating the correlation between spectral attributes for the events detected at Ilulissat, Greenland on 1 January 2018 using the recursive STA/LTA algorithm. The attribute names are defined to match those used by Provost et al. [19]. In this example, attribute 26 considers a frequency range close to the sampling rate of the recorded signal, therefore, it has null value for all events. The code to reproduce this plot is provided as a worked example in the detailed documentation provided with the software.
Programming language Python 2 or 3.
Additional system requirements Memory and disk space will limit the volume of data that can be processed in a contiguous chunk; the code will work on any system that can support obspy. Internet access is required to download new waveform data from online repositories.

Dependencies
The minimal dependency for use is Python 2/3 with obspy, numpy, pandas and seaborn packages installed. (3) Reuse potential The main purpose behind the seismic attributes library is to provide a streamlined workflow for researchers to detect and categorise seismic events from environmental and geotechnical sources in a rigorous and reproducible manner. The code will find extensive use in areas of seismology where a diverse population of seismic signals are present, including glaciers, volcanoes, landslides and mine sites. The get events function is expected to be especially useful, in particular in glacier seismology, to provide a consistent method for creating event catalogues for ongoing machine learning and conventional seismological analysis. We anticipate that our get attributes function may also be useful in applications outside seismology to aid in the construction of a catalogue of waveform attributes for use in machine learning. The code would need only minor modifications to handle time series data, outside of seismology, stored in different formats. It could conceivably find applications other areas of science (e.g. astrophysics), economics and many other potential applications. Limited support may be provided by contacting the corresponding author.

Funding statement
This research was supported under Australian Research Council's (ARC) Special Research Initiative for Antarctic Gateway Partnership (project ID SR140300001), and ARC Discovery Projects DP190100418 and DP2101000834.