Scikit-spectra: Explorative Spectroscopy in Python

Scikit-spectra is an intuitive framework for explorative spectroscopy in Python. Scikit-spectra leverages the Pandas library for powerful data processing to provide datastructures and an API designed for spectroscopy. Utilizing the new IPython Notebook widget system, scikit-spectra is headed towards a GUI when you want it, API when you need it approach to spectral analysis. As an application, analysis is presented of the surface-plasmon resonance shift in a solution of gold nanoparticles induced by proteins binding to the gold’s surface. Please refer to the scikit-spectra website for full documentation and support: http://hugadams.github.io/scikit-spectra/


Introduction
Spectroscopy, the study of the interaction of light and matter, is utilized as an experimental technique in chemistry (IR/Raman), physics (NMR/Xray), biology and nanotechnology (UVVis/circular dichroism), and many other scientific fields. Despite widespread interest, spectroscopy software development is not often a research focus; researchers traditionally rely on commercial software bundled with instrumentation, such as a benchtop spectrometer, or a Raman microscope. Such software is expensive and usually tailored to a particular research domain or application. Open-source solutions are less abundant, and also tend to be specialized.
Python has emerged as a swiss-army knife for scientific research, due in large part to a core group of scientific libraries known as SciPy [1], perhaps the most prominent of which are NumPy [2], Pandas [3] and IPython [4]. To integrate with the SciPy ecosystem, new libraries must be NumPy-compatible. For example, the scikit-image 1 [5] library stores images as pure NumPy arrays, while Pandas' primary datastructures are directly subclassed from NumPy arrays. In regard to spectroscopy in Python, a handful of domain-specific, SciPy-compatible libraries are available; for example, NMRGlue [6] and PySpecKit [7] are great resources for nuclear magnetic resonance and astronomy applications, respectively.
Interoperability between Python's spectroscopy libraries is challenging, even when they are NumPy-compatible. The primary difficulty arises in storing metadata. Spectral data is tabular: a matrix of n spectra measured at m timepoints, or more generally, m variational points, with labeled rows (e.g. wavelength) and columns (e.g. time). Most would recognize this datastructure in an Excel Spreadsheet. Fortunately, tabular data is already wellsupported in Python by the widely-used Pandas library. Pandas provides an intuitive, NumPy-friendly API for IO, plotting, statistical analysis and data manipulation. Unfortunately, it's not straightforward to simply repurpose Pandas for spectroscopy, since Pandas datastructures don't preserve arbitrary metadata, nor do they support conventional Python subclassing. Such obstacles reduce Pandas' applicability to spectral analysis.
Herein, scikit-spectra is presented, a Python library that provides generalized datastructures and APIs for explorative 2 spectroscop. Scikit-spectra overcomes the aforementioned Pandas metadata and subclassing obstacles to provide spectroscopy datastructures that behave identically to Pandas objects, leading to a much more intuitive framework for IO, manipulation and plotting of spectral data. The ways that scikit-spectra extends Pandas to suit the needs of spectroscopists include: 1. Nearest-neighbor slicing to index data based on an approximate range of values. 2. 2D and 3D contour, waterfall, auto-correlation, and other spectral plots seamlessly integrated into pandas' pre-existing plotting API. 3. Unit-aware indexing objects for easy unit conversions and integration with the plotting API. 4. IPython Notebook graphical user interfaces (GUIs) to expedite many common tasks such as resampling, normalization, and plotting without ever leaving the notebook environment. 5. Spectra and Spectrum classes to replace the Pandas DataFrame and Series, respectively. These objects retain all of the functionality of their Pandas counterparts, and add capabilities such as persistence of metadata, the notion of baseline and reference spectra, and reversible spectral normalization: for example converting raw data into a transmission(T), percent transmission(%T), or absorbance(A) spectra.

Implementation and architecture
Scikit-spectra's core datastructures are the Spectrum and Spectra, which behave as if they were directly subclassed from the Pandas Series and DataFrame, respectively. Spectrum and Spectra are actually composite classes: pure Python classes that store both a Pandas object and metadata attributes; however, to the end user, operate identically to Pandas objects. In the future, scikit-spectra may be refactored to truly subclass from Pandas objects, as libraries like GeoPandas [8] and Xray [9] have recently shown how to do this properly.
In addition, scikit-spectra provides a SpecStack class for operating on multiple Spectra, analogous to the Pandas Panel class. SpecStack provides basic functionality for operations on multiple datasets, but does not try to emulate the API of the Panel.
Scikit-spectra defines custom Pandas Index objects, such as the SpecIndex, TimeIndex and TempIndex to support common spectral labels. For example, a TimeIndex supports timestamped labels and can convert to interval representations, (e.g. seconds elapsed), and the SpecIndex can transform spectral units, for example nanometers to inverse centimeters to electron volts. Arbitrary unit systems can also be defined through the Unit class. For example, a custom unit to denote polarization would be defined as follows: from skspec.units import Unit The short, full and symbol attributes ensure new units will automatically interface to the indexing and plotting systems.
Scikit-spectra includes graphical applications developed fully with IPython's widget API, and is one of the first libraries to do so. The documentation is built with Sphinx[10], using the Bootstrap theme [11], and sphinx gallery extensions [12]; and is heavily inspired by the scikit-image and scikit-learn [13] docs.

Examples of use 3
To illustrate some basic functionality of scikit-spectra, data are analyzed from a system of gold nanoparticles (AuNPs) in a cuvette of water before and after protein has been added to the solution. Binding between the protein and the nanoparticles yields a characteristic shift towards long wavelengths in the absorbance spectrum of the gold, known as the localized surface plasmon resonance [14,15]. For brevity, a bundled dataset, aunps_water(), is used; however, reading data from a CSV file is quite easy, as scikitspectra wraps Pandas' powerful read_csv() parser. The iloc indexer was used to display only the first five rows and columns, as seen in Table 1. If working in the IPython Notebook, Spectrum and Spectra objects will automatically render as HTML tables, with metadata attributes such as name, units, shape, baseline; and reference and normalization states, shown in the header. The ability to store arbitrary metadata is crucial to repurposing Pandas for specific applications.
In this dataset, the baseline comes preset, but is not subtracted. The dataset (baseline and reference spectra are plotted as dashed lines for clarity) is shown in Fig. 1. ts.reference = 0 ts.varunit = 's' ax = ts.plot() ts.baseline.T.plot(color='k', ls='--', ax=ax) ts.reference.T.plot(color='magenta', ls = '--', ax=ax); Spectral data in their raw form are typically not very useful; it is better to work with the absorbance spectrum, which is defined by the transformation:  Table 1: HTML output of first five rows and columns of gold nanoparticles in water. This built-in dataset is preset with a stored baseline and reference spectra, and has column labels of timestamps. where A n (λ) is the absorbance of the nth curve, S n (λ) is the spectrum, R n (λ) is the reference spectrum, and B n (λ) is the baseline. The absorbance data tend to be very noisy in the short wavelength region, due to the small signal in the raw data, so the usual procedure is to crop the values between 400-700nm. The curves in Fig. 2 show the surface plasmon resonance around 525nm and its clear shift to the right after proteins are added.
The plasmon resonance refers to the wavelength at which the nanoparticles maximally absorb, A n (λ max ). Prior to analysis, the first few curves must be eliminated. These correspond to the timepoints taken prior to the addition of nanoparticles to the cuvette; that is, A n (λ max ) = 0. This is most easily done through boolean masking, which nicely exemplifies the notion of NumPy-compatibility .
The blue curves in Fig. 2 correspond to timeseries taken before the addition of nanoparticles. To remove these and retain only the subset of curves with significant absorbance, a mask is defined with a lower threshold of 0.10 absorbance units. The new TimeSpectra, ts_cut, retains only curves after the addition of AuNPs, the timepoint when the absorbance maximum rises above the threshold value.
Next, the plasmon resonance shift vs. time is analyzed. Pandas already has a method that returns the index corresponding to the maximum value for every curve in the dataset: idxmax(). Since scikit-spectra objects inherit all Pandas methods, idxmax() is also a TimeSpectra method.
Most of the analysis so far could have been performed in one of scikit-spectra's Notebook GUIs. At any point in the workflow, one could have opened the GUI, manipulated the data, exported it back into the Notebook namespace, and resumed working through the API, exemplifying the philosophy of the GUI when you want it, the API when you need it. The code needed to run the GUI and a screenshot   of the result are shown below in Fig. 4. The GUI supports nine plot types, as well as interactive plots through mpld3 [16]. A video tutorial of the GUI is available on the scikit-spectra website[17].
Asynchronicity, Ψ(λ,λ), measures the spectral distribution of uncorrelated events over a time 5 interval. For example, if a peak at λ=a forms early in an experiment, and then later a second peak at λ=b appears, then there is asynchronicity at Ψ(a,b) because these events occurred at different times: they are uncorrelated and likely due to different underlying processes in the system. If the peaks had formed at the same time, then they would be regarded as highly synchronous. Together, synchronicity and asynchronicity encompass all of the variance in the dataset. 2DCS applications are discussed much more extensively in the scikit-spectra documentation. For datasets with many spectral peaks, 2DCS can often resolve otherwise intractable information about the order and nature of events in the system. from skspec.correlation import Corr2d cspec = Corr2d(ts.nearby[:600]) cspec.async.plot(contours=128, cbar=True) Fig. 5 illustrates the asynchronicity at wavelengths ranging from 200 to 600nm in the absorbance spectra. The time-span, spectral unit, spectral symbol and other metadata appear in the default plot labels, demonstrating the connectedness of scikit-spectra's units and plotting APIs. Strong cross-peaks between the ultraviolet and the plasmon resonance regions indicate that at some point in the experiment these peaks change asynchronously. This is verified by looking back at the data and observing that the initial protein binding leaves the nanoparticles saturated, with no binding-sites for a second addition of proteins. However, the additional protein does increase the absorbance in UV region. In other words, adding protein late in the experiment increases short-wavelength absorption, but has no effect on the plasmon resonance shift, resulting in asynchronicity between the 250nm and 530nm regions.

Quality control
Scikit-spectra includes a preliminary nose [24] test framework, inspired by the excellent Pandas test suite. A collection of tutorials and website examples are batch-run to catch breaking-changes that are not covered by the nose tests.

(3) Reuse potential
Scikit-spectra is built for generalized applications, and built on the already successfully Pandas library. Scikitspectra's core design is adaptable to many branches of spectroscopy, such as NMR, IR and Raman. Ideally, each branch will eventually be supported as a distinct subset of scikit-spectra, built on the same core framework. This is a long-term goal and will require contributions from many scientists and developers. The vision for scikitspectra is to adapt and interface with other spectroscopy libraries, not supplant them. Unifying Python's spectroscopy libraries, whether or not it ultimately involves scikit-spectra, is critical to bringing open-source solutions to the research community, much in the same way that ImageJ [25] has brought open-source image processing into the mainstream. Interested developers are welcome to contact the authors with suggestions or ideas.
• Nicholas Bollweg (IPython) • Jeff Reback and Stephan Hoyer (Pandas) • Jonathan March and Robert Kern (Traits) Fig. 5: Asynchronous correlation spectrum of gold nanoparticle absorbance at wavelengths ranging from 200-600nm. The strong cross peaks between the UV and plasmon resonance regions (roughly 525-550nm) corresponds to the time after nanoparticle-protein binding is saturated. Addition of more protein causes a UV response, but no response in the plasmon resonance region, resulting in this asynchronicity. Sideplots show the mean-centered average spectrum from the full set.
Notes 1 Scikit stands for SciPy Toolkit, which are SciPy-based libraries deemed too specialized to live in the core SciPy distribution. 2 The term "explorative" refers to an API compatible with SciPy libraries to streamline customized analysis visualization. 3 These examples are available in a single notebook at: http://nbviewer.ipython.org/github/hugadams/scikitspectra/blob/master/examples/Notebooks/grad_ presentation.ipynb 4 The synchronous and asynchronous correlation spectra are fundamental to 2DCS. Some important new developments include generalized scaling of correlation spectra[X] and the derivation of the so-called codistribution spectra[X], both of which are built into scikit-spectra's 2DCS API. 5 2DCS is also applicable to non-temporal datasets, for example spectra changing as a function of pressure, temperature or any other "perturbation variable".