J. Onel et oL/fuel 117(2014) 224-229
226
ctc.. facilitates the analysis of any trends in thc data, When loading plots are inrerprcted. one can study thc contribution of cxplana-tory variables into the construction of each PC and explain the spe-cific data smjcture that is obsen/ed on a corresponding score plot 118].
222. Consrrucrion of discrimination models - partial least squares disaiminant analysis
The purpose of discrimination methods is to divide a set of sam-ples into a number of groups that are characterized according to their physico-chemical properties. e.g.. the discrimination of fuel samples with an inereased or decreased tax ratę. The discrimination bascd on multivariate Chemical data can be performed using the partial least squares discriminant model (PLS-DA) [19|. It is a useful technique when the data contain a large number of conre-lated variables because only a few latent factors are necessary to efficiently model the relationship between a set of explanatory variables (EEMs) and a dependent variable. For a two group problem. a dependent variable is represented by a binary vector that indicates membership for each sample group (a tax level - inereased or decreased). A PLS-DA model is constructed for a limited number of latent variables and their number is called model com-plexity./. The construction of a model with an acceptable discrimination power requires the optimization of the number of latent variables. This is usually done using the so-called cross-validation procedurę |20|.
2.3. Data set and data pretreatment 2.3.1. Data preprocessing
The first preprocessing step of excitation-emission fluorescence matrices is to handle the Rayleigh scattering. This is a chemically irrelevant component of the EEM spectra and thus can negatively influence their modeling. In the literaturę, there are several pro-posals as to how it can be corrected. For instance. the Rayleigh scattering can be removed from the EEM data by replacing its corresponding values with zeros [21.221( treating them as a missing values 1231 or interpolation using diflerent approaches [24.251. In this paper the Rayleigh scattering was removed from the EEMs and interpolated with thc Delaunay triangulation as described in Ref. |24].
Next, the EEMs were unfolded. A single fluorescence matrix was rearranged into a row vector containing 46 augmented emission spectra (each consisting of 226 sampling points). see Fig. 1. The unfolded 180 EEMs produced a data matrix of size 180 x 10396.
The last step in the data preprocessing workflow consisted of averaging the intensities of the fluorescence measurements that are observed for three technical replicates that are recorded for each laboratory sample. The mean spectra formed the finał data set - a matrix of size 60 x 10396 (samples x (emission wave-lengths x excitation wavelengths)).
2.32. Discrimination schemes
Before the construction of a dependent variable. a careful study of any possible fluctuations in the Chemical composition of the samples should be carried oul For example. if a diesel oil of a fuli ratę tax value has been transported in a Container that had previ-ously been used to transport a fuel with another tax level. it may be a source of residues from a spiked oil that will mix with a diesel oil that does not have excise tax additives. Therefore. these samples can be incorrectly recognized as samples after the sorption process. Thus. in order to draw a definite condusion about the presence of tax additives. it would be an insufficient criterion to discriminate oil samples. A discrimination model as a discrimination criterion should also take into account the concentration lev-els of the excise tax additives (described by law). According to Polish law |3|. Solvent Yellow 124 is introduced to diesel oil in an amount that varies between 6.0 mg L‘1 and 9.0 mg L *. The concentration level of SY19 must be at least equal to 6.3 mg L 1 or larger. Taking into account the concentration criterion and expenmental settings. some of the available samples can be la-beled as either a fuli ratę of duty or as rebated tax samples. How-ever, in a data set there may be samples with ambiguous. according to the law. concentrations of a marker and dye that can-not support any definite discrimination. For example. a sample with concentration of a marker equal to 7 mg L 1 (that defines a rebated tax oil) and a concentration of dye 2 mg L 1 (oil with a fuli ratę of duty). Therefore. in this study four difTerent discrimination problems were considered. Each discrimination problem is described by a dependent binary variable. In all of the discrimination cases, the available samples were divided into two groups with re-spect to the concentration of an additive{s). Samples included in the fuli ratę of duty group of oils are denoted as T and the remain-ing samples (form the rebated tax group) as '-V. The design of the experiment and the discrimination schemes that were investigated are presented in Fig. 2.
The first discrimination problem, nl. is based on a concentration of a marker compound (samples characterized with SY124 concentration below the level described in the law were included in the fuli ratę of duty group. and the rest of the samples formed the rebated tax group). The concentration of a dye was used as the second criterion. The remaining two discrimination problems focus on two additives. For n3. simultancous concentrations of both a marker and dye should exceed the levels described by the law to assign samples to the group of oils with a fuli ratę of duty. For n4. the concentration of at least one of the tax additives should be equal to or larger than the specified concentration in order to recognize a sample as a rebated tax oil.
Odginał data
Cnfoldrd dala
u
j:
Flf.1. DifTtrtni pottihilitim of EEM dju repretenunon - the unfnlding of thf EEM (luornceree fing^rprinK