Latest version 

Java-based Exploratory Data Analysis for EEG

Developer notes

 

Welcome

Getting started

For users

Introduction

A developer needs to read the User Documentation first, since this describes the top-down rationale for the processing pipeline. Then they should look at the section below Outline of application below for the bottom-up structure of this application: the code and additional resources are diverse and scattered, so this roadmap may be useful. Later sections deal with more nitty-gritty matters.

Outline of application

To top The distribution includes not just Java source code, but also configuration files, native code, some data, libraries, documentation, scripts, and code used for performing tests.
|-- build/
|-- data/
|-- html/
|-- jri/
|-- lib/
|-- native32/
|-- native64/
|-- native-src/
|-- scripts/
|-- servercache/
|-- shellscripts/
|-- src/
|-- test-src/
|-- backend.properties
|-- build.xml
|-- demo
|-- demo.bat
|-- derby.properties
|-- info
|-- info.bat
`-- logging.properties

The main body of source code is located under src/, and comprises about 140 Java files in 12 functional groups. A (truncated) listing is shown below:

|-- backendClasses/
|   `-- Preproc.java
|-- channelClasses/
|   |-- Channel.java
|   |-- ChannelScript.java
|   `-- Gratton.java
|-- com/
|   `-- nr/
|-- dsp/
|   |-- FIR/
|   `-- IIR/
|-- epochClasses/
|   |-- EpochScript.java
|   |-- EpochSelector.java
|   |-- QueryResult.java
|   |-- QueryResultServerside.java
|   |-- QueryResultSuper.java
|   `-- Subset.java
|-- frontendClasses/
|   |-- CLI.java
|   |-- Demo.java
|   |-- MemoryMonitor.java
|   |-- Prescorer.java
|   `-- QueryDB.java
|-- generalClasses/
|   |-- BrowserLaunch.java
|   |-- CliHelpFrame.java
|   |-- ColorTableCellEditor.java
|   |-- CommonLog.java
|   |-- DataMode.java
|   |-- DetrendedFA.java
|   |-- Event.java
|   |-- GBC.java
|   |-- HmsFormatter.java
|   |-- HtmlHelpFrame.java
|   |-- IsoFormatter.java
|   |-- OptionArg.java
|   |-- ParamUtils.java
|   |-- QRFormatterDOM.java
|   |-- QRFormatterMat.java
|   |-- QRFormatterR.java
|   |-- QRFormatterSQL.java
|   |-- QRFormatterTxt.java
|   |-- QRSpt.java
|   |-- QueryResultGrid.java
|   |-- RAMJavaFileObject.java
|   |-- RuntimeCompiler.java
|   |-- Site.java
|   |-- SiteScalp.java
|   |-- SiteSet.java
|   |-- StreamingLogHandler.java
|   |-- StringJavaFileObject.java
|   |-- TabulateSeries.java
|   |-- Trans.java
|   |-- Unit.java
|   `-- Units.java
|-- jnt/
|   `-- FFT/
|-- outputClasses/
|   |-- DBExport.java
|   |-- Export.java
|   `-- GenericOutput.java
|-- plotClasses/
|   |-- scores/
|   |-- Cartesian.java
|   |-- CartesianX.java
|   |-- GenericDisplayFrame.java
|   |-- QRStatistics.java
|   |-- Stack.java
|   |-- Summary.java
|   `-- TiledStack.java
|-- recordingClasses/
|   |-- FileFormat.java
|   |-- Recording.java
|   |-- RecordingCnt.java
|   |-- RecordingEdf.java
|   |-- RecordingEeg.java
|   |-- RecordingException.java
|   |-- RecordingExceptions.java
|   |-- RecordingMec.java
|   `-- RecordingNs5.java
|-- scorerClasses/
|   |-- scores/
|   |-- CallR.java
|   |-- DFA.java
|   |-- EegFit.java
|   |-- EegFitInterface.java
|   |-- GenericScorer.java
|   |-- HeartRate.java
|   |-- ReactionTime.java
|   |-- SpectPL.java
|   |-- TradERP.java
|   `-- TradSpect.java
|-- seriesClasses/
|   |-- binaryEvents/
|   |-- seriesGeneration/
|   |-- BasicStats.java
|   |-- Series.java
|   |-- SeriesAnalog.java
|   |-- SeriesBinary.java
|   |-- SeriesException.java
|   `-- Transformation.java
|-- serverPlotClasses/
|   |-- FullTimeSeries.java
|   `-- RecordingSummary.java
`-- utilClasses/
    |-- BuildDB1.java
    |-- BuildDB2.java
    `-- BuildDB3.java

All building and testing of Jeda can be performed via Ant tasks:

 ant all         Clean build directories, then compile and test [default]
 ant archive     Create archive of all source files, JARs, scripts
 ant clean       Delete build, docs and test directories
 ant cleanCache  Delete all files in the server cache
 ant compile     Compile Java sources and associated native code
 ant help        Describe Ant and program options (Linux version only)
 ant isodist     Create ISO comprising Java source+class and supporting files
 ant jardist     Create jar of Java source files, and supporting files
 ant javadoc     Create static and API documentation
 ant linecount   Count lines of original java source code
 ant non         Compile the Java source files
 ant r2r         Create binary distribution, with supporting files
 ant testa       Compile tests
 ant testb       Compile and run tests: unit only
 ant testc       Compile and run tests: unit and end-to-end
 ant testd       Compile and run tests: tests under development

The functional outline is sketched below. It shows how

Schematic overview of program

Adding new time-series options

To top Time-series operations (those associated with the -scriptChannel command line option) cover filtering, re-referencing, artifact correction, resampling and decompositions. Most of the code for these operations is built into the application, so the scripts usually just need to call pre-existing methods within the Channel or the SeriesAnalog classes.

The same operations can also be used to generate artificial (or 'mock') time-series, which is very useful for testing new analysis methods. The directory scripts/channelClasses/ contains examples of scripts that generate mock datasets, each approximating data from one of the standard Brain Resource paradigms. These datasets can have known ERPs, known variations across the scalp, known noise characteristics, and known event timings.

Each script typically (i) defines a subset of channels, (ii) does some operation on that subset, and (iii) update the parent list. Often the input data will contain a mixture of EEG, EOG, EMG, ECG, etc channels. The scripts will often have to perform operations tailored to each of the recording modalities, and repeat the above steps for each subset.

There are many methods available for creating and refining subsets:

listsUnion(list1,list2,...) Returns a merging of multiple lists
listsIntersect(list1,list2) Returns the intersection of two lists
selectMode(DataMode mode) Returns a list of time-series matching 'mode'
discardMode(DataMode mode) Returns a list of time-series not matching 'mode'
selectSite(String regex) Returns a list of time-series with names matching 'regex'
selectSite(String regex, list) Returns the subset of 'list' that have names matching 'regex'
discardSite(String regex, list) Returns the subset of 'list' that have names not matching 'regex'

The second step is typically to do something to a subset of time-series. The script will iterate through a list, calling the appropriate methods on each time-series. Many methods exist within the SeriesAnalog class for basic waveform arithmetic and filtering. Methods that operate on multiple channels simultaneously are:

av(list) Returns the result of spatial averaging
doGratton(float epochDur) Performs EOG artifact correction

The third step is to update the list of raw time-series, or — in the case of mock data — to initialize the list. It may also be necessary to initialize the list of stimulus/response events: this too is possible.

replaceAllSeries(list) Replace the current Recording's series with the specified list
updateMatchingSeries(list) Update the current Recording's series with the specified list
appendToCurrentSeries(series) Append one series to the Recording's set of series
replaceAllEvents(newEvents) Replace the current Recording's event list with the specified list

New scripts should be placed somewhere under scripts/channelClasses, and called via the -scriptChannel command line option.

Adding new epoch options

To top Epoch-related scripts (those associated with the -scriptEpoch command line option) generate a table of event attributes. The rows of the table correspond to 'events'; and these events might be stimulus events, response events, or might be synthetic events computed by the script. The script takes the list of low-level events created during reading of the data file, is free to augment or transform that list in any way. Thus a single low-level event of know finite duration might be represented in the attribute table as one on-set event and one off-set event. Or synthetic events might be created that are time-locked to the onset of an alpha burst. There are no restrictions on what is regarded as an event.

The columns of the attribute table are similarly unconstrained. Each column has a label and datatype, and one value per row. The column labels are alphanumeric strings (no spaces, case-insensitive); the allowed data types are integer, float, double, string, or timestamp; and the values are the corresponding Java objects (or null. The data types are strings like "FLOAT", "INTEGER", "VARCHAR5", "VARCHAR10", etc (strings are represented this way to facilitate their subsequent export to an SQL table).

Accordingly, the principal job of the script is to generate an array String[2][nCols] th and Object[nRows][nCols] td, and to return this information to the calling method. The meta information, th, combines the label and data type. The abstract superclass specifies methods such that the attribute values, td are returned one row at a time via an iterator: this suits later translation into SQL UPDATE commands.

Much of the power of this program derives from imaginative use of this script. Attaching an arbitrary number of attributes to any number of specific time points means that epochs can be binned in a way to tell any story.

Apart from the attribute table, the script also returns the offset (from each event) to the start of the epoch; and the duration of epochs. This might seem a separate matter that can be chosen later, however choosing these values within the script allows extra checking of what events to include in the attribute table. Events near the start and end of a recording will be disqualified if the corresponding epoch extends beyond the limits of the recording.

New scripts should be placed somewhere under scripts/epochClasses, and called via the -paradigm and -scriptEpoch command line options.

Adding new scoring options

To top The result of the prior stages in the processing pipeline is a data object comprising a 2-D grid of 'frames', where each frame is a set of waveforms. See this example of the canonical representation of such an object. The rows and columns of the grid are implied by the arguments to -binV and -binH, while the waveforms within each frame are distinguished by the argument to -binZ.

The job of the scoring plug-in is to compute new values: values that might be associates with individual waveforms, or global values associated with the entire frame. Some examples are provided under the directory scorerClasses (simple examples and templates).

The second job of each scoring plug-in is to bundle the scores in such a way that the values can be plotted (where appropriate) by display plug-ins, and can be exported in various formats for separate statistical analysis. This is a complex matter. See How to manage scores for more about the data structures used for scores.

Adding new display options

To top Display (and export) options take the grid of frames, possibly augmented by scores, and generate output.

It is a challenge of sorts to obtain the time-series and scores from the grid of frames (particularly the fancy use of reflection). However it should be minor compared with the challenge of building a good GUI.

Ideas for additional display options:

Verification and validation

To top Verification and validation during development involves both fine-grained and overall tests. There is a unit testing suite implemented for Jeda, which tests large or small program components in isolation.

In one set of tests, specific inputs are supplied to methods, and the returned values are compared to expected values (and testb). This is well-suited to verification of the preprocessing (server) aspects of this application. In another set of tests, command-line arguments are passed to the main routine, and the full pipeline is checked against expected results (and testc).

 ant testa       Compile tests
 ant testb       Compile and run tests: unit only
 ant testc       Compile and run tests: unit and end-to-end
 ant testd       Compile and run tests: tests under development

Unit testing requires that the Jar file from TestNG be installed. I unzip the TestNG package in its own directory, and then edit the property testng.jar in the build file, to point to the relevent jar file.

The command line options -previewSeries, -reviewSeries and -v also provide diagnostics, with user-selectable levels of detail.

Further remarks

To top This division of the overall task into three successive steps (dealing with time-series operations, epoch selection, and output operations) seems reasonable. Generally time-series operations do not depend on events, and epoch selection do not affect time-series. However it is easy to imagine situations that challenge this separation of tasks. For example epochs might have as an attribute the phase of the alpha power at the time of the stimulus: this would be generated as part of epoch selection but requires access to the EEG time-series from one scalp site. Another example is if some time-series filter operation has to be applied at times marked by events, such as events indicating the onset of an fMRI volume acquisition. These occasional complications can be accommodated, but are mentioned here to show why EEG analysis programs will never be simple to program, nor simple to use.

The complications multiply when there are acceptance criteria applied to the data. It can be unclear at what stage of the process that the criteria are applied: before or after filtering, before or after epoching?

The approach adopted here embraces scripting, which is explicit, flexible and reproducible. Also it copes well with arbitrary complexity, unlike GUIs. The scripts used by this program are central to its ability to transcend the humdrum, over-familiar `Tg vs Bg' level of analysis, and to do so with some level of convenience. Thus much of the complexity of analyses is encapsulated in a set of scripts for performing operations related to time-series, and another set related to epoch selection. This, plus the arbitrary number of view and scoring options, should cover most eventualities.

A nice example of the virtues of scripting is in the testing of scoring and display algorithms. The scripts can be used to generate continuous time-series with particular known characteristics, which can be compared to the output.

So far it appears that time-series operations (spatial averaging, re-montaging, filtering, EOG correction, etc) are relatively simple to implement. That is not true of the event information.

The plan is to deal with all these issues in a stepwise fashion, from decoding the raw events through to presenting a table from which users can choose epochs in a natural way. The following should be performed during the loading of each data file, due to the hardware-dependent nature of this data:

  1. Combine raw events within the specified file with any other sources of information (DB, .poX files, etc)
  2. Translate raw events and other information into a generic form, which may have a duration as well as a time associated with it.
The following event processing should be performed within a paradigm-specific script
  1. It may be necessary to scan the generic events to determine, say, the mean reaction time to targets; which could be used in a later step classify responses as one of {slow, ≈mean fast}.
  2. It may be necessary to scan the time series to, say, separate SCL from SCR, or to do automatic detection of hypoxic events.
  3. Define a set of factors, each with a number of levels, for example StimResp={TgNoresp, BgNoresp, TgResp, BgResp}. Also, some additional levels might be calculated from the time-series, like alpha phase as the time of the stimulus, or some estimate of the size of possible artifacts.
  4. For each factor level specify a pattern in terms of generic events. These patterns will often specify some semantic context, e.g. `TgResp' is `stim=Tg and button response between 0.1 and 0.5 seconds later', or `OffResp' is `end of visual stimulus'. Note that the awkwardness arising from punctate and extended events is eliminated by proper factor definitions, so that the subsequent epoch selection process is uncomplicated by consideration of event durations.
  5. Step through all generic events, and see what patterns are matched. This will enable a database table to be generated, consisting of event times and factor levels, one per factor.

Then the user, using standard SQL commands, can select a number of rows from the database specific to the analysis goal, e.g. "(StimResp='TgNoresp' or StimResp='TgResp') and Site LIKE '_z' and Artifact<100". Thus a subset of both times and channels can be selected by a common mechanism. See the following suggestion of what a DB table might look like.

TimeStimStimRespTgOrdBgOrd SCRSiteArtifactID
3.232TgTgNoResp1NA fCz5.410023883
4.452TgTgResp2NA tCz9.410023883
6.459BgBgNoRespNA1 fCz32.110023883
7.862TgTgNoResp3NA tPz105.810023883
::::: ::::
The table is useful for selection of epochs, but also the factor levels perform an additional role as epoch attributes during the viewing and scoring phases. In those phases they have a slightly different meaning, as they are regarded as attributes of the entire epoch, rather than being linked to a specific time during the acquisition period.

Overall, it can be seen that epoch selection is very structured. Note how hardware-specific tasks are built into the file reader, paradigm-specific operations are implemented in a suitably crafted script, and only the epoch choice specific to the analysis goal is just a single line to be typed by the user. At each stage it is the right method for the job (c.f. an alternative approach).

The present scheme allows for good flexibility in bundling single trials into arbitrary subsets and then either scoring or viewing the result — but how to perform differencing prior to scoring? Differencing might be added as an option to the viewing module, and to the scoring module, but clearly the proper solution is to factor out such operations. How about grouping layout with epoch selection, as they collectively contribute to the subsetting problem; simplify the viewing and scoring group to show only viewing options; and adding the scoring to the view window — only the relevant scoring options.

It might be argued that subsetting is of little appeal when applied only to individual recordings. There would be too little signal to noise when subsets are formed. Perhaps the ideas expressed here should be expanded to allow multiple recordings, each with distinguishing attributes like subject ID or diagnostic category. Then waveforms from different subjects could be overlaid, as could spectra from different age bins. These are very desirable options indeed. However it is challenging for at least three reasons:

  1. There is a need to all manner of additional subject-specific information to be fed into the program: age, sex, psychometric scores, psychophysiology scores, etc. This practically demands a DB interface, and makes the program very dependent on things out of its control.
  2. When looking at 1000's of subjects it becomes hard to guarantee that all are compatible; that all express identical attributes. Does the number of channels remain the same? And does it matter?
  3. The program currently offers access to all waveforms, all the way down to the single trial level. When looking at 1000's of subjects this becomes difficult to implement: 1000 subjects × 50 epochs × 40 channels × 500 points implies 4 GB of data, 64 bit CPUs, and performance issues.
I think this calls for storage of epochs in a database, along with all their corresponding attributes. That would (a) relieve the JVM of memory constraints, (b) achieve persistence, and (c) leverage the power of SQL for epoch selection. See the embeddable database Derby. Database access is still exacting — but see the Java Persistence API (JPA), which simplifies the storage and retrieval of serialized Java objects.

It is feasible to generate figures and scores from the command line. The value of this lies in reproducibility. Once the command line incantation is finalized, it can be used without any hesitation or complication to regenerate results. Also, in the context of report generation, the chance of human error can be reduced by command shell scripting of all figure generation.

It is feasible to carry out statistics and high-quality visualization directly from this application. With the aid of Java wrappers for R from Omegahat/RSJava, and the shared object version of R (/usr/lib/R/lib/libR.so), it is possible to perform any R operation, and return the results into the Java environment. See the tutorial RFromJava.pdf. This is one way to extend the scope of this application to cover the generation of figures for reports. The same functionality is available via rJava plus JRI. Likewise, JMatLink is well suited to help explot MatLab.

Runtime monitoring

To top There are various ways to monitor memory usage, number of threads, etc. Utilities include jstat, jconsole and jvisualvm. All work well and reliably when both they and the target application are on the same host. Problems only arise when trying to monitor a remote system.

Remote RMI registry and jstatd

For remote monitoring it is essential that there is an RMI registry running on the remote host, and for jstatd to be running. The local host will try to initiate a TCP connection to the remote RMI registry, and will fail (with a poor explanation) if that registry is not there or if blocked by a firewall. The simplest way to launch a registry is to run jstatd on the remote host thus:

   remote:~/progs $ jstatd  \
                      -J-Djava.security.policy=jstatd.all.policy \
                      [-J-Djava.rmi.server.logCalls=true] \
                      -J-Djava.rmi.server.hostname=172.20.22.12  \
                      [-p 1099] [-n JStatRemoteHost] &

Started this way (no -nr option) jstatd will create its own RMI registry, and will listen by default on TCP port 1099, and use the name 'JStatRemoteHost' to distinguish it from other registries that may be listening on the same port. [Note that as of late 2009 (Java version 1.6.0_17) it was essential that jstatd uses the default port and registry name.] There seem to be other ports involved, however, (try netstat -utanp | grep jstatd), so I find that firewalling must be completely disabled on the remote machine before monitoring can take place. Specifying a policy file is essential: the standard one is a three-liner, and easily found in the documentation. The hostname specification is peculiar. Most users don't require this property to be set, but I found that my machines infer their IP address to be 127.0.1.1; and that this cripples RMI. Setting the 'logCalls' property to true is extremely useful initially. You will probably want to start jstatd in the background ('&'), since you can then close the shell and jstatd will continue running (i.e. disown is not required).

Once jstatd is started it will report two items (if logging is turned on) then wait. You can further confirm that all is OK by turning to the local machine (firewalling of the local machine is OK), and running

   local:~/ $ jps -l [-m] [-v]  rmi://172.20.22.12
   local:~/ $ jvisualvm

jvisualvm will need to be told the name and IP of the remote host. Then it will list all java apps running on that machine, including jstatd.

Now run an application such as Jeda on the remote host. Run with the following additional properties:

   remote:~/progs $ java  \
                 -Dcom.sun.management.jmxremote.port=12345 \
                 -Dcom.sun.management.jmxremote.ssl=false  \
                 -Dcom.sun.management.jmxremote.authenticate=false ....
I am not sure of the significance or necessity of the port number. And 'authenticate' can be set either to true or false, it seems (maybe the permissive jstatd.all.policy file used by jstatd renders this property irrelevant.)

The application should immediately be registered by jvisualvm, and monitoring will commence after right-clicking on that tree item, and selecting 'Open'. It is not necessary to enter the port number 12345, nor is it necessary to specify any username/password. Nor is there any call to enter any role/rolePassword defined under $JAVA_HOME/jre/lib/management. You can then monitor memory usage number of threads and number of classes.

However...

The above is quite complex, and there are residual doubts about ports and authentication (does jmxremote.port=12345 allow jstatd to be bypassed?).

Overall, I suspect that this exercise is of dubious value, when there is the alternative of doing an ssh -X to the remote machine, and running jvisualvm on the remote host. This is just as responsive in my case, and jvisualvm is able to report more diagnostics when run natively. Also there is no need for jstatd, nor for all the extra property settings when starting an application, nor for firewalling to be disabled.

Comparisons

To top There are many analysis tools (see here) although few are both alive and directly comparable to Jeda. The most interesting available peer comparison is with EEGLAB. This is built on Matlab, and so has all the associated strengths and weaknesses: easy vector and matrix operations, accessible scripts, user extensions, a constrained model for plotting, no (enforced) object orientation.

There is another relevant package, named bioelectromagnetism, which is definitely aimed at Matlab enthusiasts. It only accepts epoched data as input, so is complementary to the core functionality of Jeda. It might be of use for MRI-ERP data fusion, and for source localization via BrainStorm.

The Biosignal package is designed to handle many different recording modalities.

BrainVision Analyser is a very complete package for EEG and ERP analysis.

 

Validate HTML CSS Last changed 2010-10-07