Users guide for code EnLink:

Introduction


EnLink is a density based hierarchical group finder for multidimensional data sets.  Given a data file it first calculates the densities and neighbor-list of the data points and subsequently performs the clustering analysis.  The novel feature of the code is the use of Shannon entropy based scheme to compute an adaptive metric for each data point. This is the best option to use when no prior information is available about the dimensions of the data. However, in some situations where the underlying metric does not vary much in space and one has a reasonable guess about the metric; one can  use a custom constant metric e.g. Euclidean. The estimated densities might be slightly less prone to noise in such cases.

If distance is defined by the following function  ds2=dxT G dx,   by metric we mean  sqrt(gij) . 

The code outputs a list of density of the particles, a list of group ID of the particles and a table of the properties of groups. A list of subgroups is also available. The groups IDs are labelled starting from 0. Due to the hierarchical nature of clustering there are two types of groups.

Independent groups Subgroups
These are groups which are not subgroups of any other group.
The parent ID is equal to the group ID of the group itself.
The status of the group is denoted by 1
These are groups which are subgroups of other group.
The parent ID is not equal to the group ID of the group.
The status of the group is denoted by 0.



Group-0:

This is the group with ID 0. These are the data points which cannot be reliably assigned to any group and should be ignored. In normal cases Group-0 is empty. However, in some cases they may containt some points. These are a)Points below user specified density, or b) independent groups having size and/or significnace below user specified limits.




Installation Python version

For Python version do
tar -zxvf enlink-0.1.0.tar.gz
cd enlink-0.1.0
python setup.py install

Usagae Python version

>>>import enlink
>>>help(enlink)

Smoothing lengths for use in Galaxia

The smoothing lengths are used by Galaxia to sample N-body particles. The following code describes how to extract them from enlink.
enlink_galaxia.py

Installation ebfpy module

You will need the python ebfpy module to read/write files. TO install latest version
pip install --user --upgrade git+https://github.com/sanjibs/ebfpy.git@master

Installation C++ version

For C version do
tar -zxvf Enlink-x.x.tar.gz
cd Enlink-x.x/src
make clean
make

To test run the C++ version on the supplied sample data file poisson.ascii, which contains homogeneous poisson sampled data with no groups.
cd ..
Enlink -dlg --dim=3 data/poisson.ascii

Usage C++ version


Enlink  [OPTIONS] file.ascii
OR
Enlink  [OPTIONS] file.ebf

On sucessful completion, the last output line is the following

Status-->SUCCESS     Total Time     time_in_secs


Examples-       
Enlink -d      --dim=3 file.ascii     (calculate and save density)
Enlink -dl     --dim=3 file.ascii     (calculate and save, density + neighbor list)
Enlink -dlg    --dim=3 file.ascii     (calculate and save, density + neighbor list + group finding results)
Enlink -dlg    --dim=3 --ngb=30  --dsuffix=_n30 --ph=4.0 --gsuffix=_g4.0  file.ascii 
                                        
(a run with user specifed settings)
Enlink -dlgsbr --dim=3 --ngb=30  --dsuffix=_n30 --ph=4.0 --gsuffix=_g4.0  file.ascii       
                                                        (another run with user specifed settings and adaptive entropy based metric)
Enlink -g      --dim=3 --ngb=30  --dsuffix=_n30 --ph=5.0 --gsuffix=_g5.0  file.ascii       
                                                         (run a clustering analysis on a previously calculated density and ngblist)

Options-
      -dlg
               
The simplest setting uses a constant Euclidean metric.
      -dlg --gmetric=1 
                Uses a constant metric but normalized using variance of data along each dimension.
      -dlg --gmetric=2 
                Uses a custom constant metric as specified in file "./metric_default.txt".
      -dlg --gmetric=3 
                Angular search on a sphere. First three co-ordinates should be x,y,z with r=1 then other e.g vr.Uses a custom constant metric as
                specified in file "./metric_default.txt"  (1.0 1.0 1.0  sigma_lb(radians)/sigma_vr). 
      -dlgsbr  
                The best possible metric calculation and is the recommended option. Computes an adaptive metric using an entropy based scheme.
                Metric can have non diagonal elements also which is useful for anisotropic structures. A bootstrap technique used for smoothing the metric.

      -dlgsb  
                Similar to above but the metric is diagonal since rotational transformations (option -r) are not done. The  code  also runs faster in this case.
      --dim=d 
                The dimensionality of data. This should match the one specified in input file, see section Input/Output for more details.
      --ph=STh
                This is the parameter STh, the siginificance threshold or the peak height threshold.  The significance of  group having peak density  ρmax
                and lowest density as ρmin is defined as follows,


                S={ln(ρmax)-ln(ρmin)}/σlnρ   , where σlnρ=sqrt{Vd ||W||22/k}.

                STh is the minimum value of significance above which to report groups. If not specified a default value is used such that G(S>STh)=0.5. This
                means  that the expected
number of groups due to Poisson noise for the given number of points and dimensionality is 0.5.  Setting it lower gives
                more groups
but some of them might also be fake.  The number of groups as a function of significance S for a Poisson sampled data having
                dimensionality
d is given by the following equation (k being the number of nearest neighbors used for density estimation).

                G(S>STh)=(0.4N/k) erfc[(STh/sqrt{2}) sqrt{(d/4)(1-2.3/k)}]
                      
      --ngb=k         
                The number of nighbors over which to evaluate density. Default is 10.  Suggested optimum is 30. 
                      
      --sepcrit=k         
                Subgroup truncation criterion. 0 to truncate the subgroup immediately, 1 to never truncate the subgroup . Default is 0. 
      --mng=MNG
                The minimum number of points a group should have to be qualified as a group. Default is 10. 
      --psuffix=suffixp    
                 A string which is an Extra suffix for output files. Currently unused
                 {p3,h3(hammer aitoff),s3 (r=5*alog(r*100)-8),en (energy)},{ v3,u3(vl,vb,vr),u2,m3(mul,mub,vr),m2,vr, l3(lx,ly,lz), l2 (lz,|l|), lz},{feal?}
      --dsuffix=suffixd    
                 A string which is a suffix for density output files. Note, suffixpd=suffixp+suffixd
                 density file-    file[suffixpd]_den.txt                          (densities + neighbor-list of particles)
      --gsuffix=suffixg    
                 Suffix for group finder output files.
                 grp file      -    file[suffixpd]_den[suffixg]_grp.txt     (group properties and its hierarchy)
 

Input/Output

One can either use ASCII format files or EBF format files. If ASCII is used output is also in ASCII.


Input

If EBF is used then it should contain a multidimensional array   with tag name "/pos(N x d),  or "/pos3(N x 3) .
N being the number of data points and d the number of dimensions. Clustering is done on this data only.
See http://ebfformat.sourceforge.net/ for further details on how to read or write ebf files.

If the input file is ASCII text then it should be in the following format.

<head>
      rows    columns    c_mass    d   c_1    c_2  ....  c_d   
<head>
x[0][0]      
x[0][1]      .... x[0][columns-1]
..................................................................
..................................................................
x[rows-1][0]  x[rows-1][1] .... x[rows-1][columns-1]

In line 2 all variable are intergers. d is dimensionality of data to be analyzed. Numbers after that specify the columns corresponding  to the dimensions.
Note, column numbers run from 0 to columns-1. If d<=0,  then it is set to d=columns by default.  c_mass is column number for mass. It  could also be set to -1 if all data points have same mass in which case all points are assigned a mass of 1.0.  The number  d should match with the dimension specified in the argument --dim while running the code. 


Output
The code outputs three files, these are the following

1)density file-    file[suffixd]_den.ebf                          (densities + neighbor-list of particles)
2)gid file      -    file[suffixpd]_den[suffixg]_gid.ebf     (group id of particles)
3)
grp file      -   
file[suffixpd]_den[suffixg]_grp.txt    (group properties and its hierarchy)


Here, suffixpd=suffixp+suffixd. Note, brackets around suffix names are inserted for convenience only and are not present in actual output file names.

The grp file has the following format-

rows columns
list of column names
list of data type codes for each column (2 for integer and   4 for float)
x[0][0]        x[0][1] ............x[0][columns-1]
...........................................
...........................................
x[rows-1][0]  x[rows-1][1] .......x[rows-1][columns-1]
lis of subgroups  of size=Sum(x[1][0:rows-1])



Explanation of Columns-

id
nsgp
status
isgp                                      
iden           
count     
parent
sig
mass
mass_vir
vol
cr
rho_max
rho_min
Group id
Number of subgroups
status
0)Independent
group, i.e.,  it is not a sub group of any other group

1)Is a sub group of some other group
starting location of subgroups. Subgroups belonging to the group are given by  the list 

sgplist[isgp:(isgp+nsgp)]

point id of densest data point
number of points in the group
group id of parent group
Significance
cumulative mass.
summed over the group and its subgroups

volume

maxiimum density
minimum density



If input file is ASCII then first two files are in ASCII format with suffix .txt   instead of .ebf . Then
density file has the following format-


rows columns
density[0]
density[1]
....
....
density[rows-1]



 gid file has the following format.-

rows columns
id[0]
id[1]
....
....
id[rows-1]