|
| 1 | +Python CHM Train and Test Algorithms |
| 2 | +==================================== |
| 3 | + |
| 4 | +This package includes the CHM train and test algorithms written in Python with several |
| 5 | +enhancements, including major speed and memory improvements, improved and added filters, and |
| 6 | +improvements to the processing algorithm. |
| 7 | + |
| 8 | +Models are not always compatible. MATLAB models created with the MATLAB `CHM_train` can be used |
| 9 | +with the Python `chm.test` with only minor differences in output. However, Python models cannot be |
| 10 | +used with the MATLAB `CHM_test`. |
| 11 | + |
| 12 | + |
| 13 | +Installation |
| 14 | +------------ |
| 15 | +The following libraries must be installed: |
| 16 | + * gcc and gfortran (or another C and Fortran compiler) |
| 17 | + * Python 2.7 |
| 18 | + * Python headers |
| 19 | + |
| 20 | +The following libraries are strongly recommended: |
| 21 | + * virtualenv |
| 22 | + * linear-algebra package including devel (in order of preference: MKL, ATLAS+LAPACK, OpenBLAS+LAPACK, any BLAS library) |
| 23 | + * devel packages for image formats you wish to read: |
| 24 | + * PNG: zlib (note: uncompressed always available) |
| 25 | + * TIFF: libtiff (note: uncompressed always available) |
| 26 | + * JPEG: libjpeg or libjpeg-turbo |
| 27 | + * etc... |
| 28 | + * hdf5 devel package for reading and writing modern MATLAB files |
| 29 | + * fftw devel package for faster FFT calculations |
| 30 | + |
| 31 | +These can all be installed with various Python managers such as Anaconda, Enthought, Python(x,y), |
| 32 | +or WinPython. On Linux machines they can be installed globally with `yum`, `apt-get`, or similar. |
| 33 | +For example on CentOS-7 all of these can be installed with the following: |
| 34 | + |
| 35 | + yum install gcc gcc-gfortran python python-devel python-virtualenv \ |
| 36 | + atlas atlas-devel lapack lapack-devel lapack64 lapack64-devel \ |
| 37 | + zlib zlib-devel libtiff libtiff-devel libjpeg-turbo libjpeg-turbo-devel \ |
| 38 | + hdf5 hdf5-devel fftw fftw-devel |
| 39 | + |
| 40 | +The recommended way to install if from here is to create a Python virtual environment with all of |
| 41 | +the dependent Python packages. On Linux machines, setting this up would look like: |
| 42 | + |
| 43 | + # Create the folder for the virtual environment |
| 44 | + # Adjust this as you see fit |
| 45 | + mkdir ~/virtenv |
| 46 | + cd ~/virtenv |
| 47 | + |
| 48 | + # Create and activate the virtual environment |
| 49 | + virtualenv . |
| 50 | + source bin/activate |
| 51 | + |
| 52 | + # Install some of the dependencies |
| 53 | + # Note: these can be skipped but greatly speeds up the other commands |
| 54 | + pip install numpy cython scipy |
| 55 | + |
| 56 | + # Install the devel pysegtools (and all dependencies) |
| 57 | + git clone [email protected]:slash-segmentation/segtools.git |
| 58 | + pip install -e segtools[PIL,MATLAB,OPT] |
| 59 | + |
| 60 | + # Install the devel PyCHM |
| 61 | + git clone [email protected]:slash-segmentation/CHM.git |
| 62 | + pip install -e CHM/python[OPT] |
| 63 | + |
| 64 | +Since the pysegtools and CHM packages are installed in editable mode (`-e`), if there are updates |
| 65 | +to the code, you can do the following to update them: |
| 66 | + |
| 67 | + cd ~/virtenv/segtools |
| 68 | + git pull |
| 69 | + |
| 70 | + cd ~/virtenv/CHM/python |
| 71 | + git pull |
| 72 | + ./setup.py build_ext --inplace # builds any changed Cython modules |
| 73 | + |
| 74 | +For hints on getting this to work on a cluster (e.g. Comet) see the INSTALL guide for segtools. |
| 75 | + |
| 76 | + |
| 77 | +CHM Test |
| 78 | +-------- |
| 79 | +Basic usage is: |
| 80 | + |
| 81 | + python -m chm.test model input-image output-image <options> |
| 82 | + |
| 83 | +General help will be given if no arguments (or invalid arguments) are provided |
| 84 | + |
| 85 | + python -m chm.test |
| 86 | + |
| 87 | +The model must be a directory of a MATLAB model or a Python model file. For MATLAB models the |
| 88 | +folder contains param.mat and MODEL\_level#\_stage#.mat. For Python models it is a single file. |
| 89 | +*The MATLAB command line made this optional but now it is now mandatory.* |
| 90 | + |
| 91 | +The CHM test program takes a single 2D input image, calculates the labels according to a model, |
| 92 | +and saves the labels to the 2D output image. The images can be anything supported by the `imstack` |
| 93 | +program for 2D images. The output-image can be given as a directory in which case the input image |
| 94 | +filename and type are used. *The MATLAB command line allowed multiple files however now only a |
| 95 | +single image is allowed now, to process more images you must write a loop in bash or similar.* |
| 96 | + |
| 97 | +The CHM test program splits the input image into a bunch of tiles and operates on each tile |
| 98 | +separately. The size of the tiles to process can be given with `-t #x#`, e.g. `-t 512x512`. The |
| 99 | +default is 512x512. The tile size must be a multiple of 2^Nlevel of the model (typically Nlevel<=4, |
| 100 | +so should be a multiple of 16). *The MATLAB command line called this option `-b`. Additionally, the |
| 101 | +MATLAB command line program overlapped tiles (specified with `-o`) which is no longer needed or |
| 102 | +supported. Finally, in MATLAB this was a critical quality vs speed option and now the default |
| 103 | +should really always be used.* |
| 104 | + |
| 105 | +Instead of computing the labels for every tile, tiles can be specified using either tile groups |
| 106 | +(recommended) or individually. Tile groups use `-g` to specify which group to process and `-G` to |
| 107 | +specify the number of groups. Thus when distributing the testing process among three machines the |
| 108 | +processes would be run with `-g 1 -G 3`, `-g 2 -G 3` and `-g 3 -G 3` on the three machines. The |
| 109 | +testing process will determine which tiles it is to process based on this information in a way that |
| 110 | +reduces any extra work and making sure each process has roughly the same amount of work. The total |
| 111 | +number of groups should not be larger than 30. Individual tiles can be specified using `-T #,#`, |
| 112 | +e.g. `-T 0,2` computes the tile in the first column, third row. This option can be specified any |
| 113 | +number of times to cause multiple tiles to be calculated at once. All tiles not calculated will |
| 114 | +output as black (0). Any tile indices out of range are simply ignored. *The MATLAB command line |
| 115 | +called this option `-t` and indexing started at 1 instead of 0 and did not support tile groups.* |
| 116 | + |
| 117 | +By default the output image data is saved as single-byte grayscale data (0-255). The output data |
| 118 | +type can be changed with `-d type` where `type` is one of `u8` (8-bit integer from 0-255), `u16` |
| 119 | +(16-bit integer from 0-65535), `u32` (32-bit integer from 0-4294967295), `f32` (32-bit floating |
| 120 | +point number from 0.0-1.0), or `f64` (64-bit floating-point number from 0.0-1.0). All of the other |
| 121 | +data types increase the resolution of the output data. However the output image format must |
| 122 | +support the data type (for example, PNG only supports u8 and u16 while TIFF supports of all the |
| 123 | +types). |
| 124 | + |
| 125 | +Finally, by default the CHM test program will attempt to use as much memory is available and all |
| 126 | +logical processors available. If this is not desired, it can be tweaked using the `-n` and `-N` |
| 127 | +options. The `-n` option specifies the number of tasks while `-N` specifies the number of threads |
| 128 | +per task. Each additional task will take up significant memory (up to 2 GiB for default tiles and |
| 129 | +Nlevel=4) while each additional thread per task doesn't effect memory significantly. However, |
| 130 | +splitting the work into two tasks with one thread each will be much faster than having one task |
| 131 | +with two threads. The default uses twice as many tasks as would fit in memory (since the maximum |
| 132 | +memory usage is only used for a short period of time) up to the number of CPUs and then divides |
| 133 | +the threads between all tasks. If only one of `-n` or `-N` is given, the other is derived based on |
| 134 | +it. *The MATLAB command line only had the option to be multithreaded or not with `-s`.* |
| 135 | + |
| 136 | +*The MATLAB command line option `-M` is completely gone as there is no need for MATLAB or MCR to be |
| 137 | +installed. It also did histogram equalization by default. This is no longer supported at all and |
| 138 | +should be done separately.* |
| 139 | + |
| 140 | + |
| 141 | +CHM Train |
| 142 | +--------- |
| 143 | +Basic usage is: |
| 144 | + |
| 145 | + python -m chm.train model inputs labels <options> |
| 146 | + |
| 147 | +General help will be given if no arguments (or invalid arguments) are provided |
| 148 | + |
| 149 | + python -m chm.train |
| 150 | + |
| 151 | +The CHM train program takes a set of input images and labels and creates a model for use with CHM |
| 152 | +test. The model created from Python cannot be used with the MATLAB CHM test program. The input |
| 153 | +images are specified as a single argument for anything that can be given to `imstack -L` (the value |
| 154 | +may need to be enclosed in quotes to make it a single argument). They must be grayscale images. The |
| 155 | +labels work similarily except that anything that is 0 is considered background while anything else |
| 156 | +is considered a positive label. The inputs and labels are matched up in the order they are given |
| 157 | +and paired images must be the same size. |
| 158 | + |
| 159 | +The model is given as a path to a file for where to save the model to. If the option `-r` is also |
| 160 | +specified and the path already contains (part of) a Python model, then the model is run in 'restart' |
| 161 | +mode. In restart mode, the previous model is examined and as much of it is reused as possible. This |
| 162 | +is useful for when a previous attempt failed partway through or when desiring to add additional |
| 163 | +stages or levels to a model. If the filters are changed from the original model, any completed |
| 164 | +stages/levels will not use the new filters but new stages/levels will. The input images and labels |
| 165 | +must be the same when restarting. |
| 166 | + |
| 167 | +The default number of stages and levels are 2 and 4 respectively. They can be set using `-S #` and |
| 168 | +`-L #` respectively. The number of stages must be at least 2 while the number of levels must be at |
| 169 | +least 1. Each additional stage will require very large amounts of time to compute, both while |
| 170 | +training and testing. Additional levels don't add too much additional time to training or testing, |
| 171 | +but do increase both. Typically, higher number of levels are required with larger structures and do |
| 172 | +not contribute much for smaller structures. Some testing has shown that using more than 2 levels |
| 173 | +does not contribute much to increased quality - at least for non-massive structures. |
| 174 | + |
| 175 | +The filters used for generating features are, by default, the same used by MATLAB CHM train but |
| 176 | +without extra compatibility. The filters can be adjusted using the `-f` option in various ways. The |
| 177 | +available filters are `haar`, `hog`, `edge`, `gabor`, `sift`, `frangi`, and `intensity-<type>-#` |
| 178 | +(where `<type>` is either `stencil` or `square` and `#` is the radius in pixels). To add filters to |
| 179 | +the list of current filters do something like `-f +frangi`. To remove filters from the list of |
| 180 | +current filters do something like `-f -hog`. Additionally, the list of filters can be specified |
| 181 | +directly, for example `-f haar,hog,edge,gabor,sift,intensity-stencil-10` which would specify the |
| 182 | +default set of filters. More then one `-f` option can be given and they will build off of each other. |
| 183 | + |
| 184 | +Besides filters being used to generate the features for images, a filter is used on the 'contexts' |
| 185 | +from the previous stages and levels to generate additional features. This filter can be specified |
| 186 | +with `-c`, e.g. `-c intensity-stencil-7` specifies the default filter used. This only supports a |
| 187 | +single filter, so `+`, `-`, or a list of filters cannot be given. |
| 188 | + |
| 189 | +To speed up the training process the training data can be subsampled by using the `-s` option. If |
| 190 | +the number of samples for a particular stage/level is over the value given, at most half of that |
| 191 | +many positive and negative samples are kept. The default is to keep all samples. *MATLAB had a fixed |
| 192 | +value of 6000000.* |
| 193 | + |
| 194 | +The training algorithm also requires running the testing algorithm internally. If desired these |
| 195 | +results can be saved using the option `-o`. This option takes anything that can be given to |
| 196 | +`imstack -S` although quotes may be required to make it a single argument. This means that if you |
| 197 | +want to see the results of running CHM test on the training data you can get it for free. Like CHM |
| 198 | +test, this supports the `-d` argument to specify the data type used to save the data. |
| 199 | + |
| 200 | +If not all of the training data should be used, a set of mask images can be provided to select |
| 201 | +which pixels will be considered during training. This is done with `-M masks` where `masks` can be |
| 202 | +anything that can be given to `imstack -L` although quotes may be required to make it a single |
| 203 | +argument. The number and size of the images must be the same as the input-images and label-images. |
| 204 | +Any non-zero pixel in the mask will be used to train with. Note that all pixels, even if not |
| 205 | +covered by the mask, are considered for generation of the features. |
| 206 | + |
| 207 | +*The MATLAB command line option `-M` is completely gone as there is no need for MATLAB or MCR to be |
| 208 | +installed. Additionally the option `-m` is now mandatory and listed first.* |
| 209 | + |
| 210 | + |
| 211 | +Filters |
| 212 | +------- |
| 213 | + |
| 214 | +### Haar |
| 215 | + |
| 216 | +Computes the Haar-like features of the image. This uses Viola and Jones' (2001) 2-rectangle |
| 217 | +features (x and y directional features) of size 16. It uses their method of computing the integral |
| 218 | +image first and then using fast lookups to get the features. |
| 219 | + |
| 220 | +When used for MATLAB models the computations are bit slower but reduces the drift errors compared |
| 221 | +to the MATLAB output. Technically it is slightly less accurate, but the numbers are in the range of |
| 222 | +1e-8 off for a 1000x1000 tile. |
| 223 | + |
| 224 | +### HOG |
| 225 | + |
| 226 | +Computes the HOG (histogram of oriented gradients) features of the image. |
| 227 | + |
| 228 | +The original MATLAB function used float32 values for many intermediate values so the outputs from |
| 229 | +this filter are understandably off by up to 1e-7. The improved accuracy is used for MATLAB or |
| 230 | +Python models since it just adds more accuracy. |
| 231 | + |
| 232 | +*TODO: more details - reference and parameters used and the new HOG* |
| 233 | + |
| 234 | +### Edge |
| 235 | + |
| 236 | +Computes the edge features of the image. This calculates the edge magnitudes by using convolution |
| 237 | +with the second derivative of a Guassian with a sigma of 1.0 then returns all neighboring offsets |
| 238 | +in a 7x7 block. |
| 239 | + |
| 240 | +When used for MATLAB models, the tiles/images are padded with 0s instead of reflection. This is not |
| 241 | +a good approach since a transition from 0s to the image data will result in a massive edge. |
| 242 | + |
| 243 | +*TODO: more details - reference* |
| 244 | + |
| 245 | +### Frangi |
| 246 | + |
| 247 | +Computes the Frangi features of the image using the eigenvectors of the Hessian to compute the |
| 248 | +likeliness of an image region to contain vessels or other image ridges, according to the method |
| 249 | +described by Frangi (1998). This uses seven different Gaussian sigmas of 2, 3, 4, 5, 7, 9, and 11 |
| 250 | +each done with the image and the inverted image (looking for white and black ridges). The beta |
| 251 | +value is fixed to 0.5 and c is dynamically calculated as half of the maximum Frobenius norm of all |
| 252 | +Hessian matrices. |
| 253 | + |
| 254 | +This is not used by default in Python models or at all for MATLAB models. |
| 255 | + |
| 256 | +### Gabor |
| 257 | + |
| 258 | +Computes several different Gabor filters on the image using all combinations of the following |
| 259 | +parameters to create the kernels: |
| 260 | + * sigma: 2, 3, 4, 5, 6 |
| 261 | + * lambdaFact: 2, 2.25, 2.5 (lambaFact = lamba / sigma) |
| 262 | + * orient: pi/6, pi/3, pi/2, 2pi/3, 5pi/6, pi, 7pi/6, 4pi/3, 3pi/2, 5pi/3, 11pi/6, 2pi |
| 263 | + * gamma: 1 |
| 264 | + * psi: 0 (phase) |
| 265 | +The magnitude of the complex filtered image is used as the feature. |
| 266 | + |
| 267 | +The original MATLAB code has a serious bug in it that it uses the `imfilter` function with an `uint8` |
| 268 | +input which causes the filter output to be clipped to 0-255 and rounded to an integer before taking |
| 269 | +the complex magnitude. This is a major problem as the Gabor filters have negative values (all of |
| 270 | +which are set to 0) and can produce results above the input range, along with losing lots of |
| 271 | +resolution in the data (from ~16 significant digits to ~3). So for MATLAB models the data is |
| 272 | +simplified in this way, otherwise a much higher accuracy version is used. |
| 273 | + |
| 274 | +*TODO: more details - reference* |
| 275 | + |
| 276 | +### SIFT |
| 277 | + |
| 278 | +*TODO* |
| 279 | + |
| 280 | +### Intensity |
| 281 | + |
| 282 | +Computes the neighborhood/intensity features of the image. Basically, moves the image around and |
| 283 | +uses the shifted images values. Supports square or stencil neighborhoods of any radius >=1. In |
| 284 | +compatibility mode these are always fixed to a particular type (stencil) and radius (10). |
0 commit comments