Ecological Modeling

From GCube System
Revision as of 19:20, 15 February 2012 by Gianpaolo.coro (Talk | contribs) (Evaluators)

Jump to: navigation, search

Ecological Modeling is a set of functionalities available in gCube for performing mining operations on biological data. It comes as a library and a Service (Statistical Manager) in the infrastructure and is able to train models which can be combined with geographical information in order to produce projections on several environmental scenarios or time periods. This system allows for managing complex phenomena, in order, for example, to predict the impact of climate changes on biodiversity, prevent the spread of invasive species, identify geographical and ecological aspects of disease transmission, help in conservation planning, guide field surveys, among many other uses.

Overview

The library is endowed with a set of features which can be resumed as:

  • GENERATORS: include probability distributions, classifications, matching or distance measurements etc.
  • MODELING: include models to be trained, e.g. neural networks, species envelopes, support vector machines etc.. The result will be typically a binary file.
  • CLUSTERING: involve clustering procedures for grouping together phenomena or multidimensional points.
  • TRANSDUCERS: involve algorithms for transforming a dataset into another.
  • EVALUATORS: a set of procedures for measuring the quality of a model.

The system is currently able to run processes on the following infrastructures:

  • LOCAL MULTICORE MACHINE
  • RAINY CLOUD


Generative Algorithms

Currently the following algorithms are supported for projecting probability distributions on geographical maps:

  • AQUAMAPS_SUITABLE: Aquamaps Suitable habitat production
  • AQUAMAPS_NATIVE: Aquamaps Native habitat production
  • AQUAMAPS_NATIVE_2050: Aquamaps Native for 2050 previsions
  • AQUAMAPS_SUITABLE_2050: Aquamaps Suitable for 2050 previsions
  • REMOTE_AQUAMAPS_SUITABLE: Aquamaps Suitable habitat generated by invoking Rainy Cloud
  • REMOTE_AQUAMAPS_NATIVE: Aquamaps Suitable habitat generated by invoking Rainy Cloud
  • REMOTE_AQUAMAPS_NATIVE_2050: Aquamaps Native 2050 habitat generated by invoking Rainy Cloud
  • REMOTE_AQUAMAPS_SUITABLE_2050: Aquamaps Suitable 2050 habitat generated by invoking Rainy Cloud
  • AQUAMAPS_NATIVE_NEURALNETWORK: Aquamaps Native Distribution by using a feed-Forward Neural Network
  • AQUAMAPS_SUITABLE_NEURALNETWORK: Aquamaps Suitable Distribution by using a feed-Forward Neural Network
  • AQUAMAPS_NEURAL_NETWORK_NS: Aquamaps Suitable Distribution by using a feed-Forward Neural Network provided by Neurosolutions (http://www.neurosolutions.com/)

The above algorithms are automatically managed by the library which takes care of the selection of a computational infrastructure suited for running the generation algorithm.

Modelers

  • HSPEN: Hspen model by Aquamaps
  • AQUAMAPSNN: Feedforward Neural Network for usage in Aquamaps generations
  • AQUAMAPSNNNS: Feedforward Neural Network by Neurosolutions (http://www.neurosolutions.com/) for usage in Aquamaps generations

Even in this case, the above algorithms are automatically managed by the library which takes care of the selection of a computational infrastructure suited for running the generation algorithm.

Clustering

No clustering algorithms are currently available.

Transducers

No transducers are currently available.

Evaluators

Available evaluation techniques are the following:

  • CLASSIFICATION QUALITY ANALYSIS

This evaluation method applies to a probability distribution and a set of occurrences\absence points. Calculation includes the following values:

    • TRUE_POSITIVES
    • FALSE_NEGATIVES
    • TRUE_NEGATIVES
    • FALSE_POSITIVES
    • ACCURACY
    • SENSITIVITY
    • SPECIFICITY
  • DISCREPANCY ANALYSIS - BETWEEN TWO SPATIAL DISTRIBUTIONS

Evaluates the distance between two spatial probabilities distributions with the same resolution, in terms of:

    • ACCURACY
    • MEAN ERROR
    • VARIANCE
    • NUMBER_OF_ERRORS
    • MAXIMUM_ERROR
    • MAXIMUM_ERROR_POINT
    • NUMBER_OF_COMPARISONS

Experiments

Manual Review vs. Automatic Reviews

An experiment was performed using the Statistical Service. We tried to compare some Aquamaps distributions, automatically and manually generated, using a test case species (the basking shark): we selected a species for which we had availability of

  • a good number of occurrence points
  • a manually reviewed map
  • a hspec-suitable map generated by the Aquamaps algorithm

The choice fell on the Basking Shark species (Cetorhinus Maximus, Fis-22747). 449 presence data were available for this species. Figure 1_PresenceData.jpg depicts the presence data distribution. Figure 2_Reviewed.jpg depicts the manually reviewed distribution. Figure 3_Aquamaps.jpg depicts the original distribution produced by the Aquamaps-Suitable algorithm.

We tried to perform 2 experiments to test if an automatic machine learning system would have been able to extract species environmental preferences from the same parameters used by the Aquamaps algorithm. The machine learning system was trained with both presence and absence data: absence points were extracted from the reviewed map, from places with probability less than 0.1. We chose a feed forward neural network as machine learning tool, and the parameters we used for the training were the same as in the Aquamaps algorithm: depth mean,depth max,depth min,sst mean,sbt mean,salinity mean,salinity b mean, primary production mean,ice concentration,distance from land,ocean area. The first experiment used 449 absence data all coming from the same region where the reviewed map reported probability values less than 0.1. Figure 4_CloseAbsenceData.jpg depicts this absence data distribution. We trained the network with all the presence and absence points. The best performing neural network had 1 inner layer with 100 neurons. The map produced by this system is depicted in figure 5_NeuralNetworkCloseAbsence.jpg and presents a big spread in the ocean. The map superposes to the reviewed one, but it is quite far from the Aquamaps-Suitable distribution. The holes left by the neural network correspond mainly to low probability points in the reviewed map. Figure 6_NeuralNetwork1VSReviewed.jpg depicts this superposition.

The second experiment used absence data randomly chosen among the reviewed map points with low probability. Figure 7_RandomAbsenceData.jpg depicts the absence data distribution. We trained again the neural network with all these points. This time the best performing presented 1 inner layer with 300 neurons. Figure 8_NeuralNetworkRandomAbsence.jpg depicts the resulting distribution. As it can be noticed by the superposition map in 9_NeuralNetwork2VSReviewed.jpg, this time the distribution is close to the one from the Aquamaps algorithm instead of being similar to the reviewed map.

We tried to make some comments about this result: if we assume that the neural network is working correctly and it is able to learn something about the fish's attitude from the characteristics of the sea associated to the occurrence and absence points, this could indicate that the manually reviewed map could have been build on partial information about the fish. Furthermore this could mean that the reviewer performed the same considerations of the neural network. On the other side, if we are certain that the reviewed map is correct, then we must admit that the information extracted from the sea is not sufficient to understand fish's preferred habitat. Notice that two automatic systems almost agree on a certain distribution for the fish, which is far from the reviewed one and this could indicate the possibility of an evaluation error in the reviewed map. This case could be helpful for implementing an alert for a biologist who wanted to manually revise a map.

Some final Note:

-for the basking shark species all the maps are very similar either in the native or in the suitable distribution -the neural network was trained many times with different topologies, in order to use the best configuration in each experiment -the neural network does not need expert knowledge to produce the map from the inputs, but absence data are necessary, which come essentially from expert knowledge. This is a little paradox as neural networks are declared in literature among the best performing systems for producing distribution maps. Anyway the inputs are dependent on human knowledge. -the reported values in the attached file "Aquamaps and NeuralNetwork performances.xlsx" refer to a training session using 449 presence and 449 absence data. Experiments were made even using 80% of the set for training and 20% for testing and were repeated using 60% for training and 40% for testing. The above considerations still remained valid. -numeric comparisons were made in order to calculate the performances of the distributions. The excel schema in attachment gives the details.