--- title: "Gaussian and Poisson CRFs" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Gaussian and Poisson CRFs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The theory behind using nodewise regression to approximate Markov Random Fields and Conditional Random Fields models was developed with binary presence-absence data in mind. However, the same principle can hold for other data structures, as long as we recognise the key limitations. For Gaussian and Poisson variables, the primary issue with fitting MRF or CRF models in this fashion is that the outcome variables are not necessarily on the same scale. Consider a community where we have three species that show some associations in their abundances, but the ranges of these abundances are quite different (this could perhaps apply to a quickly-reproducing prey species and some wide-ranging predator species, for example). Here, we simulate a dataset with two fake covariates and three species to illustrate the problem ```{r} cov <- rnorm(500, 0.2) cov2 <- rnorm(500, 4) sp.2 <- ceiling(rnorm(500, 1) + (cov * 2)) sp.2[sp.2 < 0] <- 0 poiss.dat <- data.frame(sp.1 = ceiling(rnorm(500, 4) + (cov2 * 15.5) + (sp.2 * -0.15)), sp.2 = sp.2, sp.3 = ceiling((sp.2 * 1.5) + rnorm(500, 0.1))) poiss.dat[poiss.dat < 0] <- 0 poiss.dat$cov <- cov poiss.dat$cov2 <- cov2 ``` By investigating the ranges of the species' abundance vectors, we can see that they are quite different, with `sp.1` generally showing higher abundance than the other two species ```{r} apply(poiss.dat[, c(1:3)], 2, range) ``` This can cause a problem for fitting MRF / CRF models, as we are essentially regressing each species' abundance against the abundances of all other species (and covariates) and then *symmetrising* the respective parameter estimates (taking the `mean` of the two estimates, by default). This will undoubtedly lead to severe overestimates for the species with lower abundance ranges and underestimates for the common species. We can illustrate this here for `sp.1` and `sp.2` by running a simple `glm` for each species by including the other species as a predictor ```{r} sp.1.vs.sp.2 <- coef(glm(sp.1 ~ sp.2, family = 'poisson', data = poiss.dat))[2] sp.2.vs.sp.1 <- coef(glm(sp.2 ~ sp.1, family = 'poisson', data = poiss.dat))[2] mean.coef <- mean(sp.1.vs.sp.2, sp.2.vs.sp.1) # Exponentiate, due to the log link mean.coef <- exp(mean.coef) # Make plots of predicted vs observed # First, predict sp.1 (the common species) abundances plot(poiss.dat$sp.1, poiss.dat$sp.2 * mean.coef, xlab = 'Observed', ylab = 'Predicted') # Now predict sp.2 (the more rare species) abundances plot(poiss.dat$sp.2, poiss.dat$sp.1 * mean.coef, xlab = 'Observed', ylab = 'Predicted') ``` To overcome the issue of possible differences in ranges for Poisson variables, `MRFcov` uses a nonparanormal transformation method whereby each variable is `log2(x + 0.1)` transformed and mapped to a normal distribution via a copula approach prior to fitting the model([see here for more information](https://www.jmlr.org/papers/v10/liu09a.html)). The model is then fitted by considering the variables as Gaussian (e.g., there is no link function for these models). In doing so, the variables are all normalised without adding too much noise (though this transformation should always be recognised as a limitation). Any Poisson models that are fit will return an additional item called `poiss_sc_factors`, which will be necessary to back-transform outputs back to the variables' original distributions. This is done by estimating each variable's negative binomial or poisson parameters, depending on which distribution was more appropriate ```{r} library(MRFcov) poiss.crf <- MRFcov(data = poiss.dat, n_nodes = 3, family = 'poisson') poiss.crf$poiss_sc_factors ``` Our coefficient for the effect of `cov2` on `sp.1` above was `15.5`. To see how the model did in isolating this effect, we can get an *approximation* by converting the effect using the `sd` of the raw variable (this is because the model was fitted to normalised variables that were at unit variance) ```{r} sd(poiss.dat[,1]) * poiss.crf$direct_coefs$cov2[1] ``` Not bad, but again this is an approximation. However for generating predictions, `MRFcov` will first generate the normalised predictions and then map these to each raw variable's appropriate distribution (either negative binomial or poisson, depending on which was more appropriate). This is a better way to ensure that outcomes are converted to the correct range. For example, we can use our fitted model to predict the raw data and generate similar plots to those that we created above ```{r} poiss.preds <- predict_MRF(data = poiss.dat, MRF_mod = poiss.crf, n_cores = 1) plot(poiss.dat$sp.1, poiss.preds[,1], xlab = 'Observed', ylab = 'Predicted') plot(poiss.dat$sp.2, poiss.preds[,2], xlab = 'Observed', ylab = 'Predicted') ``` These plots should look much better. We can also inspect the outputs of the `cv` functions for Poisson data. `MRFcov` can use cross-validation (fitting models to specific subsets of the data and using resulting equations to predict observations for the remaining subset) to describe model fit and compare models with and without covariates. For `family = binomial` models, as seen at the end of the `MRFs and CRFs for Bird.parasites data` vignette, the functions return information on classification accuracy. But for Poisson data, it instead returns information on model deviance and mean squared error. We can showcase this functionality here using the fake data we have generated. We would expect that a CRF would fit the data better in this case, as this model includes the two covariates that we know are influential ```{r} poiss.cv <- cv_MRF_diag(data = poiss.dat, n_nodes = 3, n_folds = 5, n_cores = 1, family = 'poisson', compare_null = TRUE, plot = FALSE) # CRF (with covariates) model deviance range(poiss.cv$Deviance[poiss.cv$model == 'CRF']) # MRF (no covariates) model deviance range(poiss.cv$Deviance[poiss.cv$model != 'CRF']) ``` Lower deviance for the CRF model confirms the idea that inclusion of covariates is supported. Please note that the type of normalisation used above **only applies to Poisson models!!** Gaussian variables are not standardised before fitting CRFs / MRFs, so users should inspect the ranges of their variables and consider using the `scale` function to standardise beforehand if need be. The reason Gaussian models do not do this by default is that there are many options for normalisation / standardisation for Gaussian data and so users may wish to have more control over how this is done before fitting models