top of page

RMark and Rmarkdown

The following is much more specialized than previous posts. Here I focus on doing capture-mark-recapture analyses with the package RMark, specifically the known fate model, and integrating the results with R Markdown.

A brief history and introduction:

I used MARK at a very cursory level for my PhD to calculate red squirrel densities using the robust design model - there is a good M.Sc thesis here for anyone who is interested. But things really got going when I was fortunate enough to take the Program MARK Intermediate Workshop in 2010. Since then, I have used program MARK extensively for my work on caribou survival. I love MARK and am much stronger statistically for having used it, not just in terms of the information-theoretic approach but my understanding of maximum likelihood and how statistical models actually works, i.e., how the matrix of independent variables (the design matrix) interacts with the parameter matrix (betas).

Given my interest in optimizing workflows and producing good reproducible research (previous post), I was intrigued when I heard about RMark, a R package for interfacing with MARK, especially because I was experiencing several significant workflow issues in MARK. First, there are the design matrices (DMs). DMs are not hard to produce in MS Excel but after returning to work after months away, it is often not clear why a particular structure was chosen or the source of data for covariates (this could be alleviated by good notes or metadata but I often don't write these down when I'm learning a new model or when I'm in a rush). DMs also tend to get large and cumbersome for my analyses and MARK seems to have limited functionality for cut/paste which increases the chance of a copy/pasting error. There are the usual version issues, e.g. I have three DM files - which is the right one? DMs also need to be modified as new data is added (e.g. years or new explanatory variables). Second, I usually want MARK output in the form of AIC tables and (derived) parameter estimates so I can plot them. Using MARK, this means exporting a csv file from MARK and having a separate R file for producing this figure and/or knitting outputs by hand into a document (see previous blog). Third, there is also the more general problem of re-running analyses in MARK which can be tedious and contributes to multiple file versions. As these three issues became increasingly problematic, RMark coupled with R Markdown, looked more attractive (note R Markdown is a simple language for creating dynamic documents from R - it is not related to MARK despite the similarity in the name - click on these links for an intro and tutorial.

Before I go farther, I would really like to thank Jeff Laake and Eric Rexstad for integrating these two really fantastic pieces of (free!!!!) software - it is a really great piece of work and the appendix on the Gentle Introduction (GI) is thorough and extensive. The authors continue the excellent tradition of the GI by making sure that RMark is not used as a black box. However, and this probably just be me, I found the RMark chapter much harder than other chapters in the GI; I got bogged down in the details and had a lot of difficulty at virtually every stage.

Objective:

So the following is meant to spare you some of my pain/confusion, i.e., it is a brief tutorial on RMark focusing on the known fate model and integrating it with R Markdown. It is meant to get you up and running quickly. For the full details, the GI is, and always will be, the bible! Also, a solid understanding of MARK and how it works is needed for this blog to be on any use - again, no software should ever be used as a black box. I'll also make the disclaimer that everything below is based on my experience and therefore, my limitations. Someone with more R ability may have a better way.

Click here for the *.R files for the main code and associated functions and the dummy data (*.inp file) on caribou survival in Newfoundland and Labrador - my apologies but these are *.docx files because WIX will not allow other file formats - you'll need to copy and save them as appropriate. Briefly, adult caribou were collared and monitored over a number of years. Each row in the encounter history has a dummy id variable. The encounter history for the known fate model in MARK (see GI for more details) uses the LDLD format. The first position (L) indicates whether an animal is alive at the beginning of an interval (1) or not tagged (0). The second position (D) indicates whether an animal is alive (0) or dead (1) at the end of an interval. In this study, the interval is month, so each month is composed of 10 (alive at end), or 11 (dead at end). 00 means that the animal was not tagged at the beginning and is censored from the analysis. 01 is not valid. If survival is measured from month-to-month, there should be 12 LD pairs per row. Then, there are 3 columns to represent the group variable year. And finally, there are two individual covariates - the year the animal was captured and age since capture. The second file contains various climate and weather data such as the North Atlantic Oscillation. I will not describe these further as this is the only variable in this data set I will use. But most of the covariates were summaries of climate and weather variables that came from different data sources. These required a fair bit of data wrangling and therefore, it seemed best, in the spirit of reproducible research, to bring these two data streams into R separately and merge them for this analysis.

Tutorial:

In this blog, I'll provide a general and/or specific explanation for each step and then show the annotated code for context. The exception is the first step which is simply to load these data into R.

##################################################################################################

rm(list=ls())

#cleanup(ask = FALSE)

library(RMark)

library(ggplot2)

#read in source code

source("P:/projects/labrador/survival-analyses/function.R")

# set directory for the encounter histories

setwd("P:/projects/labrador/survival-analyses/blog")

# bring in the group covariates

group.covars <- read.csv("summary-gca-nao-group.csv", header=TRUE, sep=",")

group.covars

getwd()

##################################################################################################

RMark - First step - General:

The first main step with RMark is basically the same as main menu (i.e., Enter specifications for MARK Analysis) in MARK and the creation of the design matrix. Now that I'm used to it, I like the code based approach better (how many times have you started a file only to have to go an find the little tiny errors in the *.inp file and then re-enter (cut/paste) everything!).

First step - Specific:

The 'convert.inp' function is pretty straight forward - it brings the *.inp file into R just like the "Click to Select File" button in MARK. This includes the group and individual covariates. The 'process.data' function is also pretty straight forward. It takes the place of the "Select Data Type", as well as the encounter occasions, attribute groups, and individual covariate boxes. This function creates a list with this information as well as the encounter histories - it can be a bit finicky but as with all things in MARK, when you get it right it means you have a good understanding of what you are doing. The model and time.interval arguments are simple. In this case, I wanted a known fate model ("Known") and the time interval is month (1/12) and I had 3 years. For right now, just leave the begin.time as default or enter year1 - I'll explain this in more detail below. There are also a number of handy arguments that are not in MARK. For example, the count argument can be included which is often useful in later output.

Once the "main menu" is filled out, the next step is creating the design data using "make.design.data". The code here is really straight forward although I struggled with the concept. Essentially, this function creates the Parameter Index Matrix (PIM) structure and combines it with "design data", the latter which makes it possible to use formulae to create models in R. First, note that the object created in make.design.data does not show a PIM in a form that corresponds to PIMs in MARK. To see a PIM, first run a model and then use the PIM function (see below). However, the PIM structure is represented in the make.design.data output by the indices. In this case, note that there are 3 years with 12 months/year = 36 occasions and therefore, there are 36 rows in df3.ddl (just like 36 rows in the DM and 36 parameters in the PIM in MARK)! Also notice that there are a lot of other fields in df3.ddl that are not normally seen in MARK. Some of these are quite simple. For example, par.index is the index within the simplified set of real parameters. These will tend to repeat in models with phi and p. Model.index serves a similar function except it applies to the whole model so it does not repeat. So these indices are redundant in this case. As stated in the GI, by convention, 'Time' and 'Age' are continuous versions of the factors 'time' and 'age'. Time starts at 0, time does not - this will be important later. Another way to think about it is:

  1. p(t) represents a model with a parameter for each occasion (time)

  2. p(T) represents a model with a single parameter (plus slope).

So at this point, the data is brought into R, the "main menu" is filled out, and the design data are created. What does this all do? Essentially, RMark uses the function model.matrix behind the scenes to create DMs from a formula (say constant survival or S(.)), which you specify later, and the design data (PIM and data) to calculate the beta and real parameters in MARK. But with the design data step, you do not physically have to create a DM like in MARK - that is done for you - pretty cool!!!

One final step particular to these data. I combined df3.ddl with all the group covariates that it wasn't convenient to merge into the inp file. As indicated above, the caribou data was completely separate from most of the climate explanatory variables. Therefore, I brought these in separately and merged them at this step.

So now, the sometimes painful process of getting data in (R)Mark is over - congratulations! While this took me longer to figure out than I care to admit, I went through phases of giving up and cursing RMark, I really do think it is pretty elegant once all the quirks are figured out.

#read in data and convert

###########################################################################

# bring in the *.inp file

year.inp <- c("year1", "year2", "year3")

year.inp

covars <- c("age")

df3 <- convert.inp("KF-data-v1.inp",

group.df=data.frame(year= year.inp),

covariates = covars)

str(df3)

head(df3)

df3

# process the new dataframe - creates a list of encounter histories, frequency matrix, time intervals,

# begining times, group covariates, counts, etc.

#year.process <- c(1, 2, 3) # this could be used in begin.time but it makes things complicated. Its up to you but know what you are doing!

#year.process

count.process <- c(15, 14, 30)

count.process

# see blog. Begin.time can be entered as a vector. This should be the same interms of dervived variables but it does get messy

df3.process <- process.data(df3, model="Known", groups=c("year"),

time.interval = rep(1/12, 12),

begin.time=1,

counts=count.process)

str(df3.process)

df3.process$data

df3.process$freq

df3.process$time.intervals <- round(df3.process$time.intervals, 4)

df3.process$time.intervals

# make the design data, e.g. group, age, time (factors, Continuous, year)

df3.ddl <- make.design.data(df3.process)

str(df3.ddl)

df3.ddl

write.csv(df3.ddl$S, "check-dd.csv", row.names=F, na="")

# add additional group covariates, all weather/climate in this case

df3.ddl$S <- merge_design.covariates(df3.ddl$S, group.covars, bygroup=T, bytime=F)

df3.ddl$S

str(df3.ddl$S)

write.csv(df3.ddl$S, "check-dd2.csv", row.names=F, na="")

###########################################################################

Second step - General: Now it’s time for the fun stuff. Begin by running a single model and look at the PIMs (which is always good to do). Then, create a function for running a suite of models - one of the main advantages of multi-model inference.

Second step - specific:

Run a single model using function "mark". Simply specify the processed data and the design data and create a linear model as follows. Then, display the PIM just to make sure that all is well. The PIMS function in RMark doesn't do a great job of this in this situation, nor does it create an R object, so Jeff Laake has generously put an alternate function on Phidot which I use below.

Creating single models is fine but it’s much more convenient to create a function. First, define a number of models using evocative names. Then, create a list of models as shown below, run it, and examine the output.

###########################################################################

#create some models

# this model is consistent in all respects with the one produced in MARK, i.e. the output files are almost exactly identical with the exception of 1) the betas (which can differ and produce same model output - differences are due to equivalent but different DM structure. 2) some rounding error.

time.int.group <- mark(df3.process, df3.ddl, model.parameters=list(S=list(formula=~time * group)))

str(time.int.group)

time.int.group

#check design matrix

head(time.int.group$design.matrix)

write.csv(time.int.group$design.matrix, "check-dm.csv", row.names=F, na="")

#check PIMS - function PIMS is very unreadable this way. But the R library has source code from Laake that makes for a readable and exportable PIM

PIMS(time.int.group, "S")

phipim <- myPIMS(time.int.group, "S", simplified=F)

phipim

pim.out <-do.call("rbind", phipim)

write.csv(pim.out, "check-pim.csv", row.names=F, na="")

# run all models - all below matches MARK perfectly except for NAO (see below for explanation)

caribou.results = run.knownfate()

caribou.results

caribou.results$S.age$results$derived$S

##########################################################

So this is great. Although its probably quicker to do this in MARK a single time, the above is annotated code to run a suite of models. This is completely reproducible but also, if anything changes are required, the analysis can rapidly redone and changes are clear. But first a few warnings because this is where I really had some problems:

  1. Where I really had trouble was specifying the model. Specifically, I wanted to set up the t*g (time * group) model just like in MARK. The problem was, I followed the advice of many websites and used a vector for "begin.time" (e.g. "2003", "2004", etc - see this link for an example) but I did not understand what this was doing to the design data. Basically, in MARK, there is a column for the Intercept, and k-1 columns for time and group (k=11 for time and k=2 for group - see GI Chapter 6 if you don't understand this). The interaction is therefore 11*2 = 22 so there are 1+11+2+22 = 36 columns in the DM. In RMark, specifying "begin.time = 2003" creates a DM that is similar (but see next point for differences between MARK and R). But if you use a different begin.time for each group, a column is created for each time period within a group (12*3 -1 = 35). In this case, the interaction term is set up very differently. Because there is a column for each for each group*time-period, there is a different structure for the interaction term and only 24 columns are needed (36-12 =24 gives a different interaction term for each parameter (see GI Chapter 6 for more details on this). So, 1 + 35 + 2 + 24 = 62 columns. So obviously, the DMs are different and therefore, so are the betas and the real parms. But the good news is that the derived variables are practically identical because of the delta method (see Appendix in GI). I suspect that the delta method could also be used to get other outputs, like monthly survival, to be equivalent. This gets back to what the GI Chap 6 says about different DMs producing similar results. Moral of the story - know what you are doing and check the ddl file real careful like as recommended by Laake and Rexstad (and throughout the GI).

  2. The second problem was very simple, and as it turns out, a problem with my understanding, not of the code. I was dutifully checking my output which now matched my MARK output for derived variables (to 4-5 decimal places) but I expected the rest of the output to be identical. However, the beta estimates were not even close. In this case, I had done it right but didn't understand the output fully and had unfortunately, not read Jeff's post on Phidot. The reason is that although RMark runs directly through MARK, MARK uses the SAS convention of the last factor being the intercept while RMark uses the R convention where the first factor level is the default. Plus, RMark adds a whole bunch of extra "stuff" into the output. Moral of this story - as long as the general DM structure is the same, the results should be the same (but note that different DMs can produce the same outcome (see the Gentle Intro) just as different betas can produce the same real parms. I recommend reading Laake's above post on Phidot).

  3. The model for NAO (S.Nao) gives slightly different results than MARK. This is because (I think) MARK calculates a real parameter for every parameter in the PIM. It seems that RMark does not do this and calculates it only for every level. Therefore, there are 36 real parameters in the MARK output and 3 real parameters in the RMark output. Fortunately, they are virtually equivalent when averaged.

  4. Remember from above that by convention, 'Time' and 'Age' are continuous versions of the factors 'time' and 'age'.

But enough of the nasty stuff and time for some of the cool stuff. Once the models are run, values can be simply extracted from the list for the desired purpose, e.g., put them in a table, figure, or text and you never have to leave the R environment!!!. The only issue here is the lists. If you are not good with lists (like me) this is a good wake up call. The hard part though was that the lists in RMark output tend to be voluminous and deeply nested. So finding the proper subscript in the dataframe within the list can be challenging. But here are number of examples that could be helpful. See the end of the RMark section in the GI for a great explanation of lists in R.

Below are some examples of how to extract values from models.

##############################################################################

# show a variety of real and derived results and creates objects for some

s.t <- caribou.results$S.time.int.group$results$real

s.t

s.dot <- caribou.results$S.dot$results$real

s.dot

s.dot$estimate

s.dot$lcl

s.dot$ucl

x <- s.dot$estimate

s.t.d <- caribou.results$S.time.int.group$results$derived

s.t.d

s.dot.d <- caribou.results$S.dot$results$derived

s.dot.d

s.nao.d <- caribou.results$S.nao$results$derived

s.nao.d

caribou.results$S.nao

#gives the elements of results

caribou.results$S.time.int.group$results

#extract elements from results

caribou.results$S.time.int.group$results$npar

caribou.results$S.time.int.group$results$beta

##########################################################

#For the R Markdown report

#convert the naming to something standard

df <- s.t.d$S

df

str(df)

#merge the dataframes

df <- cbind(df, df3.process$counts, df3.process$group.covariates)

names(df)[names(df)=="df3.process$counts"] <- "count"

df

# get the S.dot

df1 <- s.dot

df1

#export survival by year and for constant survival (s-dot)

write.csv(df, "KF-output-s-t-d.csv", row.names=F, na="")

write.csv(df1, "KF-output-s-dot.csv", row.names=F, na="")

write.csv(caribou.results$model.table, "KF-output-aic-table.csv", row.names=F, na="")

##########################################################

Click here for an example of how to integrate the above analysis in a simple report in R Markdown.

##########################################################

# some useful websites


bottom of page