A question of scale: Replication and the effective evaluation of conservation interventions

Conservation interventions can keep critically endangered species from going extinct and stabilize threatened populations. The species-specific, case-by-case approaches and small sample sizes inherent to applied conservation measures are not well suited to scientific evaluations of outcomes. Debates about whether a method “works” become entrenched in a vote-counting framework. Furthermore, population-level replication is rare but necessary for disentangling the effects of an intervention from other drivers of population change. Turtle headstarting is a conservation tool that has attracted strong opinions but little robust data. Logistical limitations, such as those imposed by the long lives of turtles, have slowed experimental evaluation and constrained the use of replication or experimental controls. Headstarting project goals vary among projects and stakeholders, and success is not always explicitly defined. To facilitate robust evaluations, we provide direction for data collection and reporting to guide the application of conservation interventions in logistically challenging systems. We offer recommendations for standardized data collection that allow their valuable results to contribute to the development of best practices, regardless of the magnitude of the project. An evidence-based and collaborative approach will lead to improved program design and reporting, and will facilitate constructive evaluation of interventions both within and among conservation programs.


Introduction
Rapid, human-impact-linked, global declines in biodiversity are a growing conservation concern (Dirzo et al. 2014). In the face of precipitous losses, applied conservation interventions have The appropriate conservation intervention to prevent extinction depends on the biology of the target population or species and the threats it faces (Scott et al. 2005). The evaluation of effectiveness and reporting of circumstances from previous applications is critical to develop best practices for an intervention. However, evaluations of the overall effectiveness of a conservation intervention (i.e., "does it work?") often use vote counting, where the number of projects perceived as successes and failures is tallied to determine an overall score (Stewart 2010;Koricheva and Gurevitch 2013). Vote counting has been widely criticized, and its statistical drawbacks are numerous; results are inherently biased because studies of different sample sizes are given equal weight (one vote), and small sample sizes are less likely to provide significant results (Friedman 2001;Scheiner and Gurevitch 2001;Hedges and Olkin 2014). Vote counting results in either accepting or discounting a tool, rather than focusing on the circumstances in which the tool was applied (e.g., Ricciardi and Simberloff 2009;Miteva et al. 2012). This approach creates a polarized, yes-or-no framework that is poorly suited to evaluating conservation interventions in complex, real-world situations, limiting critical discussion about appropriate applications of conservation tools.
We argue that "does it work?" (e.g., Dodd and Seigel 1991;Mitrus 2005;Pérez-Buitrago et al. 2008;Miteva et al. 2012) is overly simplistic and reduces our ability to effectively evaluate conservation interventions. First, "does it work?" is difficult to answer without a consistent, explicit definition of a "working" intervention. This definition may vary widely among studies (e.g., Mitrus 2005;Pérez-Buitrago et al. 2008). Second, even if the term is clearly defined, the question still leads directly to a limiting yes or no response. We suggest that this language may stymie conservation efforts and creativity by discouraging the testing of techniques that may prove appropriate and valuable for particular conservation efforts. We are obviously unable to use the peer-reviewed literature to quantify how many conservation interventions have not been tested. However, in the course of our conservation work we have experienced several examples of "does it work?" approaches preventing "risky" but potentially useful research on conservation intervention. To provide one example, one of us observed two senior researchers at a conference sharing their strong opinions that a particular conservation intervention "does not work" (this is a direct quote, and they are of course entitled to their expert opinions). No evidence was provided at the time, and these strong statements led directly to a graduate student changing the focus of their research away from that intervention, despite the fact that its efficacy has never been rigorously evaluated.
We suggest that the critical questions for evaluation of conservation interventions should be whether application of the method achieves the explicitly defined objectives of the intervention (e.g., stabilizing or reversing a population decline, increasing the survivorship of an age-class, expanding the range of the species, or establishing a viable assurance population). The success of most interventions will depend on the details, so projects should also experimentally evaluate best practices (e.g., husbandry,  (Pullin and Knight 2009;Smith et al. 2010).
A major challenge in evaluating conservation successes is defining the scale of evaluation, and then finding adequate and appropriate replicates at that scale. When testing the impact of a conservation intervention on individual fitness, sample size is equal to the number of individuals in the treatment and control groups. However, if the objective is to evaluate the impact of an intervention on population growth, then the sample size is equal to the number of study populations involved. One cannot evaluate the effectiveness of an intervention on population growth rates by studying a single population, no matter how much research is done on that population. The level of replication for such an analysis is the population; therefore, the sample size is one. In an experimental framework, the proper comparison would be among multiple populations in which the conservation intervention is applied and several control populations, where site-level differences are accounted for in a quantitative analysis. Evaluating the population-level effects of conservation techniques therefore requires either studying multiple populations at once, or meta-analytical approaches that pool data from multiple projects (e.g., Marczak et al. 2010;Branton and Richardson 2011).
In this paper, we examine the practice of inadequate replication in evaluations of conservation interventions. We selected turtle headstarting as an example to illustrate our concerns because we have worked directly with this method, and because it has generated a great deal of controversy, advocacy, criticism, and debate in the literature. Despite decades of peer-reviewed studies on headstarting in single populations (reviewed by Burke 2015), the method itself has not yet received a great deal of hypothesis-driven or evidence-based evaluation. This is partly due to turtles' long-lived life history strategies. Turtles are one of the most vulnerable vertebrate groups in the world, with approximately half of the described species listed as threatened with extinction (Turtle Conservation Coalition 2011). Most species of turtle are long-lived, exhibit delayed sexual maturity, and experience high embryo and juvenile predation rates (Gibbons 1987;Wilbur and Morin 1988;Congdon et al. 1994). Recovery from population-level mortality events is extremely slow, requiring several decades (Brooks et al. 1991;Congdon et al. 1993Congdon et al. , 1994Doak et al. 1994;Heppell et al. 1996;Heppell 1998). This life history complicates conservation efforts, particularly where multiple, emergent threats exist (e.g., Crawford et al. 2014).
The goal of turtle headstarting is to boost recruitment to declining populations (Burke 2015), but recruitment to reproductive age classes is rarely measured directly, and is not compared with appropriate control populations. Thus, arguments on both sides of the debate remain largely based on within-population analyses, despite decades of excellent research on particular populations (e.g., Mitrus 2005;Shaver and Rubio 2008;Nagy et al. 2015). To provide context to our discussion, we briefly review the historical application of headstarting as a conservation tool for turtles. Next, we examine the current critique of headstarting and ask how success has been measured in turtle headstarting to date. We then examine evaluations of headstarting at different scales of analyses. Finally, we advocate for a hypothesis-testing framework that recognizes the limitations of data collection at different scales, recommends appropriate data collection and communication at those scales, and ultimately facilitates meaningful comparisons among conservation projects.
All data have value, but not all data can answer the same questions. It is not our intention to prescribe one set of methods to all conservation projects, but rather to encourage projects of all sizes to maximize their potential impact. To this end, we offer clear criteria for success at various timescales, and clarify the appropriate units of comparison for scientific evaluation. Large, well-funded projects with an explicit research component may have the capacity to collect and analyze large data sets. Smaller, purely applied projects can increase their conservation impact by contributing their results to larger meta-analyses, with minimal additional effort or cost. A lack of data collection during individual conservation intervention projects does not diminish their potential impact on the target population(s), and data collection may not be of interest to all project managers. Nevertheless, a collaborative approach in which relevant projects of all sizes and capacities collect and share standardized data can expand their collective impact, providing evidence-based tools that can be more broadly applied to wildlife conservation, and that are adaptable to a variety of circumstances.
Headstarting refers to the ex situ rearing of individuals hatched in captivity from wild eggs or collected as hatchlings from the wild (Burke 2015). Originally established on an experimental basis for endangered sea turtles (Pritchard 1979), headstarting programs are now increasingly used in recovery programs for tortoises and freshwater turtles (Seigel and Dodd 2000). The goal of headstarting is to circumvent low hatching success and (or) high mortality of juveniles to increase recruitment, often by accelerating growth in captivity (Moll and Moll 2000), resulting in stabilized or increased population growth rates (Frazer et al. 1990;Iverson 1991;Spinks et al. 2003). Headstarting can also be used to rear turtles for reintroduction (van Leuven et al. 2006;Buhlmann et al. 2015). Once turtles hatch in captivity, they may be released soon after hatching (often termed artificial incubation: García et al. 2003), or reared indoors for some amount of time before translocation into native habitat (e.g., Haskell et al. 1996;Mitrus 2005;Buhlmann et al. 2015;Nagy et al. 2015). Below, we use turtle headstarting to highlight the problem of evaluating conservation interventions at appropriate scales, drawing from our own experiences in freshwater turtle headstarting and examples from relevant literature.
Headstarting and captive breeding are generally evaluated as separate conservation interventions. However, in the absence of parental care, the experience of the hatchling turtle, whether from a wild or captive-sourced egg, is ultimately the same-protection in captivity until release. Thus, evaluations of headstarting as an intervention can encompass projects using turtle eggs from any source, although selective pressures (e.g., local adaptation, artificial selection) warrant consideration in evaluations of post-release success. Although this selection is unlikely to be as strong in captive turtles because of their different mating strategies and long generation times (but see Jensen et al. 2015), selection on hatchery-raised fish can have major impacts on post-release survival (Le Cam et al. 2015;Jensen et al. 2016), and this possibility should not be discounted in other taxa.

Critiques of the headstarting method for turtles
Whether turtle headstarting boosts recruitment to populations has generated much controversy and debate (Woody 1990;Heppell et al. 1996;Seigel and Dodd 2000). Some of the concerns raised about headstarting reflect a lack of post-release monitoring and reporting of individual turtle outcomes, resulting in speculation about the relative fitness of headstarted turtles and health of headstarted populations. Critics point out that habituation to captivity could affect predator avoidance (Frazer 1992;Meylan and Ehrenfeld 2000), foraging (East and Ligon 2013), and habitat selection behaviours (Okuyama et al. 2010). Headstarts from temperate populations kept active year-round in captivity to maximize growth will lack overwintering experience prior to release, potentially compromising post-release over-wintering survivorship (Frazer 1992;Bjorndal et al. 2003). Furthermore, omitting overwintering can reduce reproductive potential in smooth green snake (Opheodrys vernalis) headstarts, where brumation appears necessary to trigger gametogenesis (Sacerdote-Velat et al. 2014). This may also prove to be a concern for turtles.
Critics have also raised concerns about the spread of disease from captive to wild populations (Flanagan 2000;Moll and Moll 2000;Seigel and Dodd 2000). Disease management is a critical component of any captive management program, and the introduction of novel pathogens or increases in the prevalence of endemic pathogens can devastate a declining population. However, disease management relates directly to the management of captive populations prior to release, rather than to the effectiveness of headstarting (or any other conservation intervention). Headstarting programs can largely eliminate disease concerns by adhering to strict biological controls in captivity and assessing disease risk prior to reintroductions (e.g., Jakob-Hoff et al. 2014). The conflation of concerns around disease management and concerns about the effectiveness of headstarting further illustrates the need for replicated, experimentally structured evaluations of this conservation method. Disease is a potential confounding factor that could threaten the success of particular projects, but is unlikely to affect many projects equally. In a vote-counting "does it work?" framework, conflating poor husbandry and disease management with headstarting itself would result in "votes" against the efficacy of headstarting, even though disease would be responsible for the poor results.
Other critiques of turtle headstarting arise from demographic models investigating the relative impact of headstarting in comparison to increasing adult survivorship on population growth rates. Elasticity analyses of population models show that adult female survivorship has the highest impact on population growth rates, and several authors have interpreted these results to mean that efforts to increase juvenile recruitment are better spent on protecting adults (Congdon et al. 1993;Heppell et al. 1996;Heppell 1998;Mitrus 2005;Enneson and Litzgus 2008). These criticisms tend to focus on headstarting in isolation, without consideration for concurrent mitigation of the initial causes of population declines (Dodd and Seigel 1991;Frazer 1992;Heppell 1998;Meylan and Ehrenfeld 2000;Vander Haegen et al. 2009;Smeenk 2010;Crawford et al. 2014). Models incorporating multiple conservation actions may be more realistic and may lead to more successful management plans for threatened populations. For example, models of conservation interventions for turtle populations simultaneously impacted by road mortality and nest predation showed that reducing adult mortality is insufficient to increase population growth rate without simultaneous nest protection (Crawford et al. 2014).
Furthermore, elasticity values based on theoretical models may not accurately represent biological reality, as interpretation of model outcomes is restricted by assumptions (i.e., time invariance, density independence, and population sex ratio) that do not necessarily reflect real-world population dynamics (Benton and Grant 1999). A lack of information on survivorship and growth rates of wild juvenile turtles also results in model parameters estimated from similar species or other life-history stages (e.g., Heppell et al. 1996;Mitrus 2005), which may not accurately represent the target population. Although modelling exercises and elasticity analyses can lead to valuable insights for the focus of conservation efforts (e.g., Doak et al. 1994;Govindarajulu et al. 2005), it is important to remember that models are hypotheses to be tested and refined rather than perfect reflections of population processes. Uncritical acceptance of model outcomes, particularly under a vote-counting paradigm, may mislead conservation efforts away from what could otherwise be effective interventions.

How has success been measured?
As with other types of intervention, definitions of success vary considerably among headstarting projects, though they are largely segregated into the attainment of either short-or long-term goals Few programs have reported on long-term goals, though there is some limited evidence of headstarts surviving to adulthood (Bell et al. 2005), reproducing (Vander Haegen et al. 2009), and apparently altering the age structure of a population (Spinks et al. 2003). Details of rearing techniques may not be described in the literature (e.g., Haskell et al. 1996;Mogollones et al. 2010;Bona et al. 2012), and few studies explicitly compare success of headstarted individuals to wild individuals (but see Spinks et al. 2003;Mitrus 2005). Rigorous, scientific evaluation of headstarting requires standardization, permanent marking of released headstarts, post-release monitoring, and accessible reporting of population parameters among projects, all of which are exceedingly limited. Most of all, scientific evaluation of headstarting requires a data set with replication at the population level-evaluation of the intervention as a tool for general use cannot be done by any study of a single population, no matter how excellent. Although individual projects have been successful at meeting their stated goals and outcomes, these vary among projects. Despite decades of turtle headstarting around the globe, comparable data from multiple headstarted and control populations are not available, and consequently, there are no published population-level meta-analyses of the effectiveness of headstarting on conserving turtle populations.

Effective evaluation
Effective evaluation of any intervention requires a priori consideration of the scale of the project. The perceived dichotomy of short-vs. long-term goals can discourage practitioners leading smallscale projects from data collection and reporting, data that can be used to inform meta-analyses and evaluations at a broad scale. Valuable information regarding individual performance metrics, rearing conditions, or even just the number of individuals released into a population in each year can be collected and reported for most projects, even those with severe financial, temporal, or logistical constraints on post-release monitoring and population-level comparisons. Small scale, populationor project-specific data can be used to inform broader scale evaluations and the development of best practices by incorporation into meta-analyses (Fig. 1). We emphasize the need for the recording and reporting of data at all project scales, and demonstrate this continuing with our example of turtle headstarting.

Impacts within individuals
How does an intervention affect the individual? Many headstarting programs accelerate hatchling growth rates beyond those of wild turtles, to ensure large body sizes before release (e.g., Haskell et al. 1996;Vermont Fish & Wildlife Department 2009;Parks Canada 2012). The fine-scale relationship between body size and survivorship in wild hatchling turtles is unclear, with equivocal evidence both for and against the "bigger is better" hypothesis (Janzen 1993;Haskell et al. 1996;Congdon et al. 1999;Janzen et al. 2000;Delmas et al. 2007;Paterson et al. 2014;Canessa et al. 2016). Although sub-adult and adult freshwater turtles do experience lower mortality rates than juveniles (Enneson and Litzgus 2008), growth is energetically costly, and the physiological trade-offs of accelerated juvenile growth are not well understood (Wieser 1994;Bayne 2000;Dmitriew et al. 2009;Dmitriew 2011). Reported data on individual health and fitness outcomes in relation to captiverearing conditions can inform best practices for optimal growth rates and husbandry of headstarted turtles, and inform survivorship and individual growth rate estimates in population models.

Impacts within populations
Single-population studies can evaluate the fitness of wild-hatched relative to headstarted individuals in a common habitat but cannot generalize about the effectiveness of headstarting on populations. If the objectives of a conservation intervention include evaluation of the impact on the target Fig. 1. Illustration of how reportable data (e.g., population growth rate (λ)), at different program scales (incubation, rearing, post-release monitoring), from wild populations and populations undergoing an intervention can inform evaluation and development of best practices at all levels, using headstarting as an example.

Bennett et al.
FACETS | 2017 | 2: 1-18 | DOI: 10.1139/facets-2017-0010 7 facetsjournal.com population ("success", as defined by the project), there must be a robust post-intervention monitoring program. This is not a novel suggestion Seddon et al. 2007;IUCN/SSC 2013;Parker et al. 2013), yet post-intervention monitoring is still not incorporated into many projects that explicitly aim to stabilize their target populations. Tracking growth, behaviour, or survivorship of headstarts alone, although valuable, cannot inform demographic evaluations (unless the population is a reintroduction and there are no wild turtles besides headstarts), because wild juvenile freshwater turtles experience high mortality compared with adults (Congdon et al. 1993). Therefore, the impact of headstart survivorship on a population is only informative in comparison with the survivorship of wild-hatched individuals in the target population, and an experimental approach should be used to compare the fitness of these two groups (Spinks et al. 2003;Mitrus 2005;Attum and Cutshall 2015). However, making the comparison presumes that there are wild-hatched juveniles in the population, which may not be the case when natural recruitment is effectively zero (e.g., Ontario Wood Turtle Recovery Team 2010) or when re-establishing populations within a historical range (e.g., Amaral 2007). Furthermore, headstarted and wild-hatched individuals must be distinguishable in the field, e.g., through non-harmful marking such as passive integrated transponders (Gibbons and Andrews 2004), visible implant elastomer (Davy et al. 2010;Antwis et al. 2014;Simon and Dörner 2014;Kozłowski et al. 2017), or other tools (Auger-Méthé andWhitehead 2007;Parker et al. 2013;Schoen et al. 2015). For example, short-term survival rates have been compared in wild and headstarted plains garter snakes (Thamnophis radix) of similar size (King and Stanford 2006), Mona Island iguanas (Cyclura cornuta stejnegeri; Pérez-Buitrago et al. 2008), and European pond turtles (Emys orbicularis; Mitrus 2005). However, biological fitness is difficult to quantify in turtles, where survivorship and fecundity require decades for direct measurement. Proxies such as individual growth rate, body condition, physiological health, movement behaviour, or site fidelity are often used (Shaver 1999;Shaver and Rubio 2008;Canessa et al. 2016), though we suggest cautious interpretation of these proxies, as they are not all equally informative (e.g., Davy et al. 2014).

Impacts among populations
Meta-analytic approaches can be used to identify the characteristics of interventions that have been successfully applied, with results being used to optimize the efforts of new and ongoing programs. We currently lack available, standardized data at the individual and within-population scales, limiting our capacity for an among-population, global review of turtle headstarting. Empirical data on the behavior, ecology, and movement of wild juvenile turtles are extremely limited (Ultsch et al. 2007;Paterson et al. 2012;Whitear et al. 2017), but necessary to inform best practices for release-site selection and release timing, as well as for the parameterization of population growth models evaluating headstarting. With advancements in tracking technology towards smaller and lighter tags, we can now follow hatchling turtles and obtain comparative, size-specific survivorship estimates from wild populations (e.g., Paterson et al. 2012). Ideally, post-release monitoring of headstarted turtles is paired with similar studies on wild juvenile survivorship and movement patterns.
To facilitate meta-analyses of turtle headstarting and other interventions such as translocations or predator removal, we provide examples of data collection requirements that would also facilitate evaluation of impacts within projects and help develop best practices for the target intervention (Fig. 1). We suggest that an openly available reporting platform (e.g., conservationevidence.com, open access academic journals), coupled with open standards for reporting data and methodological detail (e.g., conservationmeasures.org), would greatly improve comparison among projects. Although we emphasize the need for reporting of data at any project scale, we recognize that academic publication may present a substantial barrier to some conservation project. We, therefore, encourage open access, online archiving of information on any platform (e.g., publication of year-end reports on a project website). The way forward: Evidence-based best practices No conservation intervention is suitable for all situations, but most interventions are useful when applied appropriately, using evidence-based best practices (e.g., Moehrenschlager and Moehrenschlager 2006;Soorae 2016). Thus, asking whether a conservation intervention such as turtle headstarting "works" is simply asking the wrong question. Projects working to conserve threatened populations should first define their goals (e.g., increased population size, juvenile survivorship, or reproductive output). They can then assess whether they have achieved these targets and, if so, which specific husbandry, release, and post-release strategies were most effective. An appropriate post-intervention monitoring program should ideally follow any conservation intervention. Most importantly, any population-specific study cannot be generalized to evaluate a conservation intervention for all other populations in need of applied conservation actions.
Standardized techniques and experimental design would allow individual headstarting projects to contribute not just to the persistence of their target populations, but also to the global effort to conserve turtle species. Incorporating experimental design into conservation interventions wherever possible, including explicitly defined control groups, allows comparison of outcomes with those of other, similar programs. With basic application of the scientific method, data collection can be targeted to address specific predictions, or goals, for the project (Fig. 2). Data collected with a predetermined purpose facilitates statistical analyses, and provides a baseline for quality control. Coupling conservation interventions such as headstarting with rigorous and standardized data collection and reporting will inform the development of best practices documents, increasing the success of conservation projects through empirical evidence and shared experience. It is time to replace the problematic dichotomy of "does it work?" with the less succinct but more useful question "when and how does an intervention achieve the measurable, explicitly-defined goals of the project?" This shift will allow continued improvement of conservation interventions to the benefit of global biodiversity.
We acknowledge that there are projects in which stakeholders do not feel the need to quantify the effects of their intervention, and many projects that are severely limited by available resources. As a group of authors that include both conservation practitioners and academics, we understand that conservation practitioners' capacity to disseminate results varies, as does the academic incentive to publish applied projects. Whether a project is evaluated does not influence its result-unevaluated projects can ultimately have as great or as small a conservation impact as robustly evaluated projects, and may indeed be critical stabilizing forces for threatened populations. Thus, unreported results do not detract from a project's potential impact on its target population or species. However, making results available (including negative ones) allows the conservation community to learn from a project's experience, thus achieving broader impact beyond the target population. Peer-reviewed literature is not the only acceptable form of communication; reports summarizing available data posted online will make them publicly available for future, peer-reviewed meta-analyses. We strongly believe that we can all benefit from increased data collection and sharing in conservation interventions. Therefore, we invite practitioners and academics to work together to maximize the impacts of their conservation efforts by evaluating and optimizing the available tools, and communicating their results.