Good Questions: Building Evaluation Evidence in a Competitive Policy Environment

jej_coverButts, Jeffrey A. and John K. Roman (2018). Good Questions: Building Evaluation Evidence in a Competitive Policy Environment. Justice Evaluation Journal, 1(1): 15-31.

Criminal justice evaluations are funded and executed in a policy environment increasingly characterized by intense competition that creates a premium on studies utilizing experimental methods. Randomization, however, is not a universally suitable method for answering all questions of criminal justice policy and practice. Experimental designs are particularly ill-suited for addressing key analytical challenges and exploiting important opportunities in justice, including discontinuity effects, interventions that depend on the perceptions and beliefs of individuals, and models of general policy change. Experiments are also rarely able to reliably measure and isolate the effects of very complex justice interventions. Policymakers and practitioners in the justice sector should consider evaluation research as a portfolio of strategic investments in knowledge development. Randomized controlled trials are merely one asset in a broader investment strategy.

Citation: Jeffrey A. Butts & John K. Roman (2018): Good Questions: Building Evaluation Evidence in a Competitive Policy Environment, Justice Evaluation Journal.


Dramatic expansions in information circulation since the 1980s have altered the professional environment for evaluation researchers. In all fields, including criminal justice, easily accessible evaluation resources allow any practitioner or policymaker (or citizen) to search for knowledge about “what works.” Everyone is an expert – with or without training in research, data analysis, causal logic, or statistical inference. Analyses of evaluation findings once were found mainly in peer-reviewed journals and at academic conferences. Today, evaluation results are disseminated and debated in social media platforms, streaming video conferences, third tier journalistic outlets, and the promotional websites of self-interested service providers and advocacy organizations. Virtually, anyone may now participate in discussions about the most effective ways to change behavior, reform justice system operations, and protect public safety.

These developments could be positive in many ways. Democratizing information and analysis may improve the quality and impact of evaluation evidence. The same forces, however, expose the assessment of evaluation evidence to untrained and undisciplined competition over funding, influence, and reputation. Such competitions may be won by the loudest voices and not by the most informed or judicious voices. This open and increasingly competitive environment for analyzing criminal justice research presents serious challenges. One of the most pressing challenges is the fervor with which advocates and public officials inject discussions of evaluation evidence with a simplistic bias favoring experimental evaluation designs over other types of evaluation and research. Too many policy debates are driven by the competitive device of good methods when they should be driven by the research norm of good questions followed by well-reasoned and reliable answers.

Random rules

Experimental studies (or randomized controlled trials, RCTs) can produce high-quality evidence, but often within a relatively narrow frame. Randomization may be the method of choice for evaluating some interventions and practices in criminal justice, but other methods are essential as well, especially when researchers need to measure the effects of policy. The choice of evaluation method should be governed by the nature of the questions being asked and by the complexity and nuance of the answers required to apply the results, not by a reflexive preference for randomization. Researchers in the criminal justice sector sometimes appear to value method above all, perhaps reflecting the “experimental turn” taken by criminology in recent decades (Sampson, 2010).

The properties of optimal research designs are well known. Both the strengths and weaknesses of RCTs are clear in a criminal justice context. The primary goal of any statistical model for evaluation is to isolate the effect of a specific mechanism (intervention, policy change, etc.) on targeted outcomes. Along this narrow dimension, the RCT is unparalleled. If carefully implemented, randomized trials can allocate competing explanations of an observed effect equally to treatment and control conditions (i.e., received the intervention or did not receive the intervention). An effective RCT establishes the logic of causation, as the intervention is the only signal remaining in a sea of noise. This leaves the mechanism being tested as the only plausible explanation for a change in outcome.

This important property of RCTs is often mistaken as the only important attribute of an effective evaluation design. More properly, however, the powerful feature of minimizing the noise from competing explanations that is embedded in an RCT should be thought of as a necessary but not sufficient condition for research and analysis. An RCT design must also isolate the effects of the proposed causal mechanism and it must establish the causal mechanism as the agent of change. In addition, an RCT must describe the causal mechanism with great precision and draw the study sample from a population that is broadly generalizable.

One critical limitation of the RCT as an evaluation design is that the method requires a simple and observable causal mechanism. Detecting a difference between treatment and control conditions is not enough; the evaluation design must be able to credibly attribute the difference to an identified causal mechanism. RCTs in social science research are intended to mimic the principles of double-blind clinical trials in medicine. In such trials, the causal mechanism is usually a drug, or therapy, or medical device of some kind. The intervention is modeled as an on/off switch. In other words, receipt of the therapy or treatment is measured as a simple binary, yes or no, condition. An experimental design functions to isolate that effect from any number of competing explanations, such as age, gender, family history, and ideally, any other factor. While these factors may affect the causal agent, they are randomly distributed across both treatment and control groups by design.

The binary assumption about causal agents is not insurmountable. Block or stratified randomized trials may be used to study more nuanced causal mechanisms, generally executed as different dosage levels. Notably, however, when a mechanism is continuous and there are possible effects from many dosage levels, or when subgroups may respond differently to an intervention, these more complex effects are usually isolated in quasi-experiments that expand on the narrow frame of randomized controlled trials.

In the criminal justice context, the narrow dichotomous or categorical causal mechanism expected in RCTs is often too simplistic to capture real-world policy interventions. Causal mechanisms in social policy studies – particularly those intended to affect human behavior (which, in some way, is the subject of virtually all criminal justice interventions) – cannot be adequately described with dichotomous variables. This includes virtually all criminal justice policies and practices (often with multifaceted causal mechanisms), those with continuous outcomes (with interventions often arbitrarily targeting points along the continuum, often referred to as policy discontinuities), and a wide range of phenomena outside the researcher’s control, such as the American population that is growing older and more diverse with increasing income inequality.

As the criminal justice field begins to appreciate the limits of RCTs for policy applications, the number of critical voices grow (e.g., Sampson, 2010). Advocates and philanthropic funders may continue to characterize the RCT as the only form of evaluation evidence suitable for public policy, but at least some agencies in the federal government are (or perhaps were) demonstrating greater discernment. In 2016, the Science Advisory Board for the federal Office of Justice Programs within the U.S. Department of Justice developed a series of formal statements to guide DOJ’s investments in criminal justice research and evaluation. Representing a consensus view among the diverse membership of the board, the statements were later promulgated on the Justice Department’s website. One statement clarified the important but limited role of RCTs in the development of effective policies and practices.

Randomized controlled trials (RCT) generally provide the strongest or most defensible causal evidence for programs and practices, but it may not always be possible to implement successful RCT evaluations in the field. Many important questions in the field of justice are not answerable using RCT studies—either for practical, economic, political, or ethical reasons. Research questions that are very difficult or expensive to answer using experimental methods may merit the necessary investment if they have widespread or profound social consequences, just as research questions with only modest consequences still merit experimental investment if they can be answered easily and at little cost. Funding for RCT evaluations should be managed like an investment portfolio with resources concentrated on the most effective combinations of theoretical salience, research feasibility, and social benefit (Science Advisory Board, 2016a).

Many technical concerns about RCTs debated in the criminological and social science literature relate to their internal validity, but the power of RCTs to generate generalizable evidence with an external validity is often overstated. There are practical trade-offs between external and internal validity. Internal validity is about isolating a causal effect within the sample. External validity is about the representativeness of a sample to the population it purports to describe, and critically, the representativeness of the population overall. Ideally, research samples are drawn at random from a known population and are thus unbiased in their representation of the target population. In practice, however, such samples are rare in criminal justice research, even in RCTs. Most study samples are actually convenience samples. Consider the example of a DUI diversion program. Most DUI diversion programs include important restrictions that lead to selection effects – participants must volunteer, and many DUI defendants are excluded because of their criminal history, previous treatment experience, or the nature of the incident that led to a new arrest (children in the car, accident with injuries, etc.). An RCT would account for this by randomizing after such restrictions are applied – meaning that the population being sampled comprised DUI defendants with limited histories and an expressed willingness to volunteer. A quasi-experimental design, on the other hand, might sample from a larger population by weighting and controlling for those differences rather than through exclusions. Further, even if the DUI defendant population was not limited in these ways, DUI arrest practices are typically (and perhaps by law) unlikely to represent a random draw from all DUI policy regimes. Thus, the randomization process may only appear to solve external validity problems through the design. Randomized, experimental designs address some important challenges, but they cannot illuminate every dark corner of the research-to-policy enterprise.

The choice of an evaluation design should optimize the explanatory power of the research while recognizing the constraints of often mutually exclusive challenges. While the RCT may be optimal for narrow studies of relatively simple causal mechanisms, there are many instances when that choice is not optimal. Consider interventions designed to affect large numbers of eligible recipients, many of whom get close to, but not through, the front door of a treatment program. Is the most important question for a policy the later success among all people designated for treatment, or is it whether the treatment works when it is actually received, even if it is received only by some people, some of the time?

Randomized experiments are often assumed to be “intent to treat” (ITT) designs when in reality they are “treatment on the treated” (TOT) designs. How does an evaluation handle outcomes among people who were assigned to receive a treatment but who failed to enroll or participate? In a TOT design, they are simply excluded. If evaluation results are applied to policymaking, this is wholly unsatisfactory. Without a researcher-controlled mechanism assigning some eligible participants as non-enrollees, there is a clear selection effect. Whatever factors may predict this selection are almost certainly unobservable in the data and thus difficult to control. Even with random assignment to treatment, controlling for the effects of such post-assignment self-selection must be addressed with a quasi-experimental post-hoc design.

The ITT model is better along these lines, since it explicitly includes those who were assigned to be treated, but who will still appear in the treatment results because they chose not to enroll or to participate for any number of reasons. Practitioners must be wary of unscrupulous evaluators who dismiss such concerns or who deride the inclusion of non-enrollees in an outcome analysis as unnecessarily diluting the effects of an intervention. Even the ITT design, however, does not address all problems with generalizing from RCTs. Applied experiments present many challenges that reveal the limits of the RCT.

Fortunately, experimental evidence is not the only type of evidence for answering questions in criminal justice. Other forms of evaluation will always be essential. First, there will never be enough funding to evaluate every decision made and every action taken by the wide range of individuals and organizations comprising the justice system. Justice system operations are vast, especially if the justice system is defined broadly to include the wide range of preventive efforts of community-based organizations, social service providers, and law enforcement agencies. Second, many important questions about justice effectiveness are not amenable to experimental evaluation studies. Due to an array of ethical and practical concerns, many policies and practices are best evaluated with research using non-randomized designs. When consumers of research treat random controlled trials (RCTs) as the only acceptable form of evidence for policy and practice decisions, they distort the process of knowledge development and restrict the reach of evaluation research (Sampson, 2010).

Evaluation in the digital age

In ancient times – before 1990 – evaluation research was a rarified sub-discipline of social science and public policy research. Academics and research practitioners labored in relative obscurity, known primarily to government agencies and the few large foundations that supported evaluation studies. Non-governmental research organizations dominated the field. From the 1960s through the 1980s, nonprofit firms such as the Urban Institute, Research Triangle Institute, RAND Corporation, MDRC, and Westat competed to secure a relatively small number of federal evaluation contracts. Their studies tested the effectiveness of policy and program demonstrations in housing, employment, urban development, children’s services, and income supports among others (Rossi, Lipsey, & Freeman, 2004). Criminal justice evaluations also grew during this time, usually focused on the relative effectiveness of behavioral interventions (Sherman et al., 1998) and comparisons of crime in cities and neighborhoods trying varying forms of public safety innovations (Skogan & Lurigio, 1991).

Another substantial body of criminal justice evaluation work was carried out by academics in colleges and universities. Their funding sources were more diverse and included smaller grants and contracts from foundations as well as state and local governments. Dissemination of this research, however, was hindered by the publication outlets favored by the academy. Research findings were effectively hidden in the pages of peer-reviewed journals read mainly by other academics. Only the most prominent studies were ever featured in journalistic and professional publications with a broad readership. Moreover, a large portion of academic evaluation work concentrated on evaluation methods themselves. Books and articles appeared throughout the 1970s and 1980s, providing advice for researchers using strategies such as utilization-focused evaluation (Patton, 1978) and practical evaluation (Patton, 1982). The number of evaluation experts in academia grew and academic journals devoted to evaluation proliferated – e.g., Evaluation Review: A Journal of Applied Social Research (Sage), Evaluation Practice (JAI Press), New Directions for Evaluation (Jossey-Bass), and Evaluation and Program Planning (Elsevier). Yet, the visibility of policy and program evaluation was fairly low.

After 1990, two forces combined to spark rapid growth in evaluation as a discipline and in the visibility and competitiveness of evaluation research – especially research on social programs and policies. First, political assaults on social spending by 1980s conservative governments (e.g., the Reagan Administration in the United States and the Thatcher government in the United Kingdom) sparked deep skepticism about the effectiveness of public policy in addressing social problems (Weiss, 1998). In both the US and the UK, new Center-Left governments in the 1990s (i.e., Clinton and Blair) made modest attempts to restore social programs but simultaneously imposed greater evidentiary requirements on such policies. In the US, once foundational social programs such as income supports for impoverished families with children began to be curtailed, prompted at least in part by research casting doubt on their effectiveness and suggesting that benefits should be paired with work requirements (Gueron & Pauly, 1991). Suddenly, even long-established social programs were being questioned. Both critics and advocates demanded stronger evidence.

The second development leading to a new era in evaluation research was the rise of the internet and the rapid diffusion of all forms of information. In the evaluation field, one of the most significant effects of the new digital era was the creation of publicly accessible databases containing compendia of evaluation findings. In the past, anyone in search of evaluation results had to wade through original research materials or secondary summaries published in journals and books. Learning about the state of evidence for a program or intervention model took considerable effort, and without sufficient training to understand technical research details, justice practitioners were more likely simply to trust the opinions of researchers with whom they were personally acquainted. Worse, they may have been influenced by the beliefs of other nonresearchers or by the promotional claims of service providers and developers. Intervention entrepreneurs could advertise their effects to practitioners and policymakers and there was no Better Business Bureau or consumer watchdog organization to scrutinize their claims. Practitioners were on their own to sort through a confusing array of studies and findings.

With the maturation of the internet in the 1990s, organizations in the academic and governmental sectors began to create practitioner-friendly clearinghouses of evaluation findings. One of the first and best of these efforts was “Blueprints for Healthy Youth Development.” Launched at the University of Colorado in 1996, the searchable database of research findings about youth services (www.blueprintsprograms. com) was intended to provide assistance to officials in that State. It soon grew beyond Colorado in response to the surging appetite for research support among policymakers and practitioners. Non-researchers appreciated having access to evaluation information that was concise and translated for a non-technical audience.

With funding from the Office of Juvenile Justice and Delinquency Prevention in the U.S. Department of Justice, the Development Services Group, Inc. of Bethesda, MD developed another very successful database of evaluation results focusing on interventions for justice-involved young people. The DSG “Model Programs Guide” (www.ojjdp. gov/mpg/) was the first accessible platform that allowed juvenile justice professionals to obtain well-reviewed summaries of evaluation research about prominent program models. The database was populated with semi-detailed, analytic reviews conducted by well-trained researchers. The database was much more than a simple listing of studies. It was essentially a “Consumer Reports” for juvenile justice.

The popularity and success of the OJJDP Model Programs Guide soon inspired other agencies in the Department of Justice to build an even more ambitious database of research results. The National Institute of Justice developed “,” which is now a primary resource for the criminal justice policy and practice community in the US. With detailed, high-quality reviews of hundreds of justice programs and practices as well, the database offers justice professionals a searchable and interactive method of sorting through decades of research findings on a wide variety of topics. Program profiles allow users platform to assess whether a particular intervention is likely to produce the results it promises when implemented properly. Practice profiles summarize the state of evidence for a general practice area, independently of any branded program model. Rather than exploring whether a specific program fared well in previous evaluations, users can review assessments of the general intervention strategy used by that program.

Together, these and other evaluation databases transformed the policy and practice environment in criminal justice. Where evaluation was once a special area of social research, it had become a more open and competitive field in which policymakers and justice practitioners debated the value of program models and where program developers worked with researchers to maximize their marketing advantage in various service sectors. Developers and service providers understood that they needed positive evaluation results to improve their rankings on digital clearinghouses. Brochures and testimonials were no longer sufficient.

Much like the ways standardized student test scores changed the climate of public school systems, and how the U.S. News & World Report’s prestige rankings increased university competition over faculty appointments and publications, the spread of searchable evaluation compendia turned the results of evaluation research into a valuable currency. became the of criminal justice, an effect magnified by the platform’s use of appealing and simplifying icons that represented the general state of evidence for a program or practice (Figure 1).

Figure 1. Rating icons used by Source:

The effect of these innovations and the increasingly competitive environment they brought to criminal justice evaluation can only be described as mixed. Certainly, programs and policies may be improved when evaluation findings are easily accessible and translated for readers outside the technical audience of academics and professional evaluators. Explaining evaluation findings in clear terms for policymakers and practitioners may help them to incorporate research knowledge into the conceptualization and design of justice interventions and justice policies. In this way, the democratization of criminal justice research may have contributed to improved public safety.

Other effects are more worrying. The research profession’s new audience awareness may create overt competition for positive results, leading evaluators to change their methods or even to modify the types of research questions they are willing to investigate. In an increasingly competitive funding environment, when the results of individual studies are more visible and when their evidentiary quality is summarized with happy green checkmarks or discouraging red null signs, it would be rational for researchers to shape their projects to generate strong, positive results. Evaluators might be tempted to concentrate their efforts on interventions suitable for randomized trials while avoiding those requiring other methods, even if those other interventions are innovative and potentially effective crime prevention strategies. Policymakers and practitioners, in turn, would naturally back crime reduction strategies based on strong evaluation results, but be unaware that criminal justice research was becoming increasingly narrow and redundant. Funding organizations might try to bolster their own reputations by focusing justice grant making on intervention models compatible with RCT evaluation designs while other public safety strategies end up starved for resources and support.

An increasingly competitive research environment dominated by RCTs could even limit the theoretical perspectives inherent in justice evaluations and justice interventions. The sample sizes and design controls required to achieve robust measures of effect are easier to reach with interventions delivered to individual offenders rather than to neighborhoods or communities. Individual-level models are also more likely to operate through a single causal mechanism rather than a more complicated web of influences. When evaluators experience powerful incentives to pursue the strongest findings with statistically significant outcomes and large effect sizes, they will naturally focus their research efforts on interventions testable at the individual level.

Other designs

If the competition for positive findings creates incentives strong enough that nonrandomized evaluations are less likely to be funded and completed, policy and practice in criminal justice would be diminished. All else being equal, the randomized control trial should be preferred over other study designs, but much like static market clearing models in beginning microeconomics, other factors can create a context where RCTs do not yield optimal results. One particularly flawed assumption embedded in well-meaning campaigns to encourage RCTs in criminal justice is that any causal mechanism can be understood as binary. Randomized social experiments mimic medical trials in their simplicity of design and their clear isolation of causal agents. In medical RCTs, classically, a drug or therapy is administered as either a dose of the test agent or as a placebo. Observed changes may then be attributed to the drug or treatment. In social science research and program evaluations, however, interventions are rarely administered with two clear and distinct outcome conditions: delivered in full versus not at all. Most justice interventions are complex, non-binary, and delivered in varying degrees. Interventions must be complex to affect the medley of human behaviors underlying crime rates. Practically, it is unusual for interventions in the criminal justice system to be delivered like an aspirin.

Four key characteristics of justice interventions highlight the limits of randomized controlled trials and illustrate how other evaluation designs are useful for causal inference:

  • Discontinuity. Criminal justice interventions are often implemented at a point of discontinuity in participant eligibility, where eligibility is an attribute such as age or legal status. Isolating effects at a moment of discontinuity may provide clear, causal insights in ways that RCTs cannot.
  • Perception. Systems of justice are predicated on the expectation that laws can be fairly and accurately communicated to subjects. Justice interventions seek to change behavior and do so within a framework of assumptions about the baseline perceptions of their recipients. Perceptions and beliefs, however, cannot be randomly assigned to facilitate research.
  • Policy. By definition, changes in policy, whether in income tax or criminal sentencing or healthcare eligibility, cannot be randomly assigned at scale. In addition, many policies are endogenous, where a policy change and the problem it is designed to affect are difficult to untangle.
  • Complexity. Justice interventions often braid or blend suites of services for complicated problems, and eligibility for services is often voluntary. In evaluating such interventions, researchers face multiple causal elements and critical unobserved attributes, often making it impossible to structure an evaluation as an RCT.


The value of discontinuity designs in justice evaluation is illustrated by studies of policies to reduce drunk driving. Lowering the incidence and prevalence of drivers operating motor vehicles while under the influence of alcohol or other drugs has been an important policy goal for generations. While police have always been able to make subjective judgments about the alcohol-related impairment of drivers, the invention and widespread adoption of the Breathalyzer to measure accurate levels of blood alcohol content (BAC) allowed states to establish more rigorous and transparent drunk driving deterrence policies beginning in the late 1950s. As with any law, there is a tension between setting a policy too strictly and over-intervening with minor instances or setting rules too loosely and allowing dangerous drivers on the road. The ramifications are substantial both in terms of the costs to victims of drunk drivers whose injuries might have been deterred and in terms of the implementation costs for police and for those who are arrested and who may suffer serious reputational and financial harm.

In the decades following the introduction of Breathalyzer tests, legal limits for blood alcohol content were set at varying levels between 0.08 and 0.15 without a definitive study determining the optimal level. In practice, there is substantial heterogeneity in operators’ ability to drive at different BAC levels and there is no level that satisfies all criteria for an optimal public policy. An RCT is of little use in settling the question – actual drunk drivers cannot be randomly assigned BAC levels, nor can legal authorities randomly arrest at different levels to satisfy a randomized design.

A recent study exploited the policy discontinuities in BAC law enforcement to study the differential effects of deterrence from different BAC thresholds (Hansen, 2015). In practice, a BAC of 0.079 and 0.08 are indistinguishable in observed driver behavior. In several states, however, that small difference leads to substantially different legal consequences, with those exceeding 0.08 facing jail, fines, license suspensions and other punishments. In the state involved in this particular study (Washington), there was a second policy discontinuity at a BAC of 0.15, triggering stronger punishments including larger fines and longer jail sentences. Thus, subsequent desistance could be causally inferred by the detected threshold. The study found significant and substantial decreases in recidivism for those with a BAC over 0.08 and an additional deterrent effect over 0.15.

The study presented clear implications for evaluation design. In the presence of policies that exist along a continuum, a discontinuity design may be clearly understood and uniquely tailored to the situation. There are even more subtle benefits to regression- discontinuity designs. The BAC study found a difference in intercept (a change in recidivism for all levels above the BAC threshold) but also a change in slope for some interventions but not others within the same jail sentence, which includes a range of BAC from 0.08 to 0.15. At similar levels of jail confinement, increases in BAC yielded increasingly larger desistance effects, but at the same level of treatment the change was not affected by BAC. Such a finding is worthy of additional study and would likely not be noted in an RCT if it could hypothetically group subjects above or below a BAC threshold.


A second example of an influential line of research not amenable to RCT focuses on the age of criminal court jurisdiction – i.e., the age (usually 18) at which young people are handled by default in the criminal (adult) justice system rather than in a juvenile or family system. In justice systems, policies intended to deter crime presume an understanding of the law by those subject to it. If important laws are widely misunderstood by those expected to follow them, the legitimacy and effectiveness of laws are both at risk. Perceptions and beliefs, and any behaviors arising from those beliefs, should follow the objective reality of a law in order for the law to achieve its objectives. However, these psychological preconditions are difficult to observe and virtually impossible to manipulate by experimental design. Thus, a non-RCT design must be used if these attributes are to be evaluated.

Arguably, no legal discontinuity has a larger or broader effect on behavior than the change in case processing that occurs on the day a youth reaches the age of criminal court jurisdiction (Butts & Roman, 2014). On that day, youth who would have previously been processed in a separate legal system where rehabilitation is supposedly preferred to punishment, are immediately subject to adult criminal case processing. Defendants in criminal court are held systematically more accountable for their criminal acts and often face more severe consequences. The idea, of course, is to use this threat of greater punishment to deter youth and young adults from criminal acts. The key question is whether individuals accurately change their beliefs about possible legal consequences upon achieving the age of criminal jurisdiction. Do young people actually perceive the reality of the new punishment regime they face? There are important consequences to their awareness, regardless whether they over-estimate or under-estimate the legal jeopardy associated with criminal behavior.

Hjalmarsson (2009) used survey data from the 1997 cohort of the National Longitudinal Study on Youth (NLSY), a project of NORC at the University of Chicago, to measure whether 18-year-olds accurately adjusted their beliefs or if their expectations about how they would be treated in the adult criminal justice system were out of sync with observed data. The study used official criminal records data to estimate the probability of confinement (jail) conditional on arrest for an adult and the chance of confinement (“placement”) conditional on arrest for a juvenile. The NLSY asked respondents (with their ages known at the time of response) about their perceived chances of jail if they were arrested for a crime (the NLSY asks specifically about motor vehicle theft)1 and about their criminal case history.

1 The NLSY asks, “Suppose you were arrested for stealing a car, what is the percent chance that you would serve time in jail?”

Hjalmarsson (2009) employed a difference-in-differences strategy to test whether observed changes in the views of youth from one age to the other were affected by changes in their real legal status – i.e. from juvenile to criminal jurisdiction. The difference- in-differences method is common in social science research and often used in quasi-experimental settings to examine before-and-after policy changes, where only one group is treated in the “after” period. It is especially helpful in evaluation designs that must accommodate a particular discontinuity – subjects are aging and maturing simultaneously as their advancing age exposes them to a policy change (in policy studies, many types of benefits and social policies are triggered by age changes). The study found that changes in the perceived likelihood of jail by 18-year-olds were substantially smaller than the objective changes in their actual risk of jail. The implications of this finding were substantial, and certainly worth studying by non-experimental methods.

“Given that the distinction between the juvenile and adult justice systems is arguably one of the best known features of the U.S. justice system, this finding certainly increases concerns that other (lesser known) policies designed to deter crime will not result in large changes in subjective measures and, thus, not be very successful” (Hjalmarsson, 2009:245).


Changes in policy present critical challenges in criminal justice evaluation and are relatively impervious to randomized designs to test their effects. Policy, in this case, refers to any law, regulation, or system characteristic that affects an entire population. The limits of an RCT in evaluating policy change are clear. Consider drug laws that impose long mandatory minimum sentences and “three strikes” laws that escalate punishment based on a person’s criminal history. Since such laws are applied uniformly once enacted, it would obviously not be possible to randomly assign their effects to individuals (to test deterrence). Natural experiments, policy discontinuities, and intertemporal effects all provide quasi-experimental ways to approximate an RCT, but an RCT itself is not feasible.

Similarly, broad legal changes have sometimes altered the reach and impact of the criminal justice system. For example, U.S. Supreme Court decisions in the 1960s established stronger, formal rules for establishing the chain of custody over evidence in criminal cases and this likely helped to professionalize law enforcement (Giannelli, 1982). A more professional law enforcement orientation might reasonably be expected to affect any number of law enforcement outcomes, but such a sweeping change is not amenable to RCT evaluation. Other large-scale changes in criminal justice, such as the growing importance of racial disparities (Baumer, 2013) and the increased attention to adolescent development in juvenile justice (Monahan, Steinberg, & Piquero, 2015), have equally profound ramifications for public safety that cannot be studied using RCT designs. Evaluators are correct to focus on these areas of policy, but randomized trials will play a limited role until sufficient knowledge exists to formulate specific hypotheses.

Among the many challenges that limit the use of RCTs in policy evaluation, perhaps the most significant is what economists call endogeneity. In the simplest terms, the presence of endogeneity means that there is a muddled relationship between cause and effect. The problem of endogeneity takes many forms in justice evaluations, with simultaneity and reverse causality being the most intuitive. An example from policing illustrates the central challenge. At a time when crime is rising in a city, a traditional and reasonable response for public officials is to increase the number of law enforcement officers assigned to specific patrol areas. As more officers begin to patrol, however, there would naturally be a time lag before the appearance of any effects they may have on the crime rate and possibly a number of other outcomes, both wanted and unwanted (Braga, Papachristos, & Hureau, 2014). Meanwhile, all the remaining mechanisms that affect a rising crime rate would continue to operate. For researchers evaluating an increase in policing resources, it could even appear that the introduction of additional officers in an area actually “caused” more crime.

The problem is not particularly amenable to an RCT solution. Putting aside the impracticality of randomly assigning police levels across a city, an RCT study design does not inherently solve the endogeneity problem. Police do not only respond to calls for service; they also observe and detect crime. Thus, adding more police almost surely results in more reported crime. Although this is a measurement problem related to the definition of “crime” and not a design flaw, it is nevertheless a practical constraint in RCT studies of policing’s effect on crime. Of course, a city could decide to assign the number of added officers to different geographies randomly in order to evaluate the effect, but inherent differences in the effects of place-based attributes on crime rates could have as much explanatory power as the policing level itself. Since place-based attributes cannot be randomly assigned, such a study would require the introduction of many covariates, creating what is essentially a quasi-experimental study rather than a true RCT.

A similar problem emerges in studying the effect of incarceration on crime. Regardless whether prison is hypothesized to reduce crime by deterring future offenders, incapacitating current offenders, or rehabilitating offenders so they desist upon release, the mechanical effect of increasing prison capacity is to reduce crime from expected levels at least temporarily.2 Thus, just as more police could be wrongly assumed to “cause” more crime, more prison could falsely appear to “cause” less crime. The effects cannot be easily disentangled. Again, an RCT is of limited help both due to practical constraints (constructing a prison in service of an experiment would be unlikely) and because the targets of crime prevention policies cannot be randomly assigned (a policy, by definition affects everyone).3

2 Incarceration could be deemed by some as immoral or unjust. The only concern here, however, is whether more or less incarceration is associated with the amount of crime. In that scenario, it could be logical to expect that larger numbers of incarcerated people will reduce the number of crimes.

3 An experiment is not impossible in this setting, but in practice such research is rarely attempted.

To study this question, Levitt (1996) used an instrumental variables technique. The idea conceptually was to find a third variable the affects the outcome being studied (crime) through the same mechanism as the original study (prison) but is not affected by the outcome (crime) in a confounding way. Thus, the third variable is used as an instrument to test the effects of the original variable without bias. Levitt posited that court orders for overcrowding serve as an instrumental variable because judicial decisions to release inmates from overcrowded facilities are not directly affected by crime levels. Once prison levels are reduced, their potential effects on crime might be observed directly. Levitt used state panel data (each state over several years) and interpreted a change in the level of crime after a court order as evidence for a positive effect of prison on crime. Levitt found a deterrent effect of prison that was larger than those found by most prior (and subsequent) studies using other methods, but generally within the same confidence bounds and thus accepted it as a reasonable benchmark.


One of the most compelling reasons to evaluate criminal justice with designs other than randomized trials is that justice policies and practices are often highly complex with causal mechanisms that are not reducible to simple models. A recent example focused on the extent of variation in crime rates between US cities over the past two decades. Sharkey (2018) examined administrative data and various organizational indicators to test whether the scale of nonprofit social services and citizen organizing in communities may have significant effects public safety.

Conventional explanations of crime control attribute public safety largely to the efforts of law enforcement and other criminal justice entities. Sharkey tested an important, alternative explanation: perhaps the depth and diversity of social supports in a city help to mitigate the anti-social forces that generate high crime rates even after controlling for the efforts of the formal justice system. As he notes “[u]nlike the studies showing the impact of policing and incarceration, virtually no research has been done on the effect of nonprofits on crime” (2018:52). Why has there been no analysis of this critical question? Partially because it is virtually impervious to individual- level, RCT evaluation designs and partially because it requires untangling a key endogeneity problem, “because places with more antiviolence organizations are, of course, places with higher levels of crime” (2018:52). To solve this technical challenge, Sharkey relies on instrumental variables in the manner of the Levitt prison analysis, and analyzes changes in nonprofit startup trends over time, controlling for any possible confounding effect. His analysis found convincing evidence for the neighborhood supports hypothesis.

Results of the study will probably exert a strong influence on policy and practice for many years, and it would have been highly unlikely for such evidence to emerge from an experimental evaluation. The sheer scale and complexity of the forces identified in the analysis would make it impossible to specify a small number of factors and then control for them sufficiently to model their combined effects on crime over time. If researchers and funding organizations consistently avoided studies not amenable to experimental designs, the type of evidence produced by Sharkey would never be available to help shape criminal justice policy and practice.


Randomized controlled trials are a critical part of the evidence base for criminal justice policy and practice, but their applications are limited. In their most basic form, RCTs require binary comparisons. Interventions must be completely present or completely absent among a group of subjects. In practice, stratified random assignments could be used to test multiple levels of dosage with equivalent rigor, but even modeling interventions at varying levels cannot capture the full complexity of actual criminal justice actions and their effects on individuals and communities. Criminal justice interventions vary dosage according to a variety of factors and in non-continuous ways. There may be tremendous heterogeneity within a single modality. Finding an RCT design, even a stratified design, which accommodates treatment heterogeneity is a substantial practical challenge.

Criminal justice evaluations also are rarely able to create clear control conditions. Control groups in justice RCTs are generally not true “no-treatment” groups akin to the subjects receiving the placebo in a medical trial. In practice, the control condition in a criminal justice evaluation is often described as a “business as usual” option. In other words, evaluation subjects may be randomly assigned to drug court versus regular criminal court, or to intensive probation rather than traditional supervision. The group receiving business as usual, however, is likely quite heterogeneous and experiencing a wide range of outcomes, thus, confounding the interpretation of treatment effects. Disentangling the treatment mechanism and identifying the source of a treatment effect presents substantial challenges.

The RCT treatment or placebo model also anticipates that people, rather than places are the appropriate units of analysis. In criminal justice studies, place attributes are equally, and often more, predictive of crime outcomes. Thus, place-based interventions can be very effective interventions for crime prevention and crime reduction (Eck, 2002). Places, however, are not often randomly assigned to receive criminal justice interventions. Even if place-based studies were capable of randomizing communities to receive an intervention, there would usually be too few of them to generate strong statistical results. Random assignment of places would also be unlikely to control for potentially confounding factors equally across treatment and controls in the way randomly assigning a medical therapy to an individual does.

These and other obstacles to conducting RCTs in criminal justice are why true experimental evaluations are relatively rare in the crime prevention and crime reduction sector. Other forms of research and evaluation will always be a critical part of the development of sound policies and practices. When RCTs are portrayed as the only form of reliable evidence – whether such claims are made by advocates, policymakers, or even researchers themselves – the evidence agenda in criminal justice is harmed. Instead of carefully considering the most practical and rigorous ways to answer important questions about the effectiveness of justice interventions, practitioners are encouraged to fashion crime prevention strategies simply by looking at and locating whatever green checkmarks come close to their desired intervention models. When the operations of the criminal justice system are managed in this fashion, it endangers public safety by limiting the vision of justice professionals and their community partners.

Evaluation evidence is not simply a question of “what works” and “what doesn’t?” It’s also a matter of “what’s promising,” “what might work better,” and “what haven’t we even considered yet?” Justice practitioners should recognize that the current menu of intervention options in criminal justice does not represent the best of all possible program options. Today’s menu of evaluation knowledge is the result of yesterday’s investments in evaluation research – investments that were chosen consciously based on the goals, preferences, and values of individual researchers and the funding organizations that supported their work. The Science Advisory Board for the U.S. DOJ’s Office of Justice Programs understood this when it recommended that the research and evaluation agenda of the federal government be viewed as an investment portfolio and that the range and type of investments should be designed to advance future knowledge and not simply to reinforce the value of existing approaches to public safety.

“The strength of evidence required to judge the value of programs and practices in the justice field is a question of balance. Judgments should be based on the best available evidence, but the strength of evidence required for any decision is gauged by the costs of error and the burden of increasing evidentiary quality. Decisions with little consequence require less accurate evidence and less exhaustive evidence. Highly consequential decisions require more evidence. Navigating the continuum of evidence- supported decision-making is complex and subjective. The available evidence for any policy, program, or practice is not the product of a straightforward and untrammeled search for effectiveness. It emerges from a contentious and inherently political process that governs social investment in research” (Science Advisory Board, 2016b).

When RCT advocates attempt to persuade policymakers and the general public that experimental evaluations are the only respectable form of evidence, they often use the marketing term “gold standard” to describe the findings of randomized studies. As noted by Sampson (2010: 496), however, there is no such thing as a gold standard evaluation method, but the research community might be well-advised to work on developing gold standard questions. In other words, knowledge is best supported when researchers use the most rigorous methods available, whether RCT or others, to generate reliable and valid answers to well-crafted, practical questions about the full range of important issues in criminal justice policy and practice.



Baumer, Eric P. (2013). Reassessing and redirecting research on race and sentencing. Justice Quarterly, 30(2): 231–261.

Braga, Anthony A., Andrew V. Papachristos, & David M. Hureau (2014). The effects of hot spots policing on crime: An updated systematic review and meta-analysis. Justice Quarterly, 31(4): 633–663.

Butts, Jeffrey A., & John K. Roman (2014). Line drawing: Raising the minimum age of criminal court jurisdiction in New York. New York, NY: Research and Evaluation Center, John Jay College of Criminal Justice, City University of New York.

Eck, John E. (2002). Preventing crime at places. In Lawrence W. Sherman, David P. Farrington, Brandon C. Welsh, & Doris Layton MacKenzie (Eds.), Evidence-based crime prevention (Chap. 7, pp. 241–294). New York, NY: Routledge.

Giannelli, Paul C. (1982). Chain of custody and the handling of real evidence. American Criminal Law Review, 20, 527–568.

Gueron, Judith M., & Edward Pauly (1991). From welfare to work. New York, NY: Russel Sage Foundation.

Hansen, Benjamin (2015). Punishment and deterrence: Evidence from drunk driving. American Economic Review, American Economic Association, 105(4), 1581–1617.

Hjalmarsson, Rand. (2009). Crime and expected punishment: Changes in perceptions at the age of criminal majority. American Law and Economics Review, 11(1), 209–248.

Levitt, Steven D. (1996). The effect of prison population size on crime rates: Evidence from prison overcrowding litigation. Quarterly Journal of Economics, 111, 319–352.

Monahan, Kathryn, Laurence Steinberg, and Alex R. Piquero. (2015). Juvenile justice policy and practice: A developmental perspective. In Michael Tonry (Ed.), Crime and justice: A review of research (Vol. 44, pp. 577–619). Chicago: University of Chicago Press.

Patton, Michael Quinn. (1978). Utilization-focused evaluation. Beverly Hills, CA: Sage Publications.

Patton, Michael Quinn. (1982). Practical evaluation. Beverly Hills, CA: Sage Publications.

Rossi, Peter H., Mark W. Lipsey, & Howard E. Freeman. (2004). Evaluation: Systematic approach (7th ed.). Thousand Oaks, CA: Sage Publications, Inc.

Sampson, Robert J. (2010). Gold standard myths: Observations on the experimental turn in quantitative criminology. Journal of Quantitative Criminology, 26(4): 489–500.

Science Advisory Board. (2016a). Research Methodology and Evidence Translation Subcommittee Advisory Statement #1. Washington, D.C.: Office of Justice Programs, U.S. Department of Justice.

Science Advisory Board. (2016b). Research Methodology and Evidence Translation Subcommittee Advisory Statement #2. Washington, D.C.: Office of Justice Programs, U.S. Department of Justice.

Sharkey, Patrick. (2018). Uneasy peace: The great crime decline, the renewal of city life, and the next war on violence. New York, NY: W. W. Norton & Company.

Sherman, Lawrence W., Denise C. Gottfredson, Doris L. MacKenzie, John Eck, Peter Reuter, & Shawn D. Bushway. (1998). Preventing crime: What works, what doesn’t, what’s promising. Research Brief [MCJ171676]. Washington, DC: National Institute of Justice, U.S. Department of Justice.

Skogan, Wesley G., & Arthur J. Lurigio (1991). Multisite evaluations in criminal justice settings: Structural obstacles to success. New Directions for Evaluation, No. 50. San Francisco, CA: Jossey-Bass.

Weiss, Carol H. (1998). Evaluation: Methods for studying programs and policies (2nd ed.) Upper Saddle River, NJ: Prentice-Hall.