Author: Howard White, Director, Evaluation and Evidence Synthesis, Global Development Network
Evaluations have two key functions: lesson learning and accountability. How well they can fill these tasks depends on the suitability of the evaluation design to address the evaluation questions of interest, and the quality of those evaluations. Unfortunately, many evaluations suffer from flaws which reduce the confidence we can have in their findings, and their usefulness for both lesson learning and accountability.
This blog lists 10 common flaws that I commonly come across. Not all evaluations have these flaws. There are many excellent evaluations. But these flaws are sufficiently common to deserve drawing attention to. After all, better evaluations can mean better lives for all.
1. Inadequate description of the intervention
Many evaluations have inadequate description of the intervention. When I worked in the evaluation department of the World Bank, we reviewed all process evaluations produced by operations - called completion reports. Often, I’d find myself asking, you spent US$500 million on this project, what did you spend it on? Whilst the project components may be listed, there were scant details of specific activities. More recently I read an evaluation intervention for people experiencing homelessness in a city in the UK which failed to name a single intervention being evaluated – though the authors did conclude that whatever it was that was being evaluated was working.
A new meta-evaluation I have been working on with Hugh Waddington and Hikari Umezawa for the Dutch Ministry of Foreign Affairs finds this problem, which Hugh labelled “missing beginnings” in the theory of change, to be very common. The evaluations claimed policy changes were the results of Dutch support to CSOs – but as no information was provided on the activities undertaken it is not possible to know if such findings are credible. This flaw seriously undermines the lesson learning function of evaluation. If we don’t know what was done then programme managers cannot learn what to do and what to avoid.
2. ‘Evaluation reports’ which are monitoring not evaluation
There are two versions of this flaw. The second appears as flaw number 9 below.
The first is simply to report outputs with no evaluative content. I have recently been reading evaluations of career guidance services for young people in various countries in Africa. Several evaluations simply state that such a service was offered, sometimes specifying the number of career centres which were set up with project support, and sometimes the number of people using the service. All those numbers are monitoring, not evaluation.
An evaluation requires some evaluative content: usually around success factors or challenges in implementation for a process evaluation, or the difference the careers guidance makes to career choices and employment for impact evaluations. There are such studies. From several process evaluations we learn, for example, that job fairs are always held in the capital city, and so not accessible to most the country’s young people. And an impact evaluation of job fairs in Ethiopia found it led to few job offers, as young people had unrealistic expectations about the jobs and salaries they might get. Such findings provide useful lessons. So-called evaluations which are simply monitoring reports of outputs may serve an accountability function but are of no use for lesson learning.
3. Data collection is not a method
The methods section of many evaluations contains statements such as ‘Our method is a sample survey’, or ‘We will hold focus group discussants with participants’. This is how the data will be collected. It is not an evaluation method. The evaluation method is how the data will be analysed.
Methods sections would better be renamed as evaluation design. Evaluation design has two components: data collection and data analysis. Both of these need to be described. The lack of explicit evaluation methods means that at best that we do not know what confidence we can have in evaluation findings, as we do not know how they were arrived at. Hence neither lesson learning or accountability are served by the absence of clear evaluation methods. This flaw is closely associated with the fourth flaw.
4. Unsubstantiated evidence claims
Evaluations sometimes present evaluations findings without any supporting evidence – such as ‘the women’s groups made women feel more empowered’ or ‘the careers services were effective’. Unsubstantiated claims mean precisely that, no supporting evidence is presented whatsoever. We can place no confidence is such evidence claims, so they are of no use for either lesson learning or accountability. When evidence is presented it is not always sufficient, which is the next flaw.
5. Insufficient evidence
The evidence presented in some evaluations it is not sufficient to support the evaluation finding. A generic example is ‘some participants said the programmes helped them’. How many participants out of how many who were asked, and how were the participants selected? Evaluations are often not clear on the balance of evidence. Evaluators often go where the project staff suggest they go, and speak only to the people presented to them, which is one possible source of positive bias (flaw 6). This flaw is also related to the other flaws of methods-free evaluations (flaw three) and inappropriate use of monitoring data (flaw 9).
6. Positive bias in process evaluations
There are a large number of positive biases in evaluation. Impact evaluations recognize and address the main bias they may face, that is selection bias. There are a wide range of possible biases in qualitative evaluations which are often not recognized, such as who evaluators speak to (flaw 7), confirmation bias, and asking leading questions with excessive focus on the intervention. It is striking that qualitative evaluation approaches find 80% of interventions successful whereas the consensus from effectiveness studies is that 80% of interventions don’t work.
An interview for a qualitative assessment of programme effectiveness often goes something like this ‘I am conducting an evaluation of project X for agency Y. Was project X successful? How did it contribute to achieving outcome A?’. A more appropriate approach is ‘I am a researcher from University Z interested in developments with respect to outcome A. What have been the main drivers of change in outcome A?’. The former approach clearly introduces a bias in favour of the intervention, which reduces its usefulness for lesson learning and accountability.
7. Limited perspectives: Who do evaluators speak to?
This problem arises in two forms. The first, mentioned above, is that evaluation teams allow themselves to be directed by project staff as to which sites to visit and who to speak to when they get there. Random sampling, including for qualitative studies, can be a useful way to combat this issue.
The second form – also a common source of positive bias - is who evaluators speak to, which is largely ‘people like them’. Evaluators in international development interventions typically speak to staff from the funding agency, the various layers of implementing agencies, and government counterparts. Important voices are left out. Evaluators of education projects don’t speak to teacher unions, evaluators of health projects don’t speak to frontline health workers, evaluators of policy initiatives don’t speak to politicians, and evaluators of interventions to address key social issues don’t speak to journalists.
8. Ignoring the role of others
Another important source of bias is ignoring the role of other agencies – or even different projects from the same agency - in supporting the changes the evaluation claims for the intervention being evaluated. An example I came across recently was a country evaluation of World Bank support to the Philippines which highlighted the Bank’s role in the introduction of a sin tax– that is taxing alcohol and tobacco to subsidise health insurance. The evaluation presents the Bank as the key player in getting the government to introduce the policy. But a Harvard case study of the policy change shows it to be domestically driven, led by the President himself, with formative research to inform the policy conducted by many agencies, of which the Bank was just one.
Similarly, our meta-evaluation found many claims that Dutch-support for a southern CSO contributed to a policy change, ignoring the fact that the same CSO was supported by many other agencies.
9. Causal claims based on monitoring data
Outcome evaluations present data on outcomes in the project area and claim that any observed changes are the result of the project. This is an inappropriate use of the monitoring data, which I have named the evidence Peter Principle.
For example, after a vocational training programme, 65% of trainees enter employment within six months of completing the training. This is seen as the impact of the training, or the intervention ‘result’. But suppose we know that 60% of a comparable group of youth who don’t get the training also find employment. It is this difference of 5% which is the true measure of the effect. Evaluations which simply report outcomes before and after the intervention are of no evaluative use whatsoever. They serve no purpose for either lesson learning or accountability.
This problem has been greatly exacerbated by the results agenda. Agencies produce ‘results frameworks’ – so monitoring staff focus on collecting high-level outcomes which are recorded as ‘results’ - which lull managers into the mistaken belief that changes in outcomes tell us anything about project achievements. They don’t. Monitoring needs to be rescued from results, and return to its proper role of collecting data on activities and outputs, which managers can actually do something about.
Causal claims should be based on a counterfactual analysis, whether using large n or small n methods.
10. Global claims based on single studies
Evaluations provide evidence on an intervention with a specific design, implementation fidelity, context and treatment population. The main lessons to be learned apply to the intervention being evaluated. Of course, we hope that the study will have some external validity, so study findings are transferable. But when considering what intervention would be most appropriate in a new setting it is best to look at the global evidence base, preferably using a systematic review.
A controversial example is the case of deworming. A small number of studies, notably a study in western Kenya, have found effects on health, education and nutrition, which have been the basis for a Deworm the World movement and claims that deworming is amongst the best buys in development. But this seems to be an African effect. There are many more studies from elsewhere finding no such effects. The Campbell systematic review finds little or no effect across a range of outcomes. Hence claims that school-based deworming programmes in, say, India are evidence-based are misplaced. Studies from India find no effect.
Some further reading
The “80% rule” that 80% of programmes are ineffective is discussed in my paper on the evidence revolution, and in this blog from 80,000 hours. Several of the examples come from this blog from Straight Talk on Evidence. The capture of monitoring by the results agenda is discussed in my paper on the unfinished evidence revolution.