Can we reduce facial biases?

Trait impressions from faces in ﬂ uence many consequential decisions even in situations in which decisions should not be based on a person's appearance. Here, we test (a) whether people rely on trait impressions when making legal sentencing decisions and (b) whether two types of interventions — educating decision-makers and changing the accessibility of facial information — reduce the in ﬂ uence of facial stereotypes. We ﬁ rst introduced a novel legal decision-making paradigm. Results of a pretest ( n = 320) showed that defendants with an untrustworthy (vs. trustworthy) facial appearance were found guilty more often. We then tested the e ﬀ ectiveness of di ﬀ erent interventions in reducing the in ﬂ uence of facial stereotypes. Educating participants about the biasing e ﬀ ects of facial stereotypes reduced explicit beliefs that personality is re ﬂ ected in facial features, but did not reduce the in ﬂ uence of facial stereotypes on verdicts (Study 1, n = 979). In Study 2 ( n = 975), we presented information sequentially to disrupt the intuitive accessibility of trait impressions. Participants indicated an initial verdict based on case-relevant information and a ﬁ nal verdict based on all information (including facial photographs). The majority of initial sentences were not revised and therefore unbiased. However, most revised sentences were in line with facial stereotypes (e.g., a guilty verdict for an untrustworthy-looking defendant). On average, this actually increased facial bias in verdicts. Together, our ﬁ ndings highlight the persistent in ﬂ uence of trait impressions from faces on legal sentencing decisions. People spontaneously


Facial stereotypes influence decision-making
While there are numerous studies demonstrating the effects of facial stereotypes, comparatively little is known about why people persistently rely on trait impressions from faces.Addressing this question is crucial, as an understanding of the underlying mechanism not only advances theory, but is also a requirement for designing effective interventions.Recently, two (non-mutually exclusive) hypotheses have been put forward to address this gap.One explanation posits that the widespread influence of trait impressions can be explained by lay beliefs in the diagnostic value of facial appearance for inferring personality traits (Jaeger, Evans, Stel, & van Beest, 2019b;Rezlescu, Duchaine, Olivola, & Chater, 2012;Todorov, 2017).Many people believe in physiognomy-the idea that personality traits are reflected in an individual's facial appearance (Jaeger et al., 2019b;Suzuki, Tsukamoto, & Takahashi, 2017).Such beliefs may drive reliance on facial stereotypes because how much people rely on a certain cue is usually not determined by how predictive the cue actually is (i.e., how accurate trait impressions are), but by how predictive people think the cue is (i.e., how accurate people think their trait impressions are; Brunswik, 1956;Hammond, Hursch, & Todd, 1964).In fact, individual differences in physiognomic belief predict reliance on trait impressions when making economic trust decisions (Jaeger et al., 2019b): People who more strongly believe that trustworthiness is reflected in facial features rely more on their counterpart's perceived trustworthiness when deciding whom to trust.Thus, reliance on trait impressions may be driven by beliefs in the diagnostic value of facial appearance for judging an individual's personality.
A second explanation posits that the intuitive accessibility of trait impressions from faces can account for their persistent effects (Jaeger, Evans, Stel, & van Beest, 2019a).Faces attract attention (Ro, Russell, & Lavie, 2001;Theeuwes & Van der Stigchel, 2006) and are processed quickly and efficiently (Stewart et al., 2012;Willis & Todorov, 2006).This processing advantage leads to an intuitive accessibility of trait impressions from faces.As a consequence, reliance on facial stereotypes is relatively fast and not influenced by the restriction of cognitive capacities (Bonnefon et al., 2013;Jaeger et al., 2019a;Mieth, Bell, & Buchner, 2016).Crucially, previous research has shown that people favor readily available cues as they reduce decision effort (Evans & Krueger, 2016;Gigerenzer, Hertwig, & Pachur, 2011;Shah, 2007;Shah & Oppenheimer, 2008).Thus, people may rely on trait impressions from faces because it allows them to make decisions relatively effortlessly.

Reducing reliance on facial stereotypes
To sum up, previous research suggests that the pervasive influence of facial stereotypes is driven by a combination of (a) beliefs in the diagnostic value of facial appearance for inferring personality traits and (b) the intuitive accessibility of trait impressions from faces.Crucially, similar mechanisms have been identified in other research areas that investigate decision biases.Theories in the field of judgment and decision-making often distinguish between two general sources of bias: false beliefs (i.e., misconceptions) and automatically activated associations (i.e., misleading intuitions; Morewedge & Kahneman, 2017;Soll, Milkman, & Payne, 2014;Wilson & Brekke, 1994).Moreover, social psychological theories of bias typically distinguish between explicit and implicit expressions of bias (Devine, 1989;Dovidio, Kawakami, & Gaertner, 2002;Greenwald & Banaji, 1995;Greenwald, McGhee, Jordan, & Schwartz, 1998).Due to these similarities, we draw on the extensive literature on debiasing techniques in judgment and decisionmaking (Morewedge et al., 2015;Soll et al., 2014) and social psychology (Forscher et al., 2019;Lai et al., 2014) to design interventions aimed at reducing reliance on facial stereotypes.
A prominent strategy for reducing biases caused by misconceptions is to challenge beliefs through education (Chan, Jones, Hall Jamieson, & Albarracín, 2017;Soll et al., 2014).For example, educating people about compound interest can increase saving behavior (McKenzie & Liersch, 2011), educating people about cognitive biases can lead to more rational clinical decision-making (Hershberger, Markert, Part, Cohen, & Finger, 1997), and raising awareness of prejudice based on social group affiliation can reduce discrimination (Axt, Casola, & Nosek, 2018).Directly confronting participants with their stereotypes-rather than just raising awareness about the existence of stereotypes in general-has also been shown to reduce biased behavior (Czopp, Monteith, & Mark, 2006;Parker, Monteith, Moss-Racusin, & Van Camp, 2018).In Study 1, we therefore test whether we can reduce reliance on trait impressions by educating people about the influence of facial stereotypes or by confronting them with the fact that their facial stereotypes are not accurate.
A prominent strategy for reducing biases caused by intuitively available information is to design decision environments in such a way that participants are nudged to rely on the "right" cues (Soll et al., 2014;Thaler & Sunstein, 2008).The primary and efficient processing of faces leads to a quick availability of face-based impressions (Freeman & Johnson, 2016;Todorov, Pakrashi, & Oosterhof, 2009;Willis & Todorov, 2006).Crucially, information that is available first often exerts a disproportionate influence on decisions (Asch, 1946;Dimov & Link, 2017;Sullivan, 2018).Initial response tendencies are not sufficiently adjusted based on subsequently processed information (producing anchoring effects; Tamir & Mitchell, 2012;Tversky & Kahneman, 1974) and people are sometimes not able or willing to exert the cognitive effort required to integrate all available information (Shah & Oppenheimer, 2008;Simon, 1955).As a consequence, people often make decisions based on the cue that was processed first in order to reduce decision effort (Gigerenzer et al., 2008).This implies that manipulating how deeply and in which order information is processed could reduce the influence of facial stereotypes (Ghaffari & Fiedler, 2018).In Study 2, we therefore test whether preventing the primary processing of faces by presenting information sequentially (with faces being displayed after more relevant information) reduces reliance on facial stereotypes.We also test the effectiveness of prompting participants to make reflective rather than intuitive decisions.
We are not the first to test how different factors influence reliance on facial stereotypes.Providing information on how trustworthy a person has been in the past (Rezlescu et al., 2012) or giving feedback about a person's trustworthiness in a repeated interaction (Yu, Saleem, & Gonzalez, 2014) has been shown to reduce reliance on facial trustworthiness.In a similar vein, simply omitting photos from the decisionmaking environment would obviously eliminate any influence of facial appearance.These strategies may be effective, but they are not viable interventions in most real-world situations.When deciding on the culpability of a defendant or on the suitability of a job candidate, decision-makers are often faced with a limited amount of ambiguous or contradicting pieces of information, and it may not be possible to provide additional information about past behavior.It might also not be possible to completely remove information about a person's appearance.For these reasons, and in contrast to previous work, we focused on interventions that do not omit or add any additional decision-relevant information.Our goal was to test the effectiveness of different interventions under conditions that resemble the real-world situations in which the biasing effect of trait impressions from faces is particularly prevalent and problematic (e.g., in criminal sentencing, personnel selection, or voting).

The current studies
Here, we examine the effectiveness of different interventions in reducing the effect of facial stereotypes on legal sentencing decisions.We focus on decision-making in a legal context, because sentencing decisions can be immensely consequential, making biased decisionmaking particularly problematic.Appearance-based stereotyping undermines people's right to a fair trial (Lown, 1977).Yet, a host of studies has shown that facial stereotypes influence many real-life legal outcomes (Berry & Zebrowitz-McArthur, 1988;Eberhardt, Davies, Purdie-Vaughns, & Johnson, 2006;Porter et al., 2010;Wilson & Rule, 2015, 2016;Zebrowitz & McDonald, 1991).
Similar to Zebrowitz and McDonald (1991), we focus on sentencing decisions in small claims court.Small claims court judges hear civil cases in which people can sue private citizens for relatively small amounts of money (e.g., up to $5000; the exact amount varies across countries).Plaintiffs and defendants often represent themselves and the evidence presented to the judge tends to be limited.However, the burden of proof is also relaxed in small claims cases: Plaintiffs do not need to present evidence that implicates the defendant "beyond reasonable doubt", but judges rule in favor of the party that presents the most credible and convincing arguments.Given that small claims rulings reflect a more subjective interpretation of the evidence by the judge, it is possible that sentences are influenced by facial stereotypes.In fact, Zebrowitz and McDonald (1991) showed that babyfacedness-a facial feature that is correlated with perceived trustworthiness (Berry & Zebrowitz McArthur, 1986;Zebrowitz & Montepare, 1992)-predicted outcomes of small claims court rulings.Babyfaced defendants were found guilty less often (although this effect was only found for cases involving intentional, rather than negligent actions).Furthermore, when facing a babyfaced plaintiff, defendants that were found guilty had to pay a smaller fraction of the damages when they looked more babyfaced themselves.These results suggest that babyfaced individuals, who are generally seen as trustworthy, honest, and kind (Berry & Zebrowitz McArthur, 1986;Zebrowitz & Montepare, 1992), experience more leniency in court.
We present the results of three studies.All data, materials, preregistrations, and analysis scripts are available at the Open Science Framework (https://osf.io/h4yf3/).We report how our sample sizes were determined, all data exclusions, and all measures.In a pretest (n = 320), we develop and validate a legal sentencing paradigm that measures reliance on facial stereotypes.We examine whether the facial trustworthiness of plaintiffs and defendants influences sentencing decisions in small claims court cases.We then test the effectiveness of two types of interventions in reducing reliance on trait impressions in two preregistered studies.In Study 1 (n = 979), we educate participants about the low diagnostic value of facial appearance for inferring personality traits.In Study 2 (n = 975), we change the decision-making environment to disrupt the intuitive accessibility of trait impressions.

Pretest
We created a novel legal sentencing task, tailored to measure reliance on facial stereotypes.Previous experimental studies have predominantly taken two methodological approaches.In some studies, participants view a series of face images and indicate perceptions of culpability or sentencing decisions (e.g., Wilson & Rule, 2016).Multiple trials with within-subjects manipulations of facial appearance increase statistical power, but providing little or no background information on the cases limits the ecological validity of the task.In other studies, participants receive realistic case descriptions including relevant extenuating or aggravating facts (e.g., Berry & Zebrowitz-McArthur, 1988;Gunnell & Ceci, 2010).This approach more closely resembles the conditions in which decisions are made in real life.However, these studies usually consist of between-subject designs with few cases and face images, limiting statistical power and the generalizability of the results.
Here, we tried to incorporate advantages of the two approaches.Based on descriptions of real small claims court cases, we created ten fictitious case files, with plaintiffs filing suits against defendants.Cases included realistic evidence and we manipulated the perceived trustworthiness of plaintiffs and defendants in a within-subjects design.Participants indicated sentencing decisions for all ten cases.In line with previous studies (Berry & Zebrowitz-McArthur, 1988;Wilson & Rule, 2016), we expected participants to find defendants guilty more often when they look untrustworthy (vs.trustworthy).We also measured confidence in verdicts and, in case participants ruled in favor of the plaintiff, the damages they wished to award to the plaintiff.This allowed us to explore whether congruence between sentences and facial stereotypes (e.g., a guilty verdict for untrustworthy defendants) would increase confidence in verdicts.Moreover, we explored whether untrustworthy-looking defendants are punished twice, by being more likely to be found guilty and by receiving a harsher sentence (i.e., being ordered to pay more damages).

Participants
We recruited a total of 363 U.S. American workers from Amazon Mechanical Turk (MTurk; Paolacci & Chandler, 2014) who participated in exchange for $1.50.Data from 30 participants (8.26%) who failed an attention check at the end of the study and 8 participants (2.40%) who indicated having only a poor or basic English proficiency were excluded from analysis, leaving a final sample of 325 participants (50.46% female, M age = 35.91,SD age = 10.03).

Materials
We created case files for ten fictitious small claims court cases (see Fig. 1).Case files included a photo and demographic information on the plaintiff and the defendant.All individuals were White male U.S. citizens and had their first and last name redacted.Case files also included the size of the plaintiff's claim (ranging from $600 to $3600) and a case summary of approximately 130 words.Each summary mentioned the reason why the plaintiff was suing the defendant (e.g., seeking reimbursement for a damaged stereo system) and the evidence that was presented by the plaintiff and the defendant (e.g., photos of a broken speaker, a receipt confirming the purchase of a stereo system).In line with real-world small claims court cases, the evidence presented by both sides was relatively limited.
We selected 20 images of White male individuals from the Chicago Face Database (Ma, Correll, & Wittenbrink, 2015).The database includes ratings of all targets on various trait dimensions.We selected the ten individuals who received the lowest (M = 2.62, SD = 0.17) and highest (M = 3.78, SD = 0.09) ratings on perceived trustworthiness.Targets varied in perceived age with average age ratings ranging from 19.5 to 43.2 years (M = 28.60,SD = 6.90).Age ratings of the trustworthy-looking targets (M = 28.67,SD = 6.64) and untrustworthylooking targets (M = 28.57,SD = 7.50) were very similar.
Next, we manipulated the perceived trustworthiness of all targets.Oosterhof and Todorov (2008) created a series of computer-generated face prototypes that reflect the typical facial appearance of targets varying on several trait dimensions (e.g., trustworthiness, dominance).We selected two face prototypes that reflect a high (i.e., three standard deviations above the mean) and low (i.e. three standard deviations below the mean) score on perceived trustworthiness.Using Psychomorph (Tiddeman, Burt, & Perrett, 2001), we transformed each target's face shape towards the face shape of the computer-generated prototype by 60%.Trustworthy-looking targets were morphed with the trustworthy-looking face prototype, whereas untrustworthy-looking targets were morphed with the untrustworthy-looking face prototype.This procedure somewhat exaggerated the facial features linked to perceptions of trustworthiness and allowed us to create prototypically (un-)trustworthy-looking individuals without compromising the realistic nature of the face stimuli.
Finally, we matched case files and face images.Each case featured a plaintiff and a defendant differing on perceived trustworthiness: One individual looked trustworthy while the other looked untrustworthy.We created four sets of stimuli.Each set contained all ten case files and all 20 face images.In each set, face images were randomly matched to a case and a role (i.e., plaintiff or defendant).Half of all cases featured a trustworthy-looking plaintiff and an untrustworthy-looking defendant, while the roles were reversed in the other half.

Procedure
Participants were randomly assigned to one of the four stimulus sets.To measure sentencing decisions, participants were instructed to carefully read each case and to indicate a sentence by ruling in favor of the plaintiff or the defendant.After each ruling, participants also indicated their confidence in the ruling on a scale that ranged from 1 (not confident at all) to 9 (extremely confident).In case participants ruled in favor of the plaintiff, they were asked to indicate the amount of damages that the plaintiff should be awarded on a scale that ranged from 50% to 100% (in steps of 10%) of the original claim.

Sensitivity analysis
We conducted a post hoc sensitivity analysis to determine the smallest effect size we were able to detect for our main effect of interest (the effect of facial trustworthiness on verdicts) with 80% power (and α = 5%).As software commonly used for sensitivity analyses, such as G*Power (Faul, Erdfelder, Lang, & Buchner, 2007), does not support multilevel data, we relied on the simr package (Green & Macleod, 2016) in R (R Core Team, 2019).The package provides power estimates for fixed effects in multilevel regression models.We systematically varied the effect of facial trustworthiness on verdicts and calculated power at each level, to test which effect size we were able to detect with at least 80% power.This showed that we had 80% power to detect an odds ratio of 1.27 for the effect of facial trustworthiness on verdict.To illustrate, an odds ratio of this size corresponds to a six percentage point difference in guilty verdicts (e.g., 50% vs. 56%) for trustworthy-looking versus untrustworthy-looking defendants.
Finally, regressing the amount of money that was awarded to the plaintiff in case of a guilty verdict on facial trustworthiness revealed a small positive effect, β = 1.655,SE = 0.727, t(137.4)= 2.28, p = .024,95% CI [0.191,3.078].Participants awarded the plaintiff 1.63 percentage points more of their original claim when the defendant looked untrustworthy (85.28% vs. 83.64%).

Discussion
Results of the pretest showed that legal sentencing decisions were influenced by the facial trustworthiness of the involved parties.The rate of guilty verdicts was 8.03 percentage points higher when the defendant looked untrustworthy (vs.trustworthy).Facial trustworthiness also influenced how much money participants awarded to the plaintiff in case of a guilty verdict, with plaintiffs receiving 1.63 percentage points more when they were suing an untrustworthy-looking (vs.trustworthy-looking) defendant.We did not find any evidence that confidence in verdicts was influenced by facial trustworthiness.Thus, using a novel sentencing task with multiple cases and controlled manipulations of facial trustworthiness, we replicate prior work showing that people rely on trait impressions from faces when making legal sentencing decisions (Porter et al., 2010;Wilson & Rule, 2015, 2016).Our findings also replicate previous work by Zebrowitz and McDonald (1991) who found that babyfacedness-a facial feature that is correlated with perceived trustworthiness (Berry & Zebrowitz McArthur, 1986;Zebrowitz & Montepare, 1992) -influenced verdicts and awarded damages in real-world small claims cases.

Study 1: belief interventions
In Study 1, we used the sentencing task that was developed and validated in the pretest to test the effectiveness of an intervention in reducing reliance on facial trustworthiness.Our goal was to reduce reliance on facial stereotypes by reducing explicit beliefs that personality can be judged from facial appearance (Jaeger et al., 2019b).In one condition, participants read a text that informed them about scientific research on facial stereotypes.The text mentioned the automatic accessibility of facial stereotypes, that facial stereotypes are usually not accurate, and that relying on them can result in worse decision-making outcomes.The intervention specifically focused on facial stereotypes, as previous work suggests that raising awareness of stereotypes in general may not be effective (Axt et al., 2018).Our manipulation was modelled after previous research in the domain of lay beliefs.For instance, Levy, Stroessner, and Dweck (1998) used fake scientific articles to manipulate beliefs in the innateness of personality traits and this influenced how strongly participants associated different social groups with stereotypical personality traits.
In a second intervention condition, we additionally confronted participants with the low diagnostic value of their facial stereotypes.
Before reading the educational text, we showed participants ten pairs of faces.Their task was to identify which of the two individuals was a convicted felon.We told participants that they only guessed four out of ten correctly, meaning that their guesses were not better than chance.We measured physiognomic beliefs (i.e., participants' explicit beliefs that personality traits can be judged accurately from faces) in all conditions and hypothesized that, compared to a control condition in which participants were not exposed to a manipulation, both interventions would reduce physiognomic beliefs and reliance on facial trustworthiness when making sentencing decisions.

Power analysis
We conducted an a priori power analysis using the simr package in R, which allows one to test how power varies as a function of the number of levels of a random effect (in our case, the number of participants or the number of cases).As the number of cases was fixed, we tested how power varies across different numbers of participants.Calculating power across a wide range of sample sizes showed that 250 participants per condition are required to detect a 30% decrease in the effect of facial trustworthiness on verdicts with 80% power (and α = 5%).As a conservative measure, we decided to recruit 325 participants per condition.

Participants
We recruited a total of 1249 US American workers from Amazon Mechanical Turk who participated in exchange for $2.50.Data from 227 participants (18.17%) who failed an attention check at the end of the study and from 42 participants (4.11%) who indicated poor or basic English proficiency were excluded from analysis, leaving a final sample of 979 participants (47.40% female, M age = 36.14,SD age = 11.24).

Materials & procedure
Participants were randomly allocated to one of three conditions.In all conditions, participants completed the legal sentencing task as described in the previous study.For each case, they ruled in favor of the plaintiff or the defendant and indicated their confidence in the ruling on a scale that ranged from 1 (not confident at all) to 9 (extremely confident).Next, to measure belief in the visibility of personality traits in facial appearance, participants completed the physiognomic belief scale (Jaeger et al., 2019b).Participants were prompted to imagine seeing the passport photo of a stranger.They were asked to indicate how much they agree with three statements (e.g., I can learn something about a person's personality just from looking at his or her face) on a scale from 1 (strongly disagree) to 7 (strongly agree).Average scores across the three items constituted our measure of physiognomic beliefs (Cronbach's α = 0.84).
The three conditions only differed in the texts participants were exposed to prior to completing the sentencing task.In the education condition (n = 332), participants read an educational text about personality impressions from faces that was approximately 300 words long.First, participants were told that people spontaneously form impressions of others' personality based on their facial appearance; that there is substantial agreement on what, for example, a trustworthy person looks like; and that these judgments are formed very quickly, sometimes without the perceiver's awareness.To illustrate these points, participants were shown two face images of a typical trustworthylooking and untrustworthy-looking face (drawn from a database of computer-generated faces varying in perceived trustworthiness; Oosterhof & Todorov, 2008).Next, the text mentioned that trustworthiness impressions influence many important decisions even though research suggests that these impressions are often inaccurate.It was also highlighted that this is problematic because it leads to unfair treatment of people with a certain facial appearance (the exact text can be found in the online materials).
In the education-and-confrontation condition (n = 332), prior to reading the educational text, participants completed an additional task that was designed to demonstrate that their face-based impressions are inaccurate.Participants saw ten pairs of faces of male individuals that were taken from the 10k Faces Database (Bainbridge, Isola, Blank, & Oliva, 2013).Participants were told that each pair included one convicted felon and that their task was to identify that person.Feedback about accuracy was standardized across all participants.They were told that they only guessed four out of ten correctly, meaning that their guesses were not better than chance.
In the control condition (n = 315), participants read a text about the geography of Scotland.
After reading the respective texts, participants answered three comprehension check questions (e.g., research shows that first impressions influence many important decisions).Participants could only proceed to the sentencing task after having answered all three questions correctly.

Physiognomic beliefs
First, we tested whether the interventions reduced beliefs that personality is reflected in facial appearance.Compared to participants in the control condition (M = 3.80, SD = 1.37), participants in the education condition (M = 3.59, SD = 1.33) indicated lower physiognomic beliefs, t(640.6)= 2.03, p = .042,d = 0.16, and so did participants in the education-and-confrontation condition (M = 3.50, SD = 1.38), t(643.5)= 2.80, p = .005,d = 0.22.Physiognomic beliefs did not significantly differ between the education and education-andconfrontation condition, t(661.2) = 0.82, p = .41,d = 0.06.These results show that both interventions were successful in reducing the belief that personality is reflected in facial features, although differences were relatively small.

Exploratory analyses
To further probe the effects of the two interventions, we conducted Bayesian analyses using the BayesFactor package (Morey & Rouder, 2018) in R (R Core Team, 2019).Bayesian t-tests with default Cauchy priors yielded substantial support for the null hypothesis of no difference between the control condition and the education condition, BF 01 = 6.49, and strong support for the null hypothesis of no difference between the control condition and the education-and-confrontation condition, BF 01 = 10.66.These results support the conclusion that neither intervention significantly reduced reliance on facial trustworthiness.
The interventions were based on a proposed link between belief in the visibility of personality in a person's facial appearance and reliance on trait impressions when making decisions (Jaeger et al., 2019b).Even though the interventions somewhat reduced physiognomic beliefs, they did not reduce reliance on facial trustworthiness, raising the question 1 Comparing the control condition against a combination of both intervention conditions also yielded no significant difference in the effect of facial trustworthiness on verdicts, β = 0.038, SE = 0.116, z = 0.33, p = .74,95% CI [−0.190, 0.267], OR = 1.04.whether physiognomic beliefs were actually related to reliance on facial trustworthiness.To test this, we extracted participant-specific slopes for the effect of facial trustworthiness from our multilevel regression models, as an indicator of how much each participant relied on trait impressions when making sentencing decisions.There was indeed a significant correlation between physiognomic beliefs and reliance on facial trustworthiness, r(977) = 0.200, p < .001.There was also a positive correlation between physiognomic beliefs and confidence in verdicts, r(977) = 0.204, p < .001.Participants who more strongly endorsed the belief that personality is reflected in facial features relied more on facial trustworthiness when making sentencing decisions and they were more confident in their verdicts.These results rule out the explanation that the observed reduction in physiognomic beliefs did not translate to less biased sentencing decisions because there was no link between beliefs and behavior.

Discussion
Neither intervention successfully reduced the effect of facial stereotypes on sentencing decisions.Educating participants about the low accuracy of their trait impressions reduced explicit beliefs in the diagnostic value of facial appearance for inferring personality traits, but this effect was relatively small.Importantly, the intervention did not reduce reliance on facial stereotypes when making sentencing decisions and it did not reduce confidence in verdicts.The same pattern was observed for a second intervention: Even when participants were directly confronted with the low accuracy their trait impressions, they continued to rely on them when making decisions in a subsequent task.

Study 2: accessibility interventions
In Study 2, we tested the effectiveness of an alternative intervention in reducing reliance on facial trustworthiness.Trait impressions from faces are intuitively accessible (Stewart et al., 2012;Todorov et al., 2009;Willis & Todorov, 2006) and accessible information often exerts a disproportionate influence on decisions (Shah, 2007;Simmons & Nelson, 2006;Tversky & Kahneman, 1974).To disrupt the primary processing of faces, we presented information sequentially.First, participants saw only case-relevant information and indicated an initial verdict.Then, they saw the entire case file (which also included face images of the plaintiff and defendant) and indicated their final verdict.We hypothesized that the majority of participants would not revise their initial verdicts.Reliance on intuitively available trait impressions constitutes a low-effort decision strategy and people might not be aware of the extent to which their decisions are influenced by facial stereotypes (Jaeger et al., 2019a).In our sequential design, participants have to actively revise their verdict (and ignore case-relevant information) if they want to base their decisions on the parties' facial appearance.They might be reluctant to do so because sticking with their initial verdict should reduce decision effort (Shah & Oppenheimer, 2008;Simon, 1955).Any initial verdict that is not revised reflects a verdict that is unbiased by facial stereotypes, as participants were not exposed to face images when deciding on their initial verdicts.Thus, if the majority of initial verdicts are not overturned, this should reduce the overall influence of facial stereotypes on verdicts compared to a control condition in which participants do not make decisions sequentially and are exposed to the face images right away.
In a second intervention condition, we tested whether the influence of intuitively available trait impressions would be further reduced by prompting participants to make reflective decisions (Newman, Gibb, & Thompson, 2017).To ensure that initial verdicts are based on a careful consideration of the case-relevant information, participants had to reflect on their initial verdicts for at least 30 s before they could indicate their decision. 6.1.Methods

Participants
Based on the results of the power analysis reported in Study 1, we again decided to recruit 325 participants per condition.We recruited a total of 1085 U.S. American workers from Amazon Mechanical Turk who participated in exchange for $2.50.Data from 93 participants (8.57%) who failed an attention check at the end of the study and from 17 participants (1.71%) who indicated a poor or basic English proficiency were excluded from analysis, leaving a final sample of 975 participants (49.74% female, M age = 35.86,SD age = 10.50).

Materials & procedure
Participants were randomly allocated to one of three conditions.In all conditions, participants completed the legal sentencing task as described in our Pretest.For each case, they ruled in favor of the plaintiff or the defendant and indicated their confidence in the ruling on a scale that ranged from 1 (not confident at all) to 9 (extremely confident).
In the sequential condition (n = 319), participants first saw the case files without any personal information about the plaintiff or defendant and were asked to indicate an initial ruling in favor of the plaintiff or the defendant.Next, participants saw the entire case files, including the images of the plaintiff and defendant, and were asked to indicate their final ruling and their confidence in the ruling on a scale that ranged from 1 (not confident at all) to 9 (extremely confident).
In the sequential-and-reflection condition (n = 329), participants followed the same procedure as in the sequential condition, but they were prompted to think carefully and make reflective decisions for all cases (Newman et al., 2017).They could only indicate an initial ruling after 30 s had passed and they were instructed to take at least this long to carefully study the case summary before indicating a ruling.
In the control condition (n = 327), participants completed the legal sentencing task without the order of stimuli being manipulated.

Response times
First, we analyzed response times for initial rulings to check whether instructions to reflect on decisions in the sequential-and-reflection condition actually led to longer decision times compared to the sequential condition.Response times were log 10 -transformed due to their right-skewed distribution.A t-test showed that participants in the sequential-and-reflection condition (M = 1.658,SD = 0.111) took longer to reach a decision compared to participants the sequential condition (M = 1.527,SD = 0.273), t(417.5)= 7.93, p < .001,d = 0.62.2

Exploratory analyses
To further probe the effects of the two interventions, we again conducted Bayesian analyses.Bayesian t-tests with default Cauchy priors yielded strong support for the alternative hypothesis that, compared to the control condition, reliance on facial trustworthiness was stronger in the sequential condition, BF 10 = 1484, and in the sequential-and-reflection condition, BF 10 = 188.These results support the conclusion that both interventions significantly increased reliance on facial trustworthiness.
Finally, we analyzed how often and under what conditions participants revised their initial decision to understand why the interventions increased rather than decreased reliance on facial trustworthiness.We hypothesized that most participants would not revise their initial decisions, which were made in the absence of face images and therefore unbiased by facial stereotypes.In fact, the majority of initial rulings in the sequential condition (89.78%) and in sequential-and-reflection condition (90.61%) were not revised when participants saw the images of the plaintiff and defendant and had the chance to change their verdict.However, analyzing revision rates showed that participants were more likely to revise their initial ruling when it was not in line with face stereotypes (e.g., a trustworthy-looking defendant being found guilty; 15.4%) than when it was already in line with stereotypes (3.14%), χ 2 (1) = 310.2,p < .001.Of all revised rulings, 83.52% ended up being congruent with face stereotypes whereas only 16.48% were incongruent with face stereotypes.As a consequence, while only 51.11% of all initial rulings made in the absence of face images were in line with face stereotypes, 57.61% of all final rulings made in the presence of face images were, χ 2 (1) = 55.12,p < .001.In sum, in the absence of face images, both interventions were successful in producing unbiased rulings, which were seldom revised when participants did have access to the face images.However, the wide majority of revisions that did occur brought decisions in line with face stereotypes.This increased the overall effect of facial trustworthiness on sentencing decisions.

Discussion
Results of Study 2 showed that both interventions increased, rather than decreased, reliance on facial stereotypes.In order to disrupt the primary processing of faces (and the intuitive accessibility of trait impressions), we asked participants to indicate initial decisions that were solely based on case-relevant information.They were then shown the entire case file, which also included facial photographs of the plaintiff and defendant, and they could still revise their sentencing decisions.As intended, the majority of participants (ca.90%) did not change their initial sentences, which means that most sentences reflected decisions that were made while being ignorant of the plaintiff's and defendant's facial appearance.However, participants who decided to change their initial decisions overwhelmingly did so to bring their final decisions in line with facial stereotypes (e.g., by finding an untrustworthy-looking defendant guilty).The same pattern was observed for a second intervention in which participants were additionally prompted to make reflective decisions.Overall, this increased the influence of facial stereotypes. 3We also compared the rate of guilty verdicts in participants' initial sentencing decisions (i.e., when they were not exposed to facial photographs).Compared to the sequential condition (54.23%), participants in the sequentialand-reflection condition indicated slightly fewer guilty verdicts (51.67%), suggesting that instructing participants to reflect on their sentencing decisions slightly decreased their likelihood of indicating a guilty verdict.However, this difference was only marginally significant, β = −0.112,SE = 0.062, z = 1.82, p = .069,95% CI [−0.246, 0.018], OR = 0.89.

Internal meta-analysis
To estimate the influence of facial trustworthiness on sentencing decisions more precisely, we calculated the meta-analytic effect across our three studies.We aggregated the data from all conditions that did not feature an intervention (the pretest and the control conditions from Study 1 and Study 2).This data set included almost 10000 sentencing decisions (n = 967 participants, 48.09% female, M age = 35.85,SD age = 10.45).We estimated a multilevel regression model with random intercepts and slopes per participant, per case, and per study.This revealed a positive effect of facial trustworthiness on sentencing decisions, β = 0.284, SE = 0.052, z = 5.46, p < .001,95% CI [0.171, 0.398], OR = 1.33. 4 Defendants were more likely to be found guilty for the same transgression when they looked untrustworthy.The rate of guilty verdicts was 6.88 percentage points higher for untrustworthylooking defendants (54.54% vs. 47.66%).

General discussion
The aim of the current investigation was to test the effectiveness of different interventions in reducing the influence of facial stereotypes on legal decision-making.We created a novel legal sentencing paradigm in which participants indicated verdicts for multiple small claims court cases and we manipulated the facial trustworthiness of plaintiffs and defendants.In line with previous studies showing that trait impressions from faces influence legal decision-making (Porter et al., 2010;Wilson & Rule, 2016), we found that defendants were more likely to be found guilty when they looked untrustworthy (vs.trustworthy).This effect was observed in all three studies.In our pretest, we also examined whether facial trustworthiness influences the fraction of damages that defendants were ordered to pay in case of a guilty verdict.Again, untrustworthy-looking defendants experienced less leniency as they were ordered to pay slightly more damages.Our results replicate previous findings by Zebrowitz and McDonald (1991) who found that babyfacedness-a facial feature that is correlated with perceived trustworthiness (Berry & Zebrowitz McArthur, 1986;Zebrowitz & Montepare, 1992) -predicted verdicts and awarded damages.Crucially, their findings were based on a large sample of real small claims cases, which suggests that the current results may generalize beyond our experimental design to real-world sentencing decisions.
We then tested the effectiveness of two debiasing techniques-educating decision-makers and changing the decision-making environment (Soll et al., 2014)-in reducing the influence of facial trustworthiness on verdicts.In Study 1, we attempted to reduce the influence of facial stereotypes by educating people about the poor diagnostic value of their trait impressions.Specifically, we (a) educated participants about the biasing influence of inaccurate facial stereotypes and (b) confronted them with the low diagnostic value of their own trait impressions.Although both manipulations succeeded in lowering beliefs that personality traits can be accurately inferred from a person's facial appearance, they did not reduce the effect of facial stereotypes on sentencing decisions.Bayesian analyses indicated strong support for the null hypothesis of no difference between the control condition and the intervention conditions.Thus, regardless of whether or not participants were given clear information about the low diagnostic value of their trait impressions from faces, sentencing decisions were influenced by the facial trustworthiness of defendants.
In Study 2, we attempted to reduce the influence of facial stereotypes by disrupting the intuitive accessibility of trait impressions.To this end, we provided information sequentially.First, participants saw only case-relevant information and indicated a preliminary sentence.Then, participants saw the entire case file (including facial photographs) and indicated their final sentence.As intended, only a minority of initial sentences (< 10%) were changed.However, sentence revisions were strongly driven by facial stereotypes, with most revised decisions reflecting a stereotype-congruent verdict (e.g., untrustworthylooking defendants being found guilty).On average, this actually increased the influence of facial stereotypes on verdicts.A similar pattern was observed when participants were additionally prompted to make reflective decisions.
Together, our results highlight the persistent influence of facial stereotypes on decision-making.Previous studies have shown that people rely on trait impressions even when other, more diagnostic cues are available (Jaeger et al., 2019a;Olivola et al., 2018;Olivola & Todorov, 2010).In a similar vein, the present results demonstrate that effects of trait impressions on decision-making persist even when participants receive clear information about how inaccurate facial stereotypes are (Study 1) and even when participants have to expand additional cognitive effort to rely on facial stereotypes (Study 2).Across all interventions, we consistently found that untrustworthy-looking defendants were more likely to be found guilty than trustworthy-looking defendants.

Limitations and future directions
We acknowledge that our education intervention in Study 1 may not have been strong enough to reduce behavioral reliance on facial stereotypes.However, other studies employing similar manipulations successfully reduced lay beliefs and related behaviors (Chiu, Hong, & Dweck, 1997;Levy et al., 1998).For example, Levy et al. (1998) exposed participants to short scientific articles written for a lay audience to manipulate beliefs in the malleability of personality traits.The manipulation successfully influenced lay beliefs, but also the extent to which participants relied on stereotypes when judging different social groups.Regardless, our intervention only had a small effect on beliefs and more intensive debiasing trainings might be necessary to change behavior (for examples, see Devine, Forscher, & Austin, 2013;Sellier, Scopelliti, & Morewedge, 2019).
One question that remains unanswered is why a later presentation of photographs increased the influence of facial stereotypes.Studies on the role of fluency in cue ordering (Dimov & Link, 2017), anchoring effects (Tamir & Mitchell, 2012;Tversky & Kahneman, 1974), and primacy effects in impression formation (Asch, 1946) all highlight the strong influence of information that is processed first.However, other investigations found a disproportionate influence of information that is processed last (i.e., recency effects; Sullivan, 2018).For example, when evaluating faces that display a series of expressions, trait impressions are more strongly influenced by the expression that was displayed last (Fang, van Kleef, & Sauter, 2018).In a similar vein, participants might have attributed more importance to facial photographs because they were the only new information that was displayed after they indicated their preliminary verdicts.To participants, this may imply that this information is relevant for their decisions (Clark & Haviland, 1977).More research is needed to systematically explore how the order in which facial appearance and other cues are processed affects the influence of facial stereotypes.
We do not doubt that certain manipulations could diminish or eliminate the effect of facial trustworthiness on verdicts.For example, providing unambiguous, outcome-relevant information has been shown to reduce reliance on stereotypes (Dovidio & Gaertner, 2000;Rezlescu et al., 2012).However, such decisive information (e.g., clear evidence that the defendant committed the crime) is often not available in real life.In many situations, such as legal sentencing, personnel selection, or voting, people have to make consequential decisions based on limited, ambiguous, or contradicting information.We therefore focused on testing the effectiveness of different interventions in a decision-making environment that resembles these conditions.
In the present studies, we always compared situations in which the plaintiff was trustworthy-looking and the defendant was untrustworthylooking or vice versa.Thus, our data do now show whether sentencing decisions are more strongly influenced by the perceived trustworthiness of plaintiffs or defendants, or whether there are interaction effects between both parties' perceived trustworthiness (Zebrowitz & McDonald, 1991).Moreover, if sentencing decisions are driven by the difference in perceived trustworthiness of the plaintiff and defendant, manipulating both parties' perceived trustworthiness simultaneously would increase its overall effect.This suggests that the effect of facial trustworthiness on sentencing decisions may be smaller under different circumstances, such as when a trustworthy-looking plaintiff is suing a slightly less trustworthy-looking defendant.
Finally, it should be noted that some of our participants may have been exposed to the face stimuli before, which were taken from a publicly available face database (Ma et al., 2015).This could have reduced the effect of facial trustworthiness of sentencing decisions, as previous work suggests that non-naïveté in participants leads to smaller effect sizes (Chandler, Paolacci, Peer, Mueller, & Ratliff, 2015;Rand & Kraft-Todd, 2014).For example, one can imagine that participants who were familiar with the faces paid less attention to them when making sentencing decisions.However, our data do not suggest that a large number of participants responded carelessly.Participants did not click through the cases as fast as possible-the median response time for verdicts was around 35 s-and only a fraction of participants (ca.1%) indicated the same verdict on all trials.We also excluded data from all participants who failed an attention check at the end of the study.
Evidence for the biasing influence of trait impressions is welldocumented and researchers have called for attempts to curb this bias (Olivola et al., 2014;Porter et al., 2010;Wilson & Rule, 2015).We took a first step in this direction but, ultimately, we were unsuccessful in reducing the influence of facial stereotypes.To stimulate further research in this area, we have made all materials needed to implement the legal sentencing task that was used here publicly available.This task allows for within-subject manipulations of facial appearance (or of other cues such as race or gender), which is statistically powerful and provides an indicator of reliance on facial stereotypes at the participant level.We hope that our results will motivate others to design and test other kinds of interventions.

Conclusion
The present research replicates prior findings that legal sentencing decisions are influenced by facial stereotypes.Participants consistently found untrustworthy-looking defendants guilty more often than trustworthy-looking defendants.We also sought to curb this bias by educating people about how inaccurate their trait impressions are and by disrupting the intuitive accessibility of trait impressions.Crucially, both attempts did not succeed in reducing the effect of facial trustworthiness on sentencing decisions.The present findings show that people persistently rely on facial stereotypes when making decisions and that this bias is difficult to mitigate.

Open practices
All data, materials, preregistrations, and analysis scripts are available at the Open Science Framework (https://osf.io/h4yf3/).

Fig. 1 .
Fig. 1.A case file with a trustworthy-looking plaintiff and an untrustworthy-looking defendant.

Fig. 2 .
Fig. 2. Differences in rates of guilty verdicts for trustworthy-looking and untrustworthy-looking defendants as a function of condition.Dots denote predicted values.Error bars denote bootstrapped 95% confidence intervals.

Fig. 3 .
Fig. 3. Differences in rates of guilty verdicts for trustworthy-looking and untrustworthy-looking defendants as a function of condition.Dots denote predicted values.Error bars denote bootstrapped 95% confidence intervals.