Test fairness is a critical concern in all high-stakes
testing programs. The ASVAB Testing Program attempts
to ensure that the ASVAB tests are fair for all examinees
by adhering to the principles outlined in a number of
test development guidelines:
- Standards for Educational and Psychological
Testing (American Educational Research Association,
American Psychological Association, and National Council
on Measurement in Education)
- Code of Fair Testing Practices in Education
(Joint Committee on Testing Practices)
- Principles for the Validation and Use of Personnel
Selection Procedures
(Society for Industrial and Organizational Psychology
Inc.)
- Guidelines for Computerized-Adaptive Test Development
and Use in Education
(American Council on Education)
- Guidelines for Computer-Based Testing
(Association of Test Publishers)
- International Guidelines on Computer-Based and
Internet Delivered Testing
(International Test Commission).
The ASVAB Testing Program also has an oversight group,
called the Defense Advisory Committee on Military
Personnel Testing, which is tasked with the mission
of ensuring that the ASVAB program adheres to professional
testing standards. Members of the Defense Advisory Committee
typically are experts in testing and measurement, industrial-organizational
psychology, and/or career psychology. Personnel from
the ASVAB Testing Program regularly brief the Defense
Advisory Committee to keep them apprised of all program
activities.
The ASVAB Testing Program takes active steps during
test development to uphold the following credos:
- The tests should not be biased.
- Adverse impact should be minimized.
- Test items should not be offensive to any examinee
subgroups.
- Items/tests should perform similarly across racial/ethnic/gender
groups.
- Test items should perform similarly across paper
and computer administrations.
- Lack of exposure to testing medium or testing conditions
should not adversely affect examinees’ performance.
- Scores should be comparable across administration
modes.
Impact occurs when subgroups that are not
matched on ability perform differentially on an item
or test. Adverse impact occurs when a subgroup
is disadvantaged by those performance differences. Bias
occurs when an item or test unfairly favors
one group over another.
The occurrence of bias is problematic, because it can
negatively affect test validity. Adverse impact does
not reflect bias if validity research shows that the
test is equally valid for relevant subgroups (i.e.,
if the regression line relating the test score and a
criterion is the same for each subgroup).
The figure below illustrates such a scenario, where
Group 2 performs better than Group 1 on a test [X],
yet the regression line for predicting performance on
the criterion measure [Y] is the same for both groups.
Thus, the performance difference does not reflect bias
because examinees with the same test scores will have
the same predicted performance, regardless of group
membership.
The occurrence of adverse
impact is often determined using the Four-Fifths Rule:
“A selection rate for any race, sex,
or ethnic group which is less than four-fifths (or eighty
percent) of the rate for the group with the highest
rate will generally be regarded by the Federal enforcement
agencies as evidence of adverse impact.”
— Code of Federal Regulations Pertaining to
the U.S. Department of Labor
Applications of the Four-Fifths Rule to examinees that
qualify for entry into the military (i.e., those generally
scoring in AFQT category IIIB or higher) and to examinees
that qualify for enlistment incentives (i.e., those
generally scoring in AFQT category IIIA or higher) suggest
some adverse impact is associated with the ASVAB. Specifically,
the qualification rates for African-American/Black applicants
are less than four-fifths of the qualification rates
for Caucasian/White applicants, for both entry into
the military and enlistment incentives. Likewise, the
qualification rates for Hispanic applicants are less
than four-fifths of the qualification rates for Non-Hispanic
White applicants, for both entry into the military and
enlistment incentives.
Click here to
learn more about AFQT scores.
Click here to
learn more about enlistment eligibility.
Although adverse impact does occur on the ASVAB, the
magnitude of Male/Female, White/Black, and Non-Hispanic
White/Hispanic score differences on the Mathematical
Knowledge, Arithmetic Reasoning, Word Knowledge, Paragraph
Comprehension, and General Science subtests are similar
to the magnitude of score differences between like groups
on tests of similar domains in other high-stakes test
batteries. The consistency of trends in effect sizes
observed across different test batteries suggests that
the adverse impact associated with the ASVAB reflects
a societal phenomenon, rather than bias in the test.
This conclusion is supported by a study of the sensitivity
and fairness of the ASVAB technical tests (Wise,
Welsh, Grafton, Foley, Earles, Sawin, & Divgi, 1992).
The study compared prediction of final school grades
for Air Force and Navy technical training courses and
scores on the Skill Qualification Test for first-term
Army recruits. Data from large samples of males, females,
whites, and blacks were analyzed, and the prediction
lines were similar for all groups.
Click here to learn
more about the ASVAB subtests.
Click here
to view a copy of the report from the 1992 study.
The ASVAB Testing Program routinely conducts analyses
to ensure that items and scores are fair and unbiased.
All new items undergo a rigorous review process that
includes (1) internal and external reviews of item content
by experts in the domain measured by the items, (2)
internal and external sensitivity reviews by trained
sensitivity reviewers, and (3) reviews of statistical
qualities of the items.
As part of the review of the statistical qualities
of new items, differential item functioning
(DIF) analyses are conducted. DIF occurs when items
perform differently across groups matched on ability.
DIF is a necessary, but not sufficient, condition for
bias. Because all statistical procedures have some rate
of false positive identification, items flagged for
DIF typically undergo further review to determine if
the statistical finding is supported by the item’s
content.
DIF analyses are conducted for the following pairs
of subgroups for which there are sufficient data to
provide stable results:
- White versus Blacks
- Males versus Females
- Non-Hispanic Whites versus Hispanics
An Empirical Bayes Mantel-Haenszel procedure (EBMH)
is used, and the severity of DIF is classified using
the following scheme:
| A = Negligible |
|EBMH| < 1.0 |
| B = Slight to moderate |
1.0 < |EBMH| < 1.5 |
| C = Moderate to severe |
|EBMH| > 1.5 |
The EBMH procedure and the classification
scheme are described further in Zwick,
Thayer, & Lewis (1999).
Items flagged for B or C DIF are reviewed to ascertain
whether they provide an unfair advantage for a given
group. Items flagged for linguistic DIF (i.e., Non-Hispanic
Whites versus Hispanics) are reviewed by linguists fluent
in both English and Spanish.
The ASVAB program uses two modes of administration;
paper and pencil (P&P) and computer adaptive (CAT-ASVAB).
Therefore, it is critical that items perform similarly
and scores be comparable across the two administration
modes. To ensure test fairness, the mode of administration
should not affect an examinee’s performance on
the ASVAB.
When the CAT-ASVAB was developed, extensive research
was conducted to evaluate item and score comparability
across P&P and CAT administrations of the ASVAB.
The research uniformly suggested that scores could be
treated interchangeably across the two administration
modes.
Click here
to learn more about the development of the CAT-ASVAB.
|