NAEP Limitations and Cautions

NAEP Limitations and Cautions

Every other year when the NAEP results come out, there is a lot of discussion and debate about the national examination.  Many put a lot of stock in the data it provides, as it is pretty much the only national test with results from every state that can be used to make comparisons of student performance across the country.  However, there are some limitations to the NAEP exam and some cautions that should be noted in terms of how the results are used.

In 2009, researchers from the Buros Center for Testing at the University of Nebraska-Lincoln (my graduate school) and from the Center for Educational Assessment at the University of Massachusetts – Amherst published the Final Report for their Evaluation of the National Assessment of Educational Process based on a congressional mandate.  The full report can be found here.

In the executive summary, the authors state the following:

Comparing student achievement on NAEP across states is complicated. To
appreciate the challenges in making state-by-state comparisons, it is necessary to
understand the sampling design adopted by NAEP and its potential impact on the
results and their interpretations. In NAEP’s multistage cluster sampling
procedure, not all students take the assessment, and those students who do take
NAEP respond to a subset of the NAEP items in each content area. While this
allows for a broad sampling of items from any one content domain, the extent to
which subgroups of students are represented adequately in NAEP’s state samples
is of concern.  

As reported in the current evaluation, NAEP’s sampling procedures do not ensure
adequate representation of various subgroups (including those defined by race and
ethnicity) within some states, putting valid interpretations about subgroup
performances within a state and across states at risk. Using NAEP to verify state
results regarding the achievement of students with disabilities is also problematic
because decisions about inclusion and allowable accommodations are made at the
state level. Because states vary in their inclusion rates and in their treatment of
accommodations for NAEP, the validity of state-by-state comparisons is
debatable. 

Below are the main concerns and recommendations the researchers stated regarding NAEP and appropriate uses of its data.

  • There is not an organized validity framework for the exam, which is needed given the complexity and multiple uses of NAEP.  According to the report, “An organized validity framework takes into account the history of the assessment
    program, current learning theory, and content-performance expectations from the
    subject-matter field and related professions. It also addresses contemporary
    xviii
    issues in current interpretations and uses of the assessment and anticipates future
    appropriate and inappropriate uses and consequences of the assessment.”
  • Additional studies are warranted if NAEP is to be used to verify state assessment results.  According to the report, “As reported in the current evaluation, there are numerous factors that can
    jeopardize the validity of interpretations when using NAEP to verify state results.
    These include differences in content being assessed, differences in standard setting
    policies and procedures, differences in the definition of the achievement
    levels, and differences in the representation of the NAEP state samples.
    Additional alignment studies that evaluate the congruency between the content
    assessed by NAEP and state content standards and assessment are crucial. The
    sampling procedures for NAEP should also be studied. Representation of
    subgroups across states varies considerably as do the inclusion and exclusion rates
    for students with disabilities, impacting the validity of the use of NAEP results for
    state-by-state comparisons and for verifying state assessment results.”
  • Revise review processes for NAEP technical reports and manuals that facilitate their timely release.   “Currently, release of
    NAEP technical
    documentation can be
    years after results have
    been released, exceeding
    what testing programs
    should tolerate…  There are several
    reasons for releasing
    timely technical
    documentation;
    primarily, it assists users
    in understanding
    appropriate uses and
    limitations of NAEP
    scores.”
  • Other measures of U.S. students’ educational
    achievement do not provide strong sources of external
    validity evidence for NAEP achievement levels.
    “It is a challenge to gather validity evidence from multiple
    sources outside a standard-setting study that can be used to
    evaluate achievement levels. Furthermore, external data are
    not perfect evaluation evidence due to potential differences
    in content, sample, and purpose. For example, some tests
    (like well-known college admissions tests—e.g., the SAT
    and ACT) involve self-selected samples of college-bound
    seniors, not a nationally representative sample. In many
    cases external tests serve purposes that are very different
    from NAEP. As the differences between what tests purport
    to do and what they measure increase, the utility of these
    measures as external evidence decreases.” 
  • NAEP should continue to explore
    methodologies for setting achievement levels.
    “Stakeholders continue to use achievement levels as one
    means of interpreting NAEP results. NAEP has engaged in
    extensive research on standard-setting since 1992 to
    improve its practice. Some of this research includes the
    pilot studies done on the new Mapmark method (Schulz
    22
    and Mitzel, 2005). However, because this new
    methodology is not widely used, more research on whether
    it is appropriate for other NAEP subject areas is needed.
    Although we conclude that the new methodology worked
    well with the experts involved in the study on the 2005
    grade 12 mathematics assessment, the degree to which the
    method will work with experts from other subject areas
    cannot be determined from this evaluation.” 
  • NAEP should prioritize gathering
    external validity evidence that evaluates the intended uses
    and interpretations of its achievement levels. “
    The validity evidence collected by NAEP from internal and
    procedural sources suggest that the methodology was
    implemented as intended and that panelists had a positive
    experience with the process. However, the reasonableness
    of the results is a judgmental decision by policymakers who
    should consider additional sources of information. External
    validity evidence is an additional source of information to
    help policymakers make the final policy decisions about
    NAEP achievement levels. Such evidence may include
    results from additional standard-setting methods, state
    university entrance levels at the high school level, and
    transcript studies that evaluate course performance.12 The
    extent to which the sources of evidence may converge is
    affected by the intended uses and interpretations of NAEP’s
    achievement levels as articulated in a validity framework.”
  • Current NAEP inclusion and participation policies and
    rates may not provide evidence to support intended uses
    and interpretations of NAEP.
    “As mentioned earlier, the intended uses and interpretations
    of NAEP results should be defined in a validity framework
    and related to how different types of students and schools
    are included in the results. Unlike state assessment
    programs developed for NCLB, all students do not take
    NAEP. Further, those who take NAEP do not take a full
    assessment but rather a sample of its content. Thus, those
    included or excluded can influence the results and any
    score interpretations. This is particularly true for students
    with disabilities (SWD) and English language learners
    (ELL). Decisions about inclusion and accommodations of
    SWD and ELL are made at the state level… Beyond inclusion policies, participation is also an
    important consideration. NAEP remains a voluntary
    assessment for students. Therefore, nonresponse and refusal
    to participate represent potential threats to the validity of
    NAEP scores, particularly for grade 12 and private school
    samples. For example, Chromy (2005) noted that recent
    student participation rates for grade 12 (74 percent) were
    considerably lower than grade 4 (94 percent) and grade 8
    (92 percent). It is also unclear whether current sampling
    plans include all potential subgroups of interest within a
    state, such as students with specific ethnicities, disabilities,
    varying language proficiencies, and free and reduced-priced lunch program status.”
  • Intended users were not familiar with NAEP scale scores
    and had difficulty distinguishing between achievement
    levels on NAEP and those that were developed by states
    for NCLB reporting purposes.
    “Most participants in our utility studies identified NAEP
    with state-level results. This represents a communications
    challenge for the future because of stakeholders’ familiarity
    with the reporting scales and achievement levels used for
    their state’s own NCLB assessment. For example, there was
    confusion among participants between state and NAEP
    achievement level results. This led to recognition that
    states’ definitions of Proficient are perhaps different from
    NAEP’s definition of Proficient. However, the nature of
    such differences is not readily apparent. Another source of
    confusion is that NAEP defines three achievement levels
    (i.e. basic, proficient, and advanced), yet often indirectly
    reports student performance at four levels (i.e. below basic,
    basic, proficient, and advanced). No policy definition for
    the achievement level below basic exists.” 
  • Prioritize score reporting and
    interpretation as an area for research in the NAEP program.
    “Systematic studies of methods to report NAEP scale scores
    and achievement levels should be carried out with
    stakeholder groups prior to their operational use. Although
    some of this research may include print media, a more
    critical focus for evaluation is the expanding presence of
    NAEP on the World Wide Web. Where appropriate, the
    NAEP elements on the Web should be revised to represent
    empirical findings about ease of use, stakeholder interests,
    and accepted Web site development practices. Because
    NAEP reporting continues to invest in the use of
    interactive, online tools, the utility of these features must
    also be assessed.” 

It is important to note that this report is almost ten years old, so it is possible that some of the concerns listed above have been addressed.  However, I could not find any research or reports that described changes like the ones suggested above in the years since it was published, and most sources indicate very little has changed about how NAEP is administered, scored, and reported in the past several years.