Spaan V8 Kim

download Spaan V8 Kim

of 30

Transcript of Spaan V8 Kim

  • 7/28/2019 Spaan V8 Kim

    1/30

    Spaan Fellow Working Papers in Second or Foreign Language AssessmentCopyright 2010

    Volume 8: 130

    English Language Institute

    University of Michigan

    www.lsa.umich.edu/eli/research/spaan

    1

    Investigating the Construct Validity of a Speaking Performance Test

    Hyun Jung KimTeachers College, Columbia University

    ABSTRACT With the increased demand for the integration of a performancecomponent in second language (L2) testing, speaking performance assessments

    have focused on eliciting examinees underlying language ability through theiractual oral performance on a given task. Considering the nature of performance

    assessments, many factors other than examinees speaking ability are

    necessarily involved in the process of evaluation. Compared to the constructdefinition of speaking ability, however, relatively less attention has been givento tasks, which are regarded as a vehicle for assessment, although there is a

    growing interest in authentic tasks in eliciting real-world language samples forevaluation. Thus, the present study investigates whether a speaking placement

    test provides empirical evidence that the effect of task, as well as examineesattributes, should be considered in describing speaking ability in a performance

    assessment. An understanding of the underlying structure of the speakingplacement test not only helps to identify the factors involved in the evaluation

    process and their relationships, but ultimately makes it possible toappropriately infer examinees speaking ability.

    In L2 testing, the notion ofperformance first emerged in the 1960s in response topractical needs, and since then, the demand to integrate examinees actual performance in L2

    assessment has increased (McNamara, 1996). Early testers who advocated the integration of aperformance component focused on whether examinees could successfully fulfill a task in a

    simulated real-life language use context (e.g., Clark, 1975; Jones, 1985; Morrow, 1979;Savignon, 1972). McNamara (1996) classified this approach as astrong sense of performance

    assessment in which the definition of L2 ability construct is limited to examinees taskcompletion.

    On the contrary, new theories of communicative competence and communicative

    language ability in the 1980s and 1990s (e.g., Bachman, 1990; Bachman & Palmer, 1996,Canale, 1983; Canale & Swain, 1980) changed not only the perception of L2 language ability,but also the role of performance in language testing. They supported a weak sense of

    performance assessment (McNamara, 1996), in which the main interest was examineeslanguage ability, instead of task completion. That is, L2 ability was determined based on

    various language components derived from the theoretical models of communicativecompetence and communicative language ability. Examinees actual performance was elicited

    for evaluation of language ability; however, the role of performance was limited to a vehicle

  • 7/28/2019 Spaan V8 Kim

    2/30

    2 H. J. Kim

    to elicit examinees underlying language ability. This approach to performance assessment,called a construct-centered approach (Bachman, 2002), has been widely accepted by L2

    testers for most general purpose language performance assessments (e.g., Brindley, 1994;Fulcher, 2003; Luoma, 2004; McNamara, 1996; Messick, 1994; Skehan, 1998).

    While the construct-centered approach to performance assessment gives priority to

    definitions of L2 ability, a different perspective has recently been proposed. A task-centeredapproach focuses on what examinees can do with the language; that is, whether they canfulfill a given task (Brown, Hudson, Norris, & Bonk, 2002; Norris, Brown, Hudson, &

    Yoshioka, 1998). Although this approach provides more systematic criteria for the evaluationof examinees task fulfillment than the approach of early testers who first argued for the

    integration of performance in language testing, it basically shares the early testers view aboutwhat performance assessments aim to measure (i.e., strong version of performance

    assessment). According to the task-centered approach, test contexts or tasks play a crucial rolein measuring L2 ability because examinees performance is evaluated based on real-world

    criteria.The two approaches to performance assessment appear to be contradictory in nature.

    Chapelle (1998), however, argued from an interactionalist perspective that both constructdefinitions and tasks should be considered together in defining L2 ability because the two

    interact during communication. As reviewed, different perspectives on L2 performanceassessment have defined language ability distinctively with a different focus. What is

    important is not which approach is superior, but whether a test is validated before inferringexaminees language ability from the test results. In other words, before an inference

    regarding an examinees language ability is made from test scores, test developers and usersneed to make sure what the test aims to measure (e.g., various language components,

    performance on tasks) and whether a test actually measures what it intends to measure.Although a test is designed for its intended purpose (e.g., following construct definitions, task

    characteristics, or both), there are still many factors that need to be considered in L2performance assessments to understand examinees performance and define their language

    ability. Examinees performance may be affected by factors other than their language ability(McNamara, 1996, 1997). McNamara (1995) elaborated a schematic representation (Figure 1),

    which Kenyon (1992) first presented, to conceptualize the performance dimension of L2speaking performance tests. As presented in the figure, examinees performance in L2

    speaking tests is affected by many factors in the testing phase (i.e., candidates, tasks,interlocutors, and their interactions) as well as in the rating phase (i.e., raters and rating

    scales). Empirical studies have identified these factors that affect speaking performance testscores as effects of the: (1) candidate(Lumley & OSullivan, 2005; OLoughlin, 2002); (2)task(Chalhoub-Deville, 1995; Clark, 1988; Elder, Iwashita, & McNamara, 2002; Farris, 1995;Malabonga, Kenyon, & Carpenter, 2005; Shohamy, 1994; Wigglesworth, 1997); (3)

    interlocutor(Brown, 2003; OSullivan, 2002); (4) rater(Barnwell, 1989; Bonk & Ockey,2003; Brown, 1995; Eckes, 2005; Elder, 1993; Y. Kim, 2009; Lumley, 1998; Lumley &

    McNamara, 1995; Lynch & McNamara, 1998; Meiron & Schick, 2000; Orr, 2002;Wigglesworth, 1993); and (5)scale/criteria(M. Kim, 2001). It might be impossible tocompletely eliminate the effects of these factors on examinees speaking performance.However, it is important to understand relative contributions of these factors to examinees

    performance and test scores in order to better estimate examinees speaking ability and moreappropriately interpret and use the test results.

  • 7/28/2019 Spaan V8 Kim

    3/30

    3Investigating the Construct Validity of a Speaking Performance Test

    Rater

    Scale/Criteria Score

    Performance

    Interlocutor Task (includingother

    candidate) Candidate

    Figure 1. Interactions in Performance Assessment of Speaking Skills

    (McNamara, 1995, p. 173)

    To sum up, examinees speaking ability can be inferred only after a test is validated

    with respect to its constructs and other factors involved in the process of evaluation. Thefocus of previous studies, however, has often been limited to effects of individual factors on

    examinees test performance. In other words, speaking performance tests have not beenexamined in a big framework in which various factors (e.g., examinees language ability,

    tasks, and rating criteria) interact with one another. Moreover, performance tests, especiallythose which do not involve high stakes, are oftentimes used without such validation. To this

    end, the current study seeks to explore the nature of a speaking placement test, which hasbeen locally used in a community English program. In order to determine whether the

    speaking test accurately measures speaking ability as intended, the underlying structure of thetest is investigated in the present study. In other words, the question of whether the

    hypothesized components of speaking ability (reflected in the scoring rubric) actually functionas the operationalized constructs of the test is examined. In addition, to better explain how the

    test works, the effects of other variables, such as rater perceptions and task characteristics, arealso investigated. That is, factors that can have an effect on speaking performance are

    considered in addition to issues regarding construct definition.

    Research Questions

    The current study addresses the following three research questions: (1) What is thefactorial structure of the speaking test? (2) To what extent does the speaking test measure the

    intended hypothesized constructs of speaking ability? (3) In addition to the measuredvariables, to what extent do other factors (i.e., raters and tasks) contribute to examinees

    speaking performance?

  • 7/28/2019 Spaan V8 Kim

    4/30

    4 H. J. Kim

    Method

    Context of the Current Study

    The Community English Program (CEP) is an English as a second language (ESL)

    program offered by the Teaching English to Speakers of Other Languages (TESOL) and

    applied linguistics programs at Teachers College. The program targets adult ESL learners whowish to improve their communicative language ability. Therefore, the CEP curriculumemphasizes not only the various language components (grammar, vocabulary, and

    pronunciation) but also the different language skills (listening, speaking, reading, and writing).To facilitate effective teaching and learning, all new students of the program are placed into

    one of 12 proficiency levels based on results of a placement test, which consists of fivesections (i.e., listening, grammar, reading, writing, and speaking).

    A majority of the CEP teachers are MA students of the TESOL and applied linguisticsprograms. That is, they are student teachers practicing ESL classroom teaching. Therefore,

    their classrooms are regularly observed by faculty and colleagues and follow-up feedbacksessions are provided throughout the semester. The teachers also serve as raters of the writing

    and speaking placement tests. From the rating experience, they not only become familiar withthe CEP students writing and speaking ability levels, but they also have an opportunity to

    have hands-on experience in evaluating ESL learners writing and speaking ability. Therefore,the CEP functions as a teacher education program as well as an adult ESL program.

    Participants

    Participants in the current study consisted of 215 incoming CEP students who took theCEP speaking placement test. The majority of students in the program were adult immigrants

    from the surrounding neighborhood or were family members of international students in theColumbia University community. The number of female students (73%) far exceeded that of

    male students (27%). In terms of the participants first language, a large percentage consistedof three languages: Japanese (36%), Korean (19%), and Spanish (15%). With regard to their

    length of residence, the vast majority of the participants responded that they had been inEnglish speaking countries, including the United States, for fewer than three years: less than

    6 months (40%), 6 months to 1 year (19%), and 1 to 3 years (20%). In terms of theirmotivation for studying English, many participants reported academic and job-related reasons,

    while over 50 percent gave priority to communication with friends as their reason forimproving their English.

    Instruments

    The instruments used in the current study included the CEP placement speaking testand an analytic scoring rubric. The speaking test was designed to measure speaking ability

    under various real-life language use situations. The test had six tasks: complaining about acatering service (Task 1), talking about a favorite movie (Task 2), narrating a story based on a

    sequence of pictures (Task 3), refusing a request from a landlord (Task 4), summarizing aradio commentary (Task 5), and summarizing a lecture (Task 6). The first three tasks (i.e.,

    Tasks 1, 2, and 3) were the independent-skills tasks, which required examinees to draw ontheir background knowledge to perform the tasks. On the other hand, the last three tasks (i.e.,

    Tasks 4, 5, and 6) were the integrated-skills tasks, which required examinees to use theirlistening skills in the performance of the tasks. That is, examinees were asked to listen to long

  • 7/28/2019 Spaan V8 Kim

    5/30

    5Investigating the Construct Validity of a Speaking Performance Test

    or short passages, which were provided as part of the tasks, and then formulate responsesbased on the content of the passages.

    The speaking test was a semi-direct, computer-delivered test. That is, there was nointeraction between an examinee and an interlocutor. Instead, the examinees listened to the

    pre-recorded instructions and prompts delivered by a computer and then they were asked to

    record their responses. The six tasks and the test format for each task (e.g., preparation time,response time) are found in Appendix A.An analytic scoring rubric consisting of five rating scales (see Appendix B) was used

    to score the examinees recorded oral responses. The five scales included meaningfulness,grammatical competence, discourse competence, task completion, and intelligibility. Each of

    the five rating scales was rated on a six-point scale (0 for no control to 5 for excellentcontrol). To analyze each scale in relation to the different tasks in this study, the five scales

    for each of the six tasks were regarded as individual items, making a total of 30 items (6 tasksx 5 rating scales) on the test. That is, each cell in Table 1 illustrates the individual items of the

    test. For instance, the item MeanT1 represents meaningfulness for Task 1 while theitemMeanT2 refers to meaningfulness for Task 2.

    Table 1. Taxonomy of Items (Task x Rating Scale) on Speaking Ability

    TasksRating scales

    Number

    of Items Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

    Meaningfulness 6 MeanT1 MeanT2 MeanT3 MeanT4 MeanT5 MeanT6

    Grammatical

    competence6 GramT1 GramT2 GramT3 GramT4 GramT5 GramT6

    Discourse

    competence6 DiscT1 DiscT2 DiscT3 DiscT4 DiscT3 DiscT6

    Task completion 6 TaskT1 TaskT2 TaskT3 TaskT4 TaskT5 TaskT6

    Intelligibility 6 IntelT1 IntelT2 IntelT3 IntelT4 IntelT5 IntelT6

    Total 30

    Procedures

    Test Administration

    The speaking test was administered in a computer lab on the second day of a two-dayplacement test administration. The test was administered to groups of approximately 40

    students. Each student was seated in front of a computer. They listened to the test instructionson a headset, read the instructions on the computer screen, and recorded their responses to the

    test items using a microphone. Since all computers were controlled from a central console, theexaminees kept the same pace while taking the test. That is, the instructions and prompts were

    delivered at the same time, and the preparation and response times were also provided to allexaminees at the same time.

    Before the actual test began, the examinees were asked to fill in a background surveywhich asked for demographic information, prior English-learning experience, and plans for

  • 7/28/2019 Spaan V8 Kim

    6/30

    6 H. J. Kim

    future study. Once all examinees of a group completed the survey, they were given a practicetask so that they would be familiar with the test format. After a short intermission for any

    questions about the test format, the six tasks were played in sequence. For each task, theexaminees first listened to or looked at an instruction and a prompt. They were allowed to

    prepare responses during a short preparation time and lastly they recorded their responses

    during the given response time.

    Scoring

    Each examinees performance was scored by two independent raters. The raters werethe CEP teachers, most of whom were MA or EdD students in the TESOL and applied

    linguistics programs at Teachers College. Prior to the actual rating, the raters attended anorming session in which the test tasks and the rubric were introduced and sample responses

    were provided for practice. Time was also given for discussion of analytic scores so that theraters had opportunities to monitor their decision-making processes by comparing the

    rationale behind their scores with other raters opinions. Rating practice and discussioncontinued until the raters felt that they were well aware of the tasks and confident with

    assigning scores on different rating scales. Following the norming session, each rater wasassigned a certain number of examinees. Since examinees performance on each of the six

    tasks was scored on the five rating scales, each examinee was given 30 analytic ratings on 30items. The maximum score for each item was five and the minimum was zero. The scores

    assigned by two independent raters were later averaged to determine a speaking score for theplacement test.

    Analyses

    Thedata were analyzed using SPSS version 12.0 (SPSS Inc., 2001) and EQS version6.1 (Bentler & Wu, 2005). Descriptive statistics (i.e., means, standard deviations,

    maximum/minimum raw scores, and skewness and kurtosis values) were calculated for theentire test, for the 30 individual items, and for each of the five rating scales across the six

    tasks separately using SPSS to verify central tendency and variability. Reliability estimateswere then calculated based on Cronbachs Alpha to examine the degree of relatedness among

    the 30 items and the six items under each of the five rating scales. Also, the degree ofagreement between the two raters (i.e., inter-rater reliability) was investigated from various

    perspectives, such as from the examinees total score, across the six tasks, and across the fiverating scales. Since composite scores comprised interval data that were converted from the

    original ordinal data, inter-rater reliability was estimated based on Pearson Product-Momentcorrelations.

    After calculating descriptive statistics and reliability estimates, exploratory factoranalyses (EFA) were conducted to determine the extent to which the 30 items clustered

    together. In other words, factor analyses were used to examine what patterns of correlationswould be observed among the 30 items. Based on the correlation matrix, initial factors were

    extracted by principal-axes factoring (PAF) after the appropriateness of the use of acorrelation matrix for factor analysis was verified using three calculations: (1) Bartletts test

    of sphericity; (2) the Kaiser-Meyer-Olkin (KMO); and (3) the determinant of the correlationmatrix. The initial factors were then rotated until the best solution was found to determine the

    number of underlying factors. Since it had been assumed that the factors were correlated with

  • 7/28/2019 Spaan V8 Kim

    7/30

    7Investigating the Construct Validity of a Speaking Performance Test

    one another, a direct oblimin rotation procedure was used after checking the factor correlationmatrices each time.

    Finally, confirmatory factor analyses (CFA) were performed to establish a model ofthe speaking test. CFA was used to determine the extent to which the 30 items were measured

    in relation to the six tasks and five scoring criteria. Based on a review of the literature, a

    second-order Multitrait-Multimethod (MTMM) Model was first hypothesized. After failing tofind an appropriate solution with the hypothesized model, several other CFA models wereattempted to find a final model that best explained the data. To assess the adequacy of models

    including the hypothesized model, several fit indices were used such as the Chi-squarestatistic, the Chi-square/df ratio, the comparative fit index (CFI), and the root mean-square

    error of approximation (RMSEA). In addition, a distribution of standardized residuals waschecked. The results of the Lagrange Multiplier test and Ward test were analyzed for each run

    in order to check any necessary and unnecessary parameters in a model. In the end, however,a final speaking test model was chosen in accordance with substantive considerations while

    taking into account the issue of parsimony. In the process of model evaluation, the ML Robustmethod was used each time due to multivariate non-normality of the data.

    Results

    Descriptive Statistics

    The descriptive statistics which were calculated for the item level, the rating scalelevel, and the entire 30-item test are presented in Table 2. The item-level means ranged from

    2.64 to 3.41 and the standard deviations from 1.01 to 1.57. Although not very different, themeans of grammar-related items (i.e., GramT1 to GramT6) were lower than those for the

    other groups of items. On the other hand, task completion-related items (i.e., TaskT1 to TaskT6) showed relatively higher means compared to the other items. Grammar-related items had

    the least variability (average Std.=1.04) while task completion-related items had the largestvariability (average Std.=1.23). With regard to the task-related aspect, Task 6 items (i.e.,

    MeanT6, GramT6, DiscT6, TaskT6, and IntelT6) had the lowest means under each ratingscale. However, their standard deviations were greatest compared to those for the other task

    items under the same rating scale. The skewness and kurtosis values, within the acceptablerange, indicated that all 30 items and five rating scales appeared to be normally distributed.

    Reliability Analyses

    The reliability estimates for internal consistency were calculated for the five ratingscales and for the entire test (see Table 3). The reliability estimate for the entire test was very

    high (0.991), signifying a high degree of homogeneity among the 30 items. Internalconsistency reliability for each rating scale also showed a high degree of consistency of the

    six tasks under the five scales. The high reliability estimates, ranging from 0.936 to 0.963,suggested that the six tasks measured the same construct with a high degree of consistency

    within each rating scale.

  • 7/28/2019 Spaan V8 Kim

    8/30

    8 H. J. Kim

    Table 2. Descriptive Statistics (N=215, K=30)

    Variable Minimum Maximum Mean Std. Skewness Kurtosis

    1. Meaningfulness (Mean) 0 5.00 3.10 1.14 -.87 .30

    MeanT1 0 5.00 3.12 1.28 -.81 .24

    MeanT2 0 5.00 3.12 1.18 -.93 .65

    MeanT3 0 5.00 3.20 1.11 -.82 .60MeanT4 0 5.00 3.10 1.30 -.84 -.07

    MeanT5 0 5.00 3.20 1.24 -.91 .27

    MeanT6 0 5.00 2.85 1.38 -.66 -.45

    2. Grammar (Gram) 0 4.58 2.87 1.04 -.95 .52

    GramT1 0 5.00 2.83 1.15 -.84 .46

    GramT2 0 4.50 2.87 1.04 -1.11 1.16

    GramT3 0 4.50 2.92 1.01 -.92 .83

    GramT4 0 5.00 2.93 1.20 -.97 .35

    GramT5 0 5.00 2.93 1.12 -.92 .46

    GramT6 0 4.50 2.73 1.27 -.79 -.29

    3. Discourse Competence

    (Disc)0 4.50 2.86 1.08 -.91 .33

    DiscT1 0 5.00 2.83 1.21 -.78 .17

    DiscT2 0 5.00 2.84 1.09 -.91 .59

    DiscT3 0 5.00 2.95 1.05 -.83 .75

    DiscT4 0 5.00 2.93 1.26 -.85 -.04

    DiscT5 0 5.00 2.97 1.18 -.84 .17

    DiscT6 0 5.00 2.64 1.32 -.63 -.42

    4. Task Completion (Task) 0 5.00 3.18 1.23 -.86 .08

    TaskT1 0 5.00 3.07 1.41 -.52 -.44

    TaskT2 0 5.00 3.36 1.33 -1.00 .32TaskT3 0 5.00 3.41 1.23 -.92 .42

    TaskT4 0 5.00 3.04 1.57 -.56 -1.01

    TaskT5 0 5.00 3.37 1.41 -.82 -.25

    TaskT6 0 5.00 2.86 1.48 -.52 -.73

    5. Intelligibility (Intel) 0 4.92 3.02 1.09 -.92 .53

    IntelT1 0 5.00 2.99 1.20 -.89 .50

    IntelT2 0 5.00 3.00 1.15 -.96 .73

    IntelT3 0 5.00 3.09 1.05 -.93 .86

    IntelT4 0 5.00 3.07 1.25 -.90 .20

    IntelT5 0 5.00 3.10 1.16 -.91 .67

    IntelT6 0 5.00 2.89 1.30 -.75 -.16

    Total (30 items) 0 4.73 3.01 1.10 -.94 .43

  • 7/28/2019 Spaan V8 Kim

    9/30

    9Investigating the Construct Validity of a Speaking Performance Test

    Table 3. Reliability Estimates (N=215)

    Construct Items UsedNr of

    ItemsReliability Estimates

    Meaningfulness MeanT1 - MeanT6 6 0.960Grammatical Competence GramT1 - GramT6 6 0.963

    Discourse Competence DiscT1 - DiscT6 6 0.958Task Completion TaskT1 - TaskT6 6 0.936

    Intelligibility IntelT1 - IntelT6 6 0.963Total 30 0.991

    Although average scores by the two raters were used for the statistical analyses, inter-

    rater reliability was calculated to determine the degree of agreement between the two raters.The correlation between Rater 1 and Rater 2 was 0.837 for examinees total score (see Table

    4), 0.71 to 0.80 across the six tasks (see Table 5), and 0.78 to 0.82 across the five rating scales(see Table 6). All correlations were significant at the alpha = 0.01 level, indicating that the

    first raters score on each task, each rating scale, and entire test significantly correlated withthe second raters score on the same task, rating scale, and entire test. As a result, it can be

    assumed that the two raters scored the examinees speaking with similar criteria in mind.

    Table 4. Inter-rater Reliability for the Entire Speaking Test (N = 215)

    Rater 1 (TotR1) Rater 2 (TotR2)

    Rater 1 (TotR1) 1.00 0.837**

    Rater 2 (TotR2) 0.837** 1.00

    **p < 0.01 (2-tailed), R1 = Rater 1, R2 = Rater 2

    Table 5. Inter-rater Reliability across Six Tasks (N = 215)T1R1 T1R2 T2R1 T2R2 T3R1 T3R2 T4R1 T4R2 T5R1 T5R2 T6R1 T6R2

    T1R1 1.00 0.80**

    T1R2 1.00

    T2R1 1.00 0.75**

    T2R2 1.00

    T3R1 1.00 0.71**

    T3R2 1.00

    T4R1 1.00 0.81**

    T4R2 1.00

    T5R1 1.00 0.80**

    T5R2 1.00

    T6R1 1.00 0.80**

    T6R2 1.00**p < 0.01 (2-tailed), T1T6: Task 1Task 6; R1 = Rater 1, R2 = Rater 2

  • 7/28/2019 Spaan V8 Kim

    10/30

    10 H. J. Kim

    Table 6. Inter-rater Reliability across the Five Constructs (N = 215)

    MR1 MR2 GR1 GR2 DR1 DR2 TR1 TR2 IR1 IR2

    MR1 1.00 0.78**

    MR2 1.00GR1 1.00 0.82**

    GR2 1.00DR1 1.00 0.80**

    DR2 1.00TR1 1.00 0.80**

    TR2 1.00IR1 1.00 0.81**

    **p < 0.01 (2-tailed), M: Meaningfulness, G: Grammatical Competence, D: Discourse

    Competence,T: Task Completion, I: Intelligibility,

    R1 = Rater 1, R2 = Rater 2

    Results of Exploratory Factor Analysis

    Once the appropriateness of the use of a correlation matrix for factor analysis was

    verified (e.g., a significant Chi-square, the positive determinant of the correlation matrix), anEFA was conducted as a preliminary step for a CFA in order to develop a factor structure for

    the 30 observed variables. The initial factor extraction showed a very different result from thehypothesized design of speaking ability, which assumed five underlying factors (i.e., five

    rating scales). Two factors with eigenvalues greater than 1.0 were extracted, which accountedfor 83.7 percentof the variance. Variable communalities were all above 0.7, specifying that

    the variances of the variables accounted for by the common factors were very high. The screeplot also suggested the extraction of two factors. Since the number of factors obtained from

    the initial extraction was quite different from the hypothesis set for the speaking test, solutionswith different numbers of factors were compared. The three factor oblique rotation was the

    best solution to achieve maximum parsimony (see Table7). As observed in Table 7, the 30items used to measure speaking ability clustered around the type of task. For instance, items

    for Tasks 1, 2, and 3 loaded on Factor 1, items for Task 6 loaded on Factor 2, and items forTasks 4 and 5 loaded on Factor 3. To illustrate, all five items for Task 6 (i.e., MeanT6,

    GramT6, DiscT6, TaskT6, and IntelT6) showed factor loadings above 0.3 for Factor 2.Further analysis of the six tasks revealed a possible reason as to why the items

    clustered around the task type factors rather than around the rating scales. Since Tasks 1, 2,and 3 required examinees to speak with the minimal input, the factor on which the items for

    these three tasks loaded was interpreted as a Speak factor. Contrary to Tasks 1, 2, and 3,

    Tasks 4 and 5 first required examinees to listen to a long message and then respond orsummarize it. Thus, Factor 3, which included items for Tasks 4 and 5, was coded as a Listenand Speak factor. While Task 6 was a summary task (as was Task 5), it appeared that Task 6

    required examinees to have topical knowledge in the process of listening and summarizing amessage. That is, examinees familiarity with the topic of the task could help them approach

    the task easily. Whereas Task 5 was about a topic (an electric car) that might be morecommonly discussed in everyday life, the listening prompt provided in Task 6 was a lecture

    with highly specified content (the Barbizon School). Thus, Factor 2 was coded as Listen and

  • 7/28/2019 Spaan V8 Kim

    11/30

    11Investigating the Construct Validity of a Speaking Performance Test

    Speak with Topical Knowledge. In sum, the items did not cluster around operationalizedconstructs of speaking ability (i.e., rating scales), showing that examinees speaking

    performance was better explained according to the task type rather than to the hypothesizedfive constructs of speaking ability. As a result, the two cross-loadings present (i.e., IntelT3

    and GramT5) were not seen as problematic since grammar and intelligibility could be

    involved in any task as long as factors were divided based on the task type. The final three-factor solution is presented in Table 8.

    Table 7. Pattern Matrix for Speaking Ability

    Factor1 2 3

    DiscT2 1.015 .054 .171GramT2 .944 .142 .163

    TaskT2 .890 .017 .014GramT1 .874 .027 -.037

    IntelT1 .873 -.019 -.054DiscT1 .856 -.006 -.061

    IntelT2 .852 .139 .069MeanT2 .847 .135 .037

    MeanT1 .837 -.052 -.138TaskT3 .795 -.090 -.138

    GramT3 .764 -.020 -.207DiscT3 .720 -.030 -.236

    MeanT3 .702 .020 -.212TaskT1 .670 .028 -.190

    IntelT3 .585 .015 -.337MeanT6 .011 .962 -.004

    TaskT6 -.044 .933 -.063DiscT6 .056 .918 -.014

    GramT6 .094 .847 -.053IntelT6 .033 .804 -.143

    MeanT4 .076 -.004 -.897GramT4 .108 .058 -.809

    TaskT4 -.057 .125 -.791IntelT4 .090 .120 -.769

    DiscT4 .135 .100 -.741TaskT5 .066 .253 -.634

    IntelT5 .189 .205 -.584MeanT5 .224 .182 -.573

    DiscT5 .288 .133 -.564GramT5 .319 .179 -.484

    Extraction Method: Principal Axes Factoring.

    Rotation Method: Oblimin with Kaiser Normalization.a Rotation converged in 13 iterations.

  • 7/28/2019 Spaan V8 Kim

    12/30

    12 H. J. Kim

    Table 8. Revised Taxonomy of Speaking Ability (Based on Exploratory Factor Analysis)

    FactorsNr of

    ItemsItems

    Speak 15

    Task 1

    Task 2Task 3

    5

    55

    MeanT1, GramT1, DiscT1, TaskT1, IntelT1

    MeanT2, GramT2, DiscT2, TaskT2, IntelT2MeanT3, GramT3, DiscT3, TaskT3, IntelT3

    Listen & Speak withTopical Knowledge

    5

    Task 6 5 MeanT6, GramT6, DiscT6, TaskT6, IntelT6

    Listen & Speak 10

    Task 4Task 5

    55

    MeanT4, GramT4, DiscT4, TaskT4, IntelT4MeanT5, GramT5, DiscT5, TaskT5, IntelT5

    Total 30

    Results of Confirmatory Factor Analysis

    Bachman (2002) argued that a language test should be designed taking task

    characteristics into account as well as the construct definition of language ability in order toachieve the intended purpose of the test. In an attempt to understand the speaking test of the

    current study in terms of both aspects (i.e., construct definition and task characteristics), thefirst MTMM model was hypothesized in which the 24 items loaded on both trait factors (i.e.,

    the four rating scales) and method factors (i.e., the six tasks), while the four trait factorsloaded on a second-order factor, speaking ability (see Figure 2). The rating scale of task

    completion was not included as a trait factor in the model since it was considered redundant inrelation to the other rating scales. As a result, six items related to task completion (i.e.,

    TaskT1, TaskT2, TaskT3, TaskT4, TaskT5, TaskT6) were deleted for the analysis, making atotal of 24 observed variables. Moreover, correlations among six tasks were not established in

    the first model because six different tasks were hypothesized to elicit different aspects ofspeaking ability.

    In order to respecify the first model, several attempts were made. First, it was testedwhether four first-order factors (i.e., four trait factors) would load on the second-order factor

    (i.e., speaking ability) without any method factors (see Figure 3). The data did not fit themodel, which indicated problems similar to those of the first model (e.g., condition codes and

    factor loadings above 1.0). Moreover, the model showed a very poor fit, with a CFI of 0.715and a RMSEA of 0.162. The results confirmed a need for consideration of both construct (i.e.,

    rating scales) and task to interpret test scores, since the model without the task factors did not

    represent the data. In addition, based on the results of this model, it was decided that four traitfactors should be correlated instead of using of a second-order factor. The model-fitevaluation of the hypothesized model indicated an excellent fit, showing the very high CFI

    (0.99) and the very low RMSEA (0.032 with the confidence interval [0.015, 0.044]). In termsof fit indices, the model was ideal since the CFI above 0.95 and the RMSEA below 0.05 are

    considered an indication of a well-fitting model (Byrne, 2006). However, the test results werenot reliable due to a condition code for a variance of factor error (Parameter: D2, D2) which

    caused an improper solution (e.g., the greater than 1.0 factor loading for Grammatical

  • 7/28/2019 Spaan V8 Kim

    13/30

    13Investigating the Construct Validity of a Speaking Performance Test

    Competence). Such a condition code, which is a common occurrence with MTMM data,might have occurred due to the complexity of model specification (Byrne, 2006). Thus, the

    initially hypothesized model was rejected.

    Figure 2. The Hypothesized Second-Order MTMM Model of CEP Speaking Placement Test

    Mean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,Intel: Intelligibility, T1T6: Task 1Task 6

  • 7/28/2019 Spaan V8 Kim

    14/30

    14 H. J. Kim

    Figure 3. The Second-order Model without Method FactorsMean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,

    Intel: Intelligibility, T1T6: Task 1Task 6

    Another attempt was made before deciding upon a final model. A model was testedwith two additional factors: Rating 1 and Rating 2. The model was run both with and without

    the correlation between the two ratings. However, both models were unsuccessful, whichconfirmed that the data were not explained with such models. Therefore, based on an

    examination of several possible models, the final MTMM model was established with fourtrait factors which were correlated with each other and six method factors (see Figure 4). This

    final model was obtained after statistically testing two assumptions which were made inadvance. The first assumption regarding the deletion of task completion factor was confirmed

    since the inclusion of task completion factor to the final model lowered the overall fit of thedata. To test the other assumption related to possible task effect, the final MTMM model was

  • 7/28/2019 Spaan V8 Kim

    15/30

    15Investigating the Construct Validity of a Speaking Performance Test

    also tested with correlations among six method factors. Although the overall fit increased, itshowed very little improvement. Thus, it was concluded that six different tasks measured

    different aspects of speaking ability so that the correlations were not included in the finalmodel. Though all estimates were statistically significant, they were not included in Figure 4

    since they were not legible with the overabundance of arrows (Refer to Table 10 for the

    estimates).As shown in Figure 4, there were 24 dependent variables (i.e., 24 observed variables)and 34 independent variables (i.e., 10 factors and 24 error terms). There were also 78

    parameters (i.e., 48 factor loadings, 6 factor covariances, 24 error variances) and 34 fixednonzero parameters (i.e., 10 factor variances, 24 error regression paths). The structure of these

    factors and variables as specified in the model was tested based on the covariance matrix.Following the summary of the model, model identification was confirmed in the output.

    The model was first assessed as a whole. In terms of residuals, off-diagonal elementswere examined since they play a major role in the effect of Chi-square statistics. The

    standardized residual values were evenly distributed, and the average off-diagonal absolutestandardized residual was also quite small, at 0.0156. In addition, the distribution of

    standardized residuals was symmetric and centered around zero. As a result, it was found thatvery little discrepancy existed between S(q) (covariance matrix implied by the specified

    structure of the hypothesized model) and S (sample covariance matrix of observed variablescores). With regard to the goodness of fit statistics, the independence Chi-square statistic was

    5189.140 with 276 degrees of freedom. Although the Chi-square/df ratio was much greaterthan 2, implying a poor model-data fit, it was ignored due to Chi-square sensitivity to sample

    size. Instead, fit indices were used for further model-fit evaluation (see Table 9).Table 9. EQS Output Goodness of Fit Statistics

    GOODNESS OF FIT SUMMARY FOR METHOD = ROBUST

    ROBUST INDEPENDENCE MODEL CHI-SQUARE = 5189.140 ON 276 DEGREES OF FREEDOMINDEPENDENCE AIC = 4637.140 INDEPENDENCE CAIC = 3430.844

    MODEL AIC = -179.250 MODEL CAIC = -1149.532

    SATORRA-BENTLER SCALED CHI-SQUARE = 264.7500 ON 222 DEGREES OF FREEDOM

    PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.02603

    FIT INDICES

    BENTLER-BONETT NORMED FIT INDEX = 0.949

    BENTLER-BONETT NON-NORMED FIT INDEX = 0.989

    COMPARATIVE FIT INDEX (CFI) = 0.991

    BOLLEN'S (IFI) FIT INDEX = 0.991

    MCDONALD'S (MFI) FIT INDEX = 0.905

    ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.030

    90% CONFIDENCE INTERVAL OF RMSEA (0.011, 0.043)

  • 7/28/2019 Spaan V8 Kim

    16/30

    16 H. J. Kim

    Figure 4. The Final MTMM ModelMean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,

    Intel: Intelligibility, T1 T6: Task 1 Task 6; F1 F10: Factors 1 Factor 10; V2 V31:Observed Variables 2 31

  • 7/28/2019 Spaan V8 Kim

    17/30

    17Investigating the Construct Validity of a Speaking Performance Test

    As shown in Table 9, the CFI was 0.991 and the RMSEA was 0.03 with the confidenceinterval [0.011, 0.043], both of which indicated an excellent fit. The final indicator of overall

    model fit was the number of iterations. According to the iterative summary in the output, onlyfive iterations were needed to reach convergence, which meant that the data fit the model

    relatively easily. Thus, it was revealed from the analyses of residuals and fit indices that the

    current 24 data fit the 10 factor MTMM model well as a whole.After confirming the good fit of the model as a whole, the fit of individual parameterswas also assessed. The statistical significance of parameter estimates was first checked based

    on the unstandardized estimates. All parameter estimates were statistically significant.Therefore, all parameters could be considered important to the model, and none of the

    parameters needed to be deleted from the model. Following the unstandardized estimates, astandardized solution was considered (see Table 10).

    As shown in Table 10 (next page), the trait factor loadings (i.e., F1 to F4), rangingfrom 0.849 to 0.926, were much higher than method factor loadings (i.e., F5 to F10), ranging

    from 0.265 to 0.466. This signified that the four traits (i.e., rating scales) were much strongerindicators than the six tasks, although both needed to be considered. Since the regression

    coefficients of errors were quite small, ranging from 0.207 to 0.309, it can be concluded thatthe contribution of errors to the variables was low and the variables were mainly explained by

    the factors. All of the very high R-squared values, which refer to the proportion of varianceaccounted for by its related factors, confirmed that all 24 items explained the model fairly

    well. Moreover, as assumed above, correlations between the trait factors were quite high ataround 0.98. The four factors were all operationalized constructs of a single construct of

    speaking ability. However, extremely high correlations were not considered ideal for analyticscoring since they indicated that four rating scales were almost indistinguishable.

    Discussion and Conclusion

    The present study examined the underlying structure of the CEP speaking placement

    test based on a confirmatory factor analysis. The analysis was conducted with four traitfactors (i.e., meaningfulness, grammatical competence, discourse competence, and

    intelligibility) and six method factors (i.e., Tasks 1 to 6). Also, the four trait factors werecorrelated with one another. Although these four traits were assumed to be related by virtue of

    being aspects of the same ability, correlations over 0.90 were unexpected. These highcorrelations may indicate that speaking ability cannot be separated into several analytic

    aspects, or the raters failed to understand and differentiate among the analytic scoring criteria.For example, raters may have given similar scores to the four rating scales of each task based

    on their own impression rather than going over the different criteria carefully, or they may nothave been accustomed to the different criteria because of the short norming period. Further

    research on raters rating processes may be required to explain the relationship among thesecomponents of speaking ability.

  • 7/28/2019 Spaan V8 Kim

    18/30

    18 H. J. Kim

    Table 10. EQS Output Standardized Solution

    !"%!#"!#

    "$

    "$

    !"$

    ""$

    "$

    "$

    !"$

    ""$

    "$

    "$

    !"$

    ""$

    "$

    "$

    !"$""$

    "$

    "$

    !"$

    ""$

    "$

    "$

    !"$

    ""$

    "!"$!

    $

  • 7/28/2019 Spaan V8 Kim

    19/30

    19Investigating the Construct Validity of a Speaking Performance Test

    The final MTMM model explained the current test data very well, as evidenced by thehigh fit indices. In particular, the four operationalized constructs (i.e., four rating scales)

    primarily explained the data with higher factor loadings than the six tasks. In other words,examinees performance on the test was mainly explained by the four constructs of speaking

    ability; however, the characteristics of the six tasks had a non-negligible effect on the

    examinees performance. Therefore, the results of the current study empirically supported theinteractionalist perspective in which examinees speaking ability is determined in terms ofboth constructs (traits) and task characteristics of the test.

    Although the current study contributes to the recent discussion concerning theimportance of both construct definitions and test task characteristics in L2 performance

    assessments, it has a number of limitations. First, due to a limited sample size, it was notpossible to include a rating factor as part of the underlying structure of the speaking test

    although multiple ratings were available for all examinees responses. It has been argued thatraters are the one of the factors that affects examinees performance (Kenyon, 1992; Linacre,

    1989; McNamara, 1995, 1996, 1997). Indeed, previous studies on raters, which analyzedraters rating behaviors both quantitatively and qualitatively, showed rater effects on

    performance assessments (e.g., Bonk & Ockey, 2003; Brown, 2005; Chalhoub-Deville, 1995;Eckes, 2005; Meiron & Schick, 2000; Orr, 2002). Therefore, inclusion of a rater/rating factor

    might change the underlying structure of the speaking test.The other limitation is that structural equation modeling is a data-specific statistical

    tool. In other words, the results of the current analyses cannot be generalized to other CEPspeaking data which include different participants. Likewise, other data sets might be

    explained with different factors or different factorial structures. Therefore, in order togeneralize the structure of CEP speaking placement test, repeated analyses of test data with a

    larger sample size are required across different test administrations. Only then can the natureof the CEP speaking placement test be understood and, ultimately, can inferences made on

    examinees speaking ability be considered reliable.

    Acknowledgements

    I would like to express my appreciation to the English Language Institute at the

    University of Michigan for giving me an opportunity to perform this research. I am also verygrateful to Professor James Purpura and my colleagues at Teachers College, for their

    insightful comments and suggestions throughout this study.

    References

    Bachman, L. F. (1990).Fundamental considerations in language testing. Oxford: OxfordUniversity Press.

    Bachman, L. F. (2002). Some reflections on task-based language performance assessment.Language Testing, 19(4), 453476.

    Bachman, L. F., & Palmer, A. S. (1996).Language testing in practice: Designing anddeveloping useful language tests. Oxford: Oxford University Press.

    Barnwell, D. (1989). Naive native speakers and judgments of oral proficiency in Spanish.Language Testing, 6(2), 152163.

  • 7/28/2019 Spaan V8 Kim

    20/30

    20 H. J. Kim

    Bentler, P. M., & Wu, E. (2005).EQS 6.1 for windows users guide. Encino, CA: MultivariateSoftware, Inc.

    Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second languagegroup oral discussion task.Language Testing, 20(1), 89110.

    Brindley, G. (1994). Task-centred assessment in language learning: The promise and the

    challenge. In N. Bird, P. Falvey, A. Tsui, D. Allison, & A. McNeill (Eds.), Languageand learning: Papers presented at the Annual International Language in EducationConference (Hong Kong, 1993) (pp. 7394). Hong Kong: Hong Kong Education

    Department.Brown, A. (1995). The effect of rater variables in the development of an occupation-specific

    language performance test.Language Testing, 12(1), 115.Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency.

    Language Testing, 20(1), 125.Brown, A. (2005).Interviewer variability in oral proficiency interviews. Frankfurt, Germany:

    Peter Lang.Brown, J. D., Hudson, T., Norris, J. M., & Bonk, W. (2002).An investigation of second

    language task-based performance assessments. Honolulu: University of Hawaii Press.Byrne, B. M. (2006). Structural equation modeling with EQS. Mahwah NJ: Lawrence

    Erlbaum Associates, Inc.Canale, M. (1983). On some dimensions of language proficiency. In J. W. Oller, Jr. (Ed.),

    Issues in language testing research (pp. 333342). Rowley, MA: Newbury House.Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second

    language teaching and testing.Applied Linguistics, 1(1), 147.Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different tests and rater

    groups.Language Testing, 12(1), 1633.Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F.

    Bachman & A. D. Cohen (Eds.),Interfaces between second language acquisition andlanguage testing research (pp. 3270). Cambridge: Cambridge University Press.

    Clark, J. L. D. (1975). Theoretical and technical considerations in oral proficiency testing. InR. L. Jones, & B. Spolsky (Eds.), Testing language proficiency (pp. 1028). Arlington,

    VA: Center for Applied Linguistics.Clark, J. L. D. (1988). Validation of a tape-mediated ACTFL/ILR-scale based test of Chinese

    speaking proficiency.Language Testing, 5(2), 187205.Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance

    assessments: A many-facet Rasch analysis.Language Assessment Quarterly, 2(3), 197221.

    Elder, C. (1993). How do subject specialists construe classroom language proficiency?Language Testing, 10(3), 235254.

    Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiencytasks: What does the test-taker have to offer?Language Testing, 19(4), 347368.

    Farris, C. S. (1995). A semiotic analysis ofsajiao as a gender marked communication style inChinese. In M. Johnson & F. Y. L. Chiu (Eds.), Unbound Taiwan: Close-ups from a

    distance. Selected Papers Vol. 8 (pp. 129). Chicago: Center for East Asian Studies,University of Chicago.

    Fulcher, G. (2003). Testing second language speaking. London: Longman.

  • 7/28/2019 Spaan V8 Kim

    21/30

    21Investigating the Construct Validity of a Speaking Performance Test

    Jones, R. L. (1985). Second language performance testing: An overview. In P. C. Hauptman,R LeBlanc, & M. B. Wesche (Eds.), Second language performance testing(pp. 1524).

    Ottawa: University of Ottawa Press.Kenyon, D. M. (1992). Introductory remarks at symposium onDevelopment and use of rating

    scales in language testing, 14th Language Testing Research Colloquium, Vancouver,

    February 27th March 1st.Kim, M. (2001). Detecting DIF across the different language groups in a speaking test.Language Testing, 18(1),89114.

    Kim, Y. (2009). An investigation into native and non-native teachers judgments of oralEnglish performance: A mixed methods approach.Language Testing, 26(2), 187217.

    Linacre, J. M. (1989).Many-facet Rasch measurement. Chicago: MESA Press.Lumley, T. (1998). Perceptions of language-trained raters and occupational experts in a test of

    occupational English language proficiency.English for Specific Purposes, 17, 34767.Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for

    training.Language Testing, 12(1), 5471.Lumley, T., & OSullivan, B. (2005). The effect of test-taker gender, audience and topic on

    task performance in tape-mediated assessment of speaking. Language Testing, 22(4),415437.

    Luoma, S. (2004).Assessing speaking. Cambridge: Cambridge University Press.Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch

    measurement in the development of performance assessments of the ESL speaking skillsof immigrants.Language Testing, 15(2), 158180.

    Malabonga, V., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation andresponse time on a computerized oral proficiency test.Language Testing, 22(1), 5992.

    McNamara, T. F. (1995). Modelling performance: Opening pandoras box.AppliedLinguistics, 16(2), 159179.

    McNamara, T. F. (1996).Measuring second language performance. London: Longman.McNamara, T. F. (1997). Interaction in second language performance assessment: Whose

    performance?Applied Linguistics, 18(4), 446466.Meiron, B., & Schick, L. (2000). Ratings, raters and test performance: An exploratory study.

    In A. J. Kunnan (Ed.),Fairness and validation in language assessment. Selected papersfrom the 19

    thLanguage Testing Research Colloquium, Orlando, Florida (pp. 6081).

    Cambridge: Cambridge University Press.Messick, S. (1994). The interplay of evidence and consequences in the validation of

    performance assessments.Educational Researcher, 23(2), 1323.Morrow, K. (1979). Communicative language testing: Revolution or evolution? In C. J.

    Brumfit, & K. Johnson (Eds.), The communicative approach to language teaching(pp.143157). Oxford: Oxford University Press.

    Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998).Designing second languageperformance assessments (Technical Report No. 18). Honolulu: University of Hawaii,

    Second Language Teaching & Curriculum Center.OLoughlin, K. K. (2002). The impact of gender in oral proficiency testing.Language Testing,

    19(2), 169192.Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores.

    System, 30, 143154.

  • 7/28/2019 Spaan V8 Kim

    22/30

    22 H. J. Kim

    Savignon, S. J. (1972). Communicative competence: An experiment in foreign languageteaching. Philadelphia: The Center for Curriculum Development.

    Shohamy, E. (1994). The validity of direct versus semi-direct oral tests.Language Testing,11(2), 99123.

    Skehan, P. (1998).A cognitive approach to language learning. Oxford: Oxford University

    Press.SPSS Inc. (2001). SPSS Base 12.0 for Windows[Computer Software]. Chicago IL: SPSS Inc.Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in

    assessing oral interaction.Language Testing, 10(3), 305335.Wigglesworth, G. (1997). An investigation of planning time and proficiency level on oral test

    discourse.Language Testing, 14(1), 85106.

  • 7/28/2019 Spaan V8 Kim

    23/30

    23Investigating the Construct Validity of a Speaking Performance Test

    Appendix A. Speaking Test Tasks

    Task 1. Catering serviceIn this task, you need to complain about something. Imagine you have ordered food from

    Party Planners Inc. for your bosss birthday party. But there was not enough food and it was

    delivered late. You spent a week planning the party, but it was ruined because of the food.You were extremely upset that it happened. Call the caterer to complain about it. You have 20seconds to plan.

    Prompt (Audio)

    [phone ringing] (Answering Machine) Hi! Youve reached Party Planners Inc. Were sorry,but were not available to take your call right now. Please leave a detailed message after the

    beep, and well get back to you as soon as possible. [Beep]Test-Taker: (45 sec response time)

    Task 2. Favorite movieIn this task, you will be asked to talk about a movie. Think about a movie that you liked and

    tell your friend about it. You have 20 seconds to plan.

    Prompt (Vidio)

    Your friend: So, what was that movie you liked? What is it about?

    Test-Taker: (60 sec response time)

    Task 3. Fly in soupIn this task, you need to tell the story in the pictures. Look at the pictures (Pictures are shown

    on the screen). Imagine this happened yesterday while you were having dinner at the nexttable. Tell your friend what you saw. You have 60 seconds to plan your response.

    Prompt (Video)

    Your friend: So, what happened last night at the restaurant?Test-taker: (60 sec response time)

  • 7/28/2019 Spaan V8 Kim

    24/30

    24 H. J. Kim

    Task 4. Moving outIn this task, you need to refuse a request. Imagine you are renting an apartment from a nice

    old couple in New York City. You have been living there for over a year. Now, listen to atelephone message from the couple.Hi, this is Mary, your landlady. Tom and I have been trying to contact you, but you never seem

    to be home. I guess you're really busy these days. Anywaywell, I don't know how to say this,butour granddaughter is moving to the City next month. She's gonna study at Columbiaand,as you know, living in the city is expensive, and the rents are really high. So, she asked us if she

    could live in the apartment you have now. I know we just renewed your lease, and we have noright to ask you to move out, and, we really like you, too. But, do you think you can possibly

    look for a different apartment? We're really sorry about this, but we have to do this for ourgranddaughter. Since theres not much time, we'd like to hear from you as soon as possible, so

    we can let our granddaughter know too. Again, we're sorryCall and let us know, ok? Thanks.(162 words)(Q) Politely tell your landlady that you cant move out and explain why. You have 30 seconds

    to plan.

    Prompt (Audio)

    Landlady: Hi. Come on in. Did you get our message? Have you thought about moving out?Test-taker: (45 sec response time)

    Task 5. Electric cars

    In this task, you will be asked to summarize a radio commentary for a friend. Imagine yourfriend, Jim is thinking about buying an electric car. Now, listen to the radio commentary.

    (Host of the radio commentary) Today, were talking about electric cars. As youre well

    aware, the conventional cars we drive everydayuse a lot of gasoline. You know, how theprice of gasoline is going upand more importantly, theres the issue of global warmingthese cars release harmful pollutants, like carbon monoxide. So, in reaction to this, engineers

    have been working on cars that run on electric batteries, so lets hear about the current stateof the technology. We have a pre-recorded commentary by Ben Smith from General Autos.

    Well, despite high expectations, the first generation of electric cars turned out to be acomplete failure. Why? The first problem is the batteryI mean, current battery technology is

    still very limited. So electric cars can only travel a short distance before its battery needsrecharging. What this means is you cant make long trips without worrying about the battery

    running out. Theyre only good for short trips like going to the supermarket or picking up thekids from school. And when you turn the air conditioner or the radio on, the battery is used up

    even quicker.Then, you might say, we can just recharge the battery when its used up.

    Welltheres a serious problem with recharging, too. To recharge a battery, we need anelectric outlet, right? But there arent many charging stationswhich means, the driver might

    get stuckwithout being able to find a charging station nearby. Well, it gets even morefrustrating. Even if you can find a station, it takes up to 3 hours to fully recharge a battery. Its

    way too long. Well, with these many limitations, does it make sense that anyone would wantto buy an electric car, even if it is environmentally friendly?

  • 7/28/2019 Spaan V8 Kim

    25/30

    25Investigating the Construct Validity of a Speaking Performance Test

    (Q) Summarize what you heard on the radio for Jim. Be sure to include two main problemswith electric cars. You have 30 seconds to plan.

    Prompt (Video)

    Jim: Did I tell you Im thinking about buying an electric car?

    Test-taker: (60 sec response time)

    Task 6. Barbizon schoolIn this task, you will be asked to summarize a lecture for a classmate. Imagine your classmate,

    Jennifer missed todays lecture about the Barbizon school. Now, listen to the lecture.

    Today, well talk about a group of artists, called the Barbizon School. The Barbizon School is

    a group of French artists, who lived in the French town, Barbizon and who developed thegenre of landscape painting. So, what are their characteristics?

    The Barbizon painters tried to find comfort in nature. I mean, they moved away from all the

    commotion and disruption happening in, then, revolutionary Paris, and sought solace innature. And nature was the main theme of their paintingsthey painted landscapes and

    scenes of rural life as true to life as possible. And they rejected the idea of manipulating orbeautifying nature. Instead, they tried to achieve a true representation of the countryside. OK?

    Second, in addition to the efforts to paint nature as realistically as possible, they also tried to

    establish landscape as an independent, legitimate genre in France. Traditionally, landscapepainting wasnt appreciated as a separate genre, but only considered as a background. But

    Barbizon artists reacted against this convention of classical landscape, and painted landscapefor its own sake. With their huge success and recognition, the painters of the Barbizon school

    established landscape and themes of country life as vital subjects for French artists.

    Now, lets look at an examplea painting by Rousseau. This one is called The Forest inWinter at Sunset. [Show the painting on screen]. It shows the ancient forest near the village

    of Barbizon. Rousseau is the best known member of the group. Each Barbizon painter had hisown style and specific interests, and Rousseaus vision was melancholic and sad. Can you feel

    the depressing mood of the painting? At the top, a tangle of tree limbs, and birds flying intothe cloudy, dark, sunset sky. After the sun sets, the forest will be freezing cold. Rousseau

    worked on this painting off-and-on for twenty years. He considered this his most importantpainting and refused to sell it during his lifetime.

    (Q) Summarize the lecture for Jennifer. Be sure to include two main characteristics of the

    school and the example shown. You have 30 seconds to plan.

    Prompt (Video)

    Jennifer: So, what was the lecture about? What did I miss?

    Test-taker: (60 sec response time)

  • 7/28/2019 Spaan V8 Kim

    26/30

  • 7/28/2019 Spaan V8 Kim

    27/30

    27Investigating the Construct Validity of a Speaking Performance Test

    GrammaticalCompetence:Accuracy,

    ComplexityandR

    ange

    5Excellent

    4Good

    3Ad

    equate

    2Fair

    1Limited

    0No

    Theresponse:

    Theresponse:

    Therespon

    se:

    Theresponse:

    Theresponse:

    Theresponse:

    is

    grammatically

    accurate.

    is

    generally

    grammatically

    accuratewithoutany

    majorerrors(e.g.,

    articleusage,

    subject/verb

    agreement,etc.)

    that

    obscuremeaning.

    ra

    relydisplaysmajor

    errorsth

    atobscure

    meaningandafew

    minorerrors(but

    whatthespeaker

    wantsto

    saycanbe

    understood).

    di

    splaysseveral

    majorerrorsaswell

    asfrequentminor

    errors,causing

    confusion

    sometimes.

    is

    almostalways

    grammatically

    inaccurate,which

    causesdifficultyin

    understandingwhat

    thespeakerwantsto

    say.

    dis

    playsno

    gra

    mmaticalcontrol.

    di

    splaysawiderange

    ofsyntacticstructures

    andlexicalform.

    di

    splaysarelatively

    widerangeof

    syntacticstructures

    andlexicalform.

    di

    splays

    asomewhat

    narrowrangeof

    syntacticstructures;

    tooman

    ysimple

    sentences.

    di

    splaysanarrow

    rangeofsyntactic

    structures,limitedto

    simplesentences.

    di

    splayslackofbasic

    sentencestructure

    knowledge.

    dis

    playsseverely

    lim

    itedornorange

    andsophisticationof

    gra

    mmaticalstructure

    andlexicalform.

    di

    splayscomplex

    syntacticstructures

    (relativeclause,

    embeddedclause,

    passivevoice,etc.)

    andlexicalform.

    di

    splaysrelatively

    complexsyntactic

    structuresandlexical

    form.

    di

    splays

    somewhat

    simples

    yntactic

    structures

    di

    splaysuseof

    simpleand

    inaccuratelexical

    form.

    di

    splaysgenerally

    basiclexicalform.

    co

    ntainsnotenough

    evidencetoevaluate.

    di

    splays

    useof

    somewh

    atsimpleor

    inaccuratelexical

    form.

  • 7/28/2019 Spaan V8 Kim

    28/30

    28 H. J. Kim

    DiscourseCompetence:OrganizationandCohesion

    5Excellent

    4Good

    3Adequate

    2Fair

    1Limited

    0No

    Theresponse:

    Theresponse:

    Theresponse:

    Theresponse:

    Theresponse:

    Ther

    esponse:

    is

    completely

    coherent.

    is

    generally

    coherent.

    is

    occasionally

    incoherent.

    is

    looselyorganized

    ,

    resultingingenerally

    disjointeddiscourse

    .

    is

    generally

    incoherent.

    is

    incoherent.

    is

    logically

    structuredlogical

    openingsand

    closures;logical

    developmentof

    ideas.

    di

    splaysgenerally

    logicalstructure.

    co

    ntainspartsthat

    displaysomewhat

    illogicalorunclear

    organization;

    however,asawhole,

    itisinge

    neral

    logically

    structured.

    of

    tendisplays

    illogicalorunclear

    organization,causin

    g

    someconfusion.

    di

    splaysillogicalor

    unclearorganization,

    causinggreat

    confusion.

    dis

    playsvirtually

    non-existent

    organization.

    at

    timesdisplays

    somewha

    tloose

    connectionofideas.

    di

    splayssmooth

    connectionand

    transitionofideasby

    meansofvarious

    cohesivedevices

    (logicalconnectors,a

    controllingtheme,

    repetitionofkey

    words,e

    tc.).

    di

    splaysgooduseof

    cohesivedevicesthat

    generallyconnect

    ideassmoothly.

    di

    splaysuseof

    simplecohesive

    devices.

    di

    splaysrepetitive

    useofsimple

    cohesivedevices;use

    ofcohesivedevices

    arenotalways

    effective.

    di

    splaysattemptsto

    usecohesivedevices,

    buttheyareeither

    quitemechanicalor

    inaccurateleavingthe

    listenerconfused.

    co

    ntainsnotenough

    evidencetoevaluate.

  • 7/28/2019 Spaan V8 Kim

    29/30

    29Investigating the Construct Validity of a Speaking Performance Test

    TaskCompletion

    Towhatextentdoesthespeakercompletethetask?

    5Excellent

    4Good

    3Adequate

    2Fair

    1Limited

    0No

    Theresponse:

    Theresponse:

    The

    response:

    Theresponse:

    Theresponse:

    Theresponse:

    fu

    llyaddressesthe

    task.

    ad

    dressesthetaskwell

    adequatelyaddresses

    th

    etask.

    insufficiently

    addressesthetask.

    barelyaddressesthe

    task.

    showsno

    understandingofthe

    prompt.

    displayscompletely

    accurateunderstanding

    ofthepromptwithout

    anymisunderstood

    points.

    includesnonoticeably

    misunderstoodpoints.

    includesminor

    m

    isunderstanding(s)

    th

    atdoesnotinterfere

    w

    ithtaskfulfillment.

    displayssome

    major

    incomprehension/

    misunderstand

    ing(s)

    thatinterferes

    with

    successfultask

    completion.

    displaysmajor

    incomprehension/

    misunderstanding(s)

    thatinterfereswith

    addressingthetask.

    containsnotenough

    evidencetoevaluate.

    completelycoversall

    mainpointswith

    completedetails

    discussedinthe

    prompt.

    completelycoversallmain

    pointswithagoodamountof

    detailsdiscussedintheprompt.

    (e

    .g.,)

    ElectricCars:twoproblemswith

    th

    ecurrenttechnology(battery

    ru

    nningoutquicklyand

    in

    convenienceinrecharging)

    BarbizonSchool:2characteristics

    of

    theschoolandoneexample

    (p

    aintednatureandestablished

    landscapingasanindependent

    ge

    nre,andtheForestinthesunset

    as

    anexample)

    OR

    touchesuponallmain

    points,butleavesout

    details.OR

    completelycovers

    one(ortwo)main

    pointswithdetails,

    butleavestherest

    out.

    OR

    touchesuponbitsand

    piecesofthep

    rompts.

  • 7/28/2019 Spaan V8 Kim

    30/30

    30 H. J. Kim

    Intelligibility

    Pronunciationandprosodic

    features(intonation,rhythm,an

    dpacing)

    5Excellent

    4Good

    3Adequate

    2Fair

    1Limited

    0No

    Theresponse:

    Theresponse:

    Therespon

    se:

    Theresponse:

    Theresponse:

    The

    response:

    iscompletely

    intelligible

    althoughaccent

    maybethere.

    mayincludeminor

    difficultieswith

    pronunciationor

    intonation,but

    generallyintelligible.

    maylack

    intelligibilityin

    places

    impeding

    comm

    unication.

    oftenlacks

    intelligibility

    impeding

    communication.

    generallylacks

    intelligibility.

    completelylacks

    in

    telligibility.

    isalmostalways

    clear,fluidand

    sustained.

    isgenerallyclear,

    fluidandsustained.

    Pacemayvaryat

    times.

    exhibitssome

    difficu

    ltieswith

    pronunciation,

    intona

    tionor

    pacing

    .

    frequentlyexhib

    its

    problemswith

    pronunciation,

    intonationor

    pacing.

    isgenerally

    unclear,choppy,

    fragmentedor

    telegraphic.

    containsnotenough

    ev

    idencetoevaluate.

    doesnotrequire

    listenereffort.

    doesnotrequire

    listenereffortmuch.

    exhibitssome

    fluidit

    y.

    maynotbe

    sustainedata

    consistentlevel

    throughout.

    containsfrequent

    pausesand

    hesitations.

    mayrequiresome

    listene

    reffortsat

    times.

    mayrequire

    significantlisten

    er

    effortattimes.

    containsconsistent

    pronunciationand

    intonation

    problems.

    requires

    considerable

    listenereffort.