INCREASING OBSERVATION AND FEEDBACK EFFICIENCY TO IMPROVE 
INSTRUCTIONAL QUALITY IN SMALL GROUP 
INTERVENTION SETTINGS 
 
 
 
 
 
 
 
 
 
 
 
 
by 
 
RONDA C. FRITZ 
 
 
 
 
 
 
 
 
 
 
 
 
 
A DISSERTATION 
 
Presented to the Department of Special Education and Clinical Sciences 
and the Graduate School of the University of Oregon 
in partial fulfillment of the requirements 
for the degree of 
Doctor of Philosophy  
 
September 2016 
 ii 
 
DISSERTATION APPROVAL PAGE 
 
Student: Ronda C. Fritz 
 
Title: Increasing Observation and Feedback Efficiency to Improve Instructional Quality 
in Small Group Intervention Settings 
 
 
This dissertation has been accepted and approved in partial fulfillment of the 
requirements for the Doctor of Philosophy degree in the Department of Special Education 
and Clinical Sciences by: 
 
Beth Harn Chairperson 
Brigid Flannery Core Member 
Audrey Lucero Core Member 
Gina Biancarosa Institutional Representative 
 
and 
 
Scott L. Pratt Dean of the Graduate School  
 
Original approval signatures are on file with the University of Oregon Graduate School. 
 
Degree awarded September 2016 
 iii 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© 2016 Ronda C. Fritz 
 iv 
 
DISSERTATION ABSTRACT 
 
Ronda C. Fritz 
 
Doctor of Philosophy 
 
Special Education and Clinical Sciences 
 
September 2016 
 
Title: Increasing Observation and Feedback Efficiency to Improve Instructional Quality 
in Small Group Intervention Settings 
 
 
 The current study investigated the reliability and validity of using short 
observations with an observation tool designed to measure implementation of small 
group interventions. Intervention lessons for eight instructional groups from two schools 
were video recorded for nine weeks, and post-test assessments of reading decoding were 
administered to 31 at-risk kindergarten students. Videos of intervention instruction from 
weeks two, five, and eight, each representing a phase in the intervention period, were 
used within this study for measuring implementation. Each video was divided into three 
ten-minute segments representing the beginning, middle, and end of each intervention 
lesson. Video segments were coded for implementation using the Quality of Intervention 
Delivery and Receipt tool (QIDR; Harn, Forbes-Spear, Fritz, & Berg, 2012). Overall, the 
results of this study indicate that a) reliability can be achieved when using 10-minute 
observations, b) QIDR scores obtained from 10-minute segments are strongly correlated 
with scores obtained from full-length observations, c) there is no statistical difference in 
scores obtained from full-length observations and those obtained in 10-minute segments, 
and d) QIDR scores obtained from both full-length and 10-minute segments accounted 
for group differences in student outcomes, with lesson segments obtained from the end of 
 v 
 
lessons accounting for the most variance. Implications for research and practice are 
discussed, including the importance of thorough training and calibration to maintain 
reliability, as well as the feasibility and utility of providing frequent observation and 
feedback through shorter observations.  
 vi 
 
CURRICULUM VITAE 
 
NAME OF AUTHOR:  Ronda C. Fritz 
 
 
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: 
 
 University of Oregon, Eugene, OR 
 Boise State University, Boise, ID 
   
 
 
DEGREES AWARDED: 
 
 Doctor of Philosophy, Special Education, 2016, University of Oregon 
 Master of Arts, Reading Education, 2001, Boise State University 
 Bachelor of Arts, Elementary Education, 1992, Boise State University 
  
 
AREAS OF SPECIAL INTEREST: 
 
 Teacher Preparation; Early Reading Intervention; Student Engagement 
 Reading Methods 
 Elementary Teacher Education 
 
 
PROFESSIONAL EXPERIENCE: 
 
 Assistant Professor of Education, Eastern Oregon University, La Grande, OR; 
 September 2014-Present 
 
 Title I Reading Specialist, North Powder Charter School, North Powder, OR; 
 August 2004-July 2011 
  
 4th/5th Grade Classroom Teacher, North Powder School District,  
 North Powder, OR; August 1996-June 1998 
 
 Middle School Language Arts/Mathematics Teacher, North Powder School 
 District, North Powder, OR; August 1996-June 1998 
 
 Title I Reading Teacher, North Powder School District, North Powder, OR 
 August 1993-June 1996 
 
 Kindergarten/Middle School Mathematics Teacher, Ukiah School District, 
 Ukiah, Oregon; August 1992-June 1993 
 
 vii 
 
 
GRANTS, AWARDS, AND HONORS: 
 
 BASES Leadership Grant, University of Oregon, 2011-2016 
 
 William N. & Patsy A. Wilber Scholarship, University of Oregon, 2013-2014 
 
 Sammie Barker McCormack Scholarship, University of Oregon, 2013-2014 
 
 Dynamic Measurement Group Scholarship, University of Oregon, 2014-2015 
 
 
PUBLICATIONS: 
 
Harn, B., Basaraba, D., Chard, D., & Fritz, R. (2015). The Impact of Schoolwide 
Prevention Efforts: Lessons Learned from Implementing Independent 
Academic and Behavior Support Systems. Learning Disabilities: A 
ContemporaryJournal, 13(1), 3-20 
 
Fritz, R., (2014). Book review of: Fingon, Joan C. & Ulanoff, Sharon H. (2012). 
Learning from culturally and linguistically diverse classrooms: Using 
inquiry to inform practice. Bilingual Research Journal (37:1). DOI: 
10.1080/15235882.2014.893269 
 
Harn, B. A., Fritz, R., & Berg, T. (2014). How do we deliver high quality literacy 
and reading instruction in inclusive schools? In McLeskey, J. Waldron, N., 
Spooner, F., & Algozzine, B. (Eds.) Handbook of Research and Practice for 
Effective Inclusive Schools. 
 
Fritz, R. (2001). Accelerated Reader: A valuable tool for increasing reading 
achievement and motivation of at-risk fourth and fifth graders?  Submitted 
to fulfill requirements of Master’s Thesis, Boise State University 
 
 
 
 viii 
 
ACKNOWLEDGMENTS 
 
I wish to express sincere appreciation to my doctoral committee, Dr. Audrey 
Lucero, Dr. Brigid Flannery, Dr. Gina Biancarosa, and Dr. Beth Harn, for providing 
support throughout my doctoral program, and especially through the dissertation process. 
I want to especially thank, my advisor and dissertation committee chair, Dr. Beth Harn, 
for providing endless hours of support and feedback throughout my doctoral program. 
Thank you for knowing exactly how much encouragement and support I needed at just 
the right times.  In addition, I would like to thank Tricia Berg, Tiffany Beattie, Manuel 
Monzalve, Kendra Carmen, Samantha Fritz, and Christine Aldrich for their endless 
dedication as part of my coding team, and Dr. Lina Shanley for your expertise and 
support during the analysis phase. 
I would also like to thank Dr. James Sinclair, Dr. Allison Baker-Wilson, Dr. Ruby 
Batz, Dr. Kara Hirano, and Tricia Berg for making this doctoral journey bearable and 
even sometimes enjoyable! I also want to thank my former students and colleagues at 
North Powder Charter Elementary School for providing me with inspiration and 
encouragement to explore questions and answers for the benefit all of our students. 
Finally, I want to thank my husband, Shane, who endured countless hours in the car for 
visits, listened during endless hours of phone calls, and provided unending support 
throughout.  
 ix 
 
 
 
 
 
 
 
 
Dedicated to my son, my husband, and my father, who all provided the inspiration for my 
further studies; and to my daughter and mother for being my biggest cheerleaders. I love 
you all and appreciate your support and encouragement along the way.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 x 
 
TABLE OF CONTENTS 
Chapter Page 
 
 
I. INTRODUCTION .................................................................................................... 1 
 Tools for Measuring Quality .................................................................................. 4 
  Measuring Instructional Quality in General Education ................................... 5 
 
  Measuring Instructional Quality in Intervention.............................................. 6 
 
 Maximizing Time: Can We Measure Quality More Efficiently to Provide ..........  
 Regular Feedback? ........................................................................................... 8 
 Purpose of the Study .............................................................................................. 13 
 Research Questions ................................................................................................ 14 
II. LITERATURE REVIEW ........................................................................................ 15 
 Classroom Observation .......................................................................................... 16 
 History of Classroom Observation......................................................................... 17 
 General Education Observation Tools ................................................................... 23 
 Classroom Assessment Scoring System (CLASS) .......................................... 23 
  Purpose ....................................................................................................... 24 
  Content ....................................................................................................... 25 
  Training ...................................................................................................... 26 
  Observation Duration ................................................................................. 27 
 Framework for Teaching (FFT) ....................................................................... 27 
  Purpose ....................................................................................................... 28 
  Content ....................................................................................................... 28 
  Training ...................................................................................................... 31 
 xi 
 
Chapter Page 
 
 
  Observation Duration ................................................................................. 31 
 Special Education and Intervention Observation Tools......................................... 31 
 Recognizing Effective Special Education Teachers Observation Tool 
 (RESET) ........................................................................................................... 32 
  Purpose ....................................................................................................... 33 
  Content ....................................................................................................... 33 
  Training ...................................................................................................... 35 
  Observation Duration ................................................................................. 36 
 Quality of Intervention Delivery and Receipt (QIDR) .................................... 36 
  Purpose ....................................................................................................... 36 
  Content ....................................................................................................... 37 
  Training ...................................................................................................... 41 
  Observation Duration ................................................................................. 42 
 Maximizing Time for Observation and Feedback ................................................. 42 
 Proximal and Distal Measures of Quality Using Short Observations.............. 43 
 Maximizing the Efficiency of Observation...................................................... 45 
 Summary and Conclusions .................................................................................... 47 
III. RESEARCH METHODS ...................................................................................... 50 
 Setting and Participants.......................................................................................... 51 
 Setting  ............................................................................................................. 51 
 Student Participants ......................................................................................... 52 
 Interventionists ................................................................................................. 53 
 xii 
 
Chapter Page 
 
 
 Observers ......................................................................................................... 53 
 Intervention Programs ............................................................................................ 54 
 Super K Intervention ........................................................................................ 54 
 Measures ................................................................................................................ 54 
 Instructional Implementation Measure ............................................................ 54 
 Student Outcome Measure ............................................................................... 55 
 Video Data Set ....................................................................................................... 55 
 Full-length Videos ........................................................................................... 55 
 Video Segment Selection ................................................................................. 55 
 Training and Observation Procedures .................................................................... 57 
 Training Procedures ......................................................................................... 57 
 Observation Procedures ................................................................................... 58 
 Inter-rater Reliability (IRR) ............................................................................. 59 
 Confidentiality ................................................................................................. 59 
 Experimental Design and Analytic Approach ....................................................... 59 
 Can Adequate Inter-rater Reliability (IRR) Be Obtained After Observing  
 10 minutes of 30-minute Full Length Intervention Lessons? .......................... 61 
 Using the QIDR, What is the Relationship Between Scores Obtained  
 Watching the Full Lesson Versus Sampling 10 minutes of the Lesson?: ........ 62 
Which QIDR Ratings (Full lesson vs. 10-minute Sample; Beginning, 
Middle, End; Intervention Phase) Account for the Most Variance in  
Student Outcomes? .......................................................................................... 63 
 
IV. RESULTS .............................................................................................................. 66 
 Descriptive Analysis .............................................................................................. 66 
 xiii 
 
Chapter Page 
 
 
 Descriptive Statistics ........................................................................................ 66 
 Testing of Model Assumptions ........................................................................ 74 
 Results .................................................................................................................... 74 
 Research Question 1: Can Adequate Inter-rater Reliability (IRR) Be  
 Obtained After Observing 10 minutes of 30-minute Full Length  
 Intervention Lessons? ...................................................................................... 74 
 
 Research Question 2: Using the QIDR, What is the Relationship  
 Between Scores Obtained Watching the Full Lesson Versus Sampling  
 10 minutes of the Lesson? ................................................................................ 77 
 
 Research Question 3: To What Extent Does the Relationship Between 
 QIDR Ratings Obtained Watching the Full Lesson, Versus Sampling  
 Ten Minutes of the Lesson, Depend on Lesson Segment or 
 Intervention Phase? .......................................................................................... 78 
 
 Research Question 4: Which QIDR Ratings (Full Lesson vs. 10-Minute  
 Lesson Segment; Beginning, Middle, End; Intervention Phase) Account  
 For the Most Variance In Student Outcomes? ................................................. 80 
 
  Null Model ................................................................................................. 80 
  Full-length QIDR Measure ........................................................................ 80 
  Lesson Segment QIDR Measures .............................................................. 81 
  Intervention Phase Measures ..................................................................... 83 
V. DISCUSSION ......................................................................................................... 86 
 Primary Findings .................................................................................................... 87 
 Inter-rater Reliability ....................................................................................... 87 
  Lesson Segment Length ............................................................................. 89 
  Multifaceted Nature of QIDR .................................................................... 89 
  Coder Characteristics ................................................................................. 90 
 xiv 
 
 Relationship between Lesson Segment and Full-length QIDR Scores ............ 91 
 Relationship between Scores Obtained During Various Lesson Segments 
 And Intervention Phases .................................................................................. 93 
 Association between QIDR and Student Outcomes ........................................ 94 
 Limitations ............................................................................................................. 97 
 Sample Size ...................................................................................................... 97 
 Student Outcome Measure ............................................................................... 97 
 Lesson Segment Numbers……….................................................................... 98 
 Observer Reliability ......................................................................................... 99 
 Implications............................................................................................................ 99 
 Reliability Can be Demonstrated with Abbreviated Observations .................. 100 
  Challenges in Achieving Reliability in School Settings ............................ 100 
 Equivalence of Implementation Regardless of Lesson Segment ..................... 101 
 Implementation is Related to Student Outcomes ............................................. 102 
 Future Research ..................................................................................................... 103 
 Conclusions ............................................................................................................ 105 
APPENDIX: QUALITY OF INTERVENTION DELIVERY AND 
RECEIPT TOOL .......................................................................................................... 107 
 
REFERENCES CITED ................................................................................................ 115 
 xv 
 
LIST OF FIGURES 
 
Figure Page 
 
 
1. Boxplots of group QIDR scores by lesson segment .............................................. 71 
 
2. Boxplots of group QIDR scores by intervention phase. ........................................ 71 
 
 xvi 
 
LIST OF TABLES 
 
Table Page 
 
 
1. Overview of Observation Tools ............................................................................. 22 
 
2. Student outcome descriptive statistics ................................................................... 68 
 
3. Descriptive statistics of overall QIDR scores by lesson segment and phase ......... 68 
4. Descriptive statistics of QIDR by group and lesson segment ................................ 69 
5. Descriptive statistics of QIDR by group and intervention phase ........................... 70 
6. Bivariate correlational analysis of group differences between full and restricted 
 sample .................................................................................................................... 73 
7. One-way, random effects, absolute agreement ICCs for assessment of inter-rater 
agreement by segment and overall ............................................................................... 77 
8. Bivariate correlations for QIDR ratings between full-length observations and lesson 
segments ....................................................................................................................... 78 
9. Bivariate correlations for QIDR ratings between full-length observationsand 
intervention phases....................................................................................................... 78 
10. One-way, within-subjects, repeated measures ANOVA summary table for the effects 
of lesson segment and intervention phase on QIDR scores ......................................... 79 
11. Fixed and random effects estimates models WAT posttest scores by lesson segment 
and intervention phase ................................................................................................. 85 
  
 
 
 
 
 
 
 
  
2 
CHAPTER I 
INTRODUCTION 
Without adequate skill in reading, a child is likely to face many obstacles 
throughout education and life. Reading is a skill of profound social significance because 
it opens the door for subsequent education, which in turn expands opportunities for 
greater employment, enrichment, and entertainment (Saunders, 2011). During the past 
decades, great strides have been made to ensure the success of all children in early 
reading. There is substantial research documenting the effectiveness of many 
interventions in reducing the number of children with long-term reading difficulties 
(Gersten, Vaughn, Deshler, & Schiller, 1997; Simmons et al., 2011; Swanson, 1999), 
however, some students remain poor readers even after receiving highly intensive 
interventions (Denton, Fletcher, Anthony, & Francis, 2006).  
There may be multiple factors that influence a child’s responsiveness to high-
quality, evidence-based interventions, including neurological, biological, and 
environmental factors (Shaywitz, 2008; Wolf, 2007). While these factors may be 
relevant, they focus solely on the student, and aren’t readily malleable or changeable by 
educational personnel. One factor that is modifiable, and has received increased attention 
recently, is quality of instructional delivery. While most of the focus on this group of 
non-responders has focused on specific student characteristics (e.g., language status, 
ethnicity; Shaywitz, 2008; Torgesen, 2002), recent efforts have documented variability of 
instructional delivery and its impact on learning even when using evidenced-based 
programs (Cook & Odom, 2013).  
  
3 
Many have documented that an effective teacher is an important factor in a child’s 
achievement (Chetty, Friedman, & Rockoff, 2011; Darling-Hammond, 2010; Hanushek 
& Rivkin, 2010; Hanushek et al., 2010). However, it is not uncommon for the students 
most at-risk to receive supplemental intervention from personnel with no formalized 
training, such as educational assistants (Causton-Theoharis, Doyle, Giangreco, & Vadasy, 
2007). This lack of training and support may impact instructional quality and be at least 
partially responsible for non-response of some students for whom reading is most 
difficult. 
Fixsen, Blase, Metz, & Van Dyke (2013) posit that improved outcomes can only 
be achieved when effective interventions are coupled with effective implementation. The 
authors also contend that effective implementation is obtained when adequate pre-service 
and in-service training, coaching, and performance assessment are provided. The reality 
of providing these types of supports to improve instruction in school settings is often far 
from this ideal. Coaching and supervision of the interventionist may be sparse in many 
school settings due to limited resources and/or an erroneous belief that evidence-based 
programs are “plug and play” and don’t require preparation or resources for follow-up 
support (Fixsen et al., 2005 as cited in Fixsen, et al., 2013). There are two major 
challenges to providing this type of implementation support in schools: 1) identifying 
tools for measuring instructional quality for interventions, and 2) having time to complete 
the instructional evaluations. In the next sections, each of these challenges will be 
explored in more detail. 
 
 
  
4 
Tools for Measuring Quality 
Studies focused on measuring the effectiveness of interventions have placed 
emphasis on evaluating implementation of the specific practice, or treatment fidelity, as 
part of a research project (Harn, Parisi, & Stoolmiller, 2013). Generally, treatment 
fidelity refers to the degree to which a treatment or intervention is delivered as intended 
(Yeaton & Sechrest, 1981). Measurement of treatment fidelity in educational research is 
often focused on structural and process fidelity (Gersten, Fuchs, et al., 2005; Odom, 
2008). Structural fidelity refers to adherence to the central components of an intervention, 
dosage, and intervention completion (Durlak & DuPre, 2008; Gersten, Fuchs, et al., 
2005) and is usually measured through direct observation or self-report by 
interventionists (Harn, et al., 2013). Process fidelity refers to the quality of intervention 
delivery and student-teacher interactions (Justice, Mashburn, Hamre, & Pianta, 2008). 
Some researchers have suggested that process fidelity is more difficult to define and 
measure, but may be more directly related to student outcomes than structural fidelity 
(Gersten, Fuchs, et al., 2005; Mowbray, Holter, Teauge, & Bybee, 2003). Holdheide, 
Browder, Warren, Buzick, & Jones (2012) stress the importance of measuring process 
components within school-level implementation because the goal in schools is to improve 
instructional delivery and quality, rather than to document intervention 
fidelity/adherence.  
The tools being used to evaluate instructional quality have focused primarily on 
evaluating quality within general education classrooms (e.g., Cameron, Connor, & 
Morrison, 2005; Kane & Staiger, 2012; Pianta, Cox, Taylor, & Early, 2013) and have not 
focused on tools that can be used to formatively evaluate quality over time (e.g., Hagan-
  
5 
Burke et al., 2013;  Johnson & Semmelroth, 2013).  Educators and educational 
researchers acknowledge that evaluation of instructional quality within general education 
is necessary to ensure that all children are receiving high quality instruction. Although 
other means of measuring quality have been used such as value-added models (VAMs), 
teacher self-report, and student evaluations (Kane & Staiger, 2012), observation remains 
one of the most widely used methods for gaining a more direct measure of classroom 
interactions that may be impacting student outcomes (Chomat-Mooney et al., 2008). As a 
result, multiple observation tools have been developed for this purpose and have been 
found reliable and valid for observation and evaluation in the general education setting 
(e.g., Danielson, 1996; Fish & Dane, 2000; La Paro, Pianta, & Stuhlman, 2004; Maxwell, 
Mcwilliam, Hemmeter, Ault, & Schuster, 2001; Waxman, Huang, Anderson, & 
Weinstein, 1997). These tools are designed around a definition of quality as it applies in a 
whole-class setting and may not accurately measure quality of instruction in a small-
group intervention setting (Johnson & Semmelroth, 2013). 
Measuring instructional quality in general education. In an attempt to 
differentiate instruction for a diverse group of learners, teachers in general education 
classrooms may need to use a variety of instructional approaches to adapt to the needs of 
a wide variety of learners (Yopp & Yopp, 2000). Therefore, the tools designed to 
measure instructional quality in this context often measure a broad sampling of teacher-
student interactions and classroom environment factors. For instance, Pianta, La Paro, 
and Hamre (2008) developed the Classroom Assessment Scoring System (CLASS), 
which measures multiple instructional dimensions including emotional support, 
classroom management, and instructional support, using ten different dimensions (e.g., 
  
6 
positive and negative climate, teacher sensitivity, behavior management, concept 
development, language modeling). Danielson (1996) also developed a system for 
measuring classroom quality called the Framework for Teaching. This system includes 22 
subscales for measuring planning and preparation, classroom environment, instruction, 
and professional responsibilities. Research regarding these and other observation tools in 
general education has provided insight into effective ways to measure instructional 
quality (e.g., Danielson, 2011; Darling-Hammond, 2010; Kane & Staiger, 2012), 
however, there is a need to shift our focus from examining not only what quality looks 
like in general education, but to understand how to measure quality of instruction 
designed for our most at-risk students receiving intervention supports  (Johnson & 
Semmelroth, 2012; Semmelroth & Johnson, 2013). 
Measuring instructional quality in intervention. A tool designed for use in 
intervention settings must reflect the differences between instructional quality in general 
education and intervention settings. The definition of instructional quality is apt to be 
considerably different in these two contexts. While differentiated instruction in a general 
education classroom calls for varying instructional approaches adapted to the needs of 
diverse students (Hall, Vue, Strangman, & Meyer, 2014),  intervention settings are likely 
to need much more specific approaches to meeting the needs of individual students 
(Zigmond & Kloo, 2011). 
 Instruction for intervention efforts is designed for the purpose of accelerating 
learning focusing on individualizing instruction (Justice, 2006). To maximize learning, 
intervention efforts are designed and implemented very differently than in general 
education. Intervention is delivered in small groups, under a more focused time constraint 
  
7 
(i.e., 20 to 30 minutes in length), and must be intensive, focused, and explicit (Foorman 
& Torgesen, 2001; Torgesen et al., 1999). Specific, systematic, direct instruction coupled 
with explicit strategy instruction has shown positive effect sizes in student achievement 
(Gersten et al., 1997; Swanson, 1999). This type of instruction is often focused on basic 
skills and may not lend itself to some of the interactions measured by tools designed for 
general education (Forbes-Spear, 2014). For instance, within the CLASS tool (Pianta et 
al., 2008) one of the variables measured is concept development. While this may be an 
important construct in general education instructional contexts, this type of interaction 
may not be essential within a curriculum that is concentrated on basic skill development 
(Semmelroth & Johnson, 2013). The short duration and intensive, specific focus of an 
intervention session limits the variety of interactions, which should be reflected in the 
type of tool used to measure quality.  
Tools designed for use in intervention contexts must be specifically measuring 
skills essential for accelerated learning. Johnson and Semmelroth (2012) have begun to 
explore this idea through development of the Recognizing Effective Special Education 
Teachers (RESET) tool for measuring instructional quality within Special Education 
settings. This tool reflects some of the differences in what is considered “instructional 
quality” between general education and intervention contexts. The RESET tool is based 
on the instructional domain of the Danielson (1996) Framework for Teaching (FFT), but 
was adapted to clearly delineate instructional components necessary for delivering 
evidence-based practices to students with disabilities, rather than the more constructivist 
approach to instructional delivery reflected in the FFT (Semmelroth, 2013). The RESET 
observation tool contains between 28 and 67 items (depending on the number of 
  
8 
instructional practices being observed) and consists of three main parts: Lesson Overview 
(introduction), Lesson Components (instructional practices), and Lesson Summary 
(conclusion; Johnson & Semmelroth, 2012). The tool is designed to be used with video-
taped lesson footage to provide feedback to Special Education teachers.  
The development of the RESET tool for measuring the quality of interventions 
delivered by special education teachers moves the field of education closer to providing a 
solution for measuring instructional quality in alternative settings. However, as was 
mentioned previously, intervention at the tier two and tier three levels within a response 
to intervention (RtI) framework is commonly delivered by educational assistants  
(Causton-Theoharis et al., 2007). While the RESET tool provides a promising avenue for 
evaluating special education teachers, it does not necessarily provide a tool that can be 
used in any intervention setting (e.g., general education, special education, or Title I 
classroom) with any interventionist (e.g., instructional assistants, volunteers, or general 
education teachers). Although the tool was designed with an underlying intent to improve 
Special Education instruction, the length of the tool limits the efficiency necessary to 
provide frequent formative feedback to improve instruction. 
Maximizing Time: Can We Measure Quality More Efficiently to Provide Regular 
Feedback?  
Maximizing the efficiency of measurement of instructional quality is especially 
important when considering the importance of frequent formative assessment to improve 
instruction. While research on using a formative evaluation approach to the timely 
measurement of instructional quality is relatively uncommon, the use of formative 
assessment to measure student achievement with the purpose of determining whether or 
  
9 
not instruction is producing desired outcomes is frequently used (Stecker, Fuchs, & 
Fuchs, 2005). The heightened accountability in education of recent years has put an 
emphasis on teachers using formative assessments such as curriculum-based measures 
(CBMs) to determine if instruction is producing desired outcomes with students. The use 
of data-based decision-making has been shown to improve student achievement as 
teachers are attending more carefully to classroom-level data and modifying instruction 
as needed to produce desired results (Fuchs, Deno, & Mirkin, 1984; Stecker et al., 2005). 
Potentially, taking a similar, responsive approach to evaluating instructional quality 
would have the same benefit of improving student outcomes by focusing on the teacher.  
Assessment of instructional quality must share other characteristics with CBMs in 
order to be effective and useful. CBMs are designed to be not only efficient, but sensitive 
enough to measure student growth across a short period of time (Good, Gruba, & 
Kaminski, 2002; Stecker, Lembke, & Foegen, 2008). In the same way, tools for 
formative assessment of instructional quality must also be efficient and sensitive in order 
to be effective and useful. 
The efficiency of the tool is essential because multiple researchers have found that 
frequent observation and feedback produces greater achievement gains (Chomat-Mooney 
et al., 2008).  To provide this level of support in school settings, tools that can provide 
targeted, specific feedback to improve instructional practice are necessary (Cook & 
Odom, 2013; Feng, Figlio, & Sass, 2010; Goe, Biggers, & Croft, 2012; Greenwood, 
Horton, & Utley, 2002; Kretlow & Bartholomew, 2010). Existing tools often focus on a 
multitude of variables which typically requires a longer duration for observation (whole 
class/intervention period), which may negatively impact the ability to provide timely, 
  
10 
frequent, and targeted feedback. It may be possible to provide an even more intense focus 
on fewer essential skills to maximize the efficiency of the tool, making observations of 
full lessons unnecessary.  
Developing a more concise tool for evaluating instructional quality in an 
intervention context may allow for a shorter observation period to enable coaches and 
supervisors to conduct more frequent observations. Current tools designed for use in 
general education settings not only measure multiple aspects of quality that may not be 
applicable in intervention settings, these tools also require observation periods that are 
much longer than may be necessary in an intervention setting. Standard observation 
protocol using the CLASS (Pianta et al., 2008) requires at least four 30-minute cycles, for 
a total of two hours of observation time, to obtain reliable and valid scores of quality. 
Although the Framework for Teaching (Danielson, 1996) does not specify a time frame 
for observation, an entire lesson is required to measure all components of the tool, which 
is likely to mean no less than 20 minutes of observation. Even the RESET tool, designed 
specifically for intervention (Johnson & Semmelroth, 2012), requires the duration of an 
entire lesson (i.e., at least 15 minutes) to obtain information about all of the components 
within the tool.  
The reality of providing numerous opportunities for feedback to interventionists is 
prohibitive as current tools are too lengthy or not focused on the facets of instructional 
quality specific and most critical to the intervention context. By further delineating the 
specific skills in intervention delivery that are responsible for improved student 
outcomes, shorter observation tools and observation periods may be possible, which 
would permit more frequent observation and feedback opportunities to improve overall 
  
11 
instructional quality. Given that administrators and coaches responsible for evaluating 
interventionists often have limited availability to conduct observations,  provide 
feedback, and follow up to ensure improvement in instruction (Knight, 2007), efficient 
tools specifically designed for formative evaluation in intervention settings may be  more 
practical than current tools. Greater clarification is needed to determine the essential 
features of quality intervention delivery as well as determining a sufficient amount of 
time needed to capture instructional quality.  
Within the context of general education, one research team has begun to address 
the need for a more concise measurement tool to ensure more frequent and timely 
feedback. Gargani and Strong (2014) have developed a tool called the Rapid Assessment 
of Teacher Effectiveness (RATE). The premise behind this tool is to provide a means to 
identify successful teachers better, faster, and cheaper than current observation tools 
provide. The RATE tool has only ten items addressing general classroom practices (e.g., 
lesson objective, multiple delivery mechanisms, providing examples/non-examples, 
pacing) each rated on a scale of one to three, with three being the highest level of 
implementation. The RATE tool was specifically designed to predict the ability of 
teachers to raise the achievement of their students, using shorter segments of instruction, 
fewer observations, and less training (Gargani & Strong, 2014). Multiple studies of 
predictive validity have shown that the development of this tool has been successful in 
predicting teachers whose students will have the best achievement outcomes after only 
one twenty-minute observation (Strong, 2011). While this tool has been effective at 
identifying effective and non-effective teachers, the concise nature of the tool was not 
  
12 
intended to provide enough information to be used for formative assessment for 
improving instruction (Gargani & Strong, 2014). 
Another research team has begun to explore the use of concise measurement of 
intervention implementation by using an observation tool called Snippets to study 
teachers’ use of specific comprehension supports in a general education, whole-
classroom setting (Pratt & Logan, 2014). In one study, two six-minute segments of a 90-
minute observation video documenting the 90-minute reading block, were coded using 
the Snippets tool. Within the 90-minute reading block, teachers were instructed to use a 
supplemental curriculum called Let’s Know (Language and Reading Research 
Consortium, in press) for 30 minutes and continue with regular language and reading 
activities for the remaining 60 minutes.  One of the six-minute segments was extracted 
from the 30-minute Let’s Know lesson, while the other was taken from the remaining 60 
minutes of language and reading instruction. The Snippets tool was designed to measure 
very discrete language-focused supports related to comprehension development in 
primary-grade children. The focused nature of this tool allowed for reliable measures of 
the use of comprehension supports within this short time frame of six-minute 
observations. The tool was able to reliably measure significant differences between 
comprehension supports utilized during Let’s Know instruction compared to Language 
Arts instruction outside of the Let’s Know curriculum.   Kappa was calculated indicating 
86% reliability with a second observer for 14% of the coded segments. 
Time is precious, particularly for students already performing below expectations. 
In the same way that the Snippets tool is specifically focused on one facet of reading 
instruction, a tool that focuses only on the most critical instructional practices for 
  
13 
improving student outcomes within intervention would give educators the ability to 
formatively evaluate intervention to improve outcomes. A tool that can allow for 
relatively short observation periods (i.e., 5-10 minutes) that can be repeated frequently 
(i.e., 3-4 times per month) could allow for a more responsive observational cycle and 
provide expedited improvement of instruction. An evaluation and support system that 
could provide this level of ongoing feedback for improving instructional delivery could 
ensure that student learning is accelerated and long-term outcomes for our most at-risk 
students are improved. 
Purpose of the Study 
Harn, Forbes-Spear, Fritz, and Berg (2012) developed the Quality of Instructional 
Delivery and Response (QIDR) tool to measure quality within small group intervention. 
The instructional elements in this tool have been shown through previous research to be 
important components of quality instruction in the intervention setting. The tool was 
originally designed to determine the relationship between instructional quality and 
student outcomes in early reading. Preliminary evidence indicates that the tool can be 
used reliably and that scores obtained are predictive of academic outcomes (Forbes-
Spear, 2014). Using videos capturing small-group intervention instruction, the tool has 
been found to be reliable when measuring quality of a full-length lesson of 20-30 minutes 
(Harn, Forbes-Spear, Fritz, Berg, & Basaraba, 2014; Forbes-Spear, 2014).  
The purpose of this study is to determine the relationship between shorter 
segments of intervention and the overall intervention session. The study is designed to 
determine if one can reliably measure instructional quality more efficiently by comparing 
a sub-sample of the intervention time (i.e., 10 minutes) to the overall intervention (i.e., 
  
14 
25-30 minutes). In addition, this study examined if a specific period of time in the 
intervention is more related to the overall intervention delivery (e.g., beginning, middle, 
or end).  These findings could assist schools in utilizing their supervisory personnel more 
efficiently which could maximize time to allow for a responsive observational cycle that 
would improve instructional quality for those students who need high-quality instruction 
most. Investigation of these issues was guided by the following research questions: 
Research Questions 
1) Can adequate inter-rater reliability (IRR) be obtained after observing 10 minutes 
of full-length intervention lessons? 
2) Using the QIDR, what is the relationship between scores obtained watching the 
full lesson versus sampling ten minutes of the lesson? 
3) To what extent does the relationship between QIDR ratings obtained watching the 
full lesson, versus sampling ten minutes of the lesson, depend on time segment of 
the lesson (i.e., beginning, middle, end) or on phase within the intervention (i.e., 
2nd week, 5th week, 8th week)? In other words, are correlations between the 
ratings systematically stronger or weaker based on time segment or intervention 
phase? 
4) Which QIDR ratings (full lesson vs. 10-minute sample; beginning, middle, end; 
intervention phase) account for the most variance in student outcomes?  
  
15 
CHAPTER II 
LITERATURE REVIEW 
 One of the most influential factors in determining student achievement is the 
quality of their teacher and, specifically, the quality of instruction received (Darling-
Hammond, 2010; Kane, Staiger, & McCaffrey, 2012). As a result, recent national policy 
has placed renewed emphasis on developing systems to evaluate teacher and instructional 
quality for the purpose of ensuring high-quality teachers, and improving instruction to 
maximize student outcomes (McGuinn, 2012; National Council on Teacher Quality, 
2012). The implementation of these new policies has proven to bring about many 
challenges. Determining what constitutes a quality teacher, quality instruction, and the 
best way to measure that quality, are ongoing struggles in the field of education (Goe, 
Bell, & Little, 2008; Johnson & Semmelroth, 2012). The definition of quality instruction 
and what aspects of instruction most impact student outcomes continues to be pursued 
through research (e.g., Cameron et al., 2005; Carlisle, Kelcey, Berebitsky, & Phelps, 
2011; Foorman & Torgesen, 2001; Gargani & Strong, 2014; Gersten, Baker, Haager, & 
Graves, 2005; Hagan-Burke et al., 2013; Pratt & Logan, 2014). The issue of measurement 
of instructional quality is further complicated by the variation in instructional contexts 
and the differences in instructional expectations for each of these settings (i.e., general vs. 
special education; Foorman & Torgesen, 2001; Zigmond & Kloo, 2011), as well as the 
resource-intensive nature of measurement when using current observation tools (Gargani 
& Strong, 2014). 
 Chapter two begins with a brief history of the use of classroom observation tools 
for research and school-based purposes, followed by an overview of the most commonly 
  
16 
used observation tools used to measure instructional quality in general education 
classrooms. For each tool, the specific purpose(s), content, and training and observation 
time requirements will be reviewed. Next, the literature review will focus on observation 
tools designed for the purpose of measuring instructional quality in alternative settings 
(i.e., intervention and special education), and the ways in which these tools, can and do, 
differ from those designed for general education.  Next, a discussion will present recent 
research investigating the use of shorter observation periods to measure instructional 
quality and teacher effectiveness. Finally, the chapter concludes with a discussion of the 
need to determine more efficient and valid ways to evaluate instructional quality to 
support more effective intervention delivery in school settings. 
Classroom Observation 
 Classroom observation has become a common component in the measurement of 
instructional quality and teacher effectiveness, both as an element of evaluation in 
applied settings, as well as for purposes of research (Chomat-mooney et al., 2008; Goe et 
al., 2012; Pianta, Mashburn, Downer, Hamre, & Justice, 2008; Semmelroth & Johnson, 
2013). In applied settings, observation is sometimes used as part of high-stakes 
employment decisions (i.e., value-added measurement, raises, termination), but has also 
been found to be helpful for providing administrators with information that can guide 
professional development of teachers (Goe & Croft, 2009; Pianta, Mashburn, et al., 
2008). Observation for the purpose of informing professional development is arguably 
the most important use, and standardized observation systems can provide a means for 
systematically determining needs for professional development for each teacher and 
school (Danielson, 2011; Pianta, 2003).  The next section will highlight the historical 
  
17 
context of observation research before delving into specific tools designed for classroom 
observation. 
History of Classroom Observation  
 Classroom observation has been a part of educational research for over forty years 
(Gage & Needels, 1989).  Much of the initial research followed a process-product 
approach meaning that researchers were trying to identify which teacher processes (i.e., 
instruction and interactions) produced best student learning. This was also an attempt to 
delineate what made effective and ineffective teachers so that effective teaching could be 
emulated across classrooms (Brophy & Good, 1986; Brophy, 1986). This research was 
often quantitative in nature and focused on frequency counts of discrete classroom 
behaviors such as number of pages of curriculum presented, time allocated for 
instruction, or classroom management behaviors (Brophy & Good, 1986). This method of 
observation brought about various criticisms including that there was too much emphasis 
on discrete teacher behaviors as “causes” and student achievement as “effects,” with no 
acknowledgement of various other classroom factors that might affect student 
achievement, including the reciprocal effect of teacher-student interactions (Gage & 
Needels, 1989; Gage, 1989). In addition to criticism regarding the content of 
observations, methodological concerns were also raised. Some believed that there was 
too much reliance on correlational research and advocated for an experimental approach 
to determine causality (Macmillan & Garrison, 1984). 
 As a result of these criticisms, researchers in the 1990s began to avoid use of 
quantitative methods and instead employed more qualitative approaches to observation 
(Chomat-Mooney et al., 2008; Gudmundsdottir, 1997). These methods provided rich 
  
18 
descriptions of complex classroom interactions and provided an avenue for creating 
hypotheses on what constituted high-quality classrooms, but findings were difficult to 
generalize and did little to provide definitive understanding of which teacher-student 
interactions allowed for the greatest student achievement (Chomat-Mooney, et al., 2008). 
In more recent years, the establishment of the Institute for Education Sciences, 
through the Education Sciences Reform Act of 2002, has increased emphasis on 
providing rigorous scientific evidence for educational practices through increased 
availability of research funds to empirically study multiple aspects of education, 
including validated and standardized observation systems (Chomat-Mooney, et al., 2008). 
In addition, private research entities such as the William T. Grant, Spencer, and Bill and 
Melinda Gates Foundations have also provided funding for research on tools and systems 
for ensuring high-quality instruction to improve student outcomes (Chomat-Mooney, et 
al., 2008). Increased availability of funding has allowed various researchers to increase 
efforts to develop valid observation tools for measuring instructional quality (e.g., 
Cameron et al., 2005; NICHD, 2003; Pianta, Belsky, Houts, & Morrison, 2007).  
Early observation systems were developed to account for multiple classroom 
features and interactions and were focused on early childhood settings. These included 
such observational tools as the Early Childhood Environment Rating Scale (ECERS; 
Harms & Clifford, 1980) and the Observational Record of the Caregiving Environment 
(ORCE; NICHD Early Child Care Research Network (ECCRN), 1996). Both of these 
measures were developed to measure the interactions of child care providers with 
children, as well as the overall quality of childcare settings. The revised edition of the 
ECERS (ECERS-R; Harms, Clifford, & Cryer, 1998) is still widely used as a global 
  
19 
measure of quality (Cassidy, Hestenes, Hegde, Hestenes, & Mims, 2005), including 
teacher-student interactions, but has been found to have a greater focus on classroom 
environment than on interactions between teachers and students (Sammons et al., 2002).  
In contrast, the ORCE was developed to specifically measure interactions between the 
caregiver and individual children (NICHD ECCRN, 1996). The ORCE has been used in a 
large-scale longitudinal study to determine how various aspects of childcare quality 
impacted later student outcomes. One of the major findings of this study was that 
teacher’s use of language (e.g., asking questions and responding to children’s talking) 
was linked to better cognitive and language development (National Institute of Child 
Health and Human Development Early Child Care Research Network, 2000; NICHHD 
SECCYD).  
An upward extension of the ORCE was later developed to measure quality in 
kindergarten and was called the Classroom Observation System for Kindergarten (COS-
K; (National Center for Early Development and Learning [NCEDL], 1997) and was later 
adapted for first (COS-1; NICHD Early Child Care Research Network, 2002), and third 
and fifth grades (COS-3/5; NICHD Early Child Care Research Network, 2004). Both the 
ORCE and COS  measure classroom features found to be related to students’ academic 
and social development through time-sampling of discrete behaviors, coupled with more 
global rating scales, to capture quality of teacher-student interactions (Hamre & Pianta, 
2005; NICHD Early Child Care Research Network, 2002b; Pianta, Belsky, Houts, 
Morrison, & National Institute of Child Health and Human Development Early Child 
Care Research Network, 2007; Rimm-Kaufman et al., 2002). The discrete behaviors were 
recorded during 10-minute periods of 30-second observation intervals and included 
  
20 
measures of setting and activities (e.g., teacher-directed activity, individual activity, 
unstructured activity, recess) and teacher behaviors (e.g., read-alouds, teacher-child 
interactions, teacher affect, and teaching of social and academic skills). Scoring of 
discrete behaviors was followed by global measures of classroom quality based on 
observations outside of the time-sampling of behaviors. Global ratings consisted of 
ratings of classroom dimensions such as overcontrol/intrusiveness, emotional climate, 
classroom management, literacy instruction, feedback, and child behavior. These 
dimensions were rated on a seven point scale, ranging from adequate to excellent (Pianta 
et al., 2002). The COS was a precursor to the Classroom Assessment Scoring System 
(CLASS; Pianta, La Paro, et al., 2008) which will be discussed in detail in a later section 
of this chapter.  
The ORCE and COS were among the first to consider both discrete and global 
measures of classroom quality, elucidating the importance of considering global 
classroom quality in the elementary grades as a factor in student outcomes. These 
elementary measures were able to capture features of elementary classrooms that were 
related to academic and social development of students while being content-independent 
(Hamre & Pianta, 2005; NICHD Early Child Care Research Network, 2002b, 2004; 
Pianta et al., 2007; Rimm-Kaufman et al., 2002). An analysis of results using data from 
the SECCYD study indicated that global ratings of the classroom using the ORCE were 
more related to student academic outcomes than the time-sampled teacher behaviors 
(Chomat-mooney et al., 2008). 
This early work in observation research has moved the field toward a greater 
understanding of the need for measures of classroom quality to systematically examine 
  
21 
global measures of quality through observation. Although the majority of such tools have 
been developed and validated for use in early childhood general education classroom 
settings (Chomat-mooney et al., 2008; Maxwell et al., 2001), other observation tool 
development has explored the use of observation in a wider range of classroom levels 
(e.g., Danielson, 2011; Fish & Dane, 2000; La Paro et al., 2004; Waxman et al., 1997). 
The following section is an overview of the more commonly-used observational 
instruments designed for use in general education, followed by a review of tools designed 
for observation in alternative settings, (i.e., special education and intervention). The 
review of each observation tool will include a discussion of the purpose and content of 
each tool as well as a discussion of the logistical requirements for the use of each 
measure. Table 1 provides a summary of the review for each tool. 
  
22 
 
Table 1 
Overview of Observation Tools 
Observation 
Tool 
Setting Focus Training 
Requirements 
Length of 
Observation 
Special 
Qualifications 
for Observers 
Classroom 
Assessment 
Scoring 
System 
(CLASS; 
Pianta, et 
al., 2005) 
General 
education 
Emotional 
support, 
classroom 
organization, 
instructional 
support 
16 hours Four 30-
minute 
cycles of 
observation 
(total: 2 
hours) 
None 
specified 
Framework 
for 
Teaching 
(FFT; 
Danielson, 
1997) 
Primarily 
general 
education; 
utility in 
special 
education 
claimed 
Planning and 
preparation, 
classroom 
environment, 
instruction 
(constructivist 
approach), 
professional 
responsibilities 
12-24 hours 30 minutes- 
1 hour, plus 
time to 
examine 
pertinent 
artifacts 
None 
specified, 
but 
typically 
designed for 
use by 
supervisors 
Recognizing 
Effective 
Special 
Education 
Teachers 
(RESET; 
Johnson & 
Semmelroth, 
2012) 
Special 
education 
Evaluating 
instructional 
practices that 
are evidence-
based for use 
with students 
with 
disabilities 
½ day One lesson 
(15-75 
minutes) 
Special 
education 
teachers  
Quality of 
Intervention 
Delivery 
and Receipt 
(QIDR; 
Harn, et al., 
2011) 
Small 
group 
intervention 
Explicit 
instruction 
principles, 
instructional 
and behavioral 
management 
4-6 hours One lesson None 
specified 
 
  
23 
General Education Observation Tools 
 In an attempt to elucidate and standardize observation methods, various 
observation tools have been developed and researched. Although some tools continue to 
focus on discrete teacher behaviors or are content-dependent, many have been developed 
with an emphasis on more global measures of quality that can be used in various content-
independent settings (Chomat-Mooney, et al., 2008). This section will review some of the 
more commonly used global, content-independent tools in general education and will 
demonstrate the need for valid and efficient tools designed for use in intervention 
settings. The tools that are included in this review are among those used in the Measures 
of Effective Teaching Project, a large-scale teacher effectiveness study (MET; Bill and 
Melinda Gates Foundation, 2009). While other tools were used in the study, those 
included in this review are those that are considered content-independent, global 
measures, which facilitate greater usability in school settings. 
Classroom Assessment Scoring System (CLASS).  One observation tool that is 
commonly used in various pre-school and elementary settings is the CLASS. In fact, the 
CLASS has been adopted by HeadStart as its primary tool for evaluating teacher quality 
as part of this federal initiative (HeadStart Act, 2007). It was originally designed to assess 
classroom processes found to be related to student outcomes in pre-kindergarten through 
3rd grade (La Paro et al., 2004; Pianta et al., 2005). The criterion and predictive validity 
of the CLASS have been established through multiple studies associating it with other 
similar tools and associating scores on the CLASS with student outcomes such as gains 
on standardized assessment and improved social skills (e.g., La Paro et al., 2004; 
Mashburn et al., 2008; Pianta et al., 2005). 
  
24 
Purpose. The CLASS was originally designed to be used for research purposes to 
understand the social-emotional climate of the classroom and how that is related to 
student achievement (Pianta, La Paro, et al., 2008). The instrument has been specifically 
used to conduct “empirically-based theories of teaching and learning that serve as the 
foundation for understanding education and developing new solutions” (Hamre, Pianta, 
Mashburn, & Downer, 2007, p. 3). Although the MET project has indicated that an 
ultimate purpose for the measure might be for providing feedback to improve instruction, 
this purpose was not within the scope of the original MET project (Joe, Tocci, Holtzman, 
& Williams, 2013) and has not been the focus of most other research conducted using the 
instrument (Hamre, Pianta, Mashburn & Downer, 2007; La Paro, Pianta, & Stuhlman, 
2004; Mashburn et al., 2008; Pianta, et al., 2005).   
In 2008, however, Pianta, Mashburn, Downer, Hamre, and Justice published a 
study using CLASS as part of a professional development and training cycle to improve 
classroom quality. The study was actually looking at the effectiveness of a system for 
professional development entitled My Teaching Partner (MTP). Within the study, two 
treatment conditions were utilized. The first used a system which linked videos of high-
quality teacher-student interactions (based on the CLASS framework) with a consultation 
process using components of the CLASS as a common language for instructional quality 
and as a framework for providing professional development to improve instruction. The 
second condition provided teachers only with videos depicting high-quality instruction 
according to the CLASS, but did not include personal consultation.  Pianta et al., (2008) 
reported that those teachers who received video exemplars along with personal 
consultation made the greater improvements in the categories of teacher sensitivity and 
  
25 
instructional learning formats than did those teachers receiving only the video exemplars. 
However, the authors found that the effect was greater for those teachers who were 
teaching classrooms in which 100% of the children were classified as experiencing 
poverty, and effect sizes were small to moderate across the two categories.  
Content. The CLASS Framework (Hamre & Pianta, 2007) describes a theory of 
classroom practice derived from earlier theoretical and empirical research in educational 
and psychological literatures (e.g., Brophy & Good, 1986; Brophy, 1999; Gage, 1989; 
Pressley et al., 2003). It is framed around three broad domains of classroom interactions 
that are hypothesized to be important for promoting learning and social development of 
students: Emotional Support, Classroom Organization, and Instructional Support.   
This framework is supported by previous research on classroom observation and 
teacher effectiveness including the work of Brophy (1999) who outlines 12 principles of 
effective teaching including classroom climate, opportunities to learn, curricular 
alignment, and student engagement. Work by Pressley and colleagues (2003), including 
organizing teaching strategies into creation of a motivational atmosphere, classroom 
management, and curricular and instructional decisions, also supports the foundational 
framework of the CLASS. The creators of the CLASS consider their tool to be more 
comprehensive than these other frameworks, however, because of a greater emphasis on 
social and emotional components of the classroom, specifically teacher-student 
interactions and relationships, as well as emphasis on instruction to enable higher-order 
thinking skills (Hamre, Pianta, Mashburn, & Downer, 2007). 
Each of the domains (Emotional Support, Classroom Organization, and 
Instructional Support) is further subcategorized into dimensions that are explicitly 
  
26 
described through behavioral, observable classroom interactions between teachers and 
students, or among students. Emotional Support is broken down into four dimensions 
which include Classroom Climate (Positive and Negative), Teacher Sensitivity, and 
Regard for Student Perspectives. Classroom Organization includes three dimensions: 
Behavior Management, Productivity, and Instructional Learning Formats.  Instructional 
Support includes dimensions for Concept Development, Quality of Feedback, and 
Language Modeling. Each of these dimensions is provided descriptors for low, middle, 
and high implementation of the subcategory. Ratings using the CLASS are made on a 
seven-point scale, ranging from “Low” to “High” on each of the ten dimensions. As an 
example, the “High” level of implementation of the flexibility and student focus 
subcategory has a behavioral description that indicates: “The teacher is flexible in his or 
her plans, goes along with students’ ideas, and organizes instruction around students’ 
interests.” If a rater considered this an accurate description of the interactions being 
observed, a rating of “high” (six or seven) would be warranted. Conversely, an observer 
could rate a teacher as “low” and score a one or two if “The teacher is rigid, inflexible, 
and controlling in his plans and/or rarely goes along with students’ ideas; most classroom 
activities are teacher-driven.” If a rater observes that “the teacher may follow the 
students’ lead during some periods and be more controlling during others,” he or she may 
rate them in the “middle” category and assign a score of three, four, or five. This same 
approach is used across the other dimensions as well. For the complete rating scale 
descriptors on the CLASS, see Pianta, La Paro, and Hamre (2008). 
Training.  To ensure that raters can provide reliable and accurate scores using the 
CLASS, developers indicate that sixteen hours of training is required. Developers of the 
  
27 
CLASS  indicate that observers should have some teaching experience, however, it has 
been found that teachers and administrators with the most experience are often less 
reliable due to preconceived notions regarding effective teaching that may not align with 
elements of the CLASS (Hamre, Goffin, & Kraft-Sayre, 2009). A manual containing 
descriptions of each of the domains and dimensions, to be read prior to training, is 
provided to trainees. The two-day workshop consists of guided practice with videotaped 
classroom footage and an extensive videotaped reliability test, involving either five or six 
cycles of 20-44 minute observations. With this level of training, an average interrater 
reliability (within one point of master coders) of .87 has been reported (Pianta, Mashburn, 
et al., 2008).  
Observation duration. Developers of the CLASS recommend a minimum of two 
hours of observation, in the form of four, 30-minute cycles, in order to obtain a reliable 
measure of classroom quality (Chomat-mooney et al., 2008). In general, it is also 
recommended that multiple observation cycles of each classroom, across different points 
in the school year, be obtained in order to confidently determine a level of quality within 
the classroom (Kane & Staiger, 2012). 
Framework for Teaching (FFT). The Framework for Teaching (FFT; 
Danielson, 1997) is another observation tool that is also widely used within general 
education settings. Twenty states and various school districts in the United States have 
adopted the FFT as a means for evaluating teachers (Hansen, Lemke, & Sorensen, 2013). 
Numerous studies have indicated predictive validity of the measure on student learning 
outcomes (Borman, Kimball, Borman, & Kimball, 2005; Heneman, Milanowski, 
Kimball, & Odden, 2006; Holtzapple, 2003; Kane, Taylor, Tyler, & Wooten, 2010; 
  
28 
Milanowski, 2004). For example, Kane, Taylor, Tyler, and Wooten (2010) found that a 
one point increase in FFT scores accounted for achievement gains of one-fifth and one-
sixth of a standard deviation for reading and math, respectively. In another example, 
Heneman, et al., (2006) used correlational research to determine the relationship between 
teacher performance on the FFT and student achievement on both reading and math 
across four sites. In the area of reading achievement, scores on the FFT correlated with 
reading achievement with an average correlation of .29, with a range of correlations from 
.22 to .37. Correlations of FFT scores with mathematics achievement averaged .23 with a 
range of correlations from .11 to .32 across the four sites.  
Purpose. The Framework for Teaching (FFT) was first published by the 
Association for Supervision and Curriculum Development in 1996. It was an extension of 
the Praxis III: Classroom Performance Assessments that had been developed over a 
period of six year (1987-1993) by Educational Testing Services (ETS) as an observation-
based method to evaluate quality of pre-service teachers for the purpose of licensure. The 
FFT expanded on ETS’ work by including skills of teaching required by all teachers, not 
just pre-service teachers (Danielson, 2011). Danielson (2007) maintains that an 
evaluation system must serve two purposes, to: a) ensure teacher quality and b) inform 
professional development. The FFT was designed to reflect current notions of “best 
practices” and to function as both a formative and summative evaluation tool (Danielson 
& McGreal, 2000). 
Content. The FFT is considered by developers to be a contemporary form of 
observation that focuses on constructive approaches to teaching (Danielson, 1996). The 
framework is based upon an underlying notion that teachers honor and nurture the 
  
29 
students’ natural impulse to construct new understandings (Brooks & Brooks, 1999). The 
knowledge base for the original ETS version of the framework for teaching was 
developed around three information sources: wisdom of experienced teachers, theory and 
data of educational researchers, and requirements for licensure from various states 
(Danielson, 2007). Surveys were used to access information from experienced teachers to 
perform job analyses of teachers from elementary, middle, and high school. Extensive 
literature searches were used to review and synthesize research on effective teaching and 
requirements of state licensing agencies were analyzed and incorporated within the ETS 
version of what would later become the FFT.  In accordance with state licensing 
agencies, the developers designed the FFT to be  aligned with the Interstate New 
Teachers Assessment and Support Consortium (InTASC; Council of Chief State School 
Officers, 2011), a set of standards used to measure competency of pre-service teachers in 
many teacher preparation programs throughout the United States. The latest edition of the 
FFT has also been modified in an effort to reflect the instructional implications of the 
Common Core State Standards (CCSS; Danielson, 2013).  
The FFT is organized around four broad domains: Planning and Preparation, 
Classroom Environment, Instruction, and Professional Responsibilities. Each of these 
domains consists of five or six components. These components are further defined 
through elements related to each component. For instance, within the Planning and 
Preparation domain, there are six components: demonstrating knowledge of content and 
pedagogy, demonstrating knowledge of students, setting instructional outcomes, 
demonstrating knowledge of resources, designing coherent instruction, and designing 
student assessments. Each of these components is further defined with additional 
  
30 
elements. The scoring rubric contains four possible levels of implementation: Level 1, 
Unsatisfactory; Level 2, Basic; Level 3, Proficient; and Level 4, Distinguished. Within 
this rubric, specific examples and detailed explanations are provided to aid in assigning 
scores during observation.  
Research informing the first domain (Planning and Preparation) was derived 
from multiple sources and highlights organizational skills, planning, content and 
pedagogical knowledge, using students’ prior knowledge, having high expectations, and 
establishing clear goals (Brooks & Brooks, 1999; Jackson & Davis, 2000; Marzano, 
2004; Schmoker, 1999; Shulman, 1987; Sykes & Bird, 1992; Wiggins & McTighe, 
1998). The second domain, Classroom Environment, draws upon research indicating that 
teachers must master at least basic levels of classroom management (i.e., creating 
routines and procedures, building an efficient and functional physical environment, and 
establishing norms and expectations for student behavior) prior to becoming skilled at 
providing instruction (Evertson & Harris, 1992; Jackson & Davis, 2000; Jensen, 1998; 
Tomlinson, 1999).  Instruction, the third domain of the FFT is designed to reflect the 
emphasis on teaching for understanding and conceptual learning and is based on the 
premise that children benefit most when allowed to “construct” new learning based on 
prior knowledge (Danielson, 2007). This domain was informed by research highlighting 
the importance of communicating expectations and goals, a need for flexibility, 
questioning and discussion skills, and assessment practices (Brooks & Brooks, 1999; 
Skowron, 2001; Tomlinson, 1999). The final domain, Professional Responsibilities, is an 
attempt to measure the full range of responsibilities that constitute teaching, including 
commitment to student learning, systematic reflection of teaching practice, collaboration 
  
31 
in a learning community, and effective parent involvement (Colton & Sparks-Langer, 
1992; Danielson, 2007; Jackson & Davis, 2000; Ross & Regan, 1993; Stronge, 2005).  
Training. According to McClellan, Atkinson, and Danielson (2012), training 
should include a minimum of 3-4 hours of an introduction to the tool, including the 
process for observation and an overview of the tool, as well as training to overcome 
potential bias. An in-depth training of the content of the tool requires between 12 and 24 
hours. Embedded within this training is an additional 12 hours for practice scoring of 
clips for each of the domains. Lastly, observers should spend between eight and ten hours 
scoring full-length practice videos. Overall, the training should be between forty and fifty 
hours in order to ensure reliability. The authors indicate that inter-rater reliability should 
be at a level of .80 or higher following training. The authors do not offer suggestions on 
levels of experience preferred for observers. 
Observation duration. The FFT was designed to be used for full-length 
observations of lessons, ranging from 30 minutes to one hour. However, some of the 
components (e.g., planning and preparation, and professional responsibilities) require 
additional time to examine artifacts such as lesson plans, inspect evidence of participation 
in professional development opportunities, and investigate the nature of interactions with 
colleagues (Danielson, 2007).  
Special Education and Intervention Observation Tools 
 Measuring teacher effectiveness within the context of special education and other 
intervention settings can be quite complex (Brownell et al., 2009). Since the goal of 
special education/intervention instruction is to provide more targeted and/or 
individualized instruction, tools designed for use in general education settings may be 
  
32 
inappropriate (Johnson & Semmelroth, 2012; Jones & Brownell, 2013). The FFT claims 
to have utility within the context of special education, acknowledging that there might be 
slight variations in the delivery and responsibilities of specialists, but that, 
“fundamentally, they are all teachers of students” (Danielson, 2007; p. 109), making the 
framework applicable to a variety of settings. However, as Jones and Brownell (2014) 
explain, instruction in a special education or intervention environment must be designed 
to focus on skills that are likely very difficult for the student to grasp requiring teacher-
directed, intensive, and repetitive tasks for students to acquire the knowledge and skills 
being taught. This teacher-directed approach is in direct contrast to the more 
constructivist framework that the CLASS and FFT tools advocate and measure. 
Because of this difference of definition of effective instruction, some researchers 
have sought to develop and validate tools measuring the types of instruction that are more 
likely seen in intervention settings (Harn, Forbes-Spear, Fritz, & Berg, 2011;  Johnson & 
Semmelroth, 2013). This section will outline two tools specifically developed to measure 
instruction within the special education or intervention context, the Recognizing 
Effective Special Education Teachers Observation Tool (RESET; Johnson & 
Semmelroth, 2013) and the Quality of Intervention Delivery and Receipt (QIDR; Harn et 
al., 2011).  
 Recognizing Effective Special Education Teachers Observation Tool 
(RESET). The RESET Observation Tool (Johnson & Semmelroth, 2012) was specially 
designed to measure effectiveness of special education teachers and take into account the 
more varied settings and instructional strategies used by special education teachers. The 
developers set out to design an observation tool that was a systematic observation 
  
33 
approach, aligned with evidence-based practices for students with disabilities, and that 
could serve as an alternative to the FFT, which the authors contend may not be aligned 
with the research base around best practices for students with disabilities, and may 
endorse practices that do not lead to improved outcomes for students with disabilities 
(Johnson & Semmelroth, 2012; Semmelroth & Johnson, 2013).  
 Purpose. Following the lead of Danielson (2007), the developers of the RESET 
observation tool sought to develop a tool that could provide feedback that could serve the 
same purposes as the FFT (i.e., to ensure teacher quality and promote professional 
development), but specifically in the special education context. The developers aimed to 
develop a tool that addressed the diversity found within special education classrooms and 
acknowledged the unique struggles found in the special education profession 
(Semmelroth, 2013). The RESET system was also designed to provide feedback on 
specific instructional practices to allow special education teachers to improve their 
practice (Semmelroth, 2013). 
 Content. The content of the RESET observation tool is based on Danielson’s 
(2007) framework with a focus on Domain 3, Instruction. It differs from the FFT, 
however, in that it includes more explicit criteria for evaluating evidence-based 
instructional practices appropriate for students with disabilities (Semmelroth, 2013). The 
tool was developed within a framework that defines special education teachers as those 
who are able to identify a student’s needs, implement evidence-based instructional 
practices and interventions, and demonstrate student growth (Johnson & Semmelroth, 
2012).  
  
34 
 The RESET observation tool was developed through an extensive review of 
research within special education. Three sources informed the content of the tool: a) 
Danielson’s FFT (2007), Domain 3: Instruction; b) Council for Exceptional Children 
(CEC) professional Standards for Special Education Teachers (2009); and c) a meta-
review of literature on effective special education instructional practice (Semmelroth, 
2013). Through this review of research, the developers created a tool designed to be 
flexible enough to be used across various special education settings (e.g., inclusive 
settings, small-group direct instruction, team-teaching) and addressing the needs of 
students with various disabilities (Semmelroth & Johnson, 2013).  
 The initial version of the RESET consists of three main parts: Lesson Overview 
(introduction), Lesson Components (instructional practices), and Lesson Summary 
(conclusion). Three different evidence-based instructional practices are included in the 
RESET tool: direct, explicit instruction, whole-group instruction, and discrete trial 
teaching. There are between 28 and 67 items on the RESET depending on the number of 
instructional practices being observed. The tool is web-based, operating on a direct logic 
system (i.e., some questions only appear if previous questions have been answered in a 
pre-defined way; Johnson & Semmelroth, 2012). For instance, if the observer indicates 
that the lesson being observed is employing direct instruction, only scoring related to 
direct instruction is revealed to the observer (Johnson & Semmelroth, 2012). In that 
instance, observers would be using the Explicit, Direct Instruction component of 
Subscale 2: EBP Implementation. Within the Explicit, Direct Instruction component, 
more specific sub-headings are evaluated: a) Organized Instruction; b) Sequenced 
Instruction; c) Student Participation; d) Scaffolding; and e) Assessment. The second 
  
35 
evidence-based instructional practice included in the RESET tool is the “Whole Group 
Instruction” component which includes subheadings of a) Individualized Instruction, b) 
Skill Development, c) Student Engagement, and d) Feedback and Assessment. The third 
evidence-based instructional practice included in the RESET tool is the Discrete Trial 
Teaching component including subheadings of Antecedent, Response, Consequence, and 
Intertrial Interval (ITI). 
 The rubric for scoring of the RESET is based on Danielson’s (2007) four-point 
scale: zero (unsatisfactory), one (basic), two (proficient), three (distinguished). Within the 
rubric, developers have included behavioral descriptors to aid observers in assigning a 
score. For example, within the Whole Group Instruction component: Student 
Engagement, two descriptors for the score of zero are provided: “The teacher provides 
little to no opportunities for guided and independent practice for students,” and “The 
teacher provides little to no opportunities for students to participate in classroom 
activities.” Conversely, a score of three on this component indicates “The teacher 
provides for individualized opportunities for guided and independent student practice for 
all students,” and “The teacher has created a learning environment that encourages active 
participation from all students, as well as maintains active levels of self-determination 
and self-advocacy.” For more excerpts from the RESET observation tool, see 
Semmelroth (2013). 
 Training. For training, observers are provided a manual outlining the components 
of the RESET observation tool. A half-day training presentation is provided to orient 
observers to the tool and provide opportunities for explaining the manual and the 
observation tool (Semmelroth & Johnson, 2013). Following the presentation, observers 
  
36 
are given the opportunity to view a practice video as a group activity. Observers rate the 
video and differences in scores are discussed until consensus is reached (Johnson & 
Semmelroth, 2012). Following this, observers rate a second video and scores are 
reviewed in a whole group activity. Developers reported an interrater agreement of .72 to 
.95 during training, measured both as a holistic score and by each subscale (Semmelroth 
& Johnson, 2013). The developers of the RESET tool sought only special education 
teachers for the initial training during this pilot version of the RESET. The teachers 
ranged in experience from five to thirty years with an average of twelve years of teaching 
experience. 
 Observation duration. Developers designed the RESET observation tool to be 
used with video recordings of single lessons. The mean time of each video used during 
development of the tool was 25 minutes, with videos ranging from 17 to 72 minutes. 
Regardless of video length, the videos were representative of one lesson and are to be 
observed and rated in their entirety (Semmelroth & Johnson, 2013). 
Quality of Intervention Delivery and Receipt (QIDR). Similar to the RESET 
observation tool, the QIDR is also designed to be used in settings other than general 
education classrooms (Harn, et al., 2011). Unlike the RESET, however, the QIDR is 
designed to measure only small group, direct, explicit instruction that is typically found 
in intervention settings. It was not developed specifically for use in special education, but 
for all intervention settings which involve small group instruction, independent of content 
area (i.e., reading, math). 
 Purpose. The QIDR tool (Harn, Forbes-Spear, Fritz & Berg, 2012) was developed 
for two main purposes. The first was to measure the quality of small group intervention 
  
37 
delivery to identify and measure specific elements of instruction that are related to 
outcomes and accelerate student learning. For each of the 15 specific instructional skills 
measured on the QIDR, a rubric was created to assess the quality of how that 
instructional skill was delivered on a scale of 0-3.   By measuring targeted instructional 
behaviors with a qualitative lens and in a systematic manner, specific instructional 
behaviors could be examined to identify potential research areas to focus on to better 
support students. The second, more applied, purpose for developing the QIDR was to 
provide a tool for principals and coaches to use to provide specific feedback to 
interventionists to drive instructional improvement. Although the tool had dual purposes, 
each purpose required that the tool measure multiple facets of instructional delivery and 
student behavior. 
 Content. To meet these purposes, developers looked to various sources to 
determine what aspects of instruction were most related to improved student outcomes. 
The content of the QIDR observation tool is not dependent on specific academic 
instructional content (i.e., reading, math), but instead is based on instructional principles 
that have evidence of increasing student achievement. In an intervention setting, 
instructional behaviors related to systematic, explicit instruction have shown positive 
effective sizes (Gersten et al., 1997; Swanson, 1999), indicating that instruction in these 
settings must be explicit, intensive and supportive (Torgesen, 2002). Therefore, items 
within the QIDR are derived from instructional principles commonly used in intervention 
settings to accelerate academic achievement in students who are at-risk or in need of 
remediation. 
  
38 
The main elements of the QIDR were developed to reflect key instructional 
elements necessary for providing explicit instruction (Archer & Hughes, 2011; Brophy & 
Good, 1986; Brophy, 1986; Rosenshine & Stevens, 1986). Similar to how the CLASS 
was developed from the effective teacher research completed during the 1970s and 
1980s, specific instructional behaviors were identified as common among teachers 
consistently getting positive outcomes from their students (Brophy & Good, 1986; 
Brophy, 1986; Dunkin & Biddle, 1974; Medley, 1979; Rosenshine & Stevens, 1986; 
Rosenshine, 1971). Through systematic classroom observations, common instructional 
behaviors were found to correlate with student outcomes. For example, Brophy and Good 
(1986), through a synthesis of previous research, found that the most consistently 
replicated finding in observation research linked student achievement to opportunity to 
learn material and the degree to which teachers provided that opportunity through active 
participation and moving students through curriculum at a brisk pace. Achieving these 
opportunities was found to be related to high teacher expectations and classroom 
management that provided organized environments that maximized student engaged time 
(Brophy & Good, 1986). 
Items on the QIDR related to the work of Brophy and Good (1986) include 
providing specific feedback for correct and incorrect responses, using clear and consistent 
wording, and modulation of lesson pacing. Further items were informed by the work of 
Rosenshine and Stevens (1986) including providing multiple and varied opportunities for 
guided and independent practice, providing frequent modeling, and ensuring students 
achieve mastery before moving on to new concepts. Archer and Hughes (2011) indicate 
that there are 16 elements of explicit instruction that are important for ensuring positive 
  
39 
student outcomes. Many of these elements are informed by the previously-mentioned 
work, but some items in the QIDR reflect Archer and Hughes’ (2011) extension of these 
instructional principles, including organizing instruction systematically, and declaring 
academic and behavioral expectations.  
 In addition to research-based instructional elements, the development team 
included elements on the QIDR related to student and group management of behavior. 
Research indicates that organization and management, along with positive social-
emotional climate, may help increase engagement and opportunities to learn, thus 
positively impacting student academic outcomes (Brophy, 1986; Cameron et al., 2005; 
Connor et al., 2009; Hamre & Pianta, 2005). Based on these findings, items related to 
organization and emotional support, including organization of materials, familiarity and 
preparation of lessons, smooth transitions, and teacher’s responsiveness to the emotional 
needs of children are also measured. 
 The wording of items on the QIDR was also carefully developed for two 
purposes: 1) to systematically and reliably measure across observers and 2) to give 
precise feedback to interventionists that could guide instructional improvement. Each 
instructional and behavioral element is behaviorally operationalized across the four levels 
of rating. For example, one of the items on the QIDR related to management is “Teacher 
appropriately responds to problem behavior.” The item is described in detail with 
examples such as “including off task behaviors; emphasizes success while providing 
descriptive, corrective feedback; positively reinforces to get students back on track.” In 
other words, coaches and interventionists could use descriptions on the scoring rubric to 
  
40 
determine specific actions to improve a score on a certain item, thus potentially 
improving instructional quality over time. 
 For each of the items on the QIDR, observers provide scores using a Likert scale, 
ranging from zero, “Not implemented,” to 3, “Expert implementation.” Each of the levels 
of implementation is behaviorally operationalized with use of examples and frequency 
(when appropriate) to distinguish one level from the others. For instance, for the item 
related to responding appropriately to problem behavior, an interventionist would receive 
a score of zero to three based on the observers perception related to the following: 0= 
“Teacher does not appropriately respond to problem behavior across multiple students. 
Teacher primarily provides negative feedback or ignores problem behavior for extended 
period of time (resulting in limited student participation, e.g., more than 20% of activity); 
1= “Teacher sometimes appropriately responds to problem behavior. Teacher provides 
some positive or corrective feedback but does not regularly emphasize success. Teacher 
may have difficulty consistently responding to one student’s problem behavior but 
sometimes responds appropriately to other students”; 2= “Teacher typically responds 
appropriately to problem behavior by emphasizing success and providing neutral 
corrective feedback for most students. Or no problem behavior occurs during the 
instruction”; and 3= “Teacher consistently responds appropriately to problem behavior by 
emphasizing success and providing descriptive corrective feedback as needed for all 
students. For example, teacher “catches” students engaging in appropriate behavior and 
provides descriptive positive feedback to encourage appropriate behavior.” All items and 
their operationalized definitions are included in the rubric (see Appendix). Preliminary 
  
41 
evidence on the QIDR indicates that it is significantly related and predictive of student 
outcomes (Forbes-Spear, 2014). 
 Training. The initial training on the use of the QIDR requires four to six hours. 
The training consists of an overview of the observation tool, explanation and examples of 
each element, guided practice of scoring on each element using segments of videos of 
small group instruction, as well as feedback on scoring accuracy for each element. All 
training videos were originally scored by the original QIDR team who independently 
coded each training video, discussed any disagreements, and used a consensus-building 
approach to come to agreement on “true” scores.  Raters are then provided the 
opportunity to practice scoring using all elements of the QIDR while watching a recorded 
30-minute intervention session. This guided practice is followed by immediate feedback 
on scoring accuracy across all elements. Raters who provide scores that are no more than 
one point off true scores (adjacent scores) according to the Likert scale for each item, are 
considered on-track to obtain acceptable reliability, and given the opportunity to 
independently score three check-out videos. If raters do not provide adjacent scores on all 
items, re-training, including discussion, re-visiting key elements of the scoring rubric, 
and additional guided practice, is provided.  
After raters score each check-out video, their scores are compared to true scores. 
Raters who obtained an intraclass correlation (ICC) of .6 or higher compared to the 
derived true scores (correlation of .7 or higher between rater and true score) after each 
video were cleared to independently score the next check-out video. Those who fell 
below this cut-off were provided re-training followed by additional check-out videos 
until an acceptable ICC was reached. No coder needed more than two additional checks 
  
42 
to obtain reliability. Once an observer demonstrated consistency and agreement in 
scoring with the true-scored videos, they could score videos independently. The training 
can be delivered in a face-to-face setting, which allows for discussion and support as 
needed, or through an online training module which allows raters to train at their own 
pace followed by telephone or email support as needed for retraining. 
 Raters using the QIDR during the pilot phase of the development of the tool had 
various backgrounds and levels of experience within the field of education. Some 
observers were undergraduate and graduate students with little or no teaching experience, 
while others were teachers with multiple years of teaching experience. Regardless of 
level of experience, all raters were able to be successfully trained to use the QIDR 
reliably which indicates that the range of observers could potentially be quite diverse and 
still obtain reliable measures of instructional quality. Despite the range of backgrounds 
within pilot training, need for retraining was relatively limited. Only one of three coders 
required informal retraining through discussion to gain reliability. Formal retraining was 
not required for any team members. 
Observation Duration. The QIDR was initially designed and used to measure 
instructional quality across an entire intervention lesson. Given the intensive nature of 
intervention instruction, these sessions typically run between 15 and 30 minutes.  
Maximizing Time for Observation and Feedback 
Each of the observation tools discussed in the previous section has the potential to 
provide important insight into instructional quality. A major challenge in using many of 
these tools is the extensive time necessary to train raters and complete observations 
(Gargani & Strong, 2014). The extended time required to carry out these tasks may make 
  
43 
the use of some of these tools prohibitive for the purpose of providing the regular 
feedback necessary to improve instruction. Two groups of researchers have begun to 
investigate alternatives to the current observation systems that can maximize efficiency 
of observation while balancing the validity of the data (Gargani & Strong, 2014; Pratt & 
Logan, 2014).  
Proximal and distal measures of quality using short observations. Pratt and 
Logan (2014) conducted a study to investigate the effects of Let’s Know (Language and 
Reading Research Consortium, in press), a supplemental curriculum for pre-kindergarten 
through 3rd grade, on teachers’ use of language-related comprehension supports. 
Researchers set out to examine two questions: 1) how the Let’s Know curriculum 
impacted teachers’ use of language-related comprehension supports; and 2) how the Let’s 
Know curriculum impacted quality of general teacher instructional delivery.  
To examine the first question, researchers developed an observation tool, 
Snippets, as a proximal measure of teachers’ use of language-focused comprehension 
supports. The tool was designed to observe very short samples, or “snippets”, of 
instruction (i.e., six-minute segments).  Snippets contains eighteen items related to 
reading comprehension skills known to be important for pre-kindergarten through third 
grade students (e.g., prediction, inference, summarizing, main idea).   
Within the study, two six-minute segments of video, recorded during a 90-minute 
reading block, were coded using the Snippets tool. One of the six-minute segments was 
extracted from a 30-minute lesson in which the teacher was delivering the supplemental 
Let’s Know (Language and Reading Research Consortium, in press) curriculum.  The 
other six-minute segment was taken from another part of the remaining hour of the same 
  
44 
reading block to determine whether or not these language-based comprehension supports 
were being used during normal language-arts instruction, but outside of the Let’s Know 
lessons. Each six-minute video segment was coded using an interval-based scheme in 
which observers coded 12, 30-second intervals for the presence or absence of any of the 
18 language-focused comprehension supports.  Therefore, in one six-minute segment, 
scores for each support could range from 0-12. Although the authors do not indicate how 
much training was required to gain reliability, reliability was assessed for 14% of the 
segments coded and obtained overall exact agreement of 98%, which translated into a 
Kappa calculation of .86.  
To measure quality of the instruction being delivered, Pratt and Logan (2014) also 
used the CLASS (Pianta, Mashburn, et al., 2008) as a global measure of quality of 
implementation, using time segments that were half those specified by CLASS protocol. 
Observers rated four 15-minute segments of the 90-minute reading block (two during the 
Let’s Know lesson, two outside of the Let’s Know lesson) rather than a minimum four 
cycles of 30 minutes each as indicated by the CLASS protocol. The instructional support 
(IS) domain of the CLASS was used to code the 15-minute segments to determine if the 
Let’s Know curriculum also impacted ratings using the instructional support domain of 
the CLASS. Even though observers were only coding a portion of the reading block, 
observers will able to obtain 89% agreement based on the percentage of within-one 
agreement (the predominant approach used to assess reliability with the CLASS). During 
Let’s Know lessons, teachers scored significantly higher in instructional support than they 
did in segments scored outside Let’s Know lessons. This difference demonstrated 
evidence for discriminate validity indicating that the differentiated instruction and use of 
  
45 
language-based comprehension supports present during the Let’s Know intervention 
provided increased quality of instructional supports, as measured by the CLASS, that 
were not present outside of the intervention.  
Maximizing the efficiency of observation. Gargani and Strong (2014) believed 
that an observation tool needed to be developed that could measure teacher effectiveness 
quickly and efficiently to expedite teacher evaluation systems. These researchers felt that 
popular observation tools such as the CLASS (Pianta et al., 2008) and the FFT 
(Danielson, 1997), while comprehensive and reflective of a broad set of standards for 
good teaching, were not necessarily designed with the observer in mind. They felt that the 
training and use of these tools might be too cumbersome for practical use within schools. 
To remedy this situation, they set out to design a more efficient observation tool that 
could predict the ability of teachers to raise the achievement of their students as a part of 
a system of teacher evaluation (Gargani &Strong, 2014). 
The result of their efforts was an observation tool called the Rapid Assessment of 
Teacher Effectiveness (RATE; Gargani & Strong, 2014; Strong, 2011). The current 
version of the RATE contains six rubric items. The authors do not contend that the six 
items within the scoring rubric for RATE define good teaching. Their main goal in the 
development of RATE was to determine if six deliberately-chosen items were sufficient 
to predict student learning on standardized tests (Gargani & Strong, 2014). 
The RATE observation tool contains six items for evaluating instruction. The 
items  were derived, in part, from items on the CLASS (Pianta et al., 2008) as well as 
items derived from previous work in which raters were asked to classify teachers 
according to whether the raters felt that students achieved above or below average 
  
46 
depending on the teacher being observed (Strong, Gargani, & Hacifazlioglu, 2011).  As a 
part of this early work, raters were polled to determine what factors most influenced their 
judgements. The raters cited student engagement, teaching strategies, and math 
knowledge as the most important factors, with student engagement being the most 
frequently cited. The results of this polling informed additional items within the rubric. 
The items within the rubric relate to lesson objectives, instructional delivery, questioning 
strategies, clarity of presentation, time on task, and level of student understanding. Each 
of the items is scored on a scale of one to three, with behavioral descriptions for each 
level of implementation. To provide scores for teacher quality, raters viewed only the 
first 20 minutes of a videotaped lesson and were allowed ten additional minutes to create 
scores using the rubric. 
One of the purposes of the development of RATE was to provide a tool that 
requires minimal training while still producing reliable scores that are predictive of 
student outcomes (Gargani & Strong, 2014). For this reason, training is limited to one 2-
hour training session. Throughout the series of validation studies, researchers purposely 
chose observers with widely varied backgrounds. Some were undergraduate and graduate 
research assistants with no teaching experience, while others were teachers with 
experience using other rating systems. The validation studies were designed to determine 
if the tool provided scores that were predictive of increases in student learning as 
assessed using value-added measures (VAMs) and if it could be used reliably. The 
researchers reported that across five of the studies, observers were able to classify 
teachers as either high or low performing, according to their VAMs, between 70 and 78 
percent of the time. Across five separate studies, interrater reliability was obtained using 
  
47 
intraclass correlations (ICCs) in the same way as the MET study.  The range of ICCs for 
independent scoring was .31 to .92 with an average of .65 across the five studies. This 
average places IRR for this tool in the good category (between .60 and .74)  according to 
commonly-cited cutoffs for qualitative ratings of agreement (Cichetti, 1994).  
Although these studies have varying purposes from each other and the current 
study, the work of these researchers can help to inform the next series of research studies 
to examine methods for how to more efficiently and validly assess instructional quality 
that is related to student outcomes.  The current study utilized the QIDR (Harn et al., 
2011) and systematically observed short (i.e., 10 minutes) segments of instruction (i.e., 
beginning, middle, end) to determine how these segments relate to the overall 
intervention time as well as to student outcomes.   
Summary and Conclusions 
The current climate in education has placed greater emphasis on evaluating the 
effectiveness of teachers to ensure students are receiving the highest quality instruction 
(McGuinn, 2012; National Association of Teacher Quality, 2012). Evaluating teacher 
effectiveness is viewed as a means for ensuring teacher quality and determining needs for 
professional development (Danielson, 2007). While multiple methods for evaluation have 
been researched and used in schools, observation remains one of the most direct ways to 
evaluate the quality of classrooms and instruction (Chomat-mooney et al., 2008). 
Multiple tools have been developed and tested, primarily with a focus on general 
education, but these tools are often complex and cumbersome, creating many challenges 
in implementation, particularly in using these tools in intervention settings (Gargani & 
Strong, 2014).  
  
48 
One major challenge with using general education observation tools within 
intervention settings is the focus on measurement of instructional strategies that may not 
be appropriate or effective in settings in which a different type of instruction is expected, 
valued, and needed (i.e., direct, explicit instruction; Johnson & Semmelroth, 2012; Jones 
& Brownell, 2014).  If these general education tools are used in intervention settings, 
interventionists may be misidentified as being in need of professional development when 
in fact the type of instruction they are providing is simply not a match for the observation 
tool, yet effective at improving student outcomes. Another major difficulty with current 
observation tools is the resource-intensive nature of both training and observation 
(Gargani & Strong, 2014). Due to the large scope of instruction that these tools are 
attempting to measure, they require extensive, time-consuming training to ensure 
reliability of raters. In addition, observations themselves can be time-consuming leaving 
little time for providing necessary feedback to improve instruction (The New Teacher 
Project, 2013). In the current state of education, resources within schools are already 
stretched thin so requiring administrators and coaches to complete extensive training 
(e.g., approximately 16 hours) and spend considerable time carrying out observations 
(e.g., four 30-minute observation cycles) and providing feedback to improve instruction 
is prohibitive. Students who are receiving intervention are arguably in need of the best 
quality instruction, but because of these challenges, interventionists are unlikely to be 
provided feedback that is frequent enough to improve instruction. 
To optimize the utility of observation tools to ensure greater student outcomes in 
intervention settings, they must be specific enough to provide feedback that can improve 
instruction, yet efficient enough to allow for succinct training and relatively short 
  
49 
observation periods. Currently, there is no observation tool that can efficiently provide 
specific feedback that can be used in a responsive instructional cycle. The development 
of the QIDR (Harn, et al., 2012) is an important step in providing a tool that can provide 
specific feedback to improve intervention instruction. The purpose of the current study is 
to determine if the QIDR tool can be used to garner reliable and valid measures of 
instructional quality without having to watch an entire instructional session (e.g., 25-33% 
of intervention). 
The next chapter will describe the methods as well as the population and data set 
that were used to explore these questions. 
  
50 
CHAPTER III 
RESEARCH METHODS 
 Implementation science has revealed that one of the factors that can improve 
implementation of intervention is providing frequent feedback to interventionists (Fixsen 
et al., 2013; Odom, 2008). Providing feedback is likely to improve intervention 
instruction that can, in turn, ensure that students in most need are receiving the highest 
quality instruction (Connor, 2013). To provide this frequent feedback, it will be necessary 
to make the process more efficient while still maintaining a high level of quality of such 
feedback. In an earlier study, the QIDR observation tool was used to measure 
instructional quality of entire lessons (i.e., 20-30 minutes) delivered by seven 
interventionists to 35 kindergarten students considered at-risk for reading difficulties. 
Sixty-four videotaped full-length lessons were used in the earlier study and coded using 
the QIDR. The current study utilized the same instructional videos and the QIDR to 
measure quality of delivery from systematically sampled lesson segments (i.e., beginning, 
middle, and end) of the same videos. These samples were then compared to the full-
length lessons previously-coded.  This study employed Classical Test Theory (which 
states that all observed scores are comprised of a true score and error, both random and 
systematic) to determine whether an observation tool designed to measure intervention 
implementation [Quality of Intervention Delivery and Receipt (QIDR)] could be used 
reliably on short samples (i.e., ten minutes) of a lesson. This chapter provides an 
overview of the existing data set, as well as the methods that were used to analyze the 
data to answer the following questions:  
  
51 
1) Can adequate inter-rater reliability (IRR) be obtained after observing ten minutes 
of full-length intervention lessons? 
2) Using the QIDR, what is the relationship between scores obtained watching the 
full lesson versus sampling ten minutes of the lesson? 
3) To what extent does the relationship between QIDR ratings obtained watching the 
full lesson, versus sampling ten minutes of the lesson, depend on lesson segment 
(i.e., beginning, middle, or end) or on intervention phase (i.e., 2nd week, 5th week, 
or 8th week)? In other words, are correlations between the ratings systematically 
stronger or weaker based on lesson segment or intervention phase? 
4) Which QIDR ratings (full lesson vs. ten-minute sample; beginning, middle, end; 
intervention phase) account for the most variance in student outcomes?  
Setting and Participants 
Setting. The original study, from which the videos for the current study were 
obtained, used data collected from two elementary schools in a school district in a mid-
size city in the Pacific Northwest. The first school involved in the study had 646 
kindergarten through 8th grade students in the 2011-2012 school year. According to the 
state department of education school report card for 2011-2012, 58% were considered 
economically disadvantaged, 17% of students were classified as students with disabilities, 
and 11% were considered limited English proficient, participating in English as a second 
language programs. The demographic make-up of this school included 69% White (not of 
Hispanic origin), 19% Hispanic origin, 4% Asian/Pacific Islander, 3% American 
Indian/Alaskan Native,  2% Black (not of Hispanic origin), and 3% multi-racial/multi-
ethnic. The second school had an enrollment of 285 students in kindergarten through fifth 
  
52 
grade during the 2011-2012 school year. According to the state department of education 
report card, 79% percent of these students were considered economically disadvantaged, 
15% were students with disabilities, and 15% were limited English proficient, with 13% 
of students participating in English as a second language programs. The demographic 
make-up included 63% White (not of Hispanic origin), 24% Hispanic origin, 4% Black 
(not of Hispanic origin), 3% Asian/Pacific Islander, and 5% multi-racial/multi-ethnic.   
Student participants. Kindergarten children in the two schools received half-day 
kindergarten and those who were identified as at-risk for reading difficulties were given 
the opportunity to participate in an intervention program that was entitled Super K. 
Students in the Super K program either stayed after or arrived early for their regular 
classroom day to receive approximately 30 minutes of reading intervention.  
Students were selected for the Super K program through use of Dynamic 
Indicators of Basic Early Literacy Skills (DIBELS; Good et al., 2002). DIBELS Letter 
Naming Fluency (LNF) and Initial Sounds Fluency (ISF) were administered to all 
kindergarten students in the fall to identify those in need of intervention. LNF presents 
students with a page of randomly-arranged upper and lowercase letters and asks students 
to name as many as possible in one minute. One-month, alternate-form reliability of LNF 
is .88 in kindergarten (Good et al., 2004). Criterion-related validity with the Woodcock-
Johnson Psycho-Educational Battery-Revised Readiness Cluster standard score is .70 in 
kindergarten (Good, et al., 2004) and predictive validity of kindergarten LNF with first 
grade Woodcock-Johnson Psycho-Educational Battery-Revised Readiness Cluster 
standard score is .65 (Good, et al., 2004).  
  
53 
Those students who scored less than three sounds or letters per minute on ISF and 
LNF, respectively, were considered most at-risk and invited to participate in the Super K 
program. As a result of the screening process, 37 students across the two schools were 
included in the Super K program. Consent was obtained from 35 students, but four 
students moved prior to the end of the intervention, leaving a remaining final sample of 
31 students. Seven of the 31 students received special education services and eleven were 
considered English language learners (ELLs). 
 Interventionists. In the original study, seven instructional assistants delivered 
instruction during the Super K intervention program. In these schools, it was typical for 
instructional assistants to deliver intervention programs under the guidance of a certified 
teacher. The interventionists involved in this study had between nine and 15 years of 
experience as instructional assistants and three to 14 years of experience using the 
specific reading programs delivered during the Super K program.  
Observers. During the original study involving the Super K program, a team of 
seven observers coded all 64 videos within the data set in their full-length form using the 
QIDR. For the current study, an additional team of observers were trained to use the 
QIDR to code ten-minute samples of the same videos for comparison. This team was 
originally comprised of seven observers, with two from the original observation team, but 
reliability issues reduced the final number of coders to five, maintaining the two 
experienced coders. Given that observers in the original study had a wide variety of 
backgrounds and levels of experience in the field of education, observers were recruited 
for this study from both graduate students and practicing teachers. The eliminated coders 
included one graduate student and one practicing general education teacher. The final set 
  
54 
of coders included three graduate students with former elementary-level teaching 
experience, one graduate student with no teaching experience, and one practicing general 
education elementary teacher. 
Intervention Programs 
Super K intervention. Students in the Super K program received instruction using 
either Early Reading Intervention (ERI; Simmons & Kame’enui, 2003), Reading Mastery 
(Engelmann et al., 2002), or a combination of both programs. Both of these programs are 
scripted and use explicit instruction principles (Archer & Hughes, 2011; Brophy, 1988; 
Brophy & Good, 1986; Rosenshine & Stevens, 1986) focusing on development of 
phonological awareness and alphabetic principle skills. The intervention was 
supplemental to the school’s core reading program, and occurred either before or after 
students’ regular instructional day. Intervention instruction was delivered in small groups 
of 3-5 students, with an average of 34 intervention sessions (range 28-41) provided to 
participants.  
Measures 
Instructional implementation measure. The QIDR (Harn et al., 2012) was used 
to measure instructional implementation across the videos. The QIDR is designed to be 
used to evaluate overall quality of an intervention based on key elements of explicit 
instruction. The QIDR is an observation instrument with a global approach to measuring 
instructional quality that is looking at multiple facets of instruction within the context of 
small group intervention. Items within the QIDR reflect key instructional elements of 
explicit instruction which are commonly used in intervention settings to accelerate 
academic achievement in students who are at-risk or in need of remediation (e.g., Gersten 
  
55 
et al., 1997; Swanson, 1999). See Chapter 2 for a complete description of the QIDR 
observation tool, as well as preliminary validation information. 
 Student outcome measure. The student outcome measure that was used to 
investigate the fourth research question for this project was the Word Attack (WAT) 
subtest of the Woodcock Reading Mastery Tests—Revised (WMRT-R; Woodcock, 
1987).  Scores were obtained at the beginning and end of the intervention, but for 
purposes of this investigation, only post-test scores were used. The WAT subtest assesses 
phonetic decoding skills by presenting real and nonsense words of increasing difficulty 
for students to read aloud. The publisher reports a split-half reliability range of 0.86-0.94 
for the WAT subtest. Concurrent validity ranges for total reading of the WRMT-R are 
reported to be from 0.85-0.91 when compared to the full scale reading test of the 
Woodcock-Johnson Psycho-Educational Battery for grades 1, 3, and 5 (Woodcock & 
Johnson, 1989). 
Video Data Set 
 Full-length videos. During the initial study, intervention sessions for a total of 
eight groups were recorded once each week, with one interventionist delivering 
instruction to two different groups. Each group recorded between seven and nine sessions 
resulting in a video data set of 64 videos. The average length of the videos was 25 
minutes.  
Video segment selection. From the data set of 64 videos, segments were 
systematically sampled from all videos collected during weeks two, five, and eight of the 
intervention study. These weeks were chosen to ensure that instruction was being 
observed throughout all phases of the intervention. These particular weeks were selected 
  
56 
because one school began recording during the first week of the study, but the other 
school didn’t begin recording until the second week, so by beginning with the second 
week, the time of year in school was held constant across sites. The eighth week was 
chosen because it is the last common week across both sites, and the fifth week was 
equidistant from the 2nd and 8th weeks. Videos from these three weeks range in length 
from 15 minutes to 29 minutes with the average video length for these three weeks being 
approximately 25 minutes. Allowing for three observations of each group resulted in nine 
separate ten-minute segments for each interventionist, except for the interventionist who 
instructed two groups, for whom there were 18 ten-minute segments. The original intent 
of the study was to examine the use of six-minute lesson segments to measure 
implementation. Due to reliability issues (which will be addressed within the observation 
procedures section of this chapter and in the reliability section of chapter 4), the decision 
was made to increase the length of video segments to ten minutes.  
To gather the ten-minute samples from the lesson, each video was evenly divided 
into three sections (beginning, middle, and end) and ten consecutive minutes were 
randomly-selected from each of these three sections. Random selection occurred through 
use of a random number generator to choose the starting point for the video segment from 
minute one through minute four. The following consecutive ten minutes from that 
starting point were used. Each segment was randomly-coded to ensure that observers 
were blind to segment of the lesson (beginning, middle, end) or intervention phase (2nd 
week, 5th week, 8th week). Due to length of some full-length videos, overlap in segments 
sometimes occurred. Selection of ten-minute segments was informed by the work of Pratt 
and Logan (2014), discussed in the previous chapter, and Giraletto and Weitzman (2002) 
  
57 
who were able to accurately and reliably code pre-school teacher-child language 
interactions shorter segments of instruction. Although Giraletto and Weitzman (2002) 
used slightly different segment lengths, this study and that of Pratt and Logan (2014) 
demonstrated that reliable and valid subsamples could be collected from an overall 
lesson. 
 Training and Observation Procedures 
Training procedures. Observers were trained to use the QIDR following the 
same protocol as the initial study. Each observer was required to attend two 3-hour 
training sessions conducted by the principal investigator. During these sessions, observers 
were introduced to general procedures involved in observation and rating of videos, as 
well as the QIDR coding scale, general characteristics of the QIDR rubric, and tools to 
aid in note-taking during observation. Coders were then introduced to the items in the 
rubric in subsections. The subsections are groupings of items within the rubric that have 
some commonalities. The first part of rubric training involved the first four items which 
are related to organization and management (e.g., organization of materials, smooth 
transitions). The next segment of training involved the three items on the rubric most 
related to provision of emotional and behavioral support during instruction (i.e., positive 
reinforcement, response to problem behaviors, response to emotional needs). The final 
section of the rubric involves four items related to instructional practices (i.e., consistent 
wording, clear signals, modeling, and error corrections). Observers were asked to explore 
the differences between different levels of implementation for each of the items within a 
subsection and were then presented with a video segment (approximately five minutes in 
length) and asked to score the instruction based on only the items for the subsection being 
  
58 
discussed (e.g., organization, management). After independent scoring for each 
subsection, an opportunity for comparison with “true” scores (derived using a consensus-
building approach during development of the QIDR), and discussion regarding 
justification for those scores, was provided. A different video segment was presented for 
each of the different segments of the rubric. Once all three subsections of the rubric had 
been presented with opportunities for practice scoring, a full-length practice video was 
presented and observers scored using all 19 items contained in the rubric. “True” scores 
were then presented to observers for comparison along with opportunities for discussion 
of discrepancies. Additional examples and practice were provided if observers had 
multiple discrepancies of more than one point off of “true” scores for practice videos. 
Once initial training was complete, observers coded a practice video independently. If 
observers were able to obtain 80% within one-point agreement with “true” scores, they 
were assigned their first set of video segments to score. All coders achieved this level of 
reliability at the onset of coding cycles. 
 Observation procedures. Video segments were randomly assigned to observers, 
with an eye toward ensuring balance across coders (i.e., one observer is not coding videos 
from the same interventionist, lesson segment, or intervention phase disproportionately). 
Thirty-six percent of the video segments were coded by two or more observers for 
purposes of measuring inter-rater reliability. Although other studies have relied on 
double-coding of a lower percentage (e.g., Pratt & Logan: 14%; Pianta, et al., 2008: 
<10%), these studies typically involve much larger data sets than were utilized for this 
study. Therefore, a higher percentage was selected to ensure reliability of observations. 
  
59 
  Assignment of each video segment was stratified so that a different coder was 
randomly-assigned to each segment taken from each full-length video.  Once video 
segment assignment had been completed, each observer received a file containing only 
the video segments that they would be scoring using the QIDR.  
Inter-rater reliability (IRR). IRR was assessed on a randomly-selected 36% 
(n=26) of video segments. Coders were systematically assigned to distribute reliability 
videos evenly across coders to ensure all possible coder pairs were utilized in the 
analyses. Reliability checks were completed weekly during data collection to ensure that 
rater drift was not present and so that re-calibration could occur as needed. This process, 
described in more detail in Chapter IV, led to both the increase in video lengths from six 
to ten minutes, and the elimination of two coders when re-training and calibration efforts 
were unsuccessful for those two coders. 
 Confidentiality. Informed consent was collected from all student and 
instructional assistant participants at the beginning of the original Super K project. 
Additional measures were taken to protect participant confidentiality throughout this 
project. All observers were required to complete CITI training and sign a confidentiality 
agreement instructing observers to not share videos, the observation tool (i.e., QIDR), 
and data from the project. All observer records were collected and destroyed once 
analysis was complete, and videos were deleted from observer computers at the end of 
the project. 
Experimental Design and Analytic Approach 
The current study examined the relationship between QIDR scores obtained from 
full-length instructional sessions with scores obtained after viewing only 10-minute 
  
60 
segments of the same sessions. Classical test theory techniques for estimating reliability 
were performed on all factor scores. The factor scores within this study include group, 
full-length lessons, segment of the lesson (e.g., beginning, middle, end), and intervention 
phase (2nd week, 5th week, 8th week). The goal when using classical test theory is to 
determine the degree to which variations in the conditions of measurement (e.g., different 
observers and different lessons) affect the consistency with which a construct is measured 
(Briesch et al., 2014). This theory assumes that any observed score is composed of a true 
score and some degree of measurement error. Since measurement of a true score is not 
possible, classical test theory allows the researcher to estimate the true score by 
determining the average score obtained across the administration of a hypothetically 
infinite number of parallel measurements (American Educational Research Association, 
American Psychological Association, & National Council on Measurement in Education, 
1999). Each observed score represents an attempt to estimate the true score, but the 
presence of some degree of random and predictable error is always assumed (Osterlind, 
2006).  
There are inherent limitations with using classical test theory within this study. 
The most notable limitation is that classical test theory only explains one general source 
of error, not taking into account the possible various sources of error (Brennan, 2010). 
For the purposes of this study, however, classical test theory is a sufficient method for 
estimating the reliability of measurements under the specified test conditions given that 
the purpose is simply to determine levels of reliability within each condition.  The 
following section will outline the analysis methods that were used to answer each 
research question. 
  
61 
Can adequate inter-rater reliability (IRR) be obtained after observing 10 
minutes of 30-minute full-length intervention lessons? Research involving the CLASS 
(e.g., Pianta, et al., 2008; Pratt & Logan, 2014) has used percent agreement within one 
point between observers. Hallgren (2012) criticized the use of this method given that 
percentage of agreements does not correct for agreements expected by chance and 
consequently overestimates the level of agreement. Because of this, the first research 
question was addressed using one-way random, single-measures ICCs with absolute 
agreement. Adequate inter-rater reliability was defined as an ICC value of .60 or above as 
this is a commonly-cited cutoff for a good rating of agreement based on ICC values 
(Cichetti, 1994). Recommendations by Hallgren (2012) guided the selection of this 
particular method. The first consideration when choosing a method for calculating IRR 
using ICCs involved how many observers would code each video segment. Although 
having all coders code all video segments is theoretically preferred, the time-intensive 
nature of carrying out this design was prohibitive (Hallgren, 2012). Therefore, all 
observers did not view all video segments, calling for a one-way model rather than a two-
way model in which all observers would code all segments. For the purposes of this 
study, a subset of videos was rated by two or more randomly-selected observers (36%; N 
= 26), while the remainder were coded by randomly-selected single observers.  
According to Hallgren (2012), the next consideration involves whether or not 
good IRR is achieved by absolute agreement or consistency in ratings (i.e., rank 
ordering). Within this study, the purpose was to investigate the ability of coders to 
provide similar ratings with each other, not whether each observer’s ratings remained 
consistent across the study, therefore absolute agreement was necessary.  
  
62 
Hallgren (2012) also recommends that the unit of analysis be considered when 
selecting a method for measuring IRR using ICCs. If all videos were being coded by 
multiple observers with the average of their ratings being used for hypothesis testing, 
average-measures ICCs could be used. Given that this study involved a subset of videos 
coded by two or more observers that were meant to generalize to the videos rated by only 
one observer, a single-measures ICC was used.   
Using the QIDR, what is the relationship between scores obtained watching 
the full lesson versus sampling ten minutes of the lesson? The second research 
question involved the relationship between scores obtained after observations of the full 
lesson versus the ten-minute samples and was calculated using Pearson product-moment 
bivariate correlations (Field, 2013; Miles & Banyard, 2007). Scores obtained after 
viewing each ten-minute segment were compared to scores obtained after viewing the 
corresponding full-length video to determine the strength of the relationship between the 
segments and the full-length videos.  
     The analysis for the third research question, which also examined the relationship 
between scores obtained in the various time segments involved a two-step process. First, 
general descriptives were obtained including means, standard deviations, and 
correlations. Next, a two-way, within-subject, repeated factors ANOVA was used to test 
for equivalence. Within-subject, repeated measures analysis was used because the 
analysis was conducted comparing repeated measures of the separate lesson segments 
and intervention phases for each interventionist using the same measurement tool across 
all conditions. Repeated measures ANOVA allows a comparison of several means 
obtained from the same subjects (Field, 2012). The two predictor variables were QIDR 
  
63 
scores obtained for the full-length lessons, the three lesson segments (beginning, middle, 
and end), and the three intervention phases (2nd week, 5th week, and 8th week). The 
dependent variable for this analysis was sum scores of the QIDR using the first nineteen 
items within the tool.  
Which QIDR ratings (full lesson vs. 10-minute sample; beginning, middle, 
end; intervention phase) account for the most variance in student outcomes? This 
research question was answered using hierarchical linear modeling (HLM) analyses. 
Multi-level modeling such as HLM is appropriate due to the nested nature of the data 
(Luke, 2004; Field, 2012). In this study, students are nested within groups which are 
nested within schools. Given that each time a segment is taken from a common full 
lesson, indicating that independence of the data is unlikely, multi-level modeling is 
appropriate as it models the relationship between residuals that are dependent in nature 
(Field, 2012). For the purposes of this study, only two levels were considered, students 
and groups.  
To examine reading achievement and determine the effect of group membership 
on student WAT scores, students are level one in the model because they are nested 
within groups, which are level two. The following equations were used to build the model 
at the student level:  
Level one:  
Level two:  
In the level one equation for this model,  represents the WAT score for student 
i in group j,  represents the mean WAT score for group j , and   represents the 
  
64 
residual for individual student. In level two of this model,  is the grand mean and  
represents the variability of WAT scores between groups. 
Next, to examine the effect of average QIDR scores in each condition (i.e., full-
length, lesson segment or intervention phase) on student outcomes as measured by WAT, 
an additional multilevel model was built. A different model was used to measure the 
effect of each particular time segment or phase. Model 2 included the full-length QIDR 
scores as predictors, 3-5 addressed the lesson segments, and models 6-8 addressed 
intervention phases. QIDR scores for each specific time segment or phase for each group 
was used for separate analyses, but followed the same general equation model, entering 
in the scores for each time segment or phase as a separate analysis.  
Level one:   
Level two:  
In level one of this model,  represents the WAT score for student i in group j, 
 represents the mean WAT score for group j,  represents the effect of the QIDR 
score for group j on WAT score for student i in group j (slope), and   represents the 
error term.  
In the equation in level two,  represents the mean WAT score for group j, 
while  is the grand mean,  represents the effect of QIDR scores on group WAT 
scores (slope), and  is the variability of reading scores between groups.  
Once the relationship between QIDR scores and student outcomes had been examined 
separately, an analysis of the relationship between specific time segments and their 
ability to explain variance in student outcomes was employed. An examination of model 
statistics and calculations of pseudo-  were used to identify the variance explained for 
  
65 
each model. Then results of calculations for each model were compared to determine 
which models explained more or less variance in student outcomes by group. 
  
66 
CHAPTER IV 
RESULTS 
Descriptive Analysis 
 Before performing any statistical analyses, raw student data for the Word Attack 
(WAT) subtest of the Woodcock-Johnson Reading Mastery Tests—Revised (WMRT-R; 
Woodcock, 1987), as well as group-level Quality of Intervention Delivery and Receipt 
(QIDR; Harn et al., 2012) observation data for each video segment and full-length video, 
were examined using SPSS 21.0 for Windows. Final lesson segments were ten minutes in 
length and were obtained from full-length intervention lessons. As discussed in Chapter 
3, lesson segments were increased from six minutes to ten minutes due to issues of 
reliability, which will be discussed in more detail later in this chapter. Lesson segments 
were selected from the beginning, middle, and end of each lesson. Results regarding 
intervention phases will also be discussed in this chapter. Intervention phases are defined 
as periods of time within the entire ten-week intervention. Therefore, Phase A is 
instruction that occurred during the 2nd week of intervention, Phase B occurred during the 
5th week of intervention, and Phase C occurred during the 8th week of intervention.  
 Descriptive statistics. Descriptive data for student WAT outcomes is provided in 
Table 2. The WAT Standard Score (SS), obtained after the intervention, was used for 
analysis. Only students with complete data were included in the analytic sample. Out of 
the 35 children in the original study, complete data was available for 31 students, so those 
31 students comprised the final analytic sample for all analyses. The average WAT 
standard score for these at-risk kindergarten students was 99.9, with a range of 94-114.  
  
67 
Scores obtained from observations of the 10-minute segments and full-length 
segments, using the QIDR tool, were also examined. The scores consisted of the sum of 
the 15 items on the QIDR pertaining to instruction, as well as the four items pertaining to 
student response. Mean scores obtained using the QIDR tool varied, with full-length 
observations yielding the highest mean score (M=36.80, SD=11.33). Whereas lesson 
segment observations conducted at the end of the lessons had the lowest mean value and 
the middle lesson segment score mean was most similar to the full-length observation 
value. The beginning lesson segment score mean fell in between the end and middle 
mean value. All segment score standard deviations were quite similar ranging from 
11.15 for the middle segment to 11.48 for the beginning segment. See Table 3 for 
complete descriptive statistics. 
During each intervention phase, a QIDR score was also obtained for each group. 
The mean of the three segment scores within each intervention phase was also examined. 
In general, mean scores decreased across intervention phases, while variability increased. 
Mean QIDR scores ranged from 32.26 – 38.27 and standard deviations ranged from 8.00 
to 13.08. Descriptive data for overall QIDR scores, lesson segments, and intervention 
phases is found in Table 3, while descriptive statistics presented by group and lesson 
segment are provided in Table 4, and descriptive statistics by group and intervention 
phase are provided in Table 5. For a visual representation of the lesson segment and 
intervention data by group, boxplots are provided in Figures 1 and 2. 
 
 
 
  
68 
Table 2 
Student Outcome Descriptive Statistics 
 
N Min Max M SD 
WJ Word Attack 
(SS) 
31 94 114 99.9 7.47 
Note. WJ= Woodcock Johnson; SS=Standard Score 
 
Table 3 
 
Descriptive Statistics of Overall Quality of Intervention Delivery and Receipt Scores by 
Lesson Segment and Intervention Phase(N = 24)  
  
Overall M SD Min Max ICCs 
Beg Segments 35.75 11.48 16.00 56.00 .72 
Mid Segments 36.23 11.15 15.00 52.00 .62 
End Segments  35.41 11.87 9.00 50.00 .77 
Full-Length 36.80 11.33 17.00 56.00 .81 
Phase A 38.27 8.00 26.00 53.00  
Phase B 36.86 11.90 15.00 56.00  
Phase C 32.26 13.08 9.00 50.50  
  
69 
 
Table 4     
     
Descriptive Statistics of Quality of Intervention Delivery and Receipt by Group and 
Lesson Segment (N = 3) 
Group M SD Min Max 
1 Beg Segment 42.67 6.11 36.00 48.00 
Mid Segment 40.43 3.60 36.80 44.00 
End Segment 48.00 2.65 45.00 50.00 
 Full-Length 47.33 8.08 38.00 52.00 
2 Beg Segment 44.50 3.28 41.50 48.00 
Mid Segment 39.83 4.48 37.00 45.00 
End Segment 42.50 6.76 35.50 49.00 
 Full-Length 38.67 10.26 30.00 50.00 
3  Beg Segment 42.33 7.77 36.00 51.00 
Mid Segment 38.83 6.45 33.50 46.00 
End Segment 45.27 5.83 41.80 52.00 
 Full-Length 40.40 8.12 32.00 48.20 
4 Beg Segment 37.00 3.12 34.50 40.50 
Mid Segment 39.00 6.25 34.00 46.00 
End Segment 42.33 4.51 38.00 47.00 
 Full-Length 36.80 8.91 27.00 44.40 
5 Beg Segment 49.33 7.02 42.00 56.00 
Mid Segment 53.17 2.75 50.50 56.00 
End Segment 47.54 2.65 44.50 49.33 
 Full-Length 47.67 2.08 46.00 50.00 
6 Beg Segment 26.67 0.58 26.00 27.00 
Mid Segment 20.83 5.01 16.00 26.00 
End Segment 20.11 8.00 15.00 29.33 
 Full-Length 23.42 12.76 9.00 33.25 
7 Beg Segment 27.00 3.61 24.00 31.00 
Mid Segment 25.00 6.00 19.00 31.00 
End Segment 25.33 4.51 21.00 30.00 
 Full-Length 25.33 3.06 22.00 28.00 
8 Beg Segment 20.33 4.93 17.00 26.00 
Mid Segment 26.67 9.87 20.00 38.00 
End Segment 26.33 7.37 18.00 32.00 
 Full-Length 23.00 8.19 16.00 32.00 
  
70 
 
 
Table 5     
     
Descriptive Statistics of Quality of Intervention Delivery and Receipt by Group and 
Intervention Phase (N = 3) 
Group M SD Min Max 
1 Phase A 40.60 7.29 36.00 49.00 
Phase B 44.50 3.77 40.50 48.00 
Phase C 46.00 3.47 44.00 50.00 
2 Phase A 36.17  0.76 35.50 37.00 
Phase B 48.00 2.65 45.00 50.00 
Phase C 36.83 6.53 30.00 43.00 
3  Phase A 48.73 3.04 46.00 52.00 
Phase B 42.80 5.19 37.00 47.00 
Phase C 35.83 5.39 32.00 42.00 
4 Phase A 38.33 4.04 34.00 42.00 
Phase B 39.00 6.25 34.00 46.00 
Phase C 37.00 9.54 27.00 46.00 
5 Phase A 49.17 4.31 44.50 53.00 
Phase B 50.44 5.09 46.00 56.00 
Phase C 48.77 1.75 47.00 50.50 
6 Phase A 29.53 3.63 26.00 33.25 
Phase B 21.17 6.53 15.00 28.00 
Phase C 13.67 4.04 9.00 16.00 
7 Phase A 29.67 1.53 28.00 31.00 
Phase B 24.00 1.73 22.00 25.00 
Phase C 22.00 3.61 19.00 26.00 
8 Phase A 34.00 3.46 32.00 38.00 
Phase B 24.00 4.36 21.00 29.00 
Phase C 18.00 2.00 16.00 20.00 
  
71 
 
Figure 1. Boxplots of group QIDR scores by lesson segment. 
 
Figure 2. Boxplots of group QIDR scores by intervention phase. 
  
72 
Normality was examined for all variables in the data set. Although the WAT 
scores demonstrated distribution statistics that were within the acceptable range based on 
various rules of thumb (i.e., skewness +/- 1 and kurtosis +/- 2; George & Mallery, 2010; 
Tabachnick & Fidell, 2013) with skewness of 0.69 (SE = 0.42) and kurtosis of -1.18 (SE 
= 0.82), an examination of the WAT score histogram suggested that these data were 
positively skewed. As such, it was determined that normality could not be assumed 
within this data set and additional steps were taken to ensure the data were appropriate 
for the proposed analyses. Specifically, because fifty-eight percent of students (n=13) 
obtained the minimum standard score of 94 (raw score = 0), floor effects for the WAT 
measure were examined. Previous research has noted that it is common to see floor 
effects with measures of early literacy, particularly with those children who have had 
little or no exposure to early literacy instruction (Catts, Petscher, Schatschneider, Bridge, 
& Mendoza, 2009). The children included in this data set, however, had received 
approximately 10 weeks of literacy intervention on top of their general education literacy 
instruction prior to the post-test WAT measure, therefore the floor effect in this data set is 
likely due to the at-risk nature of the kindergarten students included in the intervention 
(Forbes-Spear, 2014). Forbes-Spear (2014) determined that when removing the standard 
scores of 94 from the data set, bivariate correlations were systematically higher, and that 
including the scores of 94 would actually provide a more conservative estimate of the 
relationships between WAT and the implementation variables that were examined.   
To verify this relationship within the current study, bivariate correlations were run 
between the full sample of scores (n = 31) and the restricted sample with scores for 
students obtaining a standard score of 94 removed. These correlations are displayed in 
  
73 
Table 6. Similar to Forbes-Spear’s (2014) findings, correlations between the full sample 
and the QIDR lesson segment measures are lower than those of the restricted sample, 
with the exception of the correlation between the end segments and the full and restricted 
sample, 0.50 and 0.49, respectively. It should be noted that all correlations using the full 
sample of WAT scores were significant at p < .05, with the exception of the middle 
segment correlation, which was not statistically significant. No correlations were 
statistically significant for the restricted sample, however the lack of significance within 
the restricted sample may be a result of the reduced sample size. As Forbes-Spear (2014) 
determined, the fact that the correlations using the full sample were smaller indicate that 
using the full sample will provide a more conservative estimate of the relationship 
between QIDR scores and WAT scores. Multi-level models are also more effective at 
accounting for violations of abnormality (Maas & Hox, 2004). For these reasons, the full 
sample of WAT scores were included for all analyses.  
Table 6 
 
Bivariate Correlational Analysis of Group Differences between Full and Restricted 
Sample 
QIDR Score 
Full Sample WAT_SS   
(n=31) 
Restricted Sample 
WAT_SS (n=13) 
Beginning Segments 0.41* 0.46 
Middle Segments 0.29 0.52 
End Segments 0.50** 0.49 
Full Length 0.45* 0.49 
Note. WAT_SS = Word Attack Standard Score; QIDR=Examining Quality of 
Intervention Delivery and Receipt. *p < .05; **p < .01 
 
  
74 
 Data for each of the segments, as well as the full-length measures, of QIDR were 
also examined for skewness, kurtosis, and severe outliers. All fell within the normal 
distribution range, with no severe outliers, skew, or kurtosis. Bivariate scatterplots of 
WAT scores and QIDR scores, including lesson segment and the full-length observations, 
were examined and revealed no significant outliers, and no notable differences between 
each lesson segment and full-length measures. Bivariate scatterplots were also generated 
to compare segment scores with full-length measures of QIDR. These scatterplots 
indicated that all segments and the full-length measure had clear linear relationships.  
 Testing of model assumptions. Assumptions were assessed for each multi-level 
model by examining the final model residuals using HLM version 7.01 for Windows 
(Raudenbush, Bryk, & Congdon, 2013) and SPSS 21.0 for Windows. Even considering 
the floor effects on the outcome variable (WAT), residuals were normally distributed and 
independent. Residuals obtained from analyses of each lesson segment length of the 
QIDR were also normally distributed and independent. 
Results 
 Research Question 1: Can adequate inter-rater reliability (IRR) be obtained 
after observing only 10 minutes of full-length intervention lessons?   Thirty-six 
percent of the video segments (n = 26) were selected through stratified random sampling, 
controlling for both lesson segment and intervention phase, to assess inter-rater 
reliability. Seven coders were initially trained to observe and code video segments. As 
discussed in the methods section of this document (Chapter 3), the original intent of the 
study was to explore the use of six-minute lesson segments for scoring using the QIDR. 
After training, these seven coders were assigned the first set of six-minute lesson 
  
75 
segments for coding, and reliability was assessed. Cichetti (1994) provided the following 
guidelines for acceptable ICC ratings: values between .60 and .74 classified as good, and 
between .75 and 1.0 as excellent agreement. The reliability for the first week was 
extremely low, with an ICC of .06. Retraining and recalibration was attempted, but even 
after these efforts, an ICC of .20 was achieved, which was also much lower than the 
“good” rating suggested by Cichetti (1994). An examination of scores on specific items, 
as well as specific raters, was conducted to determine the source of issues of reliability. 
With the data collected to this point, no clear patterns emerged, suggesting that there was 
not a correctable problem. It was hypothesized that the multi-faceted nature of the QIDR 
tool may be impacting observation with six-minute segments. Although the original six-
minute length was informed by the Snippets research (Pratt & Logan, 2014), the Snippets 
tool had a much narrower focus (i.e., looking for use of discrete reading comprehension 
strategies) than the QIDR, so it was hypothesized that increasing the length to account for 
the more complex and multi-faceted nature of the QIDR observation tool might increase 
the possibility of gaining a more acceptable level of agreement.   
Even with this increase to ten-minute segment lengths, reliability was variable 
across coding weeks. After the first week of coding of ten-minute videos, the reliability 
achieved was just under the acceptable level of .60 (ICC = .53), which warranted a re-
train conversation with all coders. The next week’s coding elicited a much higher 
reliability rating (ICC = .80), so the third week’s coding assignments were distributed. 
The third week also elicited a good level of agreement with an ICC of .64. The final 
week’s coding assignments were then assigned and the level of agreement for this week 
was far below an acceptable level (ICC = .23), resulting in an overall reliability of .53. 
  
76 
This prompted an in-depth exploration of the scoring patterns of individual coders. To do 
this, five segments were coded by all coders, one was coded by six of the seven coders, 
and six were coded by different combinations of five coders. This allowed for a 
comparison of all possible pairs of coders. Through this comparison, it was discovered 
that two of the seven coders were systematically unreliable with other coders. When these 
two coders were eliminated, ICCs increased to .70 overall. For this reason, all lesson 
segments coded by these two coders were eliminated from the sample and randomly re-
assigned to remaining coders.  
The elimination of these two coders resulted in acceptable levels of reliability 
across the study. Final IRR was assessed using a one-way, random-effects, absolute-
agreement intra-class correlation (ICC; McGraw & Wong, 1996) to determine the degree 
to which coders agreed upon ratings of lesson segments. As seen in Table 7, the resulting 
average ICC for all video segments was in the good range, ICC = .71, indicating that 
coders had moderate to high agreement. ICCs were also calculated by lesson segment to 
determine if beginning, middle, or end segments elicited higher rates of inter-rater 
agreement. The highest level of agreement between raters occurred within end segments 
(ICC = .77), while the lowest agreement occurred within middle segments (ICC = .62); 
however, as seen in Table 7, all segments and overall agreement fell within the good or 
excellent range (Cichetti, 1994). These ratings indicate that measurement error introduced 
by the final five independent coders was minimal, regardless of observation length and 
segment of the lesson, and that QIDR ratings were suitable for use in additional analyses 
in the present study.  
  
77 
Table 7 
One-way, Random-effects, Absolute Agreement Intra-class Correlations for 
Assessing Inter-rater Agreement by Segment and Overall 
 
Segment/Overall Beginning  (n = 
9) 
Middle (n = 
10) 
      End (n = 
7) 
Overall (n = 
26) 
ICCs  .72 .62 .77 .71 
 
 Research Question 2: Using the QIDR, what is the relationship between 
scores obtained watching the full lesson versus sampling ten minutes of the lesson? 
Pearson product-moment bivariate correlations (Field, 2013; Miles & Banyard, 2007) 
were used to calculate the relationship between observation scores obtained from full-
length lessons and those obtained from ten-minute segments. In addition, lesson segment 
scores obtained across phases were averaged and correlated with the scores obtained from 
full-length lessons. Given that all observations were obtained from the same set of 
videos, strong correlations between scores were expected. Relationships between all 
segments and the full-length observations were strong, positive, and statistically 
significant at the p < .01 level. Full-length observations were most strongly correlated 
with beginning segments, followed by middle segments, and end segments with 
correlations ranging from .72 to .81. Table 8 provides an overview of correlational 
analyses between lesson segments and full-length observations. Correlations between 
intervention phase scores from lesson segments were also strongly and significantly 
correlated with full-length observation scores. The weakest correlation was between 
Phase A lesson segment scores and full length observations (r = .77, p < .05), and phases 
B & C lesson segment scores were similarly highly correlated with the full-length lesson 
  
78 
scores, (r = .94, p < .01 and r = .95, p < .01, respectively).  Table 9 provides an 
overview of correlational analyses between intervention phases and full-length lessons. 
Table 8 
Bivariate Correlations for QIDR Ratings Between Full-length Observations and Lesson 
Segments (N = 24)  
 
Lesson Segment 1 2 3 4 5 
1. Full-length 
Observation 
-     
2. Beginning Segment .81** -    
3. Middle Segment .74** .88** -   
4. End Segment .72** .82** .84** -  
Note. **p < .01 
 
Table 9 
Bivariate Correlations for QIDR Ratings Between Full-length Observations and 
Intervention Phases (N = 24)  
 
Phase 1 2 3 4 5 
1. Full-length 
Observation 
-     
2. Phase A Segment 
Average 
   0.77* -    
3. Phase B Segment 
Average 
0.94** 0.75* -   
4. Phase C Segment 
Average 
0.95** 0.80* 0.95** -  
Note. **p < .01; *p < .05. 
 
 Research Question 3: To what extent does the relationship between QIDR 
ratings obtained watching the full lesson, versus sampling ten minutes of the lesson, 
depend on lesson segment or on intervention phase? Data were analyzed as a two-
way, within-subject, repeated measures ANOVA to test for equivalence. The two within-
  
79 
subjects predictor variables were the lesson segments (beginning, middle, end, and full 
lesson), and the intervention phase (2nd week, 5th week, 8th week, and average overall full 
lesson). The dependent variable was the total score (i.e., sum of the first 19 items) of the 
QIDR. The average of each group’s full-length QIDR score was used to calculate the 
differences. Unadjusted p-values were used to evaluate within-subjects effects because 
the assumption of sphericity was evaluated with the Mauchly Sphericity Tests and found 
to be tenable for both lesson segment and intervention phase,  and 
, respectively. The analysis of variance results are reported in 
Table 10. There was not a statistically significant effect of lesson segment, nor 
intervention phase F(3, 67) = 0.34, p = .80, and F(3, 21) = 2.85, p = .06, respectively. 
Although the effect of phase was nearing statistical significance, neither lesson segment, 
nor intervention phase statistically significantly explained the variance in scores on the 
QIDR.  
Table 10 
One-Way, Within-subjects, Repeated Measures Analysis of Variance Summary Table for 
the Effects of Lesson Segment and Intervention Phase on QIDR Scores 
 
Source df SS MS F p  
Within Subjects       
   Lesson Segment 3 26.73 8.91 .34 .80 .02 
   Error within 69 1814.14 26.29    
Within subjects       
   Intervention 
Phase 
3 164.19 54.73 2.85 .06 .29 
   Error within 21 404.03 19.24    
  
80 
 
Research Question 4: Which QIDR ratings (full lesson vs. 10-minute lesson 
segment; beginning, middle, end; intervention phase) account for the most variance 
in student outcomes? For these analyses, hierarchical linear modeling (HLM) was 
employed to analyze variance in WAT scores that could be explained through analysis of 
the full-length model, lesson segments (beginning, middle, and end), and intervention 
phases within the intervention period (2nd week, 5th week, and 8th week). Table 11 
provides an overview of the model estimates for each model. 
Null model. To begin, the null model was used to estimate the variance at each 
level, with no predictors entered into the model. This analysis indicated that there was 
significant variance at the student level t(7) = 48.05, p < .001, and the group level, 
. When ICCs were calculated, it was determined that 56% of 
the variance in WAT scores occurred at the student level, while 44% of the variance 
occurred between groups. These results corroborated the results found by Forbes-Spear 
(2014) with the same data set, and led to the same conclusion that multi-level models 
were the appropriate analyses, given the large variance at the group level.  
Full-length QIDR measure. Next, the average of each group’s full-length QIDR 
scores, obtained from the 2nd, 5th, and 8th weeks of instruction, was entered into the model 
to determine how much variance could be explained from a score obtained from 
observing a full-length intervention lesson. The coefficient for the full-length QIDR was 
not significant t(6) = 1.63, p = 0.15. Because of the small sample size, resulting in an 
underpowered study, it is not unexpected that this relationship was not significant. For 
this reason, additional parameters were examined in each model to explore how well each 
  
81 
lesson segment or intervention phase measure predicted student outcomes. Therefore, for 
each model, level two ICCs and level two pseudo- were calculated.  
In this full-length QIDR measure model the amount of variance shifted from the 
null model, with 65% of the variance now at the individual level, and 35% of the variance 
at the group level. This level two variance was significant, , and 
the amount of variance explained by adding the QIDR as a level two predictor indicated 
that 30% of the variance at level two was accounted for by the full-length QIDR score, 
pseudo-  = 0.30.  
Lesson segment QIDR measures. Next, each lesson segment was entered into the 
model separately to determine if a particular lesson segment explained more or less 
variance in student WAT performance. When the average of each group’s QIDR score 
was entered for each lesson segment, none of the coefficients for any of the lesson 
segments (beginning, middle, or end) were statistically significant, t(6) = 1.41, 0.78, and 
2.04, respectively, with all p-values greater than .05. It is important to note, however, that 
the end segment had a p-value that was approaching statistical significance, p = .09.  
An examination of model statistics for each lesson segment model revealed slight 
shifts in the variance explained at each level, for each lesson segment. When the 
beginning segment was entered as the individual predictor the amount of variance at level 
one was 62%, and 38% at level two. This represented a shift from the null model, but the 
shift in variance was similar to the variance explained when the full-length QIDR scores 
were entered as the predictor. The amount of variance explained at each level was similar 
to the null model when the middle segment was entered as the individual predictor, with 
55% of the variance explained at level one, and 45% of the variance explained at level 
  
82 
two. However, when the end segment was entered as the individual predictor, there was a 
more pronounced shift in variance explained when compared to the null model, with 70% 
of the variance explained at level one, and only 30% of the variance explained at level 
two. 
The level two variance in each of the lesson segment models (beginning, middle, 
and end) was statistically significant. Model 4, in which the middle segment was entered, 
was significant at the p < .001 level,  = 23.47. When pseudo-  was calculated for 
this model, the amount of variance explained was negligible, pseudo-  = -0.05. This 
finding indicates that middle segments did not provide any explanation of variance in 
group WAT scores. The other two segments, beginning and end, provided stronger 
models for predicting group WAT scores. Level two variance for beginning segments 
was significant,  = 19.02, p < .01.  By adding the beginning segment as a level two 
predictor, 20% of the variance was accounted for by the beginning segment QIDR score, 
pseudo-  = .20. The model with the best fit was the one in which the end segment 
scores were entered as the individual predictor,  = 14.96, p < .01. This model 
explained 45% of the variance in group level WAT scores, pseudo-  = 0.45, explaining 
more variance in WAT scores than the model in which full-length QIDR scores were 
entered as the individual predictor (pseudo-  = 0.30).  
Model deviance decreased for each of the lesson segment models, but most 
markedly within the end-lesson segment. This, coupled with the higher pseudo-  for the 
end-lesson segment, indicates that the end-lesson segment may be the strongest predictor 
of group differences in students’ WAT scores, while the middle segment of each lesson 
appears to be a less effective predictor of group differences in students’ WAT scores.  
  
83 
Intervention phase measures. Following the examination of the lesson segments 
as individual predictors, mean QIDR scores for intervention phases were individually 
entered into the model to determine if average 10-minute QIDR scores within a particular 
intervention phase were more or less predictive of student outcomes on the WAT. Once 
again, when the average of each group’s QIDR scores were entered for each intervention 
phase (2nd week, 5th week, 8th week), none of the coefficients for any of the phases were 
significant, t(6) = 1.12, 1.34, 1.36, respectively, with all p-values greater than .05.   
Level two variance for each of the intervention phases (2nd week, 5th week, 8th 
week) was significant,  = 21.22, 19.33, and 19.03, respectively, p < .05 for all 
models. Variance explained at each level for each phase of intervention also shifted 
somewhat from the null model. When the first phase of the intervention was entered into 
the model, 58% was explained at level one, and 42% was explained at the group level, 
which was the smallest shift from the null model.  The second and third phases both 
revealed 61% of the variance explained at the individual level, with 39% of the variance 
at level two.  
The level two variances were significant, p < .01, for all three models in which 
phase was entered in as the predictor; however, based on the pseudo-  calculations, 
none predicted group differences in student outcomes as well as the previous models 
involving beginning and end segments. For the 2nd-week phase of intervention, 8% of the 
variance could be explained by QIDR scores for the phase, pseudo-  = 0.08, while 
during the 5th week of the intervention, 17% of the variance was explained by QIDR 
scores, pseudo-  = 0.17, and during the 8th week of intervention, 19% of the variance 
  
84 
was explained by QIDR scores, pseudo-  = 0. 19.  These findings indicate that scores 
obtained later in the intervention explained the most level 2 variance
  
 
Table 11 
 Fixed and Random Effects Estimates Models WAT Posttest Scores by Lesson Segment and Intervention Phase
Parameter Model 1 
(Null) 
Model 2 
(Full-
length) 
Model 3 
(Beg 
Segment) 
Model 4 
(Mid 
Segment) 
Model 5 
(End 
Segment) 
Model 6 
(Phase A) 
Model 7 
(Phase B) 
Model 
8 
(Phase 
C) 
Fixed Effects         
    Intercept 100.37*** 
(2.09) 
89.65*** 
(6.79) 
100.50*** 
(1.93) 
100.45*** 
(2.13) 
100.56*** 
(1.57) 
100.47*** 
(2.02) 
100.50*** 
(1.95) 
100.49**
* 
(1.94) 
    QIDR Score  0.30 
(0.18) 
0.19 
(0.20) 
0.17 
(0.21) 
0.32 
(0.14) 
0.33 
(0.28) 
0.23 
(0.17) 
0.22 
(0.16) 
Random Effects         
    Group 
(intercept) 
25.98*** 18.17** 20.81** 27.39*** 14.21* 23.74** 21.45** 21.15** 
    Student 
residual 
33.15 33.59 33.45 33.31 33.65 33.36 33.46 33.49 
Model Statistics         
    ICC—Level 1 .5606 .6489 .6167       .5488        .7031        .5842        .6093       
.6129 
    ICC—Level 2 .4394 .3511 .3833       .4512        .2969        .4158        .3907       
.3871 
    Pseudo          
        Level 2  0.3004  0.1988      -
0.0542 
    0.4534      0.0862     0.1743     
0.1858 
        Level 1  -0.0135 -0.0092      -
0.0050 
  -0.0152    -0.0064   -0.0094    -
0.0104 
    Deviance 203.23 200.50 200.90    202.03   199.75   200.76   201.32   201.46 
    Parameters 2 2 2 2 2 2 2 2 
    Deviance 
Change 
-- -2.73 -2.33      -1.20     -3.48      -2.47 -1.91     -1.77 
Note. Parentheses denote standard errors. Level two predictors are group centered. *p < .05, **p < .01, ***p < .001 
  
86 
CHAPTER V 
DISCUSSION 
 Implementation science indicates that quality of implementation can only be 
improved through frequent feedback to interventionists/teachers (Fixsen, Blase, Metz, & 
Van Dyke, 2013). While multiple observation tools have been developed demonstrating 
the relation of specific instructional practices and student outcomes in general education 
(e.g., La Paro, Pianta, & Stuhlman, 2004), few have been developed for use in monitoring 
special education or small group interventions (Johnson & Semmelroth, 2013). 
Furthermore, current tools often require extended observation periods (e.g., over 60 
minutes, multiple observations across days, etc.) that limit the practicality of use in 
schools. Additionally, there is a limited understanding of how much of a lesson needs to 
be observed to determine overall quality. Practitioners need tools that are reliable, 
efficient, and target key intervention instructional practices that are related to improved 
student outcomes. This study examines the use of an observation tool, the Quality of 
Intervention Delivery and Receipt (QIDR; Harn, Forbes-Spear, Fritz, & Berg, 2011), an 
implementation measure specifically designed for monitoring small group intervention. 
Prior efforts have demonstrated that the QIDR correlates with other commonly used 
measures (i.e., CLASS, and opportunities to respond) and accounts for significant 
variance in student outcomes (Forbes-Spear, 2014). The focus of this study was to 
examine issues related to how long an observation needs to be, as well as how scores 
from observations conducted during specific portions of a lesson or intervention are 
related to student outcomes using the same data set. 
  This study addressed the following research questions: 
 87 
1) Can adequate inter-rater reliability (IRR) be obtained after observing 10 minutes 
of full-length intervention lessons? 
2) Using the QIDR, what is the relationship between scores obtained watching the 
full lesson versus sampling ten minutes of the lesson? 
3) To what extent does the relationship between QIDR ratings obtained watching the 
full lesson, versus sampling ten minutes of the lesson, depend on time segment of 
the lesson (i.e., beginning, middle, end) or on phase within the intervention (i.e., 
2nd week, 5th week, 8th week)? In other words, are correlations between the 
ratings systematically stronger or weaker based on lesson segment or intervention 
phase? 
4) Which QIDR ratings (full lesson vs. ten-minute sample; beginning, 
middle, end; intervention phase) account for the most variance in student 
outcomes?  
The initial section of this chapter will relate findings of this study to prior 
research, and then the chapter will conclude with a discussion on implications for future 
research and practice. 
Primary Findings 
 Inter-rater reliability.  To answer the first research question, observers were 
randomly assigned to code lesson segments using the QIDR. Acceptable inter-rater 
reliability (IRR) was achieved when using ten-minute observations (ICC = .71). While 
the intra-class correlations (ICCs) obtained from the lesson segments was not as high as 
those obtained from observations of the full-length videos (ICC = .81) in the original 
study, the reliability ratings for each of the segments fell within the good range for 
 88 
agreement (ICC > .60), with the reliability for end segments being in the excellent range 
(ICC = .77; Cichetti, 1994). These results indicate that individuals can be trained to 
reliably measure implementation on a multifaceted tool (i.e., QIDR) using a 10-minute 
segment, and that this measure of implementation is highly correlated with the score 
obtained from watching the entire lesson (Forbes-Spear, 2014). These findings are similar 
to Snippets research (Pratt & Logan, 2014), which found that the Snippets tool could 
reliably measure the use of specific comprehension instructional strategies within a 
reading lesson. A MET follow-up study also found that the reliability of 15-minute 
observations (representing 33% of full-length lessons) was comparable to full length 
lessons in general education classrooms when using the Framework for Teaching (FfT; 
Danielson, 1996), also a multifaceted observation tool (Ho & Kane, 2013). For 
instructional coaches and administrators in schools, these findings suggest that brief 
observations are highly related to what they would see if they had the opportunity to 
watch an overall lesson. Knowing this may actually encourage coaches/administrators to 
conduct more frequent observations to identify interventionists that may need additional 
professional development.  
It should be noted, however, that the process of reaching an acceptable level of 
reliability using ten-minute segments was more challenging than it was to achieve 
adequate reliability for full-length QIDR observations in the previous study (Forbes 
Spear, 2014). The challenges in gaining reliability may have arisen from the length of the 
lesson segments, the multifaceted nature of the QIDR measure, and/or individual coder 
characteristics. More research is necessary to determine the specific sources of variance 
in reliability, and each is discussed below.   
 89 
Lesson segment length. The original intent of this study (see chapters III and IV) 
was to use six-minute segments similar to Pratt and Logan (2014), but achieving 
reliability with segments of that length proved difficult. One reason reliability may not 
have been as challenging within the Pratt and Logan study was that the Snippets tool was 
measuring the presence or absence of much more discrete comprehension strategies 
rather than multiple elements of implementation on the QIDR. Therefore, the decision 
was made to increase the length of the lesson segments to ten minutes to determine if 
additional time would allow for more opportunities for coders to observe various 
implementation and response behaviors.  
Multifaceted nature of QIDR. As found in other studies, the cognitive load 
required to attend to multiple dimensions during an observation may have affected 
observer reliability (Jerald, 2012; Joe, McClellan, & Holtzman, 2016). During the large-
scale Measures of Effective Teaching (MET; Bill & Melinda Gates Foundation) project, 
a follow-up study was conducted to determine if complexity of an instrument had an 
effect on inter-rater reliability scores of segments of full-length lessons in a general 
education classroom (Joe, McClellan, & Holtzman, 2016). To do this, researchers 
compared reliability achieved when observers used only a subset of items of the FfT 
(Danielson, 1996), with reliability achieved when using all elements of the FfT. Lengths 
of observations ranged from 22 to 30 minutes and reflected approximately half of a full 
lesson.  Researchers found that inter-rater reliability decreased significantly when 
observers were required to use all items within each of the observation instruments 
compared to when they used only a subset of items (Joe, McClellan, & Holtzman, 2016). 
The findings of this MET study support the notion that reliability issues within the 
 90 
current study may have been a function of the complex, multifaceted nature of the tool. 
Whether the multifaceted nature of the QIDR is more influential on inter-rater reliability 
than the length of the lesson segments cannot be determined from this study, but is worth 
considering. For example, with the QIDR, observers were trained to watch for numerous 
elements of instruction within a shortened lesson that may not have occurred during that 
sample, such as emotional responsiveness, partner opportunity to respond, or teacher 
responding appropriately to problem behavior. While these instructional behaviors are 
important, their frequency of use is dependent on the context of the situation and may 
have impacted reliability in the current study.    
Coder characteristics. Individual coder issues may have also affected reliability 
within this study. Two coders who had difficulty with reliability were removed from the 
original pool of coders (see Chapters III and IV for additional information regarding this 
removal). The reasons for their difficulties with attaining reliability are unclear, but some 
possibilities are discussed here.  
Researchers have determined that multiple factors can affect coder accuracy 
(Repp, Nieminen, Linger & Brusca, 1988) including the setting of the interventions, 
complexity of the observation tool, and observer bias. Expectations of subject 
performance, or bias, may affect the observer’s ability to accurately score an observation 
(Repp et al., 1988). In the case of this study, observers were aware of the general 
background of the interventionists and one of the eliminated coders may have been 
susceptible to closely identifying and sympathizing with the situations being observed, 
making that coder more likely to score the interventionists more leniently. This coder had 
multiple years of experience in both general education and intervention settings. During 
 91 
training and calibration activities, this coder often commented that she understood the 
actions of the interventionist (“and may have reacted in the same way”) even when the 
interventionist being scored was exhibiting less-than-desirable implementation behaviors. 
This coder seemed to rely more on emotional reactions and her own beliefs about 
teaching than actual element descriptors within the QIDR.  
The second coder who had difficulty with reliability was an English language 
learner with multiple years of teaching experience outside of the United States (U.S.), but 
no experience teaching or observing instruction in the U.S. Experience differences may 
have presented some bias within this study. It is possible that different expectations 
regarding teacher and student behavior may have prompted this coder to score 
interventionists more leniently. In addition, nuances in the rubric language may have 
made interpretation of the rubric more difficult for this coder.  
Interestingly, a post-hoc analysis was conducted on the reliability of the initial set 
of six-minute lesson segments with the same two coders eliminated. Again, it was found 
that reliability increased within that sample of observations, and reached an acceptable 
level of agreement on the two six-minute lesson segments assigned after re-training (ICC 
= .69). This may indicate that the largest issue impacting reliability was individual coder 
characteristics, rather than segment length or the multifaceted nature of the tool. Future 
studies should continue to investigate short segment lengths to increase efficiency further.  
Relationship between lesson segment and full-length QIDR scores. To 
examine the second research question, QIDR scores from the full-length lessons were 
compared with QIDR scores obtained from lesson segments of the same lessons and 
average phase scores across the 10-week intervention. Results indicated that all segments 
 92 
were strongly correlated to the full lesson (r > .70), with beginning lesson segments most 
strongly correlated with the full-length scores (r = .81). This finding was similar to a 
MET study comparing FfT scores from the first 15 minutes of a lesson with scores 
obtained for the whole lesson (Ho & Kane, 2013). Ho and Kane found there was little 
difference between the 15-minute score and the full-length observation.  Correlations 
obtained comparing lesson segment phase scores with full-length phase scores were also 
strong and significant, (r > .77), with the lesson segment scores from phase C being most 
strongly correlated with full-length lesson phase scores (r = .95, p < .01).  
These findings indicate that an observer can get a similar measure of 
implementation using the QIDR regardless of whether you watch a whole lesson, or 
any10-minute segment. The correlations between overall phase scores also indicate that a 
similar measure can be obtained regardless of phase within the intervention. The reason 
this result was attainable may relate to the targeted nature of the QIDR and the fact that it 
was developed to identify the use of very specific elements of explicit instruction found 
in the intervention programs used in this study. Fixsen (2013) posits that assessment 
systems that directly relate to the philosophy and critical elements of a program or 
practice can provide opportunities for repeated assessment and feedback. Hill and 
Grossman (2013) contend that observation tools that can provide specific feedback that 
can be readily implemented are more successful in improving instructional quality. The 
findings of the current study indicate that the nature of the QIDR may allow for 
flexibility in terms of what segment of a lesson can be observed, while also providing 
enough specificity in key instructional elements to guide discussion and feedback for 
interventionists that can improve instruction and student outcomes in an efficient manner.   
 93 
 Relationship between scores obtained during various lesson segments and 
intervention phases. To further explore the relationship between scores obtained during 
the various lesson segments and intervention phases, a two-way, within-subject, repeated 
measures ANOVA was used to determine the equivalency of the scores. Results indicated 
that there was no statistically significant difference between scores obtained in full-length 
observations and those generated during specific lesson segments or phases of 
intervention. This finding further supports the notion that any time segment will provide 
you with a similar measure of implementation to provide support for 
coaches/administrators to conduct implementation checks as needed (e.g., when student 
response is limited, prior observation showed low scores, etc.) rather than in a procedural 
manner (e.g., once a year/quarter).  
Related to being responsive to monitoring implementation, an interesting, yet 
non-statistically significant finding was noted related to the phase of the intervention. As 
observed in Figure 2 (pg. 65) mean QIDR scores for all interventionists decreased across 
phases within this study.  Others have found that observers tend to rate more harshly 
across time (Casabianca, Lockwood, & McCaffrey, 2015; Congdon & McQueen, 2000); 
however, the design of the current study should have corrected for this phenomenon 
because coders were blind to lesson time and segments were randomly assigned to 
coders, controlling for segment and phase.  
When scores were closely examined, it was found that three of the four 
interventionists whose scores decreased most drastically across phases also had the 
lowest mean QIDR scores in Phase A.  In general, if an interventionist’s average score 
was above 35 during Phase A, the scores for subsequent phases also remained at or above 
 94 
35. Those whose initial scores were below 35 in Phase A, showed declining scores across 
remaining phases. This finding may indicate that those interventionists who are most in 
need of support at the beginning of the intervention period will have continued decreases 
in instructional quality without feedback and coaching supports. This issue will be further 
discussed within the implications section of this chapter.  
It is important to note, however, that the changes in scores reflect only three 
points in time within the intervention (2nd, 5th, and 8th week) and may not be 
representative of scores occurring across entire intervention phases. Forbes-Spear (2014) 
included all weekly full-length lesson measures of the QIDR throughout the 10-week 
intervention period within her study and found that there were no significant changes 
across time, on average. It was also noted within the Forbes-Spear (2014) study that 
QIDR scores were variable across the entire intervention period.  Therefore, the findings 
of this study should be approached with caution, and it may be necessary to get multiple 
measures across time to get a more accurate measure of overall implementation within a 
specific phase of an intervention. 
 Association between QIDR and student outcomes. To address the final 
research question, multi-level modeling was used to determine if scores obtained using 
the QIDR were predictive of student outcomes. Due to the small sample size and issues 
with a floor effect on the DV (see discussion in the limitation section), these results also 
need to be interpreted with caution, and are considered exploratory in nature.  
 Scores on the QIDR, regardless of lesson segment or phase, were not significant 
predictors of group differences in student outcomes. However, when model statistics 
were examined for the full-length scores, each lesson segment, and each intervention 
 95 
phase, the findings indicated that there were differences in variance explained with each 
predictor. The model using full-length lesson QIDR scores as predictors explained a 
substantial amount of variance in WAT scores at the group level (30%; pseudo-  = 
0.30). This was slightly different from the relationship found in Forbes-Spear’s (2014) 
study that found scores from the QIDR accounted for 36% (pseudo-  = 0.36) of the 
variance in WAT scores. The differences in variance explained may be attributed to 
differences in the two studies. In the current study, the 15 items that address instructional 
elements, as well as the four items related to student response were included for analysis, 
while Forbes-Spear omitted the student response items from her analysis. In addition, 
Forbes-Spear used QIDR scores across all weeks of intervention, where this study 
employed QIDR scores obtained from the 2nd, 5th, and 8th week of the intervention period. 
Of all predictors, QIDR scores for end lesson segments accounted for the most 
variance in WAT scores at the group level (45%; pseudo-  = 0.45). This indicates that 
QIDR scores obtained while observing the end of a lesson may be the best predictor of 
student outcomes. The fact that end segments actually explained more variance than did a 
full-length lesson suggests that there is some element of instruction or student response 
that is or is not occurring at the end of a lesson that may be key to impacting student 
outcomes. Notably, the end lesson segments also averaged the lowest QIDR scores across 
all interventionists and had the most variability, ranging from 9 to 50. The lower scores 
and large variability of implementation during the end segments of lessons may help to 
explain the differences in outcomes within groups. It is possible that those 
interventionists most skilled at teaching, including sustaining student engagement, 
 96 
throughout the entire lesson, are more likely to have better student outcomes than those 
who do not possess the same skills.  
While end lesson segments explained the most variance in student outcomes, 
beginning segments explained 20% of the variance in group WAT outcomes (pseudo-  
= 0.20) and middle segments provided no explanation of variance in WAT scores at the 
group level, making middle segments a very poor predictor of student outcomes (pseudo-
 = -0.05).  Although previously discussed findings revealed that similar scores of 
implementation were obtained across segments, given the differences in variance 
explained across lesson segments, the best choice for an administrator or coach may be to 
observe the end of the lesson. The feedback provided from these end-of-lesson 
observations could better assist the interventionist in improving instruction in such a way 
as to sustain elements of quality implementation throughout the entire lesson, thus 
impacting student outcomes most profoundly.  
The phase of intervention also revealed an interesting pattern regarding the 
variance explained across the models. While the variance explained across the phases 
was not as substantial as most of the models with lesson segments or the full-length 
scores as predictors, it appeared that the QIDR scores as a predictor became stronger 
across the intervention phases, with the final intervention phase explaining the most 
variance in WAT scores (19%) at the group level. This finding indicates that the QIDR 
score received by interventionists closest to the time of post-test may be more predictive 
of student outcomes than other phases of the intervention. Interestingly, when looking 
across scores, the average QIDR score obtained from intervention phases decreased 
across time, while variability in scores increased. The schools involved in this study had 
 97 
multiple years of experience providing tiered supports for all students in a fully 
implemented RTI model. The interventionists included in this study had been previously 
trained in intervention delivery and had multiple years of experience using the 
intervention programs involved in the study, but the schools did not provide formal 
ongoing support and coaching. The finding that overall instructional quality decreased 
across phases, coupled with the final phase of the intervention explaining the most 
variance in student outcomes, provides additional support for ensuring that 
interventionists are provided frequent feedback and supports that will improve, rather 
than deteriorate, implementation across the intervention period (Fixsen, Blase, Metz, & 
Van Dyke, 2013; Hill & Grossman, 2013; Pianta, Mashburn, Downer, Hamre & Justice, 
2008).  
Limitations 
 Several limitations within this study must be considered with these findings, and 
may help to inform future research. 
 Sample size. First, given that only 31 students were nested within eight groups, 
the small sample size included in this study contributed to the underpowered nature of the 
study, specifically in consideration of the relationship between implementation and 
student outcomes. The insufficient power within this study makes it difficult to identify 
statistically significant effects (Maxwell, 2004) and may increase Type II error.  
 Student outcome measure. Another limitation within this study involved the use 
of the Woodcock-Johnson Word Attack Subtest (WAT; WMRT-R; Woodcock, 1987) as 
the outcome measure. Because of the developmental age and at-risk nature of the students 
within this study, the WAT was not sensitive enough to detect individual differences in 
 98 
student reading outcomes. The WAT does target the skills being taught in the 
interventions, however, a more sensitive curriculum-based measure such as DIBELS or 
easyCBM may have more accurately captured the differences across students.  
 Lesson segment numbers. An additional limitation should be noted regarding the 
number of lesson segments included in the study. While 72 lesson segments may have 
been adequate for answering the research questions related to reliability and the 
relationship to full-length lesson scores, as the analyses began to explore the relation of 
lesson segment and intervention phase at the group level, the n needed to accurately 
answer some of the questions may have been too small. For instance, the 72 lesson 
segments were derived from only 24 full-length lessons, which represented only three 
lessons from each group. Therefore, when lesson segments were examined, only nine 
lesson segments were analyzed for each instructional group, accounting for only three 
beginning, middle, and end lesson segments per group. The study also only took into 
account three weeks out of the ten-week intervention period. Therefore, the average of 
one beginning, one middle, and one end lesson segment (all from the same full-length 
lesson) comprised the score for the intervention phase. Considering the multifaceted 
nature of the QIDR tool, as well as the variability across time found by Forbes-Spear 
(2014) in an earlier study, a more accurate measure of implementation within the phase 
may have been achieved with multiple lesson measures rather than being derived from 
only one lesson within the intervention phase. For this reason, some findings, particularly 
those involving phase and specific lesson segments by group, should be approached with 
some caution. 
 99 
 Observer reliability. The difficulty with achieving reliability among observers 
provided insight into aspects of observation that may impact reliability, but also provided 
some limitations that must be noted. It is important to point out that a select pool of 
coders was necessary to achieve reliability and that two coders had to be eliminated. 
Further discussion of this issue will be addressed in the implications section of this 
chapter. 
Implications  
The overall purpose of this study has a practical focus. Providing teachers with 
regular observations and feedback has been found to improve student outcomes when 
incorporated into a responsive instructional cycle (Fixsen, 2013; Pianta, et al., 2008). The 
focus of the current study was to determine if the QIDR could be used to measure 
implementation efficiently so that it might be effectively used for providing frequent 
feedback to improve intervention instruction.  
Reliability can be demonstrated with abbreviated observation. The issue of 
achieving reliability with these abbreviated, ten-minute observations has important 
implications for both research and practice. Observation and feedback is only useful 
when it is also accurate and provides specific feedback that can improve instruction (Hill 
& Grossman, 2013). If administrators and coaches have the ability to visit classrooms to 
perform short observations using a tool that can inform feedback, they may be more 
likely to perform these observations on a more regular basis. This frequent observation 
and feedback loop is especially essential for interventionists who demonstrate weaker 
skills in early observations. Results of this study indicated that those interventionists who 
had lower QIDR scores at the beginning of the study, continued to have lower scores 
 100 
throughout. Therefore, shorter observations could allow the administrator or coach to 
visit those interventionists most in need of feedback on a much more regular basis, 
increasing the likelihood that instruction will improve over time. 
Challenges in achieving reliability in school settings. Providing specific, 
targeted feedback using a multifaceted tool such as the QIDR may present unique 
challenges for training and ongoing support for observers (Jones, Reid, & Patterson, 
1975; Taplin & Reid, 1973). Therefore, providing adequate training, as well as frequent 
calibration checks, is vital for establishing and maintaining reliability within school 
settings.  
The goal of initial training should be to ensure that observers adopt a view of 
teaching that is consistent with the tool being used for measurement (Bell, et al., 2016). 
Initial training using the QIDR must include a thorough explanation of instructional 
components most important for improving student outcomes within intervention settings 
so coaches and administrators are able to complete observations that are devoid of their 
own personal biases on instruction. To maintain measures of implementation that are 
reliable and aligned with the intent of the tool, it is also necessary to provide regular 
check-ins across time to ensure that coaches and administrators are continuing to provide 
accurate measures of implementation and student response.  
The fact that two of the seven coders within this study were found to have issues 
with reliability indicates that having only one or two observers within a school may be 
problematic if the biases of the coder or coders prevent them from providing accurate 
assessment and feedback to interventionists. Calibration against a set of “master” scores 
(i.e., scores obtained and agreed upon by a group of experienced coders) initially and at 
 101 
multiple times throughout a study may be necessary in order to ensure that particular 
coders are, and remain, reliable. Given the difficulty with achieving reliability with the 
original seven coders, observer characteristics may also be an important consideration. It 
may be necessary for researchers and administrators to carefully select observers who can 
provide objective measures of instructional quality. One MET study found that a survey 
of teacher beliefs could predict which teachers were most likely to be successfully trained 
to use the CLASS observation tool reliably (Ho & Kane, 2013). Developing a system for 
screening coders prior to initial training may help to conserve training resources by 
identifying those not likely to be reliable coders.  
Equivalence of implementation regardless of lesson segment. The finding that 
there was no significant difference in the measures of implementation across lesson 
segments or intervention phases, as well as its strong correlation to the full-length lesson, 
provides a great deal of flexibility for observers in school settings. The knowledge that 
time of observation has little effect on measures of implementation, along with the earlier 
finding of reliability across shorter segments of lessons, allows administrators and 
coaches the ability to fit observations and feedback into busy schedules at their 
convenience. Shorter observations, with the added benefit of scheduling flexibility, may 
mean that more frequent observations and feedback can be provided to interventionists, 
thus allowing greater opportunity to improve instruction and subsequent student 
outcomes (Hill & Grossman, 2013; Jarald, 2012).  
Although there were limitations in regards to the number of segments included in 
each phase, the findings regarding implementation across intervention phase may also be 
very important for maximizing resources for instructional supports within a school. 
 102 
Similar to the idea of tiered instructional supports used in RTI, it may be possible to tailor 
the frequency and intensity of observation and feedback depending upon the needs of 
each interventionist. Myers, Simonsen, and Sugai (2011) used this approach with teachers 
implementing elements of a system of Positive Behavior Intervention and Supports 
(PBIS). Those teachers who were nonresponsive to schoolwide PBIS training (tier 1) in 
using specific and contingent praise  were offered targeted training supports (tier 2) 
followed by more individualized training for those who were not responsive to targeted 
training (tier 3). Through their investigation, they found that all teachers benefited from 
additional supports, but that these supports were differential in need, meaning that some 
teachers required more intensive supports before a change in their behavior was 
observed.  This approach could be applied using the QIDR, or other implementation 
tools, as well. Within this study, interventionists who scored high (above 35) in the first 
intervention phase, maintained high implementation across the intervention, so it may be 
possible to provide less frequent observations and less intensive coaching supports for 
them. In contrast, interventionists scoring below 35 during the initial intervention phase 
may require much more frequent coaching and feedback to ensure that implementation 
not only improves, but doesn’t worsen across time.  
Implementation is related to student outcomes. The elements of explicit 
instruction in which the QIDR were based have been shown to impact student outcomes 
and emphasize explicit, intensive, and supportive instructional methods (Gersten et al., 
1997; Torgesen, 2002; Swanson, 1999). Although the small sample size, coupled with the 
floor effect in the outcome measure, limited the ability to definitively say that QIDR 
scores are predictive of student outcomes, the variance explained with each of the models 
 103 
indicates that the QIDR may be effective for this purpose. In addition, the variance 
explained by QIDR scores within end segments indicates that instructional behaviors at 
the end of a lesson may be particularly impacting student outcomes. Although previously 
discussed findings indicated that observing an intervention lesson at any time can give a 
reliable measure of implementation, the considerable additional variance explained with 
end lesson segments might provide further guidance on optimal observation times, if the 
opportunity to choose an observation time is available. The ability of an interventionist to 
sustain high levels of implementation across an entire lesson may be the best predictor of 
student outcomes. In addition, the feedback that is provided based on the end lesson 
segment observations may be able to better target the skills most in need of improvement 
for that interventionist.  
Future Research 
 The most challenging aspect of this study involved the reliability of coders. 
Future research needs to address training methods and how to guard against observer bias 
when training observers to use a multifaceted tool such as the QIDR. In addition, 
investigation into the observer characteristics that are optimal for use in both research 
contexts and school-based contexts is important. Understanding what traits are essential 
in observers may help future researchers to avoid reliability issues, and could optimize 
the utility of the QIDR as a tool for providing useful feedback to interventionists in 
school settings. Future research is necessary to determine if there is a screening measure 
that could be used to determine optimal coder characteristics with a tool such as the 
QIDR, similar to what was done with the CLASS (Ho & Kane, 2013). Finally, given the 
post-hoc analysis which revealed that a smaller subset of coders was able to achieve 
 104 
reliability with six-minute segments, additional research investigating the possibility of 
using shorter segments is necessary. 
Future research should also address the elements of instruction and student 
response within the QIDR tool that might overlap in the construct being measured. 
Reducing the number of elements that observers are discerning may help to increase 
reliability by reducing cognitive load required by observers (Joe, McClellan, & 
Holtzman, 2016). One possible way to decrease elements may be to combine certain 
elements. For example, potentially combining the “modulating lesson pacing” element 
with the “teacher ensuring students are firm on content” element may be possible. Both 
are addressing the interventionist’s ability to adjust instruction based on student 
performance, so only one element to that effect may be necessary to capture this 
construct. In addition, there are two elements within the instruction portion of the QIDR 
that can be scored based on student response. The first item, “Teacher is familiar with the 
lesson,” discusses teacher fluency with lesson formats, but also includes an element 
regarding whether or not students follow procedures. The other item, “Teacher 
expectations are clearly communicated and understood by students,” states that either 
“the teacher explicitly reviews expectations, or it is clear expectations have been taught 
because all students demonstrate knowledge of expectations for behavior and academic 
routines, and meet or exceed expectations.” These items seem to overlap with three items 
found on the group student response rubric, which address whether or not students 
demonstrate behaviors consistent with knowledge of expectations for routines, on-task 
behavior, and following directions. This could potentially reduce the number of items 
 105 
necessary for scoring by three elements, thus giving the observer fewer elements to 
consider and impact reliability. 
 Given the practical intent of the current study, it would be remiss not to suggest 
the need for research to elucidate the utility of the QIDR for providing feedback to 
improve instruction. Research needs to be conducted that would determine if coaching 
using the QIDR as a prompt for guiding discussion and feedback with interventionists, 
did, in fact, improve implementation over time, and then, if improved implementation 
also resulted in improved student outcomes.  
Conclusions 
Quality instruction is especially important for students who are at-risk for failure. 
Unfortunately, due to limited resources in schools, interventionists providing instruction 
for these students are often the least likely to receive ongoing supports to ensure high-
quality implementation of interventions (Al-Otaiba, Wagner, & Miller, 2014). Following 
the lead of RtI, a responsive instructional cycle can be used to provide the needed 
supports to interventionists in such a way that resources can be maximized for those 
requiring the most intensive supports. For those interventionists requiring more targeted 
and frequent supports, administrators and coaches must utilize tools that can help them to 
provide support and feedback that is useful in improving instruction and student 
outcomes on a more regular basis (Myers, Simonsen, & Sugai, 2011). The challenge is in 
creating a tool that can accurately and reliably measure implementation, provide enough 
information to guide specific, targeted feedback to improve instruction, but be 
streamlined enough that it can be used for frequent observation and feedback (Fixsen, et 
al., 2013; Hill & Grossman, 2013).  Findings from this study provide initial support for 
 106 
the use of the QIDR as a tool that meets these criteria. While additional research is 
needed to fine-tune the QIDR and confirm its utility as a coaching tool, the current study 
indicates that shorter, more frequent observations are feasible and that there is promise in 
the efficient use of the QIDR for just such a purpose.  
     
   
 
 
  
 
APPENDIX  
Quality of Intervention Delivery and Receipt 
Item Not implemented:  
0 points 
<50% 
Inconsistent implementation: 
1 point 
>50% 
Effective implementation:  
2 points 
>80% 
Expert implementation: 
3 points 
>95% 
a) Teacher is familiar with the lesson  
(e.g., it is evident that teacher has previewed 
the lesson and demonstrates fluency with the 
formats and lesson activities). 
Teacher does not demonstrate 
fluency with formats and lesson 
activities and students do not 
follow the procedures. 
Teacher occasionally demonstrates 
fluency with formats and lesson 
activities and students only 
sometimes follow the procedures. 
Teacher typically demonstrates 
fluency with formats and lesson 
activities and most students 
typically follow the procedures. 
Teacher consistently 
demonstrates fluency with 
formats and lesson activities 
and all students consistently 
follow the procedures. 
b) Instructional materials are organized (e.g., 
instructional materials are prepped before 
starting the lesson including worksheets, 
pencils for easy distribution; organization 
supports rather than detracts from effective 
instruction, smooth transitions, etc.). 
Instructional materials are not 
organized. 
Instructional materials are partially 
organized. 
Instructional materials are  
completely organized. 
All instructional materials are 
organized specifically by 
lesson or student name. 
c) Transitions between activities are efficient 
and smooth (e.g., well-established routines are 
in place, “teacher talk” is minor between 
lesson components, less than 1-2 minutes). 
Excluding factors outside teacher control such 
as fire drill. 
Teacher does not implement well-
established routines to minimize 
interruptions. (e.g., transitions 
often take longer than 2 minutes, 
excluding outside factors).   
Teacher occasionally implements 
well-established routines to 
minimize interruptions but 
“Teacher Talk” may occur, or 
transitions are inconsistent (e.g., 
transitions occasionally take longer 
than 2 minutes, excluding outside 
factors).   
Teacher implements well-
established routines to minimize 
interruptions. “Teacher talk” 
between transitions is minimal 
(e.g., transitions typically take 
less than 1-2 minutes, excluding 
outside factors).   
Teacher implements well-
established routines to 
minimize interruptions. All 
transitions consistently occur 
and activities flow nearly 
seamlessly (e.g., transitions 
consistently take about a 
minute excluding outside 
factors).   
d) Teacher expectations are clearly 
communicated and understood by students 
(e.g., teacher reviews academic and behavior 
expectations, uses clearly established routines, 
precorrects for challenging activities, etc.). 
 
 
Teacher does not explicitly state 
expectations and 
students do not demonstrate 
knowledge of expectations for 
behavior and academic routines. 
Teacher states expectations but 
students only occasionally 
demonstrate knowledge of 
expectations for behavior and 
academic routines. 
Teacher explicitly reviews 
expectations or it is clear 
expectations have been taught 
because most students typically 
demonstrate knowledge of 
expectations for behavior and 
academic routines. 
Teacher explicitly reviews 
expectations or it is clear 
expectations have been 
taught because all students 
consistently demonstrate 
knowledge of expectations 
for behavior and academic 
routines and meet or exceed 
those expectations. 
 108 
 
 
Item Not implemented:  
0 points 
<50% 
Inconsistent implementation: 
1 point 
>50% 
Effective implementation:  
2 points 
>80% 
Expert implementation: 
3 points 
>95% 
e) Teacher positively 
reinforces correct 
responses and behavior 
as appropriate (group 
and individual) (e.g., 
teacher inserts 
affirmations, specific 
praise, and confirmations 
either overtly or in an 
unobtrusive way). 
Teacher does not use positive 
reinforcement to reinforce correct 
responses and appropriate 
behavior through verbal and 
nonverbal feedback when 
appropriate. 
Teacher occasionally uses positive 
reinforcement to reinforce correct 
responses and appropriate 
behavior through verbal and 
nonverbal feedback when 
appropriate. 
Teacher typically uses targeted 
positive reinforcement (specific 
and general) to reinforce correct 
responses and appropriate 
behavior through verbal and 
nonverbal feedback when 
appropriate 
Teacher consistently and 
effectively uses positive 
reinforcement (specific and 
general, individual and group) 
to reinforce correct 
responses and appropriate 
behavior through verbal and 
nonverbal feedback when 
appropriate. 
f) Teacher appropriately 
responds to problem 
behaviors  (e.g., including 
off task; emphasizes 
success while providing 
descriptive, corrective 
feedback; positively 
reinforces to get students 
back on track). 
Teacher does not appropriately 
respond to problem behavior 
across multiple students. Teacher 
primarily provides negative 
feedback or ignores problem 
behavior for extended period of 
time (resulting in limited student 
participation, e.g., more than 20% 
of activity). 
 
Teacher sometimes appropriately 
responds to problem behavior. 
Teacher provides some positive or 
corrective feedback but does not 
regularly emphasize success. 
Teacher may have difficulty 
consistently responding to one 
student’s problem behavior but 
sometimes responds appropriately 
to other students. 
Teacher typically responds 
appropriately to problem 
behavior by emphasizing success 
and providing neutral corrective 
feedback for most students. 
Or no problem behavior occurs 
during the instruction. 
Teacher consistently 
responds appropriately to 
problem behavior by 
emphasizing success and 
providing descriptive 
corrective feedback as 
needed for all students. For 
example, teacher “catches” 
students engaging in 
appropriate behavior and 
provides descriptive positive 
feedback to encourage 
appropriate behavior.  
g) Teacher is responsive 
to the emotional needs 
of the students (e.g., 
teacher connects not only 
academically but 
personally to students, 
calls them by name, jokes 
with them, asks about 
their day, etc.). 
Teacher provides limited/no 
positive feedback, may use 
sarcasm, and is 
unresponsive/unaware of 
students’ emotional needs.  
Teacher is generally neutral, may 
provide positive feedback but is 
directed toward academic content 
(i.e., no demonstration of being 
aware of students’ emotional 
needs).  
Teacher is typically positive, 
responsive and aware of most 
students’ emotional needs. 
Teacher greets students by 
name, makes students feel 
welcome, respects their 
individuality, makes an effort to 
make a connection, and appears 
to enjoy students.  
Teacher is consistently very 
positive, responsive and 
aware of all students’ 
emotional needs. Teacher 
greets students by name, 
makes students feel 
welcome, respects their 
individuality, makes an effort 
to make a connection, and 
appears to enjoy students.   
 
 109 
 
 
Item Not implemented:  
0 points 
<50% 
Inconsistent implementation: 
1 point 
>50% 
Effective implementation:  
2 points 
>80% 
Expert implementation: 
3 points 
>95% 
h) Teacher uses clear and consistent 
lesson wording (e.g., using the exact 
wording or a close approximation of the 
language of the program consistently 
across activities). 
Teacher does not use guide 
including script or format. 
Wording is inconsistent, and 
there appears to be excessive 
“teacher talk”. 
Teacher partially uses guide 
including script or format. 
Wording is sometimes 
consistent (during particular 
activities or instructional 
components). 
Teacher typically uses guide 
including script or format. 
Wording is consistent and 
directions are clear and easy 
to follow across activities. 
Teacher consistently uses 
guide including script or 
format. Wording is always 
consistent, and directions 
are clear and easy to 
follow across all activities. 
i) Teacher uses clear and consistent 
auditory or visual signals (e.g., it is clear 
to students when and how to respond 
appropriately during individual, partner 
and group responses, across all 
components of lesson). 
Teacher does not use clear 
auditory or visual signals to 
ensure students respond 
appropriately. 
Teacher occasionally uses clear 
auditory or visual signals to 
ensure students respond 
appropriately. 
Teacher typically uses clear 
auditory or visual signals to 
ensure students respond 
appropriately. 
Teacher consistently uses 
clear auditory or visual 
signals to ensure students 
respond appropriately. 
j) Teacher models skills/strategies 
during introduction of activity  (e.g., 
shows students examples that 
demonstrate how to complete the 
academic skill/strategy, which all 
students can easily see, during 
teaching). 
Teacher does not clearly 
demonstrate skills/strategies 
prior to student practice 
opportunities. 
Teacher occasionally clearly 
demonstrates skills/strategies 
prior to student practice 
opportunities. 
Teacher typically clearly 
demonstrates skills/strategies 
prior to student practice 
opportunities. 
Or no modeling is used but all 
students are successful with 
activities. 
Teacher consistently 
demonstrates 
skills/strategies prior to 
student practice 
opportunities.  
k) Teacher uses clear and consistent 
error corrections that demonstrates 
the correct response and has students 
practice the correct answer (e.g., use of 
corrective feedback procedures is 
evident and student(s) have the 
opportunity to respond correctly). 
 
Teacher does not use 
corrective feedback 
procedures, including giving 
students an opportunity to 
practice the correct response.  
Teacher occasionally uses 
corrective feedback 
procedures, including giving 
students an opportunity to 
practice the correct response. 
Teacher typically uses 
corrective feedback 
procedures, including giving 
students an opportunity to 
practice the correct response 
or fewer than three errors 
occur during the entire 
lesson. 
Teacher consistently uses 
corrective feedback 
procedures, including 
giving students an 
opportunity to practice 
the correct response. 
 110 
 
 
Item Not implemented:  
0 points 
<50% 
Inconsistent 
implementation: 
1 point 
>50% 
Effective implementation:  
2 points 
>80% 
Expert implementation: 
3 points 
>95% 
l) Teacher provides a range of 
systematic group or partner 
opportunities to respond (e.g., offers 
students practice by partner, choral 
and/or written responses). 
Teacher does not provide 
opportunities for group or 
partner opportunities to 
respond. 
Teacher provides some 
opportunities for group or 
partner opportunities to 
respond. 
Teacher provides a range of 
systematic group or partner 
opportunities to respond.  
Teacher regularly 
provides a range of 
systematic group or 
partner opportunities to 
respond. 
m) Teacher presents individual turns 
systematically (e.g., students are 
given opportunities to respond 
individually but using a varied 
approach to keep students engaged, 
provides additional opportunities for 
students making regular errors). 
Teacher does not present 
individual turns when 
appropriate. 
Teacher occasionally 
presents individual turns 
when appropriate (round 
robin and turns are 
predictable). 
Teacher presents individual 
turns when appropriate, 
purposely varied across 
students during some 
portions of the instruction. 
(All students are given 
opportunities to respond 
individually on a random 
basis.) 
Teacher presents 
individual turns when 
appropriate purposely 
and strategically across 
students. (All students 
are given opportunities 
to respond individually 
on a random basis.) 
Individual turns are 
strategically 
incorporated throughout 
the instructional time.  
n) Teacher systematically modulates 
lesson pacing/provides adequate 
think time (e.g., appropriate to learner 
performance). 
Teacher makes no attempt 
to adjust pacing in response 
to student performance. 
Teacher adjusts pacing/wait 
time occasionally in 
accordance with student 
responses. 
Teacher typically anticipates 
and adjusts pacing/wait 
time between question and 
student response. 
Teacher consistently 
anticipates and adjusts 
pacing/wait time 
between question and 
student response. 
o) Teacher ensures students are firm 
on content prior to moving forward 
(e.g., holds students to a high 
criterion/mastery level of 
performance on each task, reteaches 
and retests as needed). 
Teacher moves on before 
most students are firm on 
content.  
Teacher moves on when 
some of the students are 
firm on the content or 
sometimes moves on when 
students are firm on content 
but other times moves on 
before students are firm on 
content. 
Teacher typically ensures 
most students are firm on 
content before moving on 
to new material.  
Teacher consistently 
moves on when most 
students are firm on the 
content or continues to 
practice when students 
are not firm on content. 
(if only one student 
persists in errors and the 
teacher moves on after 
attempting correction, 
this is ok) 
**If one activity goes particularly poorly, the teacher cannot receive a rating of 3 on the following items: familiarity with the lesson, clear and 
consistent wording, modeling, clear signals and correction procedures.
 111 
 
 
Student Response During Intervention 
 
Group Student Behavior 
 
 
Item None or One  
0 points 
<50% 
Some 
1 point 
>50% 
Most 
2 points 
>80% 
All 
3 points 
>95% 
a) Students are familiar 
with group routines (e.g., 
students demonstrate they 
know procedures). 
Students do not 
demonstrate knowledge of 
group routines. 
Students occasionally 
demonstrate knowledge of 
group routines. 
Most students typically 
demonstrate knowledge 
of group routines. 
All students demonstrate 
knowledge of group 
routines consistently 
during the instruction. 
b) Students are actively 
engaged with the lesson 
(e.g., students are listening, 
on task and responding). 
Students are not actively 
engaged during the lesson. 
Students are actively 
engaged during part of the 
lesson. 
Most students are 
actively engaged for the 
majority of the lesson. 
All students are actively 
engaged for the majority 
of the lesson. 
c) Students follow 
teacher directions (e.g., 
students are listening and 
responding to teacher 
requests). 
Students do not follow 
teacher’s directions when 
asked. 
Students occasionally 
follow teacher’s directions 
when asked. 
Most students typically 
follow teacher’s 
directions when asked. 
All students consistently 
follow all teacher’s 
directions when asked. 
d) Students are 
emotionally engaged 
with the teacher (e.g., 
students connect with 
teacher beyond schoolwork 
and are excited to be there). 
Students don’t appear to 
want to be in the group 
(e.g., students direct 
negative 
comments/behavior 
toward teacher, etc.). 
Students seem 
complacent/compliant with 
the group (e.g., student 
“going through the 
motions” in group but not 
negative). 
Most students appear to 
genuinely want to be in 
the group (e.g., students 
smile when joining the 
group, say hi to teacher, 
etc.). 
All students appear to 
genuinely want to be in 
the group (e.g., students 
smile when joining the 
group, say hi to teacher, 
etc.). 
 112 
 
Individual Student Response 
Item 
0 points 
 
<50% 
1 point 
 
>50% 
2 points 
 
>80% 
3 points 
 
>95% 
Emotional Engagement 
Student appears to be 
disconnected from the 
teacher. Student responds 
to teacher attention with 
negative comments or 
behaviors. 
Student appears to be 
somewhat connected with 
the teacher, but appears to 
be complacent with teacher 
attention. Student may not 
actively seek out teacher 
attention, but does not 
respond negatively to the 
teacher. 
Student typically appears to 
be connected with the 
teacher and seems to seek 
interactions with teacher. 
Student smiles when joining 
group, appears happy to be 
there, seeks teacher 
attention, and appears to 
want to work with teacher. 
Student consistently 
appears to be highly 
connected with the teacher 
and seems to seek 
interactions with teacher. 
Student smiles when joining 
group, appears happy to be 
there, seeks teacher 
attention, and appears to 
want to work with teacher. 
Self-Regulated Behavior  
Student demonstrates 
limited attention. Across the 
instructional observation, 
engagement is dependent 
upon significant teacher 
prompting. Consistently 
needs to be redirected to 
complete tasks. 
 
 
 
Student demonstrates 
occasional attention to tasks 
(and may be able to 
maintain attention during 
one or certain type of tasks), 
but engagement is often 
dependent upon significant 
teacher prompting (e.g., at 
least 2 prompts in 1 task). 
Consistently needs to be 
redirected to complete 
tasks. 
After prompting, will 
comply. 
Student demonstrates 
moderate engagement. 
Student is typically engaged 
but is sometimes dependent 
on teacher prompting (e.g., 
<2 within a task).  Completes 
work/answers on signal, 
asks questions when 
appropriate. Appears to be 
trying hard. Sometimes 
volunteers to participate. 
Student demonstrates 
consistent sustained 
attention. Able to stay 
engaged in lesson regardless 
of amount of teacher 
attention. Completes 
work/answers on signal, 
asks questions when 
appropriate. Appears to be 
trying hard. Student actively 
initiates and regularly 
volunteers to participate. 
*Only code student individual behaviors if they are visible for the majority of the session (i.e., more than 50% of time). 
 
Student Responsiveness Descriptors: 
 Responsive: Student may or may not visibly demonstrate awareness of feedback, but attempts to incorporate feedback (i.e., accuracy 
improves, self-corrects) later in lesson. 
 Non-responsive: Student may or may not demonstrate overt awareness of feedback, but demonstrates consistent error patterns across 
lesson. 
  
 
  
Group ID: _____________   Date of Video/Observation: _________ Observer Name: 
__________________________  
Number of Minutes of Lesson:___________________          Number of Students Observed: _______ 
 Approximate time per activity type:  Whole group:_______  Independent work: _____  Partner 
work:_______ 
Criteria for Level of Implementation Ratings (see developed rubric for each rating of implementation): 
3 = Expert; 2 = Effective;  1 = Inconsistent;  0 = Element absent or not observed 
Quality of Intervention Delivery 
If one activity goes particularly poorly, the teacher cannot receive a rating of 3 on the following item: teacher 
familiarity of lesson, clear and consistent wording, modeling, clear signals and correction procedures. 
 
 
 
 
 
 
 
 
 
Item 
Level of 
Implementation Comments 
a) Teacher is familiar with the lesson 0        1        2        3  
b) Instructional materials are organized 0        1        2        3  
c) Transitions from one activity to another are efficient 
and smooth (i.e., less than 2-3 minutes) 
0        1        2        3  
d) Teacher expectations are clearly communicated and 
understood by students 
0        1        2        3  
e) Teacher positively reinforces correct responses and 
behavior as appropriate (group and individual) 
0        1        2        3  
f) Teacher appropriately responds to problem behavior 
(including off task) 
0        1        2        3  
g) Teacher is responsive to the emotional needs of the 
students 
0        1        2        3  
h) Teacher uses clear and consistent lesson wording  0        1        2        3  
i) Teacher uses clear auditory or visual signals  0        1        2        3  
j) Teacher models skills/strategies to introduce an 
activity 
0        1        2        3  
k) Teacher uses clear and consistent error corrections 
that  includes the correct response and has students 
practice the correct answer  
0        1        2        3  
l) Teacher provides a range of systematic group or 
partner opportunities to respond 
0        1        2        3  
m) Teacher presents individual turns systematically 0        1        2        3  
n) Teacher systematically modulates lesson 
pacing/provides adequate think time 
0        1        2        3  
o) Teacher ensures students are firm on content prior to 
moving forward 
0        1        2        3  
Overall Quality of Intervention Delivery Total /45  
           
 114 
 
Overall Intervention Delivery 
Overall effectiveness takes into consideration quality of delivery, understanding of the 
program, and student engagement and management. 
Ineffective Needs Improvement Proficient Effective Highly Effective 
1 3 5 7 9 
        0         1           2          3                4              5        6               7             8   9          10 
Student Response During Intervention 
Group Student Behavior 
Item 
Level of 
Implementation Comments 
a) Students are familiar with group routines 0        1        2        3  
b) Students are actively engaged with the lesson 0        1        2        3  
c) Students follow teacher directions 0        1        2        3  
d) Students are emotionally engaged with the 
teacher 
0        1        2        3  
Overall Group Student Behavior /12  
 
Individual Student Response 
(Record students from left to right from your perspective) 
Stud Emotional 
Engagement 
Self-Regulated 
Behavior 
Responsiveness 
S1 0        1        2        3 0        1        2        3 Responsive          Non-Resp 
S2 0        1        2        3 0        1        2        3 Responsive          Non-Resp 
S3 0        1        2        3 0        1        2        3 Responsive          Non-Resp 
S4 0        1        2        3 0        1        2        3 Responsive          Non-Resp 
S5 0        1        2        3 0        1        2        3 Responsive          Non-Resp 
 
**If student performance was unclear due to camera angle, indicate by placing an X over the student 
number. Only code student individual behaviors if they are visible for the majority of the session (i.e., 
more than 50% of time). 
 
  
 
REFERENCES CITED 
Archer, A. L., & Hughes, C. A. (2011). Explicit instruction: Effective and efficient 
teaching. New York, NY: The Guilford Press. 
Bell, C. A., Qi, Y., Croft, A. J., Leusner, D., McCaffrey, D. F., gitomer, D. H. & Pianta, 
R.C. Improving observational score quality: challenges in observer thinking.  In T. 
Kane, K. Kerr, and R. Pianta (Eds.), Designing teacher evaluation systems: New 
guidance from the measures of effective teaching project.  (pp. 415-443). Retrieved 
from http://k12education.gatesfoundation.org/wp-
content/uploads/2015/11/Designing-Teacher-Evaluation-Systems_freePDF.pdf 
Borman, G. D., Kimball, S. M., Borman, G. D., & Kimball, S. M. (2005). Teacher quality 
and educational equality : Do teachers with higher standards-ratings close student 
achievement gaps? The Elementary School Journal, 106(1), 3–20. 
Brennan, R. L. (2010). Generalizability theory and classical test theory. Applied 
Measurement in Education, 24(1), 1-21. 
Brooks, M. G., & Brooks, J. G. (1999). The courage to be constructivist. The 
Constructivist Classroom, 57(3), 18–24. 
Brophy, J. (1986). Teacher influences on student achievement. American Psychologist, 
41(10), 1069–1077. doi:10.1037//0003-066X.41.10.1069 
Brophy, J. (1999). Teaching (Vol. 37). Geneva, Switzerland. doi:10.1016/S0167-
8922(00)80004-8 
Brophy, J., & Good, T. L. (1986). Teacher behavior and student achievement. In M. C. 
Wittrock (Ed.), Handbook of Research on Teaching (3rd ed., pp. 328–375). New 
York, NY, US: Macmillan Publishing Company. 
 116 
 
Brownell, M. T., Bishop, A. G., Gersten, R., Klingner, J., Penfield, R., Dimino, J., … 
Sindelar, P. T. (2009). The role of domain expertise in beginning special education 
teacher quality. Exceptional Children, 75(4), 391–411. 
Cameron, C. E., Connor, C. M., & Morrison, F. J. (2005). Effects of variation in teacher 
organization on classroom functioning. Journal of School Psychology, 43(1), 61–85. 
doi:10.1016/j.jsp.2004.12.002 
Carlisle, J., Kelcey, B., Berebitsky, D., & Phelps, G. (2011). Embracing the complexity 
of instruction: A Study of the effects of teachers’ instruction on students' reading 
comprehension. Scientific Studies of Reading, 15(5), 409–439. 
doi:10.1080/10888438.2010.497521 
Casabianca, J. M., Lockwood, J. R., & McCaffrey, D. F. (2015). Trends in classroom 
observation scores. Educational and Psychological Measurement, 75(2), 311-337. 
Cassidy, D. J., Hestenes, L. L., Hegde, A., Hestenes, S., & Mims, S. (2005). 
Measurement of quality in preschool child care classrooms: An exploratory and 
confirmatory factor analysis of the early childhood environment rating scale-revised. 
Early Childhood Research Quarterly, 20(3), 345–360. 
doi:10.1016/j.ecresq.2005.07.005 
Catts, H. W., Petscher, Y., Schatschneider, C., Bridges, M. S., & Mendoza, K. (2008). 
Floor effects associated with universal screening and their impact on the early 
identification of reading disabilities. Journal of Learning Disabilities. 
Causton-Theoharis, J. N., Doyle, M. B., Giangreco, M. F., & Vadasy, P. F. (2007). The 
“sous-chefs” of literacy instruction. Teaching Exceptional Children, 40(1), 56–62. 
 117 
 
Chetty, R., Friedman, J. N., & Rockoff, J. E. (2011). The long-term impacts of teachers: 
Teacher value-added and student outcomes in adulthood. Cambridge, MA. 
Chomat-mooney, L. I., Pianta, R. C., Hamre, B. K., Mashburn, A. J., Luckner, A. E., 
Grimm, K. J., … Downer, J. T. (2008). A practical guide for conducting classroom 
observations: A summary of issues and evidence for researchers. Charlottesville. 
Cichetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed. 
Psychological Assessment, 6(4), 284. 
Colton, A. B., & Sparks-Langer, G. M. (1992). Restructuring student teaching 
experiences. In Supervision in Transition (pp. 155–168). Alexandria, VA: 
Association for Supervision and Curriculum Development. 
Congdon, P. J., & MeQueen, J. (2000). The Stability of Rater Severity in Large‐Scale 
Assessment Programs. Journal of Educational Measurement,37(2), 163-178. 
Connor, C. M. (2013). Commentary on two classroom observation systems: Moving 
toward a shared understanding of effective teaching. School Psychology Quarterly, 
28(4), 342–6. doi:10.1037/spq0000045 
Connor, C. M., Piasta, S. B., Fishman, B., Glasney, S., Schatschneider, C., Crowe, E., … 
Morrison, F. J. (2009). Individualizing student instruction precisely: effects of Child 
x Instruction interactions on first graders’ literacy development. Child Development, 
80(1), 77–100. doi:10.1111/j.1467-8624.2008.01247.x 
Cook, B. G., & Odom, S. L. (2013). Evidence-based practices and implementation 
science in special education. Exceptional Children, 79(2), 135–144. 
Danielson, C. (1996). Enhancing professional practice: A framework for teaching. 
Alexandria, VA: Association of Supervision and Curriculum Development. 
 118 
 
Danielson, C. (2007). Enhancing professional practice: A framework for teaching (2nd 
ed.). Alexandria, VA: Association for Supervision and Curriculum Development. 
Danielson, C. (2011). Enhancing professional practice: A framework for teaching. 
Association of Supervision and Curriculum Development. 
Danielson, C. (2013). The Framework for Teaching evaluation instrument, 2013 edition: 
The newest rubric enhancing the links to the Common Core State Standards, with 
clarity of language for ease of use and scoring. 
Danielson, C., & McGreal, T. L. (2000). Teacher evaluation to enhance professional 
practice. Association of Supervision and Curriculum Development. 
Darling-Hammond, L. (2010). Evaluating teacher effectiveness: How teacher 
performance assessments can measure and improve teaching. Retrieved from 
www.americanprogress.org 
Denton, C. a., Fletcher, J. M., Anthony, J. L., & Francis, D. J. (2006). An evaluation of 
intensive intervention for students with persistent reading difficulties. Journal of 
Learning Disabilities, 39(5), 447–466. doi:10.1177/00222194060390050601 
Dunkin, M. J., & Biddle, B. J. (1974). The study of teaching. Holt, Rinehart & Winston. 
Durlak, J. a, & DuPre, E. P. (2008). Implementation matters: a review of research on the 
influence of implementation on program outcomes and the factors affecting 
implementation. American Journal of Community Psychology, 41(3-4), 327–50. 
doi:10.1007/s10464-008-9165-0 
Engelmann, S., Arbogast, A., Bruner, E., Lou Davis, K., Engelmann, O., Hanner, S., & 
Al., E. (2002). SRA Reading Mastery Plus. DeSoto, TX: SRA/McGraw-Hill. 
 119 
 
Evertson, C., & Harris, A. (1992). What we know about managing classrooms. 
Educational Leadership, 49(7), 74–78. 
Feng, L., Figlio, D. N., & Sass, T. R. (2010). School accountability and teacher mobility. 
Field, A. P. (2013). Discovering statistics using IBM SPSS statistics (4th ed.). Sage 
Publications Inc. 
Fish, M. C., & Dane, E. (2000). The Classroom Systems Observation Scale : 
Development of an instrument to assess classrooms using a systems perspective. 
Learning Environments Research, 3, 67–92. 
TNTP (2013). Fixing classroom observations: How common core will change the way we 
look at teaching. Retrieved from 
http://tntp.org/assets/documents/TNTP_FixingClassroomObservations_2013.pdf 
Fixsen, D. L., Blase, K., Metz, A., & Van Dyke, M. (2013). Statewide implementation of 
evidence-based programs. Exceptional Children, 79(2), 213–230. 
Foorman, B. R., & Torgesen, J. (2001). Critical Elements of Classroom and Small-Group 
Instruction Promote Reading Success in All Children. Learning Disabilities 
Research and Practice, 16(4), 203–212. doi:10.1111/0938-8982.00020 
Forbes-Spear, C. (2014). Examining the relationship between implementation and student 
outcomes: The application of an implementation measurement framework. 
University of Oregon. 
Foundation, B. and M. G. (2009). Measures of effective teaching project (MET). 
Fuchs, L. S., Deno, S. L., & Mirkin, P. K. (1984). The effects of frequent curriculum-
based measurement and evaluation on pedagogy , student achievement , and student 
awareness of learning. American Educational Research Journal, 21(2), 449–460. 
 120 
 
Gage, N. L. (1989). The paradigm wars and their aftermath: A “historical” sketch of 
research on teaching since 1989. Educational Research and Evaluation : An 
International Journal on Theory and Practice, 18(7), 4–10. 
Gage, N. L., & Needels, M. C. (1989). Process-product research on teaching : A review 
of criticisms. Elementary School Journal, 89(3), 253–300. 
Gargani, J., & Strong, M. (2014). Can we identify a successful teacher better, faster, and 
cheaper? Evidence for innovating teacher observation systems. Journal of Teacher 
Education, 65(5), 389–401. doi:10.1177/0022487114542519 
Gay, L. R., Mills, G. E., & Airasian, P.W. (2009). Educational research: Competencies 
for analysis and applications. Upper Saddle River, NJ: Pearson Higher Education. 
George, D., & Mallery, P. (2010). SPSS for windows step by step: A simple guide and 
reference. 18.0 update (11th ed.). Boston: Allyn & Bacon. 
Gersten, R., Baker, S. K., Haager, D., & Graves, A. (2005). Exploring the role of teacher 
quality in predicting reading outcomes for first-grade English learners. Remedial 
and Special Education, 26(4), 197–206. 
Gersten, R., Fuchs, L., Compton, D., Coyne, M., Greenwood, C., & Innocenti, M. (2005). 
Quality Indicators for Group Experimental and Quasi-Experimental Research in 
Special Education. Exceptional Children, 71(2), 149–164. 
Gersten, R., Vaughn, S., Deshler, D., & Schiller, E. (1997). What we know about using 
research findings: Implications for improving Special Education practice. Journal of 
Learning Disabilities, 30(5), 466–476. 
 121 
 
Girolametto, L., & Weitzman, E. (2002). Responsiveness of child care providers in 
interactions with toddlers and preschoolers. Language, Speech, and Hearing 
Services in Schools, 33(October), 268–281. 
Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness : A 
research synthesis. Retrieved from http://files.eric.ed.gov/fulltext/ED521228.pdf 
Goe, L., Biggers, K., & Croft, A. (2012). Linking teacher evaluation to professional 
development: Focusing on improving teaching and learning. 
Goe, L., & Croft, A. (2009). Methods of evaluating teacher effectiveness (pp. 1–12). 
Good, R. H., Gruba, J., & Kaminski, R. A. (2002). Best Practices in Using Dynamic 
Indicators of Basic Early Literacy Skills (DIBELS) in an Outcomes-Driven Model. 
In Best practices in school psychology IV (Vol. 1, Vol. 2). (pp. 699–720). 
Washington,  DC, US: National Association of School Psychologists. 
Good, R. H., Kaminski, R. A., Shinn, M., Bratten, J., Laimon, L., Smith, S., & Flindt, N. 
(2004). Technical adequacy and decision making utility of DIBELS (No. 7). 
Greenwood, C. R., Horton, B. T., & Utley, C. A. (2002). Academic engagement: Current 
perspectives on research and practice. School Psychology Review, 31(3), 328–349. 
Gudmundsdottir, S. (1997). Introduction to the theme issue of “narrative perspectives on 
research on teaching and teacher education.” Teaching and Teacher Education, 
13(I), 1–3. 
 
 
 
 122 
 
Hagan-Burke, S., Coyne, M. D., Kwok, O.-M., Simmons, D. C., Kim, M., Simmons, L. 
E., … McSparran Ruby, M. (2013). The effects and interactions of student, teacher, 
and setting variables on reading outcomes for kindergarteners receiving 
supplemental reading intervention. Journal of Learning Disabilities, 46(3), 260–77. 
doi:10.1177/0022219411420571 
Hall, T., Vue, G., Strangman, N., & Meyer, A. (2014). Differentiated instruction and 
implications for UDL implementation. Effective Classroom Practices Report. 
Retrieved November 07, 2014, from 
http://aim.cast.org/learn/historyarchive/backgroundpapers#.VF0K8jTF9Og 
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An 
overview and tutorial. Tutor Quant Methods Psychol, 8(1), 23–34. 
Hamre, B. K., Goffin, S. G., & Kraft-Sayre, M. (2009). Classroom Assessment Scoring 
System (CLASS) implementation guide: Measuring and improving classroom 
interactions in early childhood settings. Charlottesville, VA. Retrieved from 
teachstone.com/wp-content/uploads/.../CLASSImplementationGuide.pdf 
Hamre, B. K., & Pianta, R. C. (2005). Can instructional and emotional support in the 
first-grade classroom make a difference for children at risk of school failure? Child 
Development, 76(5), 949–967. 
 
 
 
 
 123 
 
Hamre, B. K., Pianta, R. C., Mashburn, A. J., & Downer, J. T. (2007). Building a science 
of classrooms: Application of the CLASS framework in over 4,000 U.S. early 
childhood and elementary classrooms (pp. 1–35). Retrieved from 
http://www.researchgate.net/profile/Jason_Downer/publication/237728991_Building
_a_Science_of_Classrooms_Application_of_the_CLASS_Framework_in_over_400
0_U.S._Early_Childhood_and_Elementary_Classrooms/links/0046352cc1bf3e4168
000000.pdf 
Hansen, M., Lemke, M., & Sorensen, N. (2013). Combining multiple performance 
measures: Do common approaches undermine districts’ personnel evaluation 
systems?. Washington, D.C. 
Hanushek, E. A., & Rivkin, S. G. (2010). Using value-added measures of teacher quality 
(pp. 1–6). doi:10.1037/e722242011-001 
Hanushek, E. A., Rivkin, S. G., The, S., Economic, A., The, P. O. F., & Rivkin, G. 
(2010). Generalizations about using value-added measures of teacher quality. 
American Economic Review: Papers & Proceedings, 100(2), 267–271. 
Harms, T., & Clifford, R. (1980). Early childhood environment rating scale. New York: 
Teachers College Press. 
Harms, T., Clifford, R., & Cryer, D. (1998). Early Childhood Environmental Rating 
Scale-Revised. New York: Teachers College Press. 
Harn, B. A., Forbes-Spear, C., Fritz, R., & Berg, T. (2011). Quality of Intervention 
Delivery and Receipt (QIDR) observation tool. Eugene, OR. 
 124 
 
Harn, B., Parisi, D., & Stoolmiller, M. (2013). Balancing fidelity with flexibility and fit: 
What do we really know about fidelity of implementation in schools? Exceptional 
Children, 79(2), 181–193. 
Heneman, H., Milanowski, A., Kimball, S. M., & Odden, A. (2006). Standards-based 
teacher evaluation as a foundation for knowledge- and skill-based pay. CPRE 
Policy Briefs. Retrieved from http://repository.upenn.edu/cpre_policybriefs/33  
Hill, H., & Grossman, P. (2013). Learning from teacher observations: Challenges and 
opportunities posed by new teacher evaluation systems.Harvard educational 
review, 83(2), 371-384. 
Holdheide, L., Browder, D., Warren, S., Buzick, H., & Jones, N. (2012). Using student 
growth to evaluate educators of students with disabilities : Issues, challenges, and 
next steps. 
Holtzapple, E. (2003). Criterion-related validity evidence for a standards-based teacher 
evaluation system. Journal of Personnel Evaluation in Education, 17(3), 207–219. 
Jackson, A. W., & Davis, G. A. (2000). Turning points 2000: Education adolescents in 
the 21st century. Williston, VT 05495-0020: Teachers College Press. 
Jerald, C. (2012). Ensuring accurate feedback from observations: Perspectives on 
practice. Retrieved from https://docs.gatesfoundation.org/documents/ensuring-
accuracy-wp.pdf 
Jensen, E. (1998). Teaching with the brain in mind. Alexandria, VA: Association for 
Supervision and Curriculum Development. 
 
 125 
 
Joe, J. N., McClellan, C. A., and Holtzman, S. L. Reliability and the length and focus of 
classroom observations (2016). In T. Kane, K. Kerr, and R. Pianta (Eds.), Designing 
teacher evaluation systems: New guidance from the measures of effective teaching 
project.  (pp. 415-443). Retrieved from http://k12education.gatesfoundation.org/wp-
content/uploads/2015/11/Designing-Teacher-Evaluation-Systems_freePDF.pdf 
Joe, J., Tocci, C., Holtzman, S., & Williams, J. (2013). Foundations of Observation. 
Princeton, NJ. 
Johnson, E. S., & Semmelroth, C. L. (2012). Examining interrater agreement analyses of 
a pilot special education observation tool. Journal of Special Education 
Apprenticeship, 1(2). 
Johnson, E., & Semmelroth, C. L. (2013). Special Education Teacher Evaluation: Why It 
Matters, What Makes It Challenging, and How to Address These Challenges. 
Assessment for Effective Intervention, 39(2), 71–82. 
doi:10.1177/1534508413513315 
Jones, N. D., & Brownell, M. T. (2013). Examining the use of classroom observations in 
the evaluation of special education teachers. Assessment for Effective Intervention, 
39(2), 112–124. doi:10.1177/1534508413514103 
Justice, L. M. (2006). Evidence-based practice, response to intervention, and the 
prevention of reading difficulties. Language, Speech, and Hearing Services in 
Schools, 37(October), 284–298. 
Justice, L. M., Mashburn, A., Hamre, B., & Pianta, R. (2008). Quality of Language and 
Literacy Instruction in Preschool Classrooms Serving At-Risk Pupils. Early 
Childhood Research Quarterly, 23(1), 51–68. doi:10.1016/j.ecresq.2007.09.004 
 126 
 
Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school 
personnel. Retrieved from http://k12education.gatesfoundation.org/wp-
content/uploads/2015/12/MET_Reliability-of-Classroom-Observations_Research-
Paper.pdf 
Jones, R. R., Reid, J. B., & Patterson, G. R. (1975). Naturalistic observation in clinical 
assessment. Advances in psychological assessment, 3, 42-95. 
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching. Measures of 
Effective Teaching Project: Bill and Melinda Gates Foundation. 
Kane, T. J., Staiger, D. O., & McCaffrey, D. F. (2012). Gathering feedback for teaching: 
Combining high-quality observations with student surveys and achievement gains. 
Retrieved from 
http://www.metproject.org/downloads/MET_Gathering_Feedback_Research_Paper.
pdf 
Kane, T. J., Taylor, E. S., Tyler, J. H., & Wooten, A. L. (2010). Identifying effective 
classroom practices using student achievement data. Cambridge, MA. 
Knight, J. (2007). Instructional Coaching. Thousand Oaks, CA. 
Kretlow, a. G., & Bartholomew, C. C. (2010). Using coaching to improve the fidelity of 
evidence-based practices: A review of studies. Teacher Education and Special 
Education: The Journal of the Teacher Education Division of the Council for 
Exceptional Children, 33(4), 279–299. doi:10.1177/0888406410371643 
La Paro, K., Pianta, R. C., & Stuhlman, M. (2004). Classroom assessment scoring system 
(CLASS): Findings from the pre-k year. The Elementary School Journal, (104), 
409–426. 
 127 
 
National Center for Early Development & Learning (1997). Classroom observation 
system-kindergarten. Charlottesville: University of Virginia. 
Macmillan, C. J. B., & Garrison, J. (1984). Using the “new philosophy of science” in 
criticizing current research traditions in education". Educational Researcher, 13(10), 
15–21. 
Marzano, R. J. (2004). Building background knowledge for academic achievement: 
Research on what works in schools. Association of Supervision and Curriculum 
Development. 
Mashburn, A. J., Pianta, R. C., Barbarin, O. A., Bryant, D., Hamre, K., Downer, J. T., … 
Howes, C. (2008). Measures of classroom quality in prekindergarten and children’s 
development of academic , language , and social skills. Child Development, 79(3), 
732–749. 
Maxwell, K. L., McWilliam, R. A., Hemmeter, M. L., Ault, M. J., & Schuster, J. W. 
(2001). Predictors of developmentally appropriate classroom practices in 
kindergarten through third grade. Early Childhood Research Quarterly, 16, 431–
452. 
Maxwell, S. E. (2004). The persistence of underpowered studies in psychological 
research: causes, consequences, and remedies.Psychological methods, 9(2), 147. 
McClellan, C., Atkinson, M., & Danielson, C. (2012). Teacher evaluator training and 
certification: Lessons learned from teh Measures of Effective Teaching project. San 
Francisco, CA. 
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass 
correlation coefficients. Psychological methods, (1), 30.     
 128 
 
McGuinn, P. (2012). The state of teacher evaluation reform: State education agency 
capacity and the implementation of new teacher-evaluation systems. 
Medley, D. M. (1979). The effectiveness of teachers. In Research on teaching: Concepts, 
findings, and implications (pp. 11–27). 
Milanowski, A. (2004). The relationship between teacher performance evaluation scores 
and student achievement : Evidence from Cincinnati. Peabody Journal of Education, 
79(4), 33–53. 
Miles, J., & Banyard, P. (2007). Understanding and using statistics in psychology. Sage. 
Mowbray, C. T., Holter, M. C., Teauge, G. B., & Bybee, D. (2003). Fidelity criteria: 
Development, measurement, and validation. American Journal of Evaluation, 24(3), 
315–340. doi:10.1177/109821400302400303 
Myers, D. M., Simonsen, B., & Sugai, G. (2011). Increasing teachers' use of praise with a 
response-to-intervention approach. Education and treatment of children, 34(1), 35-
59. 
National Council on Teacher Quality State of the states 2012 : Teacher effectiveness 
policies. (2012). Washington, DC. Retrieved from 
http://www.nctq.org/dmsView/State_of_the_States_2012_Teacher_Effectiveness_P
olicies_NCTQ_Report 
National Institute of Child Health and Human Development Early Child Care Research 
Network. (2000). The relation of child care to cognitive and language development. 
Child Development, 71(4), 960–980. 
 129 
 
NICHD Early Child Care Research, & Duncan, G. (2003). Modeling the impacts of child 
care quality on children’s preschool cognitive development. Child Development, 
(74), 1454–1475. 
NICHD Early Child Care Research Network. (1996). Characteristics of infant child care: 
Factors contributing to positive caregiving. Early Childhood Research Quarterly, 
11(3), 269–306. doi:10.1016/S0885-2006(96)90009-5 
NICHD Early Child Care Research Network. (2002a). Classroom observation system-
first grade. Charlottesville, VA: University of Virginia. 
NICHD Early Child Care Research Network. (2002b). Early child care and children ’ s 
development prior to school entry : Results from the nICHD study of early child 
care. American Educational Research Journal, 39(1), 133–164. 
NICHD Early Child Care Research Network. (2004). Classroom observation system-fifth 
grade. Charlottesville, VA: University of Virginia. 
Odom, S. L. (2008). The tie that binds: Evidence-based practice, implementation science, 
and outcomes for children. Topics in Early Childhood Special Education, 29(1), 53–
61. doi:10.1177/0271121408329171 
Officers, C. of C. S. S. (2011). InTASC model core teaching standards: A resource for 
state dialogue. Washington, DC. 
Pianta, R. C. (2003). Standardized classroom observations from pre-k to third grade: A 
mechanism for improving quality classroom experiences during the P-3 years. 
Retrieved from http://fcd-
us.org/sites/default/files/StandardizedClassroomObservations.pdf 
 130 
 
Pianta, R. C., Belsky, J., Houts, R., & Morrison, F. J. (2007). Opportunities to learn in 
America’s elementary classrooms. Science, 315, 1795–1796. 
Pianta, R. C., Belsky, J., Houts, R., Morrison, F., & National Institute of Child Health and 
Human Development Early Child Care Research Network. (2007). Opportunities to 
learn in America’s elementary classrooms. Science, 315, 9–10. 
Pianta, R. C., Cox, M. J., Taylor, L., & Early, D. (2013). Kindergarten teachers ’ 
practices related to the transition to school : Results of a national survey. The 
Elementary School Journal, 100(1), 71–86. 
Pianta, R. C., La Paro, K., & Hamre, B. (2008). Classroom Assessment Scoring System. 
Pianta, R. C., Mashburn, A. J., Downer, J. T., Hamre, B. K., & Justice, L. (2008). Effects 
of web-mediated professional development resources on teacher-child interactions in 
pre-kindergarten classrooms. Early Childhood Research Quarterly, 23, 431–451. 
Pianta, R. C., Paro, K. M., Payne, C., Cox, M. J., Bradley, R., Pianta, R. C., … Payne, C. 
(2002). The relation of kindergarten classroom environment to teacher , family , and 
school characteristics and child outcomes. The Elementary School Journal, 102(3), 
225–238. 
Pianta, R., Howes, C., Burchinal, M., Bryant, D., Clifford, R., Early, D., & Barbarin, O. 
(2005). Features of pre-kindergarten programs , Classrooms , and teachers : Do they 
predict observed classroom quality and child-teacher interactions. Applied 
Developmental Science, 9(3), 144–159. 
Pratt, A., & Logan, J. (2014). Improving language-focused comprehension instruction in 
primary-grade classrooms: Impacts of the Let’s Know! experimental curriculum. 
Educational Psychology Review, July. 
 131 
 
Pressley, M., Roehrig, A. D., Raphael, L., Dolezal, S., Bohn, C. M., Mohan, L., & 
Hogan, K. (2003). Teaching processes in elementary and secondary education. In 
Handbook of psychology (Vol. 1). doi:10.1037/005272 
Raudenbush, S., Bryk, A., & Congdon, R. (2013). HLM 7.01 for Windows [Hierarchical 
linear and nonlinear modeling software]. Skokie, IL: Scientific Software 
International. 
Reid, J. B., Skindrud, K. D., Taplin, P. S., & Jones, R. R. (1973, August). The role of 
complexity in the collection and evaluation of observation data. Inmeeting of the 
American Psychological Association, Montreal, Canada. 
Repp, A. C., Nieminen, G. S., Olinger, E., & Brusca, R. (1988). Direct observation: 
Factors affecting the accuracy of observers. Exceptional Children, 55(1), 29-36. 
Rimm-Kaufman, S. E., Early, D. M., Cox, M. J., Saluja, G., Pianta, R. C., Bradley, R. H., 
& Payne, C. (2002). Early behavioral attributes and teachers’ sensitivity as 
predictors of competent behavior in the kindergarten classroom. Journal of Applied 
Developmental Psychology, 23(4), 451–470. doi:10.1016/S0193-3973(02)00128-4 
Rosenshine, B. (1971). Teaching behaviours and student achievement. 
Rosenshine, B., & Stevens, R. (1986). Teaching functions. In M. C. Witrock (Ed.), 
Handbook on research and teaching (3rd ed., pp. 376–390). New York, NY: 
Macmillan. 
Ross, J., & Regan, E. (1993). Sharing professional experience: Its impact on professional 
development. Teaching and Teacher Education, 9(1), 91–106. 
 
 132 
 
Sammons, P., Sylva, K., Melhuish, E., Siraj-Blatchford, I., Taggart, B., & Elliott, K. 
(2002). The effective provision of pre-school education (EPPE) project: Measuring 
the impact of pre-school on chidren’s cognitive process over the pre-school period 
(Vol. 44). London, UK. 
Saunders, K. J. (2011). Designing instructional programming for early reading skills. In 
W. Fisher & C. Piazza (Eds.), Handbook of Applied Behavior Analysis (pp. 92–109). 
New York, NY, US: Guilford Press. 
Schmoker, M. J. (1999). Results: The key to continuous school improvement. Association 
of Supervision and Curriculum Development. 
Semmelroth, C. L., & Johnson, E. (2013). Measuring Rater Reliability on a Special 
Education Observation Tool. Assessment for Effective Intervention, 39(3), 131–145. 
doi:10.1177/1534508413511488 
Shaywitz, S. (2008). Overcoming dyslexia: A new and complete science-based program 
for reading problems at any level. New York: Random House. 
Shulman, L. S. (1987). Knowledge and teaching: Foundations of the new reform. 
Harvard Educational review1, 57(1), 1–23. 
Simmons, D. C., Coyne, M. D., Hagan-Burke, S., Kwok, O.-M., Simmons, L. E., 
Johnson, C., … Crevecoeur, Y. C. (2011). Effects of supplemental reading 
interventions in authentic contexts: A comparison of kindergarteners’ response. 
Exceptional Children, 77(2), 207–228. 
Simmons, D. C., & Kame’enui, E. J. (2003). Scott Foresman: Early Reading 
Intervention. Glenview, IL: Scott Foresman. 
 133 
 
Skowron, J. (2001). Powerful lesson planning models. Arlington Heights, IL: Skylight 
Publishing. 
Stecker, P. M., Fuchs, L. S., & Fuchs, D. (2005). Using curriculum-based measurement 
to improve student achievement: Review of research. Psychology in the Schools, 
42(8), 795–819. doi:10.1002/pits.20113 
Stecker, P. M., Lembke, E. S., & Foegen, A. (2008). Using progress-monitoring data to 
improve instructional decision making. Preventing School Failure: Alternative 
Education for Children and Youth, 52(2), 48–58. doi:10.3200/PSFL.52.2.48-58 
Strong, M., Gargani, J., & Hacifazlioglu, O. (2011). Do we know a successful teacher 
when we see one? Experiments in the identification of effective teachers. Journal of 
Teacher Education, 62(4), 367–382. doi:10.1177/0022487110390221 
Stronge, J. H. (2005). Evaluating teaching: A guide to current thinking and best practice. 
Corwin Press. 
Swanson, H. L. (1999). Reading Research for Students with LD: A Meta-Analysis of 
Intervention Outcomes. Journal of Learning Disabilities, 32(6), 504–532. 
doi:10.1177/002221949903200605 
Sykes, G., & Bird, T. (1992). Teacher education and the case idea. Review of Research in 
Education, 18, 457–521. 
Tabachnick, B. G., & Fidell, L. S. (2013). Using Multivariate Statistics, 6th International 
edition (cover) edn. 
Tomlinson, C. A. (1999). Mapping a Route Toward Differentiated Instruction. 
Educational Leadership, 57(1), 12–16. 
 134 
 
Torgesen, J. K. (2002). The prevention of reading difficulties. Journal of School 
Psychology, 40(1), 7–26. doi:10.1016/S0022-4405(01)00092-9 
Torgesen, J. K., Wagner, R. K., Rashotte, C. a., Rose, E., Lindamood, P., Conway, T., & 
Garvan, C. (1999). Preventing reading failure in young children with phonological 
processing disabilities: Group and individual responses to instruction. Journal of 
Educational Psychology, 91(4), 579–593. doi:10.1037//0022-0663.91.4.579 
Waxman, H. C., Huang, S.-Y. L., Anderson, L., & Weinstein, T. (1997). Classroom 
process differences in inner-city elementary schools. The Journal of Educational 
Research, 91(1), 49–59. doi:10.1080/00220679709597520 
Wiggins, G., & McTighe, J. (1998). Understanding by Design (pp. 1–34). Association for 
Supervision and Curriculum Development. 
Wolf, M. (2007). Proust and the squid : the story and science of the reading brain / 
Maryanne Wolf ; illustrations by Catherine Stoodley. (C. J. Stoodley, Ed.). New 
York, NY : HarperCollins. Retrieved from 
http://www.loc.gov/catdir/toc/fy0803/2008297333.html 
Woodcock, R., & Johnson, M. (1989). Woodcock-Johson Psycho-Educational Battery-
Revised. Allen, TX: DLM Teaching Resources. 
Woodcock, R. W. (1987). Woodcock reading mastery tests, revised. Circle Pines, MN: 
American Guidance Service. 
Yeaton, W. H., & Sechrest, L. (1981). Critical dimensions in the choice and maintenance 
of successful treatments: Strength, integrity, and effectiveness. Journal of 
Consulting and Clinical Psychology, 49(2), 156–167. doi:10.1037//0022-
006X.49.2.156 
 135 
 
Yopp, H. K., & Yopp, R. H. (2000). Supporting Phonemic Awareness Development in 
the Classroom. The Reading Teacher, 54(2), 130–143 CR  – Copyright &#169; 2000 
International . Retrieved from http://www.jstor.org/stable/20204888 
Zigmond, N., & Kloo, A. (2011). General and special education are (and should be) 
different. In Kauffman, J. M.  Handbook of special education (pp. 160–172).