<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">OJEpi</journal-id><journal-title-group><journal-title>Open Journal of Epidemiology</journal-title></journal-title-group><issn pub-type="epub">2165-7459</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/ojepi.2024.141005</article-id><article-id pub-id-type="publisher-id">OJEpi-130627</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Medicine&amp;Healthcare</subject></subj-group></article-categories><title-group><article-title>
 
 
  Cautionary Remarks When Testing Agreement between Two Raters for Continuous Scale Measurements: A Tutorial in Clinical Epidemiology with Implementation Using R
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Mohamed</surname><given-names>M. Shoukri</given-names></name><xref ref-type="aff" rid="aff1"><sub>1</sub></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><label>1</label><addr-line>Department of Epidemiology and Biostatistics, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Canada</addr-line></aff><pub-date pub-type="epub"><day>26</day><month>12</month><year>2023</year></pub-date><volume>14</volume><issue>01</issue><fpage>56</fpage><lpage>74</lpage><history><date date-type="received"><day>8,</day>	<month>December</month>	<year>2023</year></date><date date-type="rev-recd"><day>19,</day>	<month>January</month>	<year>2024</year>	</date><date date-type="accepted"><day>22,</day>	<month>January</month>	<year>2024</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Background: When continuous scale measurements are available, agreements between two measuring devices are assessed both graphically and analytically. In clinical investigations, Bland and Altman proposed plotting subject-wise differences between raters against subject-wise averages. In order to scientifically assess agreement, Bartko recommended combining the graphical approach with the statistical analytic procedure suggested by Bradley and Blackwood. The advantage of using this approach is that it enables significance testing and sample size estimation. We noted that the direct use of the results of the regression is misleading and we provide a correction in this regard. 
  Methods: Graphical and linear models are used to assess agreements for continuous scale measurements. We demonstrate that software linear regression results should not be readily used and we provided correct analytic procedures. The degrees of freedom of the F-statistics are incorrectly reported, and we propose methods to overcome this problem by introducing the correct analytic form of the F statistic. Methods for sample size estimation using R-functions are also given. 
  Results: We believe that the tutorial and the R-codes are useful tools for testing and estimating agreement between two rating protocols for continuous scale measurements. The interested reader may use the codes and apply them to their available data when the issue of agreement between two raters is the subject of interest.
 
</p></abstract><kwd-group><kwd>Limits of Agreement</kwd><kwd> Pitman and Morgan Tests</kwd><kwd> Test of Parallelism</kwd><kwd> The Arcsine Variance Stabilizing Transformation</kwd><kwd> Sample Size Estimation</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>The subject of agreement between two or more raters is of interest to investigators who work in medical research as well as physical sciences. When continuous scale measurements are available, agreements between two measuring devices or medical diagnostic tools are assessed both graphically and analytically. In clinical investigations, Bland and Altman proposed [<xref ref-type="bibr" rid="scirp.130627-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.130627-ref2">2</xref>] suggested plotting subject-wise differences between raters against subject-wise averages. Bartko [<xref ref-type="bibr" rid="scirp.130627-ref3">3</xref>] recommended combining the graphical approach with the statistical analytic procedure based on linear regression models that were suggested by Bradley and Blackwood [<xref ref-type="bibr" rid="scirp.130627-ref4">4</xref>] .</p><p>According to Stephenson &amp; Babiker [<xref ref-type="bibr" rid="scirp.130627-ref5">5</xref>] , “Clinical epidemiology can be defined as the investigation and control of the distribution and determinants of disease”. Last [<xref ref-type="bibr" rid="scirp.130627-ref6">6</xref>] felt that the term was an oxymoron, and that its increasing popularity in many different medical schools was a serious issue.</p><p>Clinical epidemiology aims to optimize the diagnostic, treatment and prevention processes for an individual patient, based on an assessment of the diagnostic and treatment process using epidemiological research data [<xref ref-type="bibr" rid="scirp.130627-ref7">7</xref>] . A central tenet of clinical epidemiology is that every clinical decision must be based on rigorously evidence-based science. The objectives of clinical epidemiology are primarily to develop epidemiologically sound clinical guidelines and standards for diagnosis, disease progression, prognosis, treatment and prevention. The data obtained in epidemiological studies are also applicable to the epidemiological justification of preventive programs for communicable and noncommunicable diseases [<xref ref-type="bibr" rid="scirp.130627-ref8">8</xref>] .</p><p>A key aspect of clinical epidemiology is the evaluation of the effectiveness of treatment and prevention medicines [<xref ref-type="bibr" rid="scirp.130627-ref8">8</xref>] . To deliver reliable results, the diagnoses must be reported error-free. Measures of reliability and agreements among diagnostic tools play an important role in this regard.</p><p>Reliability and agreement are important issues in disease diagnosis and classification, the development of screening tools, quality assurance, and the evaluation of diagnostic tools for clinical investigations (Kottner et al. [<xref ref-type="bibr" rid="scirp.130627-ref9">9</xref>] ).</p><p>When the responses are interval scale measurements the intraclass correlation is used to quantify reliability. When the measured responses are categorical the agreement between raters is quantified by the well-known “Kappa” coefficient. On the other hand, reliability is measured by the ICC. The concept of agreement between two raters when the responses are interval scale measurements is quantified by assessing both the bias and accuracy of the rating devices. The approach proposed by Bradley and Blackwood [<xref ref-type="bibr" rid="scirp.130627-ref4">4</xref>] is used to simultaneously test for bias and accuracy. Their test is obtained from the simple regression of the case-wise differences between the raters against the case-wise means of the ratings. In other words, we say that agreement between measuring devices or two raters exists if three conditions are satisfied: The two sets of measurements are highly correlated; the two methods are equally precise, and the two methods are unbiased relative to each other. The approach applies statistical testing jointly on the intercept and the slope. Testing the intercept equals zero is equivalent to testing for the absence of bias, while testing the slope equals zero is equivalent to equality of precisions. This joint test of intercept and slope coefficients in simple linear regression are not straightforward. Our main objective in this paper is to caution against the automatic results produced by commercial statistical programs for regression analysis and present alternative approaches. Issues of sample size estimation are discussed as well.</p></sec><sec id="s2"><title>2. Methods</title><sec id="s2_1"><title>2.1. Wilk’s Tests</title><p>Let ( x i 1 , x i 2 ) , i = 1 , 2 , ⋯ , n denote a random sample of size n drawn from a bivariate normal distribution whose parameters are ( μ 1 , μ 2 , σ 1 2 , σ 2 2 , ρ 12 ) .</p><p>The summary statistics of the data are: X &#175; 1 = m e a n ( X 1 ) , X &#175; 2 = m e a n ( X 2 ) , S 1 2 = v a r i a n c e ( X 1 ) , S 2 2 = v a r i a n c e ( X 2 ) , and ρ 12 is the correlation between X 1 and X 2 .</p><p>The ultimate goal is to test the simultaneous null hypothesis H 0 : μ 1 = μ 2 ∩ σ 1 2 = σ 2 2 , evaluate its power and determine approximately the sample size to achieve prespecified levels of power.</p><p>The above hypothesis has two components; the first is H 0 : σ 1 2 − σ 2 2 = 0 , which is testing the hypothesis that the two raters have equal precision. The second is H 0 : μ 1 − μ 2 = 0 , which is testing the hypothesis that the two raters are unbiased relative to each other.</p><p>The null hypothesis H 0 : μ 1 = μ 2 ∩ σ 1 2 = σ 2 2 is an extension of the parallel test. Bradley and Blackwood [<xref ref-type="bibr" rid="scirp.130627-ref4">4</xref>] proposed a simple statistic to test the above hypothesis. This test has applications in agreement studies. Needless to say that separate statistics tests for the equality of the two means or the two variances are well-documented in statistical literature. To avoid multiplicity, researchers used Bonferroni correction by conducting separate tests of equality of means followed by testing equality of variances. This requires that the test size α be split into α/2 for testing the mean (using paired t-test) and α/2 is the size of the test of equality of two correlated variances (Morgan [<xref ref-type="bibr" rid="scirp.130627-ref10">10</xref>] and Pitman [<xref ref-type="bibr" rid="scirp.130627-ref11">11</xref>] ) known as Morgan-Pitman test.</p><p>The separate statistical tests for the equality of means or variances of two dependent variables are well-known, and using both of them for a simultaneous test of both null hypotheses requires the use of a Bonferroni correction.</p><p>The null hypothesis of equality of means is tested using the following statistic:</p><p>Z m = X &#175; 1 − X &#175; 2 S 1 2 + S 2 2 − 2 S 1 S 2 ρ 12 / n − 1 (1)</p><p>which has t-distribution with ( n − 1 ) degrees of freedom when H 0 : μ 1 = μ 2 is true.</p><p>On the other hand, the null hypothesis of equality of variances (equality of precisions) is tested using the statistic:</p><p>z v = n − 2 ( S 1 2 − S 2 2 ) 2 S 1 S 2 1 − ρ 12 2 (2)</p><p>which has t-distribution with ( n − 2 ) degrees of freedom when H 0 : σ 1 2 = σ 2 2 is true.</p><p>Earlier, Wilks [<xref ref-type="bibr" rid="scirp.130627-ref11">11</xref>] [<xref ref-type="bibr" rid="scirp.130627-ref12">12</xref>] suggested tests of equality of correlated means and correlated variances using the statistic:</p><p>Z m p = S 1 2 S 2 2 ( 1 − ρ 12 2 ) S 2 ( 1 + ρ 12 ) [ S 2 ( 1 − ρ 12 ) + C ] (3)</p><p>where</p><p>S 2 = 1 2 [ S 1 2 + S 2 2 ]</p><p>C = ( X &#175; 1 − X &#175; 2 ) 2 / 2</p><p>Then Q = − 2 log ( Z m v ) ~ X ( 2 ) 2 .</p></sec><sec id="s2_2"><title>2.2. Example</title><p>We apply the methodology presented in this paper on Serum Alanine aminotransferase ( ALT ). The ALT is a critical parameter for both the assessment and follow-up of patients with liver disease. Therefore, establishing the repeatability and the precision of ALT measurements as a diagnostic marker is of paramount importance. Regardless of gender or body mass index ( BMI ) [<xref ref-type="bibr" rid="scirp.130627-ref13">13</xref>] , the normal range was most often estimated from a population that included patients with subclinical liver disease, including non-alcoholic fatty liver disease (NAFLD), which is now documented as the greatest prevalent cause of chronic liver disease worldwide [<xref ref-type="bibr" rid="scirp.130627-ref14">14</xref>] . Recent studies have recommended establishing normal ranges for ALT separately in males and females [<xref ref-type="bibr" rid="scirp.130627-ref15">15</xref>] .</p><p>In a large tertiary hospital-based registry, the available data were collected from 30 males. The ALT levels were evaluated twice, once in the department of laboratory medicine (rate 1, and the values are denoted by X i 1 ) and once by the department of pathology (rater 2 and the values are denoted by X i 2 ).</p><p>Rater 1: Department of laboratory medicine.</p><p>Rater 2: Department of pathology.</p><p>ALT1&lt;-c (6, 6, 67, 97, 57, 63, 55, 192, 212, 182, 317, 303, 62, 64, 64, 54, 54, 67, 68, 135, 68, 191, 262, 151, 70, 75, 76, 5, 6, 61, 74).</p><p>ALT2&lt;-c (8, 8, 69, 99, 59, 63, 57, 191, 211, 184, 319, 305, 64, 66, 66, 56, 56, 69, 70, 137, 70, 193, 261, 153, 72, 77, 78, 5, 8, 63, 73).</p><p>The ALT data has the following summary statistics:</p><p>X &#175; 1 = 106.967 , X &#175; 2 = 108.500 , S 1 = 81.91 , S 2 = 81.56 and ρ 12 = 0.999 , and the sample size n = 30 .</p><p>Therefore</p><p>Z m = − 7.686     and     p -value = 0.00001 ,</p><p>This means that the hypothesis of the two raters are not unbiased relative to each other is supported by the data. On the other hand:</p><p>Z p = 1.82 ,   p -value = 0.078</p><p>This means that the two raters are equally precise.</p><p>The omnibus test of equality of the two means and the two variances is:</p><p>Z m p = 0.469 , and Q = 1.52, with p-value = 0.468, and we Therefore, we accept the hypothesis that the two raters are unbiased relative to each other and are equally precise. In addition to the fact that ρ 12 is quite high we may be tempted to conclude that there is strong agreement between the two raters. This conclusion is flawed since the two raters are not unbiased relative to each other.</p></sec></sec><sec id="s3"><title>3. Bland &amp; Altman’s and Bradley-Blackwood (1989) Methodologies</title><p>Bradley-Blackwood [<xref ref-type="bibr" rid="scirp.130627-ref4">4</xref>] proposed using the F-statistic for testing the significance of the simple regression parameters in order to assess agreement between the two raters. Here we summarize their methods.</p><p>Let y = x 1 − x 2 , and x = 1 2 ( x 1 + x 2 ) .</p><p>From the multivariate normal theory, the regression of y on x is given by the conditional expectation:</p><p>E [ y | x ] = μ y + ρ x y σ y σ x ( x − μ x ) (4)</p><p>Moreover,</p><p>var [ y | x ] = σ y 2 ( 1 − ρ x y 2 ) (5)</p><p>The regression Equation (4) has parameters that can be easily expressed as functions of bivariate norma parameters BVN ( μ , μ 2 , σ 1 2 , σ 2 2 , ρ 12 ) where BVN stands for bivariate normal:</p><p>Form the algebra of bivariate normal distribution we have:</p><p>E ( y ) ≡ μ y = μ 1 − μ 2</p><p>var ( y ) ≡ σ y 2 = σ 1 2 + σ 2 2 − 2 ρ 12 σ 1 σ 2</p><p>E ( x ) ≡ μ x = 1 2 ( μ 1 + μ 2 )</p><p>var ( x ) ≡ σ x 2 = 1 4 [ σ 1 2 + σ 2 2 + 2 ρ 12 σ 1 σ 2 ] .</p><p>We can also show that the correlation between x and y is given by:</p><p>ρ x y ≡ corr ( x , y ) = σ 1 2 − σ 2 2 [ ( σ 1 2 + σ 2 2 ) 2 − 4 ρ 12 2 σ 1 2 σ 2 2 ] 1 / 2 (6)</p><p>We also note that:</p><p>ρ x y 2 1 − ρ x y 2 = ( σ 1 2 − σ 2 2 ) 2 4 σ 1 2 σ 2 2 ( 1 − ρ 12 2 ) (7)</p><p>The quantity in (7) mimics the effect size, or the non-centrality parameter which is usually used to evaluate the power of the test of significance on the regression parameters when the null hypothesis does not hold. Writing Equation (4) in a simple linear regression format we get:</p><p>E [ y | x ] = α + β ( x − μ x ) (8)</p><p>Comparing (4) and (8) we have:</p><p>α = μ 1 − μ 2 (9)</p><p>β = ρ x y [ σ y σ x ] (10)</p><p>In terms of the bivariate normal population parameters we can write:</p><p>β = 2 ( σ 1 2 − σ 2 2 ) [ σ 1 2 + σ 2 2 − 2 ρ 12 σ 1 σ 2 ] 1 / 2 [ ( σ 1 2 + σ 2 2 ) 2 − 4 ρ 12 2 σ 1 2 σ 2 2 ] 1 / 2 [ σ 1 2 + σ 2 2 + 2 ρ 12 σ 1 σ 2 ] 1 / 2 (11)</p><p>As can be seen from (8), that the two raters are deemed unbiased relative to each other whenever:</p><p>α = μ 1 − μ 2 = 0 .</p><p>That is when the intercept of the linear regression equation is 0. From Equation (11) the slope of the regression model β is identically 0, when σ 1 2 = σ 2 2 , that is when the two raters are equally precise. Hence, testing the null hypothesis: H 0 : μ − μ 2 = 0 ∩ σ 1 2 − σ 2 2 = 0 , is equivalent to testing:</p><p>H 0 : α = 0 ∩ β = 0 (12)</p><p>We shall test this hypothesis against the general alternative:</p><p>H 1 : α = α 1 ≠ 0 ∩ β = β 1 ≠ 0 .</p><p>The analytic expression of the statistic used to test the omnibus null hypothesis (12) is given by Equation (13) and was derived by [<xref ref-type="bibr" rid="scirp.130627-ref16">16</xref>] given in:</p><p>F = n 2 σ ^ y 2 [ α ^ 2 + 2 α ^ β ^ x &#175; + β ^ 2 ( S x 2 + x &#175; 2 ) ] (13)</p><p>The elements of the R. H. S. of (13) are:</p><p>S x 2 = S S x / n</p><p>σ ^ y 2 = 1 n − 2 [ S S y − ( S S x y ) 2 / S S x ] (14)</p><p>x &#175; = 1 n ∑ i = 1 n     x i</p><p>S S x = ∑ i = 1 n ( x i − x &#175; ) 2</p><p>β ^ = S S x y / S S x</p><p>S S x y = ∑ i = 1 n ( x i − x &#175; ) ( y i − y &#175; )</p><p>and α ^ = y &#175; − β ^ x &#175; , where n is the sample size.</p><p>In the context of agreement between two raters Bland and Altman [<xref ref-type="bibr" rid="scirp.130627-ref2">2</xref>] proposed a graphical plot, whereby the horizontal axis represents the subjects mean of the two measurements taken by each of the two raters 1 2 ( x 1 + x 2 ) and the vertical axis represents the difference y = x 1 − x 2 , between the two ratings for each individual. Bartko [<xref ref-type="bibr" rid="scirp.130627-ref3">3</xref>] recommended that in agreement studies where measurements are reported on the continuous scale both graphical and ANOVA of regression be used as a formal test on the absence of bias of ratings and equal precision.</p><p>The null hypothesis is rejected when the test statistic:</p><p>Exceeds the critical value of the F 2 , n − 2 , That is H 0 is rejected at a significance level α if F &gt; F α , 2 , n − 2 , where F α , 2 , n − 2 is the upper ( 1 − α ) 100 percentile of the F 2 , n − 2 distribution.</p><p>When the null hypothesis is not supported by the data, then the non-null distribution of the test statistics is that of a non-central F-distribution ( F 2 , n − 2 , λ ) with non-centrality parameter λ , is</p><p>λ = ( α 1 + β 1 E ( x ) ) 2 + β 1 2 σ x 2 σ y 2 (15)</p><p>The elements of λ are given by:</p><p>α = μ 1 − μ 2</p><p>E ( x ) = 1 2 ( μ 1 + μ 2 )</p><p>σ y = [ σ 1 2 + σ 2 2 − 2 ρ 12 σ 1 σ 2 ] 1 / 2</p><p>σ x = 1 2 [ σ 1 2 + σ 2 2 + 2 ρ 12 σ 1 σ 2 ] 1 / 2 ,</p><p>and</p><p>β 1 = ρ x y σ y σ x .</p><p>The power of the test statistic or the probability of the false hypothesis is 1 − α = P r [ F 2 , ν , λ &gt; F 2 , ν , α ] with ν = n − 2 being the degrees of freedom of the denominator of the F statistic.</p><p>We propose the following flowchart (<xref ref-type="fig" rid="fig1">Figure 1</xref>) to guide the testing of agreement.</p><p>We illustrate the methodology using biological data given in example 1.</p><p>Example 1 continued: Unified approach to testing agreement using the ALT data:</p><p>We have two data sets of ALT measurements from the same 30 subjects. We shall use R to plot Bland and Altman levels of agreement and use ANOVA to analyze the simple linear regression of the pair-wise difference on the pair-wise average.</p><p>df=data.frame(ALT1,ALT2)</p><p>x1=as.numeric(df$ALT1)</p><p>x2=as.numeric(df$ALT2)</p><p>df=data.frame(x1,x2)</p><p>head(df)</p><p>df$x=(df$x1+df$x2)/2</p><p>df$y=df$x1-df$x2</p><p>N=nrow(df)</p><p>N</p><p>Analysis:</p><p>Step 1: Bland and Altman graphical representation (R code)</p><p>ssy=N*var(df$y)</p><p>ssy</p><p>ssxy=sum((df$x-mean(df$x))*(df$y-mean(df$y)))</p><p>ssxy</p><p>ssx=N*var(df$x)</p><p>ssx</p><p>sig=(ssy-(ssxy^2/ssx))/(N-2)</p><p>sig # residual sum of squares</p><p>library(ggplot2)</p><p>library(sadists)</p><p>df$x&lt;-rowMeans(df)</p><p>df$y&lt;-df$x1-df$x2</p><p>head(df)</p><p>cor(df$x,df$y)</p><p>mean_diff&lt;-mean(df$y)</p><p>lower&lt;-mean_diff-1.96*sd(df$y)</p><p>lower</p><p>upper&lt;-mean_diff+1.96*sd(df$y)</p><p>upper</p><p>lower&lt;-mean_diff-1.96*sd(df$y)</p><p>lower</p><p>upper&lt;-mean_diff+1.96*sd(df$y)</p><p>upper</p><p>ggplot(df,aes(x=x,y=mean_diff))+</p><p>geom_point(size=5)+</p><p>geom_hline(yintercept=mean_diff)+</p><p>geom_hline(yintercept=lower, color=“red”,linetype=“dashed”)+</p><p>geom_hline(yintercept=upper, color=“red”,linetype=“dashed”)+</p><p>ggtitle(“Bland-Altman Plot”)+</p><p>ylab(“Difference Between X1 and X2”)+</p><p>xlab(“Average X1 and X2”)</p><p>From <xref ref-type="fig" rid="fig2">Figure 2</xref>, one may conclude that there is strong agreement between the two sets of reading since all the points fall within the limits of agreements.</p><p>Step 2: Testing for agreement using the ANOVA of regression and setting the Type I error rate at 25%.</p><p>Residual standard error: 1.032 on 28 degrees of freedom.</p><p>Anova of Regression:</p><p><xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref> provides the results of the regression analysis produced by R, and table 2 summarizes the ANOVA results of the regression model.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref></label><caption><title> The results of the regression of the difference “y” on the pairwise mean “x”</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >Std. Error</th><th align="center" valign="middle" >t value</th><th align="center" valign="middle" >Pr (&gt;|t|)</th></tr></thead><tr><td align="center" valign="middle" >(Intercept)</td><td align="center" valign="middle" >−1.99848</td><td align="center" valign="middle" >0.313822</td><td align="center" valign="middle" >−6.368</td><td align="center" valign="middle" >6.83E−07</td></tr><tr><td align="center" valign="middle" >df$x</td><td align="center" valign="middle" >0.005784</td><td align="center" valign="middle" >0.003121</td><td align="center" valign="middle" >1.853</td><td align="center" valign="middle" >0.0744</td></tr></tbody></table></table-wrap><table-wrap id="table2" ><label><xref ref-type="table" rid="table2"><xref ref-type="table" rid="table">Table </xref>2</xref></label><caption><title> The results of the ANOVA of regression</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Df</th><th align="center" valign="middle" >Sum Sq</th><th align="center" valign="middle" >Mean Sq</th><th align="center" valign="middle" >F-value</th><th align="center" valign="middle" >Pr (&gt;F)</th></tr></thead><tr><td align="center" valign="middle" >df$x</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >3.6566</td><td align="center" valign="middle" >3.6566</td><td align="center" valign="middle" >3.4346</td><td align="center" valign="middle" >0.07441</td></tr><tr><td align="center" valign="middle" >Residual</td><td align="center" valign="middle" >28</td><td align="center" valign="middle" >29.8101</td><td align="center" valign="middle" >1.0646</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td></tr></tbody></table></table-wrap><p>The results of the hand calculation of the F-statistic and the corresponding p-value:</p><p>error=sum((df$y-predict(model_test))*(df$y-predict(model_test)))</p><p>MSE=error/(N-2)</p><p>total=sum((df$y-mean(df$y))*(df$y-mean(df$y)))</p><p>reg=total-error</p><p>MREG=reg/2</p><p>F_full model=MREG/MSE</p><p>F_full model = 1.717 which is identical to F-ANOVA/2 = 3.436/2</p><p>SSE_full model = 29.8</p><p>Error mean square = 29.8/28 = 1.113</p><p>When we use any of the statistical program available in R, SAS or SPSS, we obtain exactly the same output shown in <xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref> and <xref ref-type="table" rid="table2"><xref ref-type="table" rid="table">Table </xref>2</xref>. The correct value of the F statistic is 1.717. This can be verified by direct calculation of F from the analytic expression in (13).</p><p>Therefore, the F-statistic and the corresponding p-value produced by the software are not correct. We can then obtain the correct p-value using the function:</p><p>p_value=pf(1.717, 2, 28, ncp=0, lower.tail = FALSE, log.p = FALSE)</p><p>p_value= 0.198</p><p>Based on the above p-value of the global test of agreement, one may conclude that there is agreement between the two sets of ALT measurements.</p><p>However, when we examine the equality of precisions and the equality of means separately we get different conclusions. It is of interest now to see if the two methods are equally precise.</p><p>That is we would like to test the hypothesis H 01 : σ 1 2 = σ 2 2 , or equivalently:</p><p>H 01 : β = 0 .</p><p>To test this hypothesis, we fit a regression model, without intercept, where the dependent variable is the difference (y) and the independent variable is the mean of two observations per subject (x). The R code to fit a linear model without intercept is given as:</p><p>model_noint=lm(df$y~0+df$x,data=df)</p><p>summary(model_noint)</p><p>The R-output of the regression model without intercept:</p><p>Analysis of Variance Table</p><p>Residual standard error: 1.586 on 29 degrees of freedom. From the ANOVA table, the F-statistic: 12.32 on 1 and 29 DF, p-value: 0.001483.</p><p>We need to pay close attention to the results of <xref ref-type="table" rid="table3"><xref ref-type="table" rid="table">Table </xref>3</xref> and <xref ref-type="table" rid="table4"><xref ref-type="table" rid="table">Table </xref>4</xref>. Analytically, the residuals sum of squares carries 28 degrees of freedom not 29 as was given by the R-output. Hence the Residual mean square = 72.986/28 = 2.6066. This means the F-statistic and the corresponding p-values are not correct. Therefore, the Residual mean square is 2.6066, and the F-statistic = 31.042/2.6066 = 11.898. Consequently the p-value of the ANOVA test on the hypothesis of equality of precisions is:</p><p>p_value=pf(11.898, 1, 28, ncp=0, lower.tail = FALSE, log.p = FALSE).</p><p>p_value= 0.0018. We conclude then that the two methods are not equally precise.</p><p>We now proceed to test the hypothesis that the two methods are unbiased relative to each other. That is to test H 02 : μ − μ 2 = 0 , or equivalently to test H 02 : α = 0 . We use R to test for the significance of the intercept, using a regression model that does not have a slope parameter:</p><p>model_noslo=lm(df$y~1,data=df)</p><p>summary(model_noslo)</p><p>anova(model_noslo)</p><p>anova(model_noslo)</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3"><xref ref-type="table" rid="table">Table </xref>3</xref></label><caption><title> The output of the regression model without intercept coefficient</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >Std. Error</th><th align="center" valign="middle" >t value</th><th align="center" valign="middle" >Pr (&gt;|t|)</th></tr></thead><tr><td align="center" valign="middle" >df$x</td><td align="center" valign="middle" >−0.01011</td><td align="center" valign="middle" >0.002881</td><td align="center" valign="middle" >−3.51</td><td align="center" valign="middle" >0.00148 **</td></tr></tbody></table></table-wrap><table-wrap id="table4" ><label><xref ref-type="table" rid="table4"><xref ref-type="table" rid="table">Table </xref>4</xref></label><caption><title> ANOVA of the regression model that has no intercept parameter</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Df</th><th align="center" valign="middle" >Sum Sq</th><th align="center" valign="middle" >Mean Sq</th><th align="center" valign="middle" >F-value</th><th align="center" valign="middle" >Pr (&gt;F)</th></tr></thead><tr><td align="center" valign="middle" >df$x</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >31.014</td><td align="center" valign="middle" >31.0142</td><td align="center" valign="middle" >12.323</td><td align="center" valign="middle" >0.001483 **</td></tr><tr><td align="center" valign="middle" >Residual</td><td align="center" valign="middle" >28</td><td align="center" valign="middle" >72.986</td><td align="center" valign="middle" >2.5168</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td></tr></tbody></table></table-wrap><p>The results of <xref ref-type="table" rid="table5"><xref ref-type="table" rid="table">Table </xref>5</xref> and <xref ref-type="table" rid="table6"><xref ref-type="table" rid="table">Table </xref>6</xref> need to be adjusted. We note that the residual degrees of freedom produced by either R or SAS are wrong and they are supposed to be (n − 2 = 28). Moreover, the results of this test cannot be accepted because the program fails to produce F-statistic. This was also the case when we used the SAS program.</p><p>It is recommended to test the hypotheses of the absence of relative bias using the paired t-test on the original data (x<sub>1</sub>, x<sub>2</sub>).</p><p>PAIRED-T-TEST as an alternative to testing of relative unbiasedness:</p><p>t = −7.5692, df = 30, p-value = 1.934e−08.</p><p>Alternative hypothesis: the true mean difference is not equal to 0.</p><p>95 percent confidence interval:</p><p>−1.884241 −1.083501.</p><p>That is the two raters are not unbiased relative to each other. Similar to the results of Wilk’s asymptotic test.</p><p>As we can see there is a contradiction between the results based on the omnibus test, where the agreement was confirmed and the results based on the individual tests on the components of agreements. However, this contradiction can be resolved if we a-priori declare that agreement is declared if the p-value of the omnibus F-statistic exceeds 25%.</p><p>Example 2: Agreement between two sets of “Area under receiver operating characteristics” AUROC:</p><p>Accurate diagnosis of a disease is in many situations the first step toward its therapy. The performance of a diagnostic test is commonly compared to an infallible or reference test usually called a “gold standard”, then measured by a pair of indices such as sensitivity (Se) and specificity (Sp). Sensitivity is defined as the probability of testing positive given a person is diseased, and specificity is defined as the probability of testing negative given a person is disease-free. Other frequently used indices include positive and negative predictive values (PPV and NPV), and positive and negative diagnostic likelihood ratios (LR+ and LR−). PPV is defined as the probability of being diseased given a positive index test result, and NPV is defined as the probability of being disease-free given a negative index test result. An important measure of diagnostic accuracy which combines both sensitivity and specificity is the Area under the Receiver Operating Characteristics curve, (AUROC).</p><table-wrap id="table5" ><label><xref ref-type="table" rid="table5"><xref ref-type="table" rid="table">Table </xref>5</xref></label><caption><title> Fitting linear regression model without slope parameter</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >Std. Error</th><th align="center" valign="middle" >t value</th><th align="center" valign="middle" >Pr (&gt;|t|)</th></tr></thead><tr><td align="center" valign="middle" >(Intercept)</td><td align="center" valign="middle" >−1.5333</td><td align="center" valign="middle" >0.1961</td><td align="center" valign="middle" >−7.818</td><td align="center" valign="middle" >1.27e−08 ***</td></tr><tr><td align="center" valign="middle"  colspan="5"  >Residual standard error: 1.074 on 29 degrees of freedom</td></tr></tbody></table></table-wrap><table-wrap id="table6" ><label><xref ref-type="table" rid="table6"><xref ref-type="table" rid="table">Table </xref>6</xref></label><caption><title> Analysis of variance <xref ref-type="table" rid="table">Table </xref>of the linear model without slope parameter</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Df</th><th align="center" valign="middle" >Sum Sq</th><th align="center" valign="middle" >Mean Sq</th><th align="center" valign="middle" >F-value</th><th align="center" valign="middle" >Pr (&gt;F)</th></tr></thead><tr><td align="center" valign="middle" >Residuals</td><td align="center" valign="middle" >29</td><td align="center" valign="middle" >33.467</td><td align="center" valign="middle" >1.154</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td></tr></tbody></table></table-wrap><p>One of the diagnostic tools that we intend to measure diagnostic accuracies combined with various studies is the FibroScan. Fibroscan is the name of a medical device used to help determine the health of a patient’s liver. The term FibroScan, which is often confused for “fiber scan,” “fibro scan” or even “fibro liver scan,” is also used to refer to the FibroScan liver test itself. If the physician is recommending a FibroScan of the liver, the likely reason is to assess the health of the liver and detect liver fibrosis, which can indicate the presence and extent of liver damage or liver disease. FibroScan uses advanced ultrasound technology called transient elastography to measure liver stiffness.</p><p>The diagnostic accuracy parameters of the non-invasive tests were estimated by comparison with liver biopsy used as the gold standard. Our aim here is to provide a methodology to confirm the agreement between the set of AUROC reported in 2006 to that reported in 2008 [<xref ref-type="bibr" rid="scirp.130627-ref17">17</xref>] [<xref ref-type="bibr" rid="scirp.130627-ref18">18</xref>] .</p><p>One should note that the measurements are in the interval x ∈ ( 0 , 1 ) . To analyze this type of data it is recommended to start by applying a variance stabilizing transformation. For this type of data, the commonly used transformation is the u = sin − 1 ( x ) . In this case, var(u) = &#188;. This means that, for this type of data and after applying the variance stabilizing transformation the two raters are deemed to be equally precise. We also recommend that if the data are reported as count, the square root transformation should be applied to the data in order to stabilize the variance.</p><p>The summary statistics of the transformed data given in <xref ref-type="table" rid="table">Table </xref>7 are:</p><p>mean(AUROC_a) =0.969, var(AUROC_a)= 0.041.</p><p>mean(AUROC_b) = 0.971, var(AUROC_b) = 0.044, and cor(AUROC_a,AUROC_b)= 0.988</p><p>In <xref ref-type="fig" rid="fig3">Figure 3</xref>, we show the Bland-Altman plot.</p><p>R-code:</p><p>model1&lt;-lm(df$diff~df$avg,data=df)</p><p>summary(model1)</p><p>The results of the omnibus tests are given in <xref ref-type="table" rid="table">Table </xref>8 and <xref ref-type="table" rid="table">Table </xref>9. The actual F statistic is 1.175/2 = 0.587. To find the correct p-value we use R:</p><p>p_value=pf(0.587, 2, 18, ncp=0, lower.tail = FALSE, log.p = FALSE) = 0.566</p><table-wrap id="table7" ><label><xref ref-type="table" rid="table">Table </xref>7</label><caption><title> Data of the AUROC</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >AUROC measurement in 2006</th></tr></thead><tr><td align="center" valign="middle" >AUROC_a = c (0.57, 0.39, 0.64, 0.81, 0.85, 0.67, 0.33, 0.80, 0.57, 0.39, 0.64, 0.81, 0.85, 0.67, 0.33, 0.80, 0.91, 0.81, 0.85, 0.67)</td></tr><tr><td align="center" valign="middle" >AUROC_a=asin(sqrt(AUROC_a))</td></tr><tr><td align="center" valign="middle" >AUROC measurement in 2008</td></tr><tr><td align="center" valign="middle" >AUROC_b = c (0.58, 0.34, 0.61, 0.85, 0.82, 0.69, 0.35, 0.82, 0.58, 0.34, 0.61, 0.85, 0.82, 0.69, 0.35, 0.82, 0.90, 0.82, 0.86, 0.69)</td></tr><tr><td align="center" valign="middle" >AUROC_b=asin(sqrt(AUROC_b))</td></tr></tbody></table></table-wrap><table-wrap id="table8" ><label><xref ref-type="table" rid="table">Table </xref>8</label><caption><title> Regression output for the AUROC dada</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >Std. Error</th><th align="center" valign="middle" >t value</th><th align="center" valign="middle" >Pr (&gt;|t|)</th></tr></thead><tr><td align="center" valign="middle" >(Intercept)</td><td align="center" valign="middle" >0.03605</td><td align="center" valign="middle" >0.03620</td><td align="center" valign="middle" >0.996</td><td align="center" valign="middle" >0.333</td></tr><tr><td align="center" valign="middle" >df$avg</td><td align="center" valign="middle" >−0.03962</td><td align="center" valign="middle" >0.03655</td><td align="center" valign="middle" >−1.084</td><td align="center" valign="middle" >0.293</td></tr></tbody></table></table-wrap><table-wrap id="table9" ><label><xref ref-type="table" rid="table">Table </xref>9</label><caption><title> Analysis of Variance of the regression model table given in <xref ref-type="table" rid="table">Table </xref>8</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Df</th><th align="center" valign="middle" >Sum Sq</th><th align="center" valign="middle" >Mean Sq</th><th align="center" valign="middle" >F-value</th><th align="center" valign="middle" >Pr (&gt;F)</th></tr></thead><tr><td align="center" valign="middle" >df$avg</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.001263</td><td align="center" valign="middle" >0.0012626</td><td align="center" valign="middle" >1.1753</td><td align="center" valign="middle" >0.2926</td></tr><tr><td align="center" valign="middle" >Residuals</td><td align="center" valign="middle" >18</td><td align="center" valign="middle" >0.019338</td><td align="center" valign="middle" >0.0010743</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td></tr></tbody></table></table-wrap><p>Therefore, we may conclude that there is agreement between the two sets of ratings since the p-value exceeds the 0.25.</p><p>## MODEL NO INTERCEPT: Test for equality of precisions.</p><p>model2&lt;-lm(df$diff~0+df$avg,data=df)</p><p>summary(model2)</p><p>anova(model2)</p><p>lm(formula = df$diff ~ 0 + df$avg, data = df)</p><p>Again, we must caution against using the residual degrees of freedom as given in R output. The correct degrees of freedom are in fact = 18. Therefore, the F statistic has F distribution with numerator and denominator degrees of freedom (1, 18), and not as shown in <xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref>0 and <xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref>1. Hence the correct p-value is obtained using the following R code.</p><table-wrap id="table10" ><label><xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref>0</label><caption><title> Output of model without intercept to test equality of precisions</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >Std. Error</th><th align="center" valign="middle" >t value</th><th align="center" valign="middle" >Pr (&gt;|t|)</th></tr></thead><tr><td align="center" valign="middle" >df$avg</td><td align="center" valign="middle" >−0.003980</td><td align="center" valign="middle" >0.007398</td><td align="center" valign="middle" >−0.538</td><td align="center" valign="middle" >0.597</td></tr></tbody></table></table-wrap><table-wrap id="table11" ><label><xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref>1</label><caption><title> The ANOVA table for the model without intercept</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Df</th><th align="center" valign="middle" >Sum Sq</th><th align="center" valign="middle" >Mean Sq</th><th align="center" valign="middle" >F-value</th><th align="center" valign="middle" >Pr (&gt;F)</th></tr></thead><tr><td align="center" valign="middle" >df$avg</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.000311</td><td align="center" valign="middle" >0.0003108</td><td align="center" valign="middle" >0.2894</td><td align="center" valign="middle" >0.5968</td></tr><tr><td align="center" valign="middle" >Residuals</td><td align="center" valign="middle" >19</td><td align="center" valign="middle" >0.0204030</td><td align="center" valign="middle" >0.0010738</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td></tr></tbody></table></table-wrap><p>p_value=pf(0.2894, 1, 18, ncp=0, lower.tail = FALSE, log.p = FALSE) = 0.597.</p><p>The equality of variances should come as no surprise since the variance stabilizing transformation produced constant variance = 1/4 for both raters.</p><p>Similar to example 1, the ANOVA analysis of the regression without slope does not produce F statistics which are shown in <xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref>2 and <xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref>3. We can test the equality of means of two sets of measurements using the paired t-test:</p><p>t.test(A,B, paired=TRUE)</p><p>Paired t-test results are summarized as follows:</p><p>t = −0.32361, df = 19, p-value = 0.7498, alternative hypothesis: true mean difference is not equal to 0. The 95 percent confidence interval (−0.01779321, 0.01302792) with mean difference = −0.002382647.</p><p>Note that the p-value associated with the paired t-test is identical to the p-value produced by the regression model without slope.</p><p>Other asymptotic tests for equality of precision and absence of relative bias:</p><p>Let SSE<sub>s</sub> define the residuals sum of squares at the model with no slope, SSE<sub>i</sub> to define the residuals sum of squares at the model with no intercept, and SSE<sub>g</sub> to define the residuals sum of squares for the full regression model. We can avoid the incorrect assignment of degrees of freedom by the software and use an asymptotic approach suggested in [<xref ref-type="bibr" rid="scirp.130627-ref19">19</xref>] . If we define the two tests as:</p><p>Test_1(testing of equal precision):</p><p>Q1= n.[Log(SSE<sub>s</sub>) - Log(SSE<sub>g</sub>)] ≥ chis-square(1,1-α),</p><p>then we reject the hypothesis of equal precision.</p><p>Test_2 (testing of no interrater bias):</p><p>Q2= n.[ Log(SSE<sub>i</sub>) - Log(SSE<sub>g</sub>)] ≥ chis-square(1,1-α),</p><p>then we reject the hypothesis of absence of bias</p><p>The results of the three models are summarized in the following table.</p><p>Full model No intercept (test of equal precision) No slope (test of unbiasedness):</p><p>S S E g = 0.0193 S S E s = 0.0204 S S E i = 0.0206</p><p>For the AUROC data, Q1 = 1.1086, and Q2 = 1.3037. The Chis-square(1,1-α) = 3.8414.</p><table-wrap id="table12" ><label><xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref>2</label><caption><title> Model without slope to test of absence of bias</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >Std. Error</th><th align="center" valign="middle" >t value</th><th align="center" valign="middle" >Pr (&gt;|t|)</th></tr></thead><tr><td align="center" valign="middle" >(Intercept)</td><td align="center" valign="middle" >−0.00238</td><td align="center" valign="middle" >0.007363</td><td align="center" valign="middle" >−0.324</td><td align="center" valign="middle" >0.75</td></tr></tbody></table></table-wrap><table-wrap id="table13" ><label><xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref>3</label><caption><title> ANOVA of the regression</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Df</th><th align="center" valign="middle" >Sum Sq</th><th align="center" valign="middle" >Mean Sq</th><th align="center" valign="middle" >F-value</th><th align="center" valign="middle" >Pr (&gt;F)</th></tr></thead><tr><td align="center" valign="middle" >Residuals</td><td align="center" valign="middle" >19</td><td align="center" valign="middle" >0.0206</td><td align="center" valign="middle" >0.001084</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td></tr></tbody></table></table-wrap><p>Therefore, we reach to the same conclusions that both raters have equal precision, and they are unbiased relative to each other. In other words, there is high agreement between the two raters.</p><p>The issue of sample size within the context of agreement</p><p>At the early stage of designing any clinical investigation one has to decide on the number of subjects to enroll in the study to ensure validity and generalizability. In this section we shall use the R package to find estimates of the sample sizes for the three situations discussed.</p><p>1) Sample size estimation to test the null hypothesis: H 0 : α = 0 ∩ β = 0 against the general alternative hypothesis: H 1 : α = α 1 ≠ 0 ∩ β = β 1 ≠ 0 .</p><p>We shall base the estimation on the usage of the ANOVA F-statistic. We use the R function (pwr.f2.test) which requires specifying the Type I error rate, the power, the numerator degrees of freedom of the F-statistic (u = 2), and the value of the non-centrality parameter λ given in (15) which is denoted by f2 in the R language. Cohen [<xref ref-type="bibr" rid="scirp.130627-ref20">20</xref>] demonstrated that the sample size needed for regression analysis depends on the chosen value of λ, which depends on the non-null values of the regression parameters. Values of λ around 20, are considered low, 35 is medium, and 50 is considered high.</p><p>Using the results of the AUROC example as the values for the regression parameters, we get f2 = 0.232. We force the degrees of freedom of the numerator of the F-statistic u to be equal to 2. Therefore, for type I error rate =0.05, and power 0.80, we can use the function “pwr.f2.test” to determine the number of degrees of freedom of the denominator of the F-statistic v. Since the sample size = denominator degrees of freedom +2, we get the following results:</p><p>library (pwr)</p><p>pwr.f2.test(u=2, f2=0.232, sig.level=0.05, power=0.80)</p><p>u = 2 (numerator degrees of freedom of the F statistic)</p><p>v = n - 2 = 41.66 (denominator degrees of freedom of the F-statistic)</p><p>Hence, sample size n= round(v) + 2 = 44.</p><p>2) Sample size requirements for testing equality of precisions, or testing the null hypothesis H 0 : σ 1 2 = σ 2 2 , or equivalently H 01 : β = 0 , (see Equation (11)), against the general alternative:</p><p><inline-formula><inline-graphic xlink:href="/html.scirp.org/file/5-1890719x97.png" xlink:type="simple"/></inline-formula>.</p><p>Note that to test the equality of precision we used the F-statistic of the ANOVA of the regression model without intercept. The numerator degrees of freedom are u = 1, and the denominator degrees of freedom are v = n - 2. If we arbitrarily select the effect size or the non-centrality parameter f2 = 0.20, the R-code for sample size is therefore:</p><p>pwr.f2.test(u=1, f2=0.2, sig.level=0.05, power=0.80).</p><p>We get v = 39.25602. Hence the sample size is:</p><p><inline-formula><inline-graphic xlink:href="/html.scirp.org/file/5-1890719x98.png" xlink:type="simple"/></inline-formula>.</p><p>3) Sample size requirement to test the absence of bias.</p><p>As we have indicated, to test the hypothesis that the two raters are unbiased relative to each other is equivalent to testing <inline-formula><inline-graphic xlink:href="//html.scirp.org/file/5-1890719x99.png" xlink:type="simple"/></inline-formula> against<inline-formula><inline-graphic xlink:href="//html.scirp.org/file/5-1890719x99.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="//html.scirp.org/file/5-1890719x100.png" xlink:type="simple"/></inline-formula>.</p><p>We indicated that the ANOVA regression does not produce F-statistic, we tested the equality of correlated means using the paired t-test. The R function can still be used under different parameters set-up. For example, the meaning difference that we need to detect is denoted by “d”. Therefore, for power = 0.80, and level of significance = 0.05, the code is:</p><p>pwr.t.test(d=.2,power=0.8,sig.level=0.05,type=“paired”,alternative=“two.sided”)</p><p>n = 198, which is the number of required subjects or a number of pairs.</p></sec><sec id="s4"><title>4. Discussion</title><p>Statistical analyses of measurement of the agreement are presented both graphically and analytically There is a great deal of research on the subject of agreement, but to our knowledge, there is no document focusing on a unified approach to the numerical evaluations and reporting of agreement studies in the medical field [<xref ref-type="bibr" rid="scirp.130627-ref21">21</xref>] [<xref ref-type="bibr" rid="scirp.130627-ref22">22</xref>] . The fundamental aim of our research was to provide a unified and robust approach to properly estimate and test agreements within healthcare settings. It is not out of place to mention that Hayes et al. [<xref ref-type="bibr" rid="scirp.130627-ref16">16</xref>] claimed that the omnibus F-statistic reported in the ANOVA of the regression model which has a numerator and denominator degrees of freedom given respectively as (2, n − 2) is the average of the two F-statistics each with (1, n − 2) degrees of freedom. Due to lack of mathematical rigor we did not use their results.</p></sec><sec id="s5"><title>5. Conclusion</title><p>We have proposed specific guidelines to report the results of testing related to agreement studies. The guidelines are broadly useful and applicable to most diagnostic issues. To properly report the results, the user may use standard statistical packages such as SAS, R, and SPSS. However, proper adjustment to the results reported by the packages is needed. We have outlined the appropriate techniques to ascertain the agreement of paired numerical data sets when assessing agreement is the subject of interest. We also provided two worked examples to illustrate these techniques, and we also provided the complete R [<xref ref-type="bibr" rid="scirp.130627-ref23">23</xref>] codes which may be readily used for data analyses of similar studies.</p></sec><sec id="s6"><title>Acknowledgements</title><p>The author acknowledges the constructive comments made by anonymous reviewers.</p></sec><sec id="s7"><title>Conflicts of Interest</title><p>None declared by the author.</p></sec><sec id="s8"><title>Cite this paper</title><p>Shoukri, M.M. (2024) Cautionary Remarks When Testing Agreement between Two Raters for Continuous Scale Measurements: A Tutorial in Clinical Epidemiology with Implementation Using R. Open Journal of Epidemiology, 14, 56-74. https://doi.org/10.4236/ojepi.2024.141005</p></sec></body><back><ref-list><title>References</title><ref id="scirp.130627-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Bland, J.M. and Altman, D.G. (1986) Statistical Methods for Assessing Agreement between Two Methods of Clinical Measurement. The Lancet, 327, 307-310. https://doi.org/10.1016/S0140-6736(86)90837-8</mixed-citation></ref><ref id="scirp.130627-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Bland, J.M. and Altman, D.G. (1995) Comparing Methods of Measurement: Why Plotting Difference against Standard Method Is Misleading. The Lancet, 346, 1085-1087. https://doi.org/10.1016/S0140-6736(95)91748-9</mixed-citation></ref><ref id="scirp.130627-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Bartko, J.J. (1994) Measures of Agreement: A Single Procedure. Statistics in Medicine, 13, 737-745. https://doi.org/10.1002/sim.4780130534</mixed-citation></ref><ref id="scirp.130627-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Bradley, E.L. and Blackwood, L.G. (1989) Comparing Paired Data: A Simultaneous Test for Means and Variances. The American Statistician, 43, 234-235. https://doi.org/10.1080/00031305.1989.10475665</mixed-citation></ref><ref id="scirp.130627-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Stephenson, J.M. and Babiker, A. (2000) Overview of Study Design in Clinical Epidemiology. Sexually Transmitted Infections, 76, 244-247. https://doi.org/10.1136/sti.76.4.244</mixed-citation></ref><ref id="scirp.130627-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Last, J.M. (1988) What Is “Clinical Epidemiology”? Journal of Public Health Policy, 9, 159-163. https://doi.org/10.2307/3343001</mixed-citation></ref><ref id="scirp.130627-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Sackett, D.L. (2002) Clinical Epidemiology. Journal of Clinical Epidemiology, 55, 1161-1166. https://doi.org/10.1016/S0895-4356(02)00521-8</mixed-citation></ref><ref id="scirp.130627-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Spitzer, W.O. (1986) Clinical Epidemiology. Journal of Chronic Diseases, 39, 411-415. https://doi.org/10.1016/0021-9681(86)90107-4</mixed-citation></ref><ref id="scirp.130627-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Kottner, J., et al. (2011) Guidelines for Reporting Reliability and Agreement Studies (GRRAS). Journal of Clinical Epidemiology, 64, 96-106. https://doi.org/10.1016/j.jclinepi.2010.03.002</mixed-citation></ref><ref id="scirp.130627-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Morgan, W.A. (1939) A Test for the Significance of the Difference between Two Variances in a Sample from a Normal Bivariate Population. Biometrika, 31, 13-19. https://doi.org/10.1093/biomet/31.1-2.13</mixed-citation></ref><ref id="scirp.130627-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Pitman, E.J.G. (1939) A Note on Normal Correlation. Biometrika, 31, 9-12. https://doi.org/10.1093/biomet/31.1-2.9</mixed-citation></ref><ref id="scirp.130627-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Gulliksen, H. and Wilks, S.S. (1950) Regression Tests for Several Samples. Psychometrika, 15, 91-114. https://doi.org/10.1007/BF02289195</mixed-citation></ref><ref id="scirp.130627-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Lazo, M. and Clark, J.M. (2008) The Epidemiology of Nonalcoholic Fatty Liver Disease: A Global Perspective. Seminars in Liver Disease, 28, 339-350. https://doi.org/10.1055/s-0028-1091978</mixed-citation></ref><ref id="scirp.130627-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Prati, D., Taioli, E., Zanella, A., Della Torre, E., Butelli, S., Del Vecchio, E. and Conte, D. (2002) Updated Definitions of Healthy Ranges for Serum Alanine Aminotransferase Levels. Annals of Internal Medicine, 137, 1-10. https://doi.org/10.7326/0003-4819-137-1-200207020-00006</mixed-citation></ref><ref id="scirp.130627-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Sanai, F.M., Helmy, A., Dale, C., Al-Ashgar, H., Abdo, A.A., Katada, K. and Hashem, A. (2011) Updated Thresholds for Alanine Aminotransferase Do Not Exclude Significant Histological Disease in Chronic Hepatitis C. Liver International, 31, 1039-1046. https://doi.org/10.1111/j.1478-3231.2011.02551.x</mixed-citation></ref><ref id="scirp.130627-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Hayes, K., O’Brian, K. and Kinsella, A. (2017) A Decomposition of the Bradley-Blackwood Paired-Samples Omnibus Test. Communications in Statistics-Theory and Methods, 46, 9892-9896. https://doi.org/10.1080/03610926.2016.1222439</mixed-citation></ref><ref id="scirp.130627-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Friedrich-Rust, M., Ong, M.F., Martens, S., Sarrazin, C., Bojunga, J., Zeuzem, S. and Herrmann, E. (2008) Performance of Transient Elastography for the Staging of Liver Fibrosis: A Meta-Analysis. Gastroenterology, 134, 960-974. https://doi.org/10.1053/j.gastro.2008.01.034</mixed-citation></ref><ref id="scirp.130627-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Friedrich-Rust, M., Rosenberg, W., Parkes, J., Herrmann, E., Zeuzem, S. and Sarrazin, C. (2010) Comparison of ELF, FibroTest and FibroScan for the Non-Invasive Assessment of Liver Fibrosis. BMC Gastroenterology, 10, Article No. 103. https://doi.org/10.1186/1471-230X-10-103</mixed-citation></ref><ref id="scirp.130627-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Carroll, R.J. and Ruppert, D. (1988) Transformation and Weighting in Regression. Chapman and Hall, New York. https://doi.org/10.1007/978-1-4899-2873-3</mixed-citation></ref><ref id="scirp.130627-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Cohen, J. (1992) A Power Primer. Psychological Bulletin, 112, 155-159. https://doi.org/10.1037/0033-2909.112.1.155</mixed-citation></ref><ref id="scirp.130627-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Shoukri, M.M. (2010) Measures of Interobserver Agreement and Reliability. 2nd Edition, Chapman &amp; Hall/CRC, Boca Raton. https://doi.org/10.1201/b10433</mixed-citation></ref><ref id="scirp.130627-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Shoukri, M.M. (2015) Agreement. Encyclopedia of Biostatistics. Wiley, New York.</mixed-citation></ref><ref id="scirp.130627-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">https://cran.r-project.org/bin/windows/base/</mixed-citation></ref></ref-list></back></article>