<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">OJMI</journal-id><journal-title-group><journal-title>Open Journal of Medical Imaging</journal-title></journal-title-group><issn pub-type="epub">2164-2788</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/ojmi.2014.44028</article-id><article-id pub-id-type="publisher-id">OJMI-52593</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Medicine&amp;Healthcare</subject></subj-group></article-categories><title-group><article-title>
 
 
  Medical Image Acquisition and Processing: Clinical Validation
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>ichael</surname><given-names>L. Goris</given-names></name><xref ref-type="aff" rid="aff1"><sub>1</sub></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><label>1</label><addr-line>Stanford University School of Medicine, Stanford, USA</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>mlgoris@stanford.edu</email></corresp></author-notes><pub-date pub-type="epub"><day>17</day><month>11</month><year>2014</year></pub-date><volume>04</volume><issue>04</issue><fpage>205</fpage><lpage>209</lpage><history><date date-type="received"><day>27</day>	<month>October</month>	<year>2014</year></date><date date-type="rev-recd"><day>26</day>	<month>November</month>	<year>2014</year>	</date><date date-type="accepted"><day>22</day>	<month>December</month>	<year>2014</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  The validation of medical imaging (processing and acquisition) can be achieved in multiple ways, somewhat influenced by the context. There are three traps to avoid: First reliance on ground truth requires the knowledge of it before the end of the trial, second comparison to gold standards cannot show improvement and finally one needs to deal with confirmation bias. In this paper we discuss those topics and alternative validation schemes.
 
</p></abstract><kwd-group><kwd>Medical Imaging</kwd><kwd> Clinical Validation</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>When new imaging technologies begin to be used in clinical settings, little is known about their potential to improve care. They are usually “sold” on technological or impressionistic criteria. Clinical validation is rare, because the consensus is that it would delay the application. In this paper we discuss the different methods for validation, not all of them which would cause implementation delay.</p><p>Ideally, for the sake of improving the appropriateness of medical imaging, one would hope for more rapid progression to what can be called scientific clinical research or technology assessment.</p><p>The first level of evaluation can be called diagnostic efficacy<sup>1</sup>. The appropriate research question to ask in this phase, while the technology is just beginning to diffuse into clinical practice or when a substantive advance in the technology occurs, is “How well does the new technology detect specific disease conditions?” The measures of efficacy are the operating characteristics (sensitivity, specificity). Effectiveness includes the positive and negative predictive value for a given prevalence of the disease in the study population and receiver-operating characteristic (ROC) analysis [<xref ref-type="bibr" rid="scirp.52593-ref1">1</xref>] . The test has diagnostic efficacy if it classifies the patient in the correct category<sup>2</sup>. The correct category in earth sciences is called the ground truth or in medicine, the defining diagnosis. One type of defining diagnostic techniques is based on the analysis (under the microscope) of tissues obtained from a lesion (e.g. histology, from a biopsy or in autopsy) or microorganism detection in the case of infection. The diagnosis is not changed if the patient dies earlier or later than expected, or responds differently to therapy.</p><p>The defining diagnosis is really a type of taxonomy<sup>3</sup> and defines the ground truth<sup>4</sup>. The assumption is that the result of the taxonomy is related to outcome or the result of a specific therapy. While diagnostic efficacy is principally of interest to radiologists, referring clinicians may be more interested in how the information derived from an imaging test affects how they care for patients, represented by the concepts of therapeutic thinking efficacy. For therapeutic thinking, the corresponding question relates to the effect of imaging on considerations of treatment.</p><p>The validation of defining diagnostic technique would be solipsistic. We will concentrate on non-defining diagnostic techniques and as much as we can on techniques based on computer processing of medical images. But why would we want a new diagnostic procedure? The global answer is that the existing one or the combination of existing ones is too costly or lacks in efficacy, effectiveness or efficiency. Costly should be understood as combining expenses in material and personnel, pain, danger, the lack of accuracy (strictly speaking unfavorable operating characteristics for the examined population: e.g. false positive rates), lack of predictive value</p><p>Not all diagnostic techniques aim to properly classify the patients, but rather to predict either the outcome, or the best therapy to obtain the desired outcome (measurements of plasma cholesterol do not provide a diagnosis but a prognosis, staging is not diagnostic, but predictive). Ultimately, the diagnostic technique should be evaluated in its role in the management of the patient, or the outcome<sup>5</sup>.</p></sec><sec id="s2"><title>2. Validation Approaches</title><sec id="s2_1"><title>2.1. Outcome Analysis</title><p>Outcome analysis is actually based on large population studies. Originally it was presented as the method to evaluate cost-effectiveness (e.g. did people live longer if there were more MRI scanners in the region). More pointedly, it has been used to look at the efficacy of screening studies. The analysis of those data is complicated by the concatenation of increased detection (an increase in incidence) and the fact that early detection may not always predict progression. For breast cancer and mammography it did take a long time to show the beneficial outcome [<xref ref-type="bibr" rid="scirp.52593-ref2">2</xref>] .</p><p>Imaging usually represents only one or a few steps in a chain of diagnostic and therapeutic interventions, so how can we ascribe an outcome to any one of these? The performance of an imaging test may be excellent, but patients might still have adverse outcomes because the treatment was inappropriate or no adequate treatment exists. As a result of all of these factors, outcomes evaluations of imaging technologies are rare. In the case of screening for early detection, the outcome is affected only to the extent that the treatment is (relatively) effective in early stages and not in more advanced disease. Outcome studies measure effectiveness rather than efficacy. Outcome analysis as validation would take too long and prevent the introduction of new techniques.</p></sec><sec id="s2_2"><title>2.2. Predictive Power</title><p>A taxonomic exact diagnosis may not be predictive. Consider that in some diseases the median survival time is n years: fifty percent die earlier, 50% later. The prognosis is not necessarily well defined by the diagnosis. Staging refines the prognosis, or the expected response to a particular therapy. Other techniques can be used to predict earlier which therapy will fail or succeed so that alternatives can be used [<xref ref-type="bibr" rid="scirp.52593-ref3">3</xref>] [<xref ref-type="bibr" rid="scirp.52593-ref4">4</xref>] . Early response to therapy may predict the long term results. There is a time lag between development and the definition of predictive power, but this approach is less burdensome than outcome analysis.</p></sec><sec id="s2_3"><title>2.3. Predicting the Taxonomy</title><p>This is the most common type of validation for diagnostic techniques. The most relevant aspects of this approach are 1) that a ground truth is assumed to exist and be known and 2) that at some point there has to be a defining test or the next best thing (e.g. a gold standard)<sup>6</sup>.</p><p>The major problem is verification bias in the first case always and in the second case mostly. An example of Verification Bias is the evaluation of Myocardial Perfusion Scintigraphy (MPS). The gold standard for the presence of (significant) coronary artery diseases (CAD) was originally the coronary arteriogram (CA).The MPS study would select patients more likely to need the CA. However, trust came too early. A value was ascribed to MPS (prematurely) and soon the probability of a CA being performed following a negative MPS decreased while the probability of a CA being performed following a positive MPS increased. The result was verification bias: with an over-estimation of the sensitivity and an under-estimation of the specificity. The proper validation would have been to perform MPS only on patients who had a positive or negative CA and do it blindly.</p><p>There are ways to overcome verification bias: one of them is to look at populations with a known prevalence [<xref ref-type="bibr" rid="scirp.52593-ref5">5</xref>] or stratified populations, another to correct the bias on the assumption of a neutral pre-selection [<xref ref-type="bibr" rid="scirp.52593-ref6">6</xref>] . The former is based on the fact that if groups are known to have a prevalence of CAD, without defining which individuals actually have it, it is axiomatic that the prevalence of positive test should correspond with the prevalence of the disease in the group.</p><p>If existing populations with known prevalence exist, this is a fairly direct approach, but expensive to implement.</p></sec><sec id="s2_4"><title>2.4. Discriminating Power</title><p>There are two concatenated conditions for a test to be discriminating: the metric has to have intrinsic discriminating value and the measurement has to be precise enough, so that variability in testing does not reach the magnitude of the difference between affected and unaffected.</p><sec id="s2_4_1"><title>2.4.1. Patient Study A</title><p>25 patients with “early” CF and 10 control cases. Patients are defined by genetics or sweat test. Controls are non-affected siblings in the same age range and same sex distribution. Did the pulmonary function tests discriminate between both groups [<xref ref-type="bibr" rid="scirp.52593-ref7">7</xref>] <xref ref-type="table" rid="table1">Table 1</xref> &amp; <xref ref-type="table" rid="table2">Table 2</xref>?</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Pulmonary function tests in 25 children with cystic fibrosis, compared to unaffected siblings: RV = respiratory volume; TLC = total lung capacity; IC% = Inspiratory Capacity; SVC = Slow Vital Capacity; FVC = Forced Vital Capacity; FEV1 = Forced Expiratory Volume in the first (3, 5) second following max inhalation; FEF = forced expiratory flow</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >RV/TLC</th><th align="center" valign="middle" >IC%</th><th align="center" valign="middle" >FVC%</th><th align="center" valign="middle" >SVC%</th><th align="center" valign="middle" >FEV1%</th><th align="center" valign="middle" >FEV1/FVC</th><th align="center" valign="middle" >FEF25-75%</th><th align="center" valign="middle" >FEFMax%</th></tr></thead><tr><td align="center" valign="middle" >Mean CF</td><td align="center" valign="middle" >28.2</td><td align="center" valign="middle" >102.3</td><td align="center" valign="middle" >115.4</td><td align="center" valign="middle" >110.7</td><td align="center" valign="middle" >104.0</td><td align="center" valign="middle" >80.4</td><td align="center" valign="middle" >83.6</td><td align="center" valign="middle" >99.6</td></tr><tr><td align="center" valign="middle" >STDV CF</td><td align="center" valign="middle" >10.2</td><td align="center" valign="middle" >18.9</td><td align="center" valign="middle" >18.5</td><td align="center" valign="middle" >19.8</td><td align="center" valign="middle" >17.1</td><td align="center" valign="middle" >7.6</td><td align="center" valign="middle" >29.5</td><td align="center" valign="middle" >25.6</td></tr><tr><td align="center" valign="middle" >Mean Nl</td><td align="center" valign="middle" >21.6</td><td align="center" valign="middle" >97.0</td><td align="center" valign="middle" >111.6</td><td align="center" valign="middle" >112.3</td><td align="center" valign="middle" >106.6</td><td align="center" valign="middle" >84.1</td><td align="center" valign="middle" >103.9</td><td align="center" valign="middle" >100.5</td></tr><tr><td align="center" valign="middle" >STDV NL</td><td align="center" valign="middle" >4.0</td><td align="center" valign="middle" >6.3</td><td align="center" valign="middle" >15.1</td><td align="center" valign="middle" >13.3</td><td align="center" valign="middle" >13.7</td><td align="center" valign="middle" >6.4</td><td align="center" valign="middle" >26.5</td><td align="center" valign="middle" >20.9</td></tr><tr><td align="center" valign="middle" >T-test</td><td align="center" valign="middle" >0.011</td><td align="center" valign="middle" >0.226</td><td align="center" valign="middle" >0.568</td><td align="center" valign="middle" >0.814</td><td align="center" valign="middle" >0.668</td><td align="center" valign="middle" >0.189</td><td align="center" valign="middle" >0.068</td><td align="center" valign="middle" >0.926</td></tr></tbody></table></table-wrap><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> The two groups are better discriminated by looking at quantitative air-trapping</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" ></th><th align="center" valign="middle" >A1</th><th align="center" valign="middle" >A2</th><th align="center" valign="middle" >A3</th></tr></thead><tr><td align="center" valign="middle" >25 CF</td><td align="center" valign="middle" >Mean</td><td align="center" valign="middle" >16.16</td><td align="center" valign="middle" >9.83</td><td align="center" valign="middle" >4.50</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >SD</td><td align="center" valign="middle" >14.71</td><td align="center" valign="middle" >10.30</td><td align="center" valign="middle" >5.11</td></tr><tr><td align="center" valign="middle" >10 NL</td><td align="center" valign="middle" >Mean</td><td align="center" valign="middle" >5.22</td><td align="center" valign="middle" >2.27</td><td align="center" valign="middle" >0.82</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >SD</td><td align="center" valign="middle" >3.64</td><td align="center" valign="middle" >1.72</td><td align="center" valign="middle" >0.62</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >T-test</td><td align="center" valign="middle" >0.0013</td><td align="center" valign="middle" >0.0012</td><td align="center" valign="middle" >0.0013</td></tr></tbody></table></table-wrap></sec><sec id="s2_4_2"><title>2.4.2. Patient Study B</title><p>To evaluate quantitative air trapping measurements in children with mild cystic fibrosis (CF) lung disease during a one year double-blind placebo-controlled rhDNase intervention trial and compare results from quantitative air trapping with those from spirometry or visually scored HRCT scans of the chest [<xref ref-type="bibr" rid="scirp.52593-ref8">8</xref>] (<xref ref-type="table" rid="table3">Table 3</xref>).</p><p>In a certain sense discriminating power is the el dorado of image processing, because if the exactitude of the measurement can often be determined (e.g. the shape and size of an hearing aid determined from an image of the external auricular canal), this is not true in all cases: there is no life verification of early air trapping in children with Cystic Fibrosis and there is no life diagnosis of Alzheimer disease in elderly. In this case discriminating power overcomes the lack of ground truth and shows efficacy.</p></sec></sec><sec id="s2_5"><title>2.5. Internalized Validation</title><p>The prototype of internalization is automation of region of interest (ROI) definition. The creator or user of the routine assumes that the user knows if a ROI is correctly placed and limited. The question is whether the automated program yields a result acceptable to the observer, and how frequently.</p><p>Another is image processing that extracts an image feature, and hence not only facilitates interpretation, but makes it more reproducible. Again, the validation is an agreement with the observer.</p><p>The validation is internalized, not because of clinical criteria, but because it standardizes interpretation to the satisfaction of the user. It is a weak validation, but immediate and cheap.</p></sec><sec id="s2_6"><title>2.6. Equivalence</title><p>Equivalence is based on the comparison with an established diagnostic procedure. The established procedure is sometimes referred to as “gold standard”, even if it is not a perfect procedure. More precisely this approach is referred to as a “no worse than” design. The “not worse than” denomination refers to the fact that at best the evaluated diagnostic procedure perfectly matches the “gold standard”, but cannot be shown to be better: all discrepancies are demonstrating a worse performance.</p><p>It takes different forms. In MPS the gold standard was the CA; however the metric was not the same: the arteriographic measurement of stenosis does not necessarily determine the relative decrease of flow in the dependent myocardium. In addition, a normal MPS predicts a lowering of risk for myocardial ischemic events independently of the CA findings [<xref ref-type="bibr" rid="scirp.52593-ref9">9</xref>] .</p><p>In imaging a common study design is to compare the automatic analysis to the judgment of a panel of (experienced) experts (see also internal validation). Again, the performance cannot be shown to improve in the new modality since the human observers define the truth.</p><p>The equivalence design is not altogether valueless since the new procedure may globally decrease the cost (expenses in material and personnel, pain and danger). What can be demonstrated is an improvement in reproducibility of the interpretation if the method is automated or quantitative.</p><p>However, the use of gold standards, while easily performed, may be dangerous if the metric differ in a physiological important manner.</p></sec></sec><sec id="s3"><title>3. Conclusion</title><p>Cost effectiveness or efficiency evaluation would be the next step: Assuming that we reach a satisfactory clinical validation, we would also like to know how much a successful technology will cost. This allows us to see, regardless of the health benefit, whether society can afford to implement it on a broad scale. Since cost is rela-</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Discriminating the effect of treatment</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Metric</th><th align="center" valign="middle" >Pulmozyme<sup>TM</sup></th><th align="center" valign="middle" >Placebo</th><th align="center" valign="middle" >P</th></tr></thead><tr><td align="center" valign="middle" >N</td><td align="center" valign="middle" >11</td><td align="center" valign="middle" >14</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >A1</td><td align="center" valign="middle" >−2.10 &#177; 44.80</td><td align="center" valign="middle" >34.40 &#177; 62.10</td><td align="center" valign="middle" >0.102</td></tr><tr><td align="center" valign="middle" >A2</td><td align="center" valign="middle" >−9.30 &#177; 42.90</td><td align="center" valign="middle" >43.40 &#177; 73.20</td><td align="center" valign="middle" >0.035</td></tr><tr><td align="center" valign="middle" >A3</td><td align="center" valign="middle" >−13.10 &#177; 40.50</td><td align="center" valign="middle" >48.20 &#177; 81.20</td><td align="center" valign="middle" >−0.02</td></tr></tbody></table></table-wrap><p>tive, researchers generally relate it to how much benefit is obtained for how much money. They have developed the ratio of cost per years of life saved and refer to this ratio as a technology’s cost-effectiveness [<xref ref-type="bibr" rid="scirp.52593-ref10">10</xref>] . The next step in the evaluation must include the result in the targeted population or efficiency, which is also a function of the disease prevalence in that population.</p></sec><sec id="s4"><title>NOTES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.52593-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Hanley, J.A. and McNeil, B.J. (1982) The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 143, 29-36. http://dx.doi.org/10.1148/radiology.143.1.7063747</mixed-citation></ref><ref id="scirp.52593-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Berry, D.A., Cronin, K.A., Plevritis, S.K., Fryback, D.G., Clark, L.C., Zelen, M., Mandelblatt, J.S., Yakovlev, A.Y., Habbema, J.D.F. and Feuer, E.J. (2005) Contributions of Screening and Adjuvant Treatment to Reduction in Breast Cancer Mortality in the US from 1975 to 2000. New England Journal of Medicine, 353, 1784-1792.  
http://dx.doi.org/10.1056/NEJMoa050518</mixed-citation></ref><ref id="scirp.52593-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Zhu, H.J. and Halkar, P.K. (2007) An Evaluation of the Predictive Value of During Treatment 18F-Fluorodeoxyglucose PET/CT Scans in Pediatric Lymphomas. RSNA Scientific Assembly and Annual Meeting Program, 954.</mixed-citation></ref><ref id="scirp.52593-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Zhu, H.J., Halkar, R., Alavi, A. and Goris, M.L. (2013) An Evaluation of the Predictive Value of Mid-Treatment 18F-FDG PET/CT Scans in Pediatric Lymphomas and Undefined Criteria of Abnormality in Quantitative Analysis. Hellenic Journal of Nuclear Medicine, 16, 169-74.</mixed-citation></ref><ref id="scirp.52593-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Goris, M.L., Bretille, J., Askienazy, S., Purcell, G.P. and Savelli, V. (1989) The Validation of Diagnostic Procedures on Stratified Populations: Application on the Quantification of Thallium Myocardial Perfusion Scintigraphy. American Journal of Physiological Imaging, 4, 11-15.</mixed-citation></ref><ref id="scirp.52593-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Diamond, G.A., Rozanski, A., Forrester, J.S., Morris, D., Pollock, B.H., Staniloff, H.M., Berman, D.S. and Swan, H.J.C. (1986) A Model for Assessing the Sensitivity and Specificity of Tests Subject to Selection Bias: Application to Exercise Radionuclide Ventriculography for Diagnosis of Coronary Artery Disease. Journal of Chronic Diseases, 39, 343-355. http://dx.doi.org/10.1016/0021-9681(86)90119-0</mixed-citation></ref><ref id="scirp.52593-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Goris, M.L., Zhu, H.J., Blankenberg, F., Chan, F. and Robinson, T.E. (2003) An Automated Approach to Quantitative Air Trapping Measurements in Mild Cystic Fibrosis. Chest, 123, 1655-1663.  
http://dx.doi.org/10.1378/chest.123.5.1655</mixed-citation></ref><ref id="scirp.52593-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Robinson, T.E., Goris, M.L., Zhu, H.J., Chen, X., Bhise, P., Sheikh, F. and Moss, R.B. (2005) Dornase Alfa Reduces Air Trapping in Children with Mild Cystic Fibrosis Lung Disease: A Quantitative Analysis. Chest, 128, 2327-2335.  
http://dx.doi.org/10.1378/chest.128.4.2327</mixed-citation></ref><ref id="scirp.52593-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Hachamovitch, R., Berman, D.S., Shaw, L.J., Kiat, H., Cohen, I., Cabico, J.A., Friedman, J. and Diamond, G.A. (1998) Incremental Prognostic Value of Myocardial Perfusion Single Photon Emission Computed Tomography for the Prediction of Cardiac Death: Differential Stratification for Risk of Cardiac Death and Myocardial Infarction. Circulation, 97, 535-543. http://dx.doi.org/10.1161/01.CIR.97.6.535</mixed-citation></ref><ref id="scirp.52593-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Beinfeld, M.T., Wittenberg, E. and Gazelle, S.G. (2005) Cost-Effectiveness of Whole-Body CT Screening. Radiology, 234, 415-422. http://dx.doi.org/10.1148/radiol.2342032061</mixed-citation></ref></ref-list></back></article>