<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article">
 <front>
  <journal-meta>
   <journal-id journal-id-type="publisher-id">
    jamp
   </journal-id>
   <journal-title-group>
    <journal-title>
     Journal of Applied Mathematics and Physics
    </journal-title>
   </journal-title-group>
   <issn pub-type="epub">
    2327-4352
   </issn>
   <issn publication-format="print">
    2327-4379
   </issn>
   <publisher>
    <publisher-name>
     Scientific Research Publishing
    </publisher-name>
   </publisher>
  </journal-meta>
  <article-meta>
   <article-id pub-id-type="doi">
    10.4236/jamp.2025.139181
   </article-id>
   <article-id pub-id-type="publisher-id">
    jamp-146160
   </article-id>
   <article-categories>
    <subj-group subj-group-type="heading">
     <subject>
      Articles
     </subject>
    </subj-group>
    <subj-group subj-group-type="Discipline-v2">
     <subject>
      Physics 
     </subject>
     <subject>
       Mathematics
     </subject>
    </subj-group>
   </article-categories>
   <title-group>
    Analysis of Risk Factors and Segment-Specific Strategies for Diabetes Prevention 
   </title-group>
   <contrib-group>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Aya Patricia
      </surname>
      <given-names>
       Konan
      </given-names>
     </name> 
     <xref ref-type="aff" rid="aff1"> 
      <sup>1</sup>
     </xref>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Adama
      </surname>
      <given-names>
       Coulibaly
      </given-names>
     </name> 
     <xref ref-type="aff" rid="aff2"> 
      <sup>2</sup>
     </xref>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Kouassi Bernard
      </surname>
      <given-names>
       Saha
      </given-names>
     </name> 
     <xref ref-type="aff" rid="aff3"> 
      <sup>3</sup>
     </xref>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Souleymane
      </surname>
      <given-names>
       Oumtanaga
      </given-names>
     </name> 
     <xref ref-type="aff" rid="aff4"> 
      <sup>4</sup>
     </xref>
    </contrib>
   </contrib-group> 
   <aff id="aff1">
    <addr-line>
     aFaculty of Mathematics and Computer Science, Felix Houphouët-Boigny University, Abidjan, Côte d’Ivoire
    </addr-line> 
   </aff> 
   <aff id="aff2">
    <addr-line>
     aInstitute for Mathematical Research (IRMA), Abidjan, Côte d’Ivoire
    </addr-line> 
   </aff> 
   <aff id="aff3">
    <addr-line>
     aHigher Teacher Training School, National Polytechnic Institute Félix Houphouët-Boigny, Yamoussoukro, Côte d’Ivoire
    </addr-line> 
   </aff> 
   <aff id="aff4">
    <addr-line>
     aLaboratory of Computer Science and Telecommunications, National Polytechnic Institute, Abidjan, Côte d’Ivoire
    </addr-line> 
   </aff> 
   <pub-date pub-type="epub">
    <day>
     09
    </day> 
    <month>
     09
    </month>
    <year>
     2025
    </year>
   </pub-date> 
   <volume>
    13
   </volume> 
   <issue>
    09
   </issue>
   <fpage>
    3186
   </fpage>
   <lpage>
    3201
   </lpage>
   <history>
    <date date-type="received">
     <day>
      2,
     </day>
     <month>
      September
     </month>
     <year>
      2025
     </year>
    </date>
    <date date-type="published">
     <day>
      25,
     </day>
     <month>
      September
     </month>
     <year>
      2025
     </year> 
    </date> 
    <date date-type="accepted">
     <day>
      25,
     </day>
     <month>
      September
     </month>
     <year>
      2025
     </year> 
    </date>
   </history>
   <permissions>
    <copyright-statement>
     © Copyright 2014 by authors and Scientific Research Publishing Inc. 
    </copyright-statement>
    <copyright-year>
     2014
    </copyright-year>
    <license>
     <license-p>
      This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/
     </license-p>
    </license>
   </permissions>
   <abstract>
    This study proposes a segmented approach to analyzing diabetes risk factors using the dataset diabete_custom.xlsx (150 individuals, 14 medical and behavioral variables). The combination of KMeans with logistic regression and KMeans with decision tree enabled the definition of three clusters corresponding to low, moderate, and high risk, while identifying key variables such as blood glucose, BMI, and heredity. The hybrid models improve accuracy and interpretability compared to KMeans alone, with the decision tree being slightly more effective in unbalanced clusters. These findings provide a foundation for personalized interventions, including targeted screening, glycemic and nutritional monitoring, physical activity, and educational campaigns tailored to each risk profile. 
   </abstract>
   <kwd-group> 
    <kwd>
     Diabetes
    </kwd> 
    <kwd>
      KMeans
    </kwd> 
    <kwd>
      Logistic Regression
    </kwd> 
    <kwd>
      Decision Tree
    </kwd> 
    <kwd>
      Segmentation
    </kwd>
   </kwd-group>
  </article-meta>
 </front>
 <body>
  <sec id="s1">
   <title>1. Introduction</title>
   <p>Diabetes is a multifactorial chronic disease influenced by genetic, biological, and behavioral factors. Each individual presents a unique risk profile, making prevention particularly complex. Existing literature often focuses on global diabetes prediction or the extraction of general rules without considering population segmentation. This is the case, for example, in Sarra S. (2024) “AI-Based Approach for a Diabetes Prediction System” <xref ref-type="bibr" rid="scirp.146160-1">
     [1]
    </xref>, Mohebbi M. A. (2021) “A Machine Learning Approach to Treatment Improvement in Diabetes” <xref ref-type="bibr" rid="scirp.146160-2">
     [2]
    </xref>, and Amani Hamada Bonheur (2024) “Design and Implementation of an Intelligent Web Application for Diabetes Diagnosis” <xref ref-type="bibr" rid="scirp.146160-3">
     [3]
    </xref>. However, diabetes risk varies according to individual profiles, and a uniform approach does not allow for targeted prevention strategies.</p>
   <p>Segmenting the population into homogeneous subgroups enables the identification of key factors and guides personalized interventions. This study adopts a hybrid approach combining KMeans, logistic regression, and decision trees, applied to the diabete_custom.xlsx dataset. The objective is to identify the determinant variables for each cluster and extract simple rules to inform clinical and behavioral prevention.</p>
  </sec><sec id="s2">
   <title>2. Dataset Description</title>
   <p>The diabete_custom.xlsx dataset comprises 150 individuals and 14 variables, derived and enriched from the Pima Indians Diabetes Dataset <xref ref-type="bibr" rid="scirp.146160-4">
     [4]
    </xref>. It includes medical factors (age, BMI, blood glucose, HbA1c, blood pressure, cholesterol, waist circumference, family history) as well as behavioral factors (physical activity, smoking, alcohol consumption, BMI category). The target variable, Diabetes, is binary (0 = absence, 1 = presence). The Excel file is directly compatible with standard data analysis tools.</p>
  </sec><sec id="s3">
   <title>3. Methodology</title>
   <sec id="s3_1">
    <title>3.1. Data Preprocessing</title>
    <p>The methodology begins with data preprocessing, which consists of separating the explanatory variables X from the target variable Y. Each variable x<sub>j</sub> is then normalized to ensure comparability across features, according to the formula:</p>
    <p>
     <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msubsup> 
        <mi>
          x 
        </mi> 
        <mi>
          j 
        </mi> 
        <mrow> 
         <mi>
           n 
         </mi> 
         <mi>
           o 
         </mi> 
         <mi>
           r 
         </mi> 
         <mi>
           m 
         </mi> 
        </mrow> 
       </msubsup> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <msub> 
          <mi>
            x 
          </mi> 
          <mi>
            j 
          </mi> 
         </msub> 
         <mo>
           − 
         </mo> 
         <msub> 
          <mi>
            μ 
          </mi> 
          <mi>
            j 
          </mi> 
         </msub> 
        </mrow> 
        <mrow> 
         <msub> 
          <mi>
            σ 
          </mi> 
          <mi>
            j 
          </mi> 
         </msub> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math></p>
    <p>where μ<sub>j</sub> and σ<sub>j</sub> represent the mean and standard deviation of variable x<sub>j</sub>, respectively. This normalization is essential for distance-based methods such as KMeans, in order to prevent any single variable from dominating the others due to differences in scale <xref ref-type="bibr" rid="scirp.146160-5">
      [5]
     </xref>.</p>
    <p>The choice of the optimal number of clusters k is determined using two complementary approaches. The elbow method relies on intra-cluster inertia, defined as:</p>
    <p>
     <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msub> 
        <mi>
          W 
        </mi> 
        <mi>
          k 
        </mi> 
       </msub> 
       <mo>
         = 
       </mo> 
       <mstyle displaystyle="true"> 
        <munderover> 
         <mo>
           ∑ 
         </mo> 
         <mrow> 
          <mi>
            i 
          </mi> 
          <mo>
            = 
          </mo> 
          <mn>
            1 
          </mn> 
         </mrow> 
         <mi>
           k 
         </mi> 
        </munderover> 
        <mrow> 
         <mstyle displaystyle="true"> 
          <munder> 
           <mo>
             ∑ 
           </mo> 
           <mrow> 
            <mi>
              x 
            </mi> 
            <mo>
              ∈ 
            </mo> 
            <msub> 
             <mi>
               C 
             </mi> 
             <mi>
               i 
             </mi> 
            </msub> 
           </mrow> 
          </munder> 
          <mrow> 
           <msup> 
            <mrow> 
             <mrow> 
              <mo>
                ‖ 
              </mo> 
              <mrow> 
               <mi>
                 x 
               </mi> 
               <mo>
                 − 
               </mo> 
               <msub> 
                <mi>
                  μ 
                </mi> 
                <mi>
                  i 
                </mi> 
               </msub> 
              </mrow> 
              <mo>
                ‖ 
              </mo> 
             </mrow> 
            </mrow> 
            <mn>
              2 
            </mn> 
           </msup> 
          </mrow> 
         </mstyle> 
        </mrow> 
       </mstyle> 
      </mrow> 
     </math></p>
    <p>where μ<sub>i</sub> is the centroid of cluster C<sub>i</sub> and ∥⋅∥ denotes the Euclidean norm. The total inertia decreases as k increases, and the “elbow” of the curve helps identify a trade-off between the number of clusters and the compactness of the groups <xref ref-type="bibr" rid="scirp.146160-6">
      [6]
     </xref>.</p>
    <p>The silhouette method complements this analysis by evaluating the cohesion and separation of clusters for each individual i, according to:</p>
    <p>
     <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mi>
         S 
       </mi> 
       <mrow> 
        <mo>
          ( 
        </mo> 
        <mi>
          i 
        </mi> 
        <mo>
          ) 
        </mo> 
       </mrow> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <mi>
           b 
         </mi> 
         <mrow> 
          <mo>
            ( 
          </mo> 
          <mi>
            i 
          </mi> 
          <mo>
            ) 
          </mo> 
         </mrow> 
         <mo>
           − 
         </mo> 
         <mi>
           a 
         </mi> 
         <mrow> 
          <mo>
            ( 
          </mo> 
          <mi>
            i 
          </mi> 
          <mo>
            ) 
          </mo> 
         </mrow> 
        </mrow> 
        <mrow> 
         <mtext>
           max 
         </mtext> 
         <mrow> 
          <mo>
            ( 
          </mo> 
          <mrow> 
           <mi>
             a 
           </mi> 
           <mrow> 
            <mo>
              ( 
            </mo> 
            <mi>
              i 
            </mi> 
            <mo>
              ) 
            </mo> 
           </mrow> 
           <mo>
             , 
           </mo> 
           <mi>
             b 
           </mi> 
           <mrow> 
            <mo>
              ( 
            </mo> 
            <mi>
              i 
            </mi> 
            <mo>
              ) 
            </mo> 
           </mrow> 
          </mrow> 
          <mo>
            ) 
          </mo> 
         </mrow> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math></p>
    <p>where a(i) is the average intra-cluster distance and b(i) is the average distance to the nearest cluster. The value of s(i) ranges from −1 to 1, with a score close to 1 indicating that the individual is well assigned to its cluster, while a negative score suggests inappropriate clustering <xref ref-type="bibr" rid="scirp.146160-7">
      [7]
     </xref>.</p>
    <p>
     <xref ref-type="fig" rid="fig1">
      Figure 1
     </xref> and <xref ref-type="fig" rid="fig2">
      Figure 2
     </xref> show the elbow plot and the silhouette plot, respectively.</p>
    <fig id="fig1" position="float">
     <label>Figure 1</label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Figure 1. Elbow curve.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1724330-rId21.jpeg?20250928025147" />
    </fig>
    <fig id="fig2" position="float">
     <label>Figure 2</label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Figure 2. Silhouette curve.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1724330-rId22.jpeg?20250928025147" />
    </fig>
    <p>The elbow method shows a sharp decrease in inertia up to k = 3, followed by stabilization, indicating an optimal point. Simultaneously, the silhouette score reaches its highest value at k = 3, reflecting strong internal cohesion and clear separation between groups. These convergent results justify the choice of three clusters, consistent with the expected typology of diabetes profiles (non-diabetic, pre-diabetic, diabetic).</p>
   </sec>
   <sec id="s3_2">
    <title>3.2. Hybrid Model: KMeans + Logistic Regression</title>
    <p>For each cluster C<sub>k</sub>, logistic regression is applied:</p>
    <p>
     <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mi>
         P 
       </mi> 
       <mrow> 
        <mo>
          ( 
        </mo> 
        <mrow> 
         <mi>
           y 
         </mi> 
         <mo>
           = 
         </mo> 
         <mn>
           1 
         </mn> 
         <mo>
           | 
         </mo> 
         <mi>
           x 
         </mi> 
        </mrow> 
        <mo>
          ) 
        </mo> 
       </mrow> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mn>
          1 
        </mn> 
        <mrow> 
         <mn>
           1 
         </mn> 
         <mo>
           + 
         </mo> 
         <msup> 
          <mtext>
            e 
          </mtext> 
          <mrow> 
           <mo>
             − 
           </mo> 
           <mrow> 
            <mo>
              ( 
            </mo> 
            <mrow> 
             <msub> 
              <mi>
                β 
              </mi> 
              <mn>
                0 
              </mn> 
             </msub> 
             <mo>
               + 
             </mo> 
             <mstyle displaystyle="true"> 
              <msubsup> 
               <mo>
                 ∑ 
               </mo> 
               <mrow> 
                <mi>
                  j 
                </mi> 
                <mo>
                  = 
                </mo> 
                <mn>
                  1 
                </mn> 
               </mrow> 
               <mi>
                 p 
               </mi> 
              </msubsup> 
              <mrow> 
               <msub> 
                <mi>
                  β 
                </mi> 
                <mi>
                  j 
                </mi> 
               </msub> 
               <msub> 
                <mi>
                  x 
                </mi> 
                <mi>
                  j 
                </mi> 
               </msub> 
              </mrow> 
             </mstyle> 
            </mrow> 
            <mo>
              ) 
            </mo> 
           </mrow> 
          </mrow> 
         </msup> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math></p>
    <p>where:</p>
    <p>Interpretation:</p>
    <p>Simplified rules by cluster:</p>
    <p>If x<sub>j</sub> increases, then the risk of diabetes is proportional to ∝β<sub>j</sub>.</p>
   </sec>
   <sec id="s3_3">
    <title>3.3. Hybrid Model: KMeans + Decision Tree</title>
    <p>The decision tree aims to partition the data into homogeneous subgroups <xref ref-type="bibr" rid="scirp.146160-8">
      [8]
     </xref>.</p>
    <p>
     <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mtext>
         Gini 
       </mtext> 
       <mo>
         = 
       </mo> 
       <mn>
         1 
       </mn> 
       <mo>
         − 
       </mo> 
       <munderover> 
        <mstyle mathsize="140%" displaystyle="true"> 
         <mo>
           ∑ 
         </mo> 
        </mstyle> 
        <mrow> 
         <mi>
           C 
         </mi> 
         <mo>
           + 
         </mo> 
         <mn>
           1 
         </mn> 
        </mrow> 
        <mi>
          C 
        </mi> 
       </munderover> 
       <mtext>
           
       </mtext> 
       <msubsup> 
        <mi>
          p 
        </mi> 
        <mi>
          c 
        </mi> 
        <mn>
          2 
        </mn> 
       </msubsup> 
      </mrow> 
     </math></p>
    <p>where p<sub>c</sub> is the proportion of observations of class c in the node <xref ref-type="bibr" rid="scirp.146160-9">
      [9]
     </xref>.</p>
    <p>Example: “If Blood Glucose &gt; 140 mg/dl and BMI &gt; 30 → Diabetes.”</p>
   </sec>
   <sec id="s3_4">
    <title>3.4. Model Evaluation</title>
    <p>The evaluation of the hybrid models’ performance is based on two complementary aspects: quantitative indicators and the relevance of the extracted rules.</p>
    <p>To assess the quality of predictions, accuracy is used, which corresponds to the proportion of correct predictions relative to the total number of predictions:</p>
    <p>
     <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mtext>
         Accuracy 
       </mtext> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           TN 
         </mtext> 
        </mrow> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           TN 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           FP 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           FN 
         </mtext> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math></p>
    <p>where TP (True Positives) represents true positives, TN true negatives, FP false positives, and FN false negatives <xref ref-type="bibr" rid="scirp.146160-10">
      [10]
     </xref>.</p>
    <p>Recall assesses the model’s ability to correctly identify individuals who are actually positive (diabetic):</p>
    <p>
     <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mtext>
         Recall 
       </mtext> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
        </mrow> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           FN 
         </mtext> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math></p>
    <p>Finally, the F1-score combines precision and recall into a harmonic mean, providing a single metric that reflects the balance between accuracy and sensitivity:</p>
    <p>
     <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mtext>
         F 
       </mtext> 
       <mn>
         1 
       </mn> 
       <mo>
         = 
       </mo> 
       <mn>
         2 
       </mn> 
       <mo>
         ⋅ 
       </mo> 
       <mfrac> 
        <mrow> 
         <mtext>
           Precision 
         </mtext> 
         <mo>
           ⋅ 
         </mo> 
         <mtext>
           Recall 
         </mtext> 
        </mrow> 
        <mrow> 
         <mtext>
           Precision 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           Recall 
         </mtext> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math></p>
    <p>Beyond numerical metrics, it is essential to evaluate the readability and clinical and behavioral applicability of the rules extracted by the models. The rules should be interpretable, consistent with known risk factors such as blood glucose, body mass index, heredity, and physical activity, and directly actionable to guide prevention and intervention strategies.</p>
   </sec>
  </sec><sec id="s4">
   <title>4. Results</title>
   <sec id="s4_1">
    <title>4.1. Comparison of the Performance of KMeans + Logistic Regression and KMeans + Decision Tree</title>
    <p>
     <xref ref-type="table" rid="table1">
      Table 1
     </xref> and <xref ref-type="table" rid="table2">
      Table 2
     </xref> present the baseline performance of KMeans and the cluster distribution, while <xref ref-type="table" rid="table3">
      Table 3
     </xref> and <xref ref-type="table" rid="table4">
      Table 4
     </xref> show the enhanced performance achieved through Logistic Regression and Decision Tree per cluster.</p>
    <table-wrap id="table1">
     <label>
      <xref ref-type="table" rid="table1">
       Table 1
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Table 1. Overall results of KMeans + mapping → classes.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="17.24%"><p style="text-align:center">Metric</p></td> 
       <td class="custom-bottom-td acenter" width="9.15%"><p style="text-align:center">Value</p></td> 
       <td class="custom-bottom-td acenter" width="73.60%"><p style="text-align:center">Comment</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="17.24%"><p style="text-align:center">Accuracy</p></td> 
       <td class="custom-top-td acenter" width="9.15%"><p style="text-align:center">0.71</p></td> 
       <td class="custom-top-td acenter" width="73.60%"><p style="text-align:center">Fair performance but can be improved, especially for diabetic individuals.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="17.24%"><p style="text-align:center">Precision (0)</p></td> 
       <td class="acenter" width="9.15%"><p style="text-align:center">0.84</p></td> 
       <td class="acenter" width="73.60%"><p style="text-align:center">The non-diabetic class is well predicted.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="17.24%"><p style="text-align:center">Recall (0)</p></td> 
       <td class="acenter" width="9.15%"><p style="text-align:center">0.70</p></td> 
       <td class="acenter" width="73.60%"><p style="text-align:center">70% of non-diabetic individuals are correctly identified.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="17.24%"><p style="text-align:center">F1-score (0)</p></td> 
       <td class="acenter" width="9.15%"><p style="text-align:center">0.76</p></td> 
       <td class="acenter" width="73.60%"><p style="text-align:center">Good balance between precision and recall for non-diabetics.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="17.24%"><p style="text-align:center">Precision (1)</p></td> 
       <td class="acenter" width="9.15%"><p style="text-align:center">0.54</p></td> 
       <td class="acenter" width="73.60%"><p style="text-align:center">Low precision for diabetic individuals.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="17.24%"><p style="text-align:center">Recall (1)</p></td> 
       <td class="acenter" width="9.15%"><p style="text-align:center">0.71</p></td> 
       <td class="acenter" width="73.60%"><p style="text-align:center">Acceptable recall, some diabetics are misclassified.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="17.24%"><p style="text-align:center">F1-score (1)</p></td> 
       <td class="acenter" width="9.15%"><p style="text-align:center">0.61</p></td> 
       <td class="acenter" width="73.60%"><p style="text-align:center">Average score for the diabetic class.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="17.24%"><p style="text-align:center">Macro avg</p></td> 
       <td class="acenter" width="9.15%"><p style="text-align:center">0.69</p></td> 
       <td class="acenter" width="73.60%"><p style="text-align:center">Balanced average across both classes.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="17.24%"><p style="text-align:center">Weighted avg</p></td> 
       <td class="acenter" width="9.15%"><p style="text-align:center">0.74</p></td> 
       <td class="acenter" width="73.60%"><p style="text-align:center">Acceptable weighted average.</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>Legend 1: Overall performance of KMeans after cluster mapping, with class-specific precision and comments on reliability.</p>
    <table-wrap id="table2">
     <label>
      <xref ref-type="table" rid="table2">
       Table 2
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Table 2. Cluster-wise results—Unsupervised KMeans.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="10.01%"><p style="text-align:center">Cluster</p></td> 
       <td class="custom-bottom-td acenter" width="11.71%"><p style="text-align:center">Accuracy</p></td> 
       <td class="custom-bottom-td acenter" width="13.20%"><p style="text-align:center">Precision (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="12.77%"><p style="text-align:center">Recall (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="12.99%"><p style="text-align:center">F1-score (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="39.32%"><p style="text-align:center">Comment</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="10.01%"><p style="text-align:center">0</p></td> 
       <td class="custom-top-td acenter" width="11.71%"><p style="text-align:center">0.68</p></td> 
       <td class="custom-top-td acenter" width="13.20%"><p style="text-align:center">0.68/0.00</p></td> 
       <td class="custom-top-td acenter" width="12.77%"><p style="text-align:center">1.00/0.00</p></td> 
       <td class="custom-top-td acenter" width="12.99%"><p style="text-align:center">0.81/0.00</p></td> 
       <td class="custom-top-td acenter" width="39.32%"><p style="text-align:center">Class 1 (diabetic) not predicted, low performance for this class.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="10.01%"><p style="text-align:center">1</p></td> 
       <td class="acenter" width="11.71%"><p style="text-align:center">0.98</p></td> 
       <td class="acenter" width="13.20%"><p style="text-align:center">0.98/0.00</p></td> 
       <td class="acenter" width="12.77%"><p style="text-align:center">1.00/0.00</p></td> 
       <td class="acenter" width="12.99%"><p style="text-align:center">0.99/0.00</p></td> 
       <td class="acenter" width="39.32%"><p style="text-align:center">Very good for non-diabetics, but no diabetics present to evaluate.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="10.01%"><p style="text-align:center">2</p></td> 
       <td class="acenter" width="11.71%"><p style="text-align:center">0.54</p></td> 
       <td class="acenter" width="13.20%"><p style="text-align:center">0.00/0.54</p></td> 
       <td class="acenter" width="12.77%"><p style="text-align:center">0.00/1.00</p></td> 
       <td class="acenter" width="12.99%"><p style="text-align:center">0.00/0.70</p></td> 
       <td class="acenter" width="39.32%"><p style="text-align:center">Conversely, class 0 poorly predicted, overall performance low.</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>Legend 2: Cluster-wise performance for unsupervised KMeans, indicating the model’s ability to predict each class and the observed limitations for certain classes.</p>
    <table-wrap id="table3">
     <label>
      <xref ref-type="table" rid="table3">
       Table 3
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Table 3. Cluster-wise results—KMeans + logistic regression.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="10.86%"><p style="text-align:center">Cluster</p></td> 
       <td class="custom-bottom-td acenter" width="12.14%"><p style="text-align:center">Accuracy</p></td> 
       <td class="custom-bottom-td acenter" width="14.69%"><p style="text-align:center">Precision (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="12.77%"><p style="text-align:center">Recall (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="13.63%"><p style="text-align:center">F1-score (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="35.92%"><p style="text-align:center">Comment</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="10.86%"><p style="text-align:center">0</p></td> 
       <td class="custom-top-td acenter" width="12.14%"><p style="text-align:center">0.97</p></td> 
       <td class="custom-top-td acenter" width="14.69%"><p style="text-align:center">1.00/0.93</p></td> 
       <td class="custom-top-td acenter" width="12.77%"><p style="text-align:center">0.96/1.00</p></td> 
       <td class="custom-top-td acenter" width="13.63%"><p style="text-align:center">0.98/0.96</p></td> 
       <td class="custom-top-td acenter" width="35.92%"><p style="text-align:center">Very good performance for both classes.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="10.86%"><p style="text-align:center">1</p></td> 
       <td class="acenter" width="12.14%"><p style="text-align:center">0.98</p></td> 
       <td class="acenter" width="14.69%"><p style="text-align:center">0.98/0.00</p></td> 
       <td class="acenter" width="12.77%"><p style="text-align:center">1.00/0.00</p></td> 
       <td class="acenter" width="13.63%"><p style="text-align:center">0.99/0.00</p></td> 
       <td class="acenter" width="35.92%"><p style="text-align:center">Diabetic class almost absent, minority class evaluation not possible.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="10.86%"><p style="text-align:center">2</p></td> 
       <td class="acenter" width="12.14%"><p style="text-align:center">0.98</p></td> 
       <td class="acenter" width="14.69%"><p style="text-align:center">1.00/0.97</p></td> 
       <td class="acenter" width="12.77%"><p style="text-align:center">0.97/1.00</p></td> 
       <td class="acenter" width="13.63%"><p style="text-align:center">0.98/0.99</p></td> 
       <td class="acenter" width="35.92%"><p style="text-align:center">Excellent performance in this cluster, very balanced.</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>Legend 3: Cluster-wise performance for the KMeans + Logistic Regression model, showing overall effectiveness and evaluation limitations when certain classes are underrepresented.</p>
    <table-wrap id="table4">
     <label>
      <xref ref-type="table" rid="table4">
       Table 4
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Table 4. Cluster-wise results—KMeans + decision tree.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="10.86%"><p style="text-align:center">Cluster</p></td> 
       <td class="custom-bottom-td acenter" width="11.92%"><p style="text-align:center">Accuracy</p></td> 
       <td class="custom-bottom-td acenter" width="14.48%"><p style="text-align:center">Precision (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="13.20%"><p style="text-align:center">Recall (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="14.26%"><p style="text-align:center">F1-score (0/1)</p></td> 
       <td class="custom-bottom-td acenter" width="35.28%"><p style="text-align:center">Comment</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="10.86%"><p style="text-align:center">0</p></td> 
       <td class="custom-top-td acenter" width="11.92%"><p style="text-align:center">0.92</p></td> 
       <td class="custom-top-td acenter" width="14.48%"><p style="text-align:center">1.00/0.83</p></td> 
       <td class="custom-top-td acenter" width="13.20%"><p style="text-align:center">0.86/1.00</p></td> 
       <td class="custom-top-td acenter" width="14.26%"><p style="text-align:center">0.92/0.91</p></td> 
       <td class="custom-top-td acenter" width="35.28%"><p style="text-align:center">Good balance, slightly underestimated for class 1.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="10.86%"><p style="text-align:center">1</p></td> 
       <td class="acenter" width="11.92%"><p style="text-align:center">0.86</p></td> 
       <td class="acenter" width="14.48%"><p style="text-align:center">1.00/0.00</p></td> 
       <td class="acenter" width="13.20%"><p style="text-align:center">0.86/0.00</p></td> 
       <td class="acenter" width="14.26%"><p style="text-align:center">0.92/0.00</p></td> 
       <td class="acenter" width="35.28%"><p style="text-align:center">Difficult cluster: no diabetics present, recall and F1 for class 1 not computable.</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="10.86%"><p style="text-align:center">2</p></td> 
       <td class="acenter" width="11.92%"><p style="text-align:center">1.00</p></td> 
       <td class="acenter" width="14.48%"><p style="text-align:center">1.00/1.00</p></td> 
       <td class="acenter" width="13.20%"><p style="text-align:center">1.00/1.00</p></td> 
       <td class="acenter" width="14.26%"><p style="text-align:center">1.00/1.00</p></td> 
       <td class="acenter" width="35.28%"><p style="text-align:center">Perfect cluster for the decision tree, all classes correctly predicted.</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>Legend 4: Cluster-wise performance for the KMeans + Decision Tree model, highlighting overall effectiveness, class balance, and limitations related to clusters where certain classes are absent.</p>
   </sec>
   <sec id="s4_2">
    <title>4.2. Interpretation of Results</title>
    <p>Global KMeans with class mapping shows fair performance (Accuracy 0.71), with good prediction for non-diabetic individuals but limited precision for diabetics. At the cluster level, unsupervised KMeans exhibits strong variations: some clusters contain almost exclusively one class, making evaluation of the other class impossible, and prediction is poor for the minority class. KMeans + Logistic Regression achieves excellent performance for balanced clusters, but evaluation is limited when a class is nearly absent. KMeans + Decision Tree efficiently predicts all classes in balanced clusters, maintaining good class balance, whereas clusters where certain classes are absent show limitations in recall and F1-score. Overall, hybrid models improve prediction compared to KMeans alone, particularly in well-represented clusters.</p>
   </sec>
   <sec id="s4_3">
    <title>4.3. Graphs</title>
    <p>After applying KMeans, individuals are grouped into three clusters:</p>
    <p>The PCA projection in <xref ref-type="fig" rid="fig3">
      Figure 3
     </xref> shows good separation of the groups for both hybrid models.</p>
    <fig id="fig3" position="float">
     <label>Figure 3</label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Figure 3. KMeans model on PCA projection.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1724330-rId33.jpeg?20250928025156" />
    </fig>
    <p>Cluster 0—High Risk</p>
    <p>Cluster 1—Very Low Risk</p>
    <p>Cluster 2—Very High Risk</p>
    <p>General Remarks:</p>
    <p>The KMeans model combined with logistic regression allowed the identification of three distinct clusters, each exhibiting specific diabetes-related risk profiles. This approach highlights the key variables unique to each group and facilitates targeted analysis.</p>
    <p>Cluster 0 (Logistic Regression)</p>
    <p>Interpretation:</p>
    <p>The model shows excellent performance (accuracy 0.97). The diabetic class (1) is very well detected (recall 1.00) with good precision (0.93). The most influential variables are heredity, BMI, and blood glucose, which significantly increase the risk of diabetes. Paradoxically, HbA1c appears as a protective factor (negative coefficient).</p>
    <p>Explanation:</p>
    <p>This cluster groups a population where heredity and excess weight play a major role. The counterintuitive result for HbA1c is related to a particular distribution: some individuals with lower HbA1c may still be classified as diabetic due to very high BMI and blood glucose. This illustrates the limitation of global coefficients within a restricted cluster.</p>
    <p>Cluster 1 (Logistic Regression)</p>
    <p>Interpretation:</p>
    <p>The model achieves excellent overall performance (accuracy 0.98), but it does not correctly predict the diabetic class (no individuals of class 1 detected, which skews precision/recall). The determining factors are heredity, blood glucose, BMI, and cholesterol, which increase risk. As in Cluster 0, HbA1c paradoxically appears as a protective factor.</p>
    <p>Explanation:</p>
    <p>This cluster is unbalanced: almost no diabetics are present (class 1 support = 1). The model is therefore overfitted to non-diabetics, explaining the biased performance. The absence of diabetics prevents a true assessment of model robustness. This cluster mainly illustrates the preventive role of normal variables (low blood glucose and HbA1c) but highlights the statistical limitations of regression on minority groups.</p>
    <p>Cluster 2 (Logistic Regression)</p>
    <p>Interpretation:</p>
    <p>Excellent performance (accuracy 0.98). The model clearly distinguishes between diabetic and non-diabetic individuals. Blood glucose is by far the dominant factor (very high coefficient). Some results appear counterintuitive: age, BMI, and heredity appear as protective factors (negative coefficients), whereas HbA1c and cholesterol increase the risk.</p>
    <p>Explanation:</p>
    <p>This cluster appears to be characterized by a younger population with an overall high BMI. The predominant weight of blood glucose masks the other variables: even a young obese individual may be classified as non-diabetic if their blood glucose is normal. The “protective” effect of age and BMI does not reflect clinical reality but rather an internal correlation effect: in this cluster, true diabetics are younger and have very high blood glucose/HbA1c, hence this paradox.</p>
    <p>
     <xref ref-type="fig" rid="fig4">
      Figure 4
     </xref> shows the variables ranked by importance, followed by a summary of the simplified explanatory rules in <xref ref-type="fig" rid="fig5">
      Figure 5
     </xref>, which indicate the positive or negative influence of each factor on the probability of belonging to a diabetic profile in Cluster 0.</p>
    <fig id="fig4" position="float">
     <label>Figure 4</label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Figure 4. Key variables/cluster 0.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1724330-rId34.jpeg?20250928025157" />
    </fig>
    <fig id="fig5" position="float">
     <label>Figure 5</label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Figure 5. Explanatory rules/cluster 0.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1724330-rId35.jpeg?20250928025157" />
    </fig>
    <p>This model highlights simple and interpretable rules to differentiate risk profiles. The decision tree primarily relies on heredity, BMI category, and blood glucose to classify individuals as diabetic or non-diabetic.</p>
    <p>Cluster 0 (Decision Tree)</p>
    <p>Interpretation: This cluster shows good classification performance (accuracy 92%). The rules indicate that diabetes is mainly associated with BMI (above 1.5, indicating overweight/obesity) and blood glucose (&gt;146.5), especially in cases of positive heredity.</p>
    <p>Explanation: Risk is modulated by two key factors: high BMI and blood glucose above the critical threshold increase the probability of diabetes, particularly for individuals with a family history. Conversely, normal BMI and lower blood glucose act as protective factors.</p>
    <p>Cluster 1 (Decision Tree)</p>
    <p>Interpretation: Classification shows high precision for non-diabetic individuals (accuracy 86%), but no data on diabetic cases (zero support). The tree indicates that blood glucose ≤ 140 is associated with non-diabetics.</p>
    <p>Explanation: This cluster groups a predominantly healthy population, characterized by normal blood glucose. The absence of diabetic cases prevents fine-tuning predictions for positives but confirms that blood glucose remains the main discriminating indicator.</p>
    <p>Cluster 2 (Decision Tree)</p>
    <p>Interpretation: The results are perfect (accuracy 100%). The main rule is based on blood glucose: ≤125 for non-diabetics, &gt;125 for diabetics.</p>
    <p>Explanation: This cluster illustrates a clear and robust separation based exclusively on blood glucose. It represents a group where glycemic value is the predominant factor, making classification highly reliable and directly interpretable clinically.</p>
    <p>The rules extracted from <xref ref-type="fig" rid="fig6">
      Figure 6
     </xref> and <xref ref-type="fig" rid="fig7">
      Figure 7
     </xref> illustrate the decision-making process and the key variables of Cluster 0.</p>
    <fig id="fig6" position="float">
     <label>Figure 6</label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Figure 6. Decision tree—KMeans + Decision tree/Cluster 0.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1724330-rId36.jpeg?20250928025158" />
    </fig>
    <fig id="fig7" position="float">
     <label>Figure 7</label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Figure 7. Key variables/cluster 0.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1724330-rId37.jpeg?20250928025158" />
    </fig>
   </sec>
  </sec><sec id="s5">
   <title>5. Discussion</title>
   <sec id="s5_1">
    <title>5.1. Performance Analysis: Strengths and Weaknesses of Each Model</title>
    <p>
     <xref ref-type="table" rid="table5">
      Table 5
     </xref> summarizes the performance of the KMeans-based hybrid models, highlighting their strengths and limitations. This synthesis, without providing detailed numerical metrics, allows a quick assessment of the overall effectiveness of each approach, while emphasizing the influence of minority classes and the impact of unbalanced clusters on the results.</p>
    <table-wrap id="table5">
     <label>
      <xref ref-type="table" rid="table5">
       Table 5
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.146160-"></xref>Table 5. Strengths and weaknesses of each model.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="26.82%"><p style="text-align:center">Model</p></td> 
       <td class="custom-bottom-td acenter" width="36.59%"><p style="text-align:center">Strengths</p></td> 
       <td class="custom-bottom-td acenter" width="36.59%"><p style="text-align:center">Weaknesses</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="26.82%"><p style="text-align:center">Global Kmeans</p><p style="text-align:center">(mapping → classes)</p></td> 
       <td class="custom-top-td acenter" width="36.59%"><p style="text-align:center">Good identification of the majority class; acceptable balance between the two classes</p></td> 
       <td class="custom-top-td acenter" width="36.59%"><p style="text-align:center">Low precision for the minority class; some diabetics misclassified</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="26.82%"><p style="text-align:center">Unsupervised KMeans by cluster</p></td> 
       <td class="acenter" width="36.59%"><p style="text-align:center">Some clusters predict the majority class very well</p></td> 
       <td class="acenter" width="36.59%"><p style="text-align:center">Minority class often absent or poorly predicted; overall poor performance in unbalanced clusters</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="26.82%"><p style="text-align:center">KMeans + Logistic Regression</p></td> 
       <td class="acenter" width="36.59%"><p style="text-align:center">Excellent performance in balanced clusters; both classes correctly predicted</p></td> 
       <td class="acenter" width="36.59%"><p style="text-align:center">Evaluation impossible for the minority class in unbalanced clusters; sensitive to class distribution</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="26.82%"><p style="text-align:center">KMeans + Decision Tree</p></td> 
       <td class="acenter" width="36.59%"><p style="text-align:center">Excellent performance in balanced clusters; captures complex variable interactions; good balance in some partially unbalanced clusters</p></td> 
       <td class="acenter" width="36.59%"><p style="text-align:center">Clusters with absence of a class prevent evaluation of some metrics; sensitive to small clusters and risk of overfitting</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>Legend 5: Summary of the strengths and limitations of KMeans alone and hybrid models, depending on cluster balance and diabetic prediction.</p>
   </sec>
   <sec id="s5_2">
    <title>5.2. Best Method</title>
    <p>By comparing the two approaches, it is observed that both models perform very well on balanced clusters but show limitations when certain classes are underrepresented <xref ref-type="bibr" rid="scirp.146160-11">
      [11]
     </xref>. KMeans + Logistic Regression offers a good precision/recall trade-off on balanced clusters but becomes limited when the minority class (diabetics) is almost absent <xref ref-type="bibr" rid="scirp.146160-12">
      [12]
     </xref>. KMeans + Decision Tree remains highly effective on balanced clusters, sometimes achieving perfect balance, but struggles on unbalanced clusters where a class is absent <xref ref-type="bibr" rid="scirp.146160-13">
      [13]
     </xref>. In this study, the KMeans clusters are unbalanced, making the Decision Tree slightly preferable: it excels on the main cluster, remains generally robust, and provides interpretable rules, whereas Logistic Regression, although reliable, is less suited to highly unbalanced clusters.</p>
   </sec>
   <sec id="s5_3">
    <title>5.3. Educational and Medical Actions Based on the KMeans + Logistic Regression Method</title>
    <p>Intervention guidelines by cluster, according to clinical, practical, behavioral, and educational dimensions, are as follows:</p>
    <p>Cluster 0—Risk modulated by BMI, heredity, and blood glucose:</p>
    <p>Cluster 1—Predominantly healthy population with low diabetes prevalence:</p>
    <p>Cluster 2—High-risk population with clear distinction based on blood glucose:</p>
   </sec>
   <sec id="s5_4">
    <title>5.4. Limitations</title>
    <p>Methodologically, the small sample size and the arbitrary choice of the number of clusters may affect the stability of the results, with a risk of overfitting in the decision tree. Clinically, the averages used to define the clusters do not reflect individual variability, and some groups (Cluster 1) are underrepresented. Practically, implementing the recommendations, particularly for intensive monitoring of Cluster 2, may be challenging in resource-limited settings. Behaviorally, the study does not account for actual adherence or social and cultural factors. Educationally, the interventions remain general, and their effectiveness has not been evaluated.</p>
    <p>After implementing cluster-specific interventions, the perspectives are as follows:</p>
   </sec>
  </sec><sec id="s6">
   <title>6. Conclusion</title>
   <sec id="s6_1">
    <title>6.1. Key Points</title>
    <p>This study highlights the value of a segmented approach for analyzing diabetes risk factors. The combination of KMeans with logistic regression and decision tree methods allowed classification of the population into three distinct clusters, corresponding to low-, moderate-, and high-risk profiles. The hybrid models improved both accuracy and interpretability compared to KMeans alone, emphasizing cluster-specific key variables such as blood glucose, BMI, and heredity. Nevertheless, some unbalanced clusters limited the evaluation of minority classes, highlighting methodological and statistical constraints related to sample size. In this context, the cluster performance makes the hybrid KMeans + Decision Tree method generally preferable.</p>
   </sec>
   <sec id="s6_2">
    <title>6.2. Value of the Dataset</title>
    <p>The dataset diabete_custom.xlsx, derived from the Pima Indians Diabetes Dataset and enriched with behavioral and medical variables, enabled the application of hybrid methods on realistic tabular data. Its multidimensional richness facilitated the identification of risk profiles and the detection of key diabetes factors across different clusters, while providing support for targeted and reproducible analyses.</p>
   </sec>
   <sec id="s6_3">
    <title>6.3. Recommendations</title>
    <p>Based on the results, the following recommendations are proposed:</p>
    <p>In conclusion, this segmented approach provides a powerful tool for guiding personalized prevention strategies, better adapted to individual profiles, and constitutes a solid foundation for future research aimed at optimizing diabetes prevention and management.</p>
   </sec>
  </sec>
 </body><back>
  <ref-list>
   <title>References</title>
   <ref id="scirp.146160-ref1">
    <label>1</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Sarra, S. (2024) Approche basée ia pour un système de prédiction du diabète. Thèse de Doctorat, Université Larbi Tébessi.
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref2">
    <label>2</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Mohebbi, M.A. (2021) A Machine Learning Approach to Treatment Improvement in Diabetes. Thèse de Doctorat, Université Technique du Danemark.
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref3">
    <label>3</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Hamada Bonheur, A. (2024) Conception et réalisation d’une application web de diagnostic intelligent du diabète. Thèse de Doctorat, Université des Sciences et Technologies de l’Université de Constantine 2.
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref4">
    <label>4</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Dua, D. and Graff, C. (2017) UCI Machine Learning Repository: Pima Indians Diabetes Dataset. University of California.
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref5">
    <label>5</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Tan, P.-N., Steinbach, M. and Kumar, V. (2019) Introduction à la fouille de données. 2e Edition, Pearson, 120.
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref6">
    <label>6</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Kaufman, L. and Rousseeuw, P.J. (2005) Clustering par partition et méthodes de validation. Dunod, 82.
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref7">
    <label>7</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Rousseeuw, P.J. (1987) Silhouettes: une méthode graphique pour interpréter et valider des clusters. Masson, 53.
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref8">
    <label>8</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Quinlan, J.R. (1993) C4.5: Arbres de décision pour la classification. Eyrolles, 45.
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref9">
    <label>9</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Hastie, T., Tibshirani, R. and Friedman, J. (2011) Apprentissage statistique: Avec applications en R. Springer, 85. 
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref10">
    <label>10</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Chicco, D. and Jurman, G. (2020) The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genomics, 21, Article No. 6. &gt;https://doi.org/10.1186/s12864-019-6413-7 
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref11">
    <label>11</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Gupta, S.L., Khandelwal, V., Katria, V., Sharma, D.A. and Pandey, A. (2024) Analyzing the Efficacy of K-Means Clustering and Logistic Regression for Diabetes Prediction. South Eastern European Journal of Public Health, 25, 1255-1262. &gt;https://doi.org/10.70135/seejph.vi.2454
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref12">
    <label>12</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     ElSeddawy, A.I., Karim, F.K., Hussein, A.M. and Khafaga, D.S. (2022) Predictive Analysis of Diabetes-Risk with Class Imbalance. Computational Intelligence and Neuroscience, 2022, Article ID: 3078025. &gt;https://doi.org/10.1155/2022/3078025 
    </mixed-citation>
   </ref>
   <ref id="scirp.146160-ref13">
    <label>13</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Aliyu, H.A. (2024) Optimizing Machine Learning Algorithms for Diabetes Data. ScienceDirect, 5.
    </mixed-citation>
   </ref>
  </ref-list>
 </back>
</article>