<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JDAIP</journal-id><journal-title-group><journal-title>Journal of Data Analysis and Information Processing</journal-title></journal-title-group><issn pub-type="epub">2327-7211</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jdaip.2023.113012</article-id><article-id pub-id-type="publisher-id">JDAIP-126207</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject><subject> Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  Application of Regularized Logistic Regression and Artificial Neural Network Model for Ozone Classification across El Paso County, Texas, United States
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Callistus</surname><given-names>Obunadike</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Adekunle</surname><given-names>Adefabi</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Somtobe</surname><given-names>Olisah</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>David</surname><given-names>Abimbola</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Kunle</surname><given-names>Oloyede</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Department of Computer Science, Austin Peay State University, Clarksville, USA</addr-line></aff><pub-date pub-type="epub"><day>11</day><month>07</month><year>2023</year></pub-date><volume>11</volume><issue>03</issue><fpage>217</fpage><lpage>239</lpage><history><date date-type="received"><day>25,</day>	<month>April</month>	<year>2023</year></date><date date-type="rev-recd"><day>8,</day>	<month>July</month>	<year>2023</year>	</date><date date-type="accepted"><day>11,</day>	<month>July</month>	<year>2023</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  This paper focuses on ozone prediction in the atmosphere using a machine learning approach. We utilize air pollutant and meteorological variable datasets from the El Paso area to classify ozone levels as high or low. The LR and ANN algorithms are employed to train the datasets. The models demonstrate a remarkably high classification accuracy of 89.3% in predicting ozone levels on a given day. Evaluation metrics reveal that both the ANN and LR models exhibit accuracies of 89.3% and 88.4%, respectively. Additionally, the AUC values for both models are comparable, with the ANN achieving 95.4% and the LR obtaining 95.2%. The lower the cross-entropy loss (log loss), the higher the model’s accuracy or performance. Our ANN model yields a log loss of 3.74, while the LR model shows a log loss of 6.03. The prediction time for the ANN model is approximately 0.00 seconds, whereas the LR model takes 0.02 seconds. Our odds ratio analysis indicates that features such as “Solar radiation”, “Std. Dev. Wind Direction”, “outdoor temperature”, “dew point temperature”, and “PM10” contribute to high ozone levels in El Paso, Texas. Based on metrics such as accuracy, error rate, log loss, and prediction time, the ANN model proves to be faster and more suitable for ozone classification in the El Paso, Texas area.
 
</p></abstract><kwd-group><kwd>Machine Learning</kwd><kwd> Ozone Prediction</kwd><kwd> Pollutants Forecasting</kwd><kwd> Atmospheric Monitoring</kwd><kwd> Air Quality</kwd><kwd> Logistic Regression</kwd><kwd> Artificial Neural Network</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>Ozone is created in the atmosphere from gases that are released through smokestacks, tailpipes, and a variety of other sources. These gases react when exposed to sunlight, thereby creating ozone pollution. Ozone is a key component of the Earth’s atmosphere; it plays a vital role in protecting life on our planet + by absorbing harmful ultraviolet radiation. However, excessive levels of ozone can have negative impacts on human health and the environment. Ozone prediction is an important task that helps us to better understand and manage the effects of ozone on our planet. Application of Machine learning serves as a powerful mechanism that helps to predict ozone levels in the atmosphere. This could be achieved by training a machine learning model on historical data, to make predictions about future ozone levels based on various factors such as temperature, wind speed, and emissions. These predictions can then be used to inform decision making and mitigate the negative effects of excessive ozone levels. Ozone starts off as an invisible pollution when not properly monitored it combines with other contaminants to cause lots of health challenges [<xref ref-type="bibr" rid="scirp.126207-ref1">1</xref>] . Ozone happens to be one of the most dangerous elements on earth. For the past several decades, researchers have been examining how ozone affects human health. In El Paso, Texas, United States, ozone level has been recorded as the highest affected city across the United States. Three oxygen atoms make up the gas molecule known as ozone (O<sub>3</sub>). Ozone, also known as “smog”, is dangerous to breathe. By chemically interacting with lung tissue, ozone actively damages it.</p></sec><sec id="s2"><title>2. Literature Review</title><p>Three oxygen atoms make up the gas molecule known as O<sub>3</sub> (see <xref ref-type="fig" rid="fig1">Figure 1</xref>). Another name for ozone (O<sub>3</sub>) is “smog”, which is very dangerous when inhaled. Ozone becomes very harmful when it chemically interacts with the lung tissue, thus causing severe damage. <xref ref-type="fig" rid="fig1">Figure 1</xref> illustrates ozone molecules.</p><sec id="s2_1"><title>2.1. Formation of Ozone (O<sub>3</sub>)</title><p>The same processes that produce ozone also produce other dangerous pollutants when O<sub>3</sub> is present. Although, we are protected from the majority of the sun’s UV radiation by the ozone layer, which is located high in the stratosphere (i.e., upper atmosphere). However, O<sub>3</sub> air pollution poses major health risks when it is present at ground level where we may breathe it (i.e., within the troposphere). Nitrogen oxides (NOx) and volatile organic compounds (VOCs) are the two main raw materials that produce ozone. In addition, burning of fossil fuels like gasoline, oil, or coal or the evaporation of certain chemicals like solvents also contributes to ozone production. Power plants, automobiles, and other high-heat combustion sources all emit NOx whereas vehicles, chemical plants, refineries, factories, petrol stations, paint, and other sources all release VOCs [<xref ref-type="bibr" rid="scirp.126207-ref1">1</xref>] . <xref ref-type="fig" rid="fig2">Figure 2</xref> shows the reaction that leads to ozone formation pattern.</p></sec><sec id="s2_2"><title>2.2. Risk of Ozone Exposure</title><p>Anyone who spends time outside in an area with high levels of ozone pollution</p><p>could be in danger. The effects of inhaling ozone are particularly harmful to four types of people:</p><p>o Children and teenagers [<xref ref-type="bibr" rid="scirp.126207-ref2">2</xref>] .</p><p>o Everyone above the age of 65 [<xref ref-type="bibr" rid="scirp.126207-ref2">2</xref>] .</p><p>o Those who already have lung conditions including asthma and chronic obstructive pulmonary disease (COPD), which also encompasses emphysema and chronic bronchitis [<xref ref-type="bibr" rid="scirp.126207-ref2">2</xref>] .</p><p>o Those who work or exercise outside [<xref ref-type="bibr" rid="scirp.126207-ref2">2</xref>] .</p><p>o People living with obesity [<xref ref-type="bibr" rid="scirp.126207-ref2">2</xref>] .</p></sec><sec id="s2_3"><title>2.3. Implications of Ozone Exposure</title><p>People with allergies may respond more strongly to allergens after inhaling ozone. Children were more likely to experience hay fever and respiratory allergies when ozone and PM2.5 levels were high, based on research study that was published in 2009 [<xref ref-type="bibr" rid="scirp.126207-ref3">3</xref>] .</p><sec id="s2_3_1"><title>2.3.1. Premature Death</title><p>When exposed to the ozone layer, one’s life may be shortened. From several research carried out in cities across the U.S., Europe, and Asia, it is obvious that ozone has a devastating effect on people’s health and life span. Over time, researchers have discovered that exposure to increasing ozone levels raised the chance of premature death [<xref ref-type="bibr" rid="scirp.126207-ref4">4</xref>] . Even when other pollutants are also present, ozone raises the chance of premature mortality, according to more recent research [<xref ref-type="bibr" rid="scirp.126207-ref1">1</xref>] .</p></sec><sec id="s2_3_2"><title>2.3.2. Inhalation Problems</title><p>In major counties across the United States (like: El Paso, Texas), ozone level increases over the summer thus leading to increase in health challenges [<xref ref-type="bibr" rid="scirp.126207-ref5">5</xref>] . In addition to a higher risk of premature mortality, inhalation challenges like wheezing, coughing, and shortness of breath; asthma episodes; increases the need for hospitalization and medical care for persons with lung disorders including asthma or chronic obstructive pulmonary disease (COPD), as well as higher risk of respiratory infections, susceptibility to pulmonary inflammation, and risk of respiratory infections [<xref ref-type="bibr" rid="scirp.126207-ref2">2</xref>] .</p></sec><sec id="s2_3_3"><title>2.3.3. Risk from Long-Term Exposure</title><p>Recent research alerts us to the negative consequences of prolonged exposure to ozone. Scientists are discovering that prolonged exposure (i.e., radiation exposure &gt; 8 hours as well as days, months, or years) increases the chance of premature mortality. Researchers have discovered that high levels of ozone are linked to an increased risk of respiratory disease which leads to a high mortality rate [<xref ref-type="bibr" rid="scirp.126207-ref4">4</xref>] . Also, New York researchers examined hospital data for pediatric asthma patients and discovered that exposure to ozone over an extended period increased the probability of hospital admission for asthma patients. Recent studies show that kids from low-income households were more likely to be hospitalized due to high levels of ozone exposure as against kids from high-income households [<xref ref-type="bibr" rid="scirp.126207-ref6">6</xref>] .</p></sec><sec id="s2_3_4"><title>2.3.4. US Environmental Protection Agency (EPA) Findings</title><p>In February 2013, EPA published a comprehensive review of their most recent findings on ozone pollution [<xref ref-type="bibr" rid="scirp.126207-ref7">7</xref>] . EPA had asked the “Clean Air Scientific Advisory Committee”, a group of distinguished scientists, to assist them in evaluating the evidence that was gathered by EPA; in particular, they looked at research published between 2006 and 2012. The EPA and the committee’s experts concluded that ozone pollution posed numerous, substantial health risks. Based on that evaluation in 2015, the EPA firmly supports the “National Ambient Air Quality Standard” (i.e., the official ozone acceptable limit). However, recent studies show that ozone can be dangerous even at much lower concentrations. In a scientific paper published in 2017, researchers presented additional proof that confirms that older adults face a higher risk of premature death even with low ozone levels beyond the national acceptable level [<xref ref-type="bibr" rid="scirp.126207-ref8">8</xref>] .</p></sec></sec><sec id="s2_4"><title>2.4. Features or Variable Types</title><p>According to [<xref ref-type="bibr" rid="scirp.126207-ref9">9</xref>] , the predictor variables could otherwise be known as “PIE (predictor, independent or explanatory) variables” while the response variables could otherwise be termed “DORT (dependent, observatory, response or target) variables”. Features (variables) importance enables the ML algorithm to train faster as well as reduces cost and time required for training the dataset, therefore making it simpler to interpret. It also reduces the variance of the model and improves the accuracy, provided the right subset is chosen [<xref ref-type="bibr" rid="scirp.126207-ref9">9</xref>] .</p>Odds Ratio<p>Generally, the intensity of the odds ratio is called the “strength of the association”. The further away an odds ratio is from 1, the more likely it is that the relationship between the exposure and the disease is causal. For instance, an odds ratio of 1.25 is above 1, but is not a strong association while that of &gt; 9.5 suggests a stronger association [<xref ref-type="bibr" rid="scirp.126207-ref9">9</xref>] .</p></sec><sec id="s2_5"><title>2.5. Selection of Logistic Regression and Artificial Neural Network Model</title><p>It’s important to note that the choice between LR and ANN models depends on the specific problem, dataset, and desired outcome. LR is suitable for simpler tasks and when interpretability is crucial, while ANN models excel in more complex problems where high accuracy is the priority.</p><sec id="s2_5_1"><title>2.5.1. Advantages of LR and ANN</title><p>The Logistic Regression model is straightforward and interpretable. It’s easy to understand and implement, making it a good choice for simple classification problems [<xref ref-type="bibr" rid="scirp.126207-ref10">10</xref>] . Training an LR model is computationally efficient compared to complex ANN models. It can handle large datasets with relative ease [<xref ref-type="bibr" rid="scirp.126207-ref11">11</xref>] . LR provides meaningful insights into the impact of each feature on the predicted outcome. It assigns weights to features, indicating their importance in the decision-making process [<xref ref-type="bibr" rid="scirp.126207-ref12">12</xref>] .</p><p>Artificial Neural Networks (ANNs) can model complex and nonlinear relationships between features and the target (DORT) variable. They can learn intricate patterns that may be difficult for LR models to capture. ANN models can automatically extract relevant features from raw data, reducing the need for manual feature engineering [<xref ref-type="bibr" rid="scirp.126207-ref13">13</xref>] . ANN models, especially deep learning models, have achieved state-of-the-art performance on various tasks, including image and speech recognition, natural language processing, and recommendation systems [<xref ref-type="bibr" rid="scirp.126207-ref14">14</xref>] .</p></sec><sec id="s2_5_2"><title>2.5.2. Disadvantages of LR and ANN</title><p>The Logistic Regression model assumes a linear relationship between features and the target variable. It may struggle to capture complex patterns and nonlinear relationships in the data [<xref ref-type="bibr" rid="scirp.126207-ref15">15</xref>] . Logistic Regression relies heavily on manual feature engineering. Thus, choosing relevant features and transforming them appropriately is crucial for its performance. LR performs well in certain scenarios, it may underperform when faced with highly complex datasets or problems that require high predictive accuracy [<xref ref-type="bibr" rid="scirp.126207-ref16">16</xref>] .</p><p>Artificial Neural Network models, especially deep neural networks, require significant computational resources and can be time-consuming to train and they often require specialized hardware like GPUs [<xref ref-type="bibr" rid="scirp.126207-ref17">17</xref>] . ANN models can be challenging to interpret and understanding how the model arrives at its predictions can be difficult, making it less transparent compared to LR models [<xref ref-type="bibr" rid="scirp.126207-ref18">18</xref>] . In addition, ANN models are prone to overfitting, especially when working with limited training data [<xref ref-type="bibr" rid="scirp.126207-ref19">19</xref>] . Regularization techniques and careful hyperparameter tuning are necessary to mitigate this risk [<xref ref-type="bibr" rid="scirp.126207-ref5">5</xref>] .</p></sec></sec><sec id="s2_6"><title>2.6. Factors that Influence Accuracy of LR and ANN</title><p>It’s important to consider these factors and carefully optimize them to achieve the best classification accuracy for LR and ANN models.</p><p>o Dataset quality and size: The quality and size of the dataset used for training and evaluation play a crucial role. A larger dataset with a diverse range of samples can help both LR and ANN models generalize better and achieve higher accuracy [<xref ref-type="bibr" rid="scirp.126207-ref18">18</xref>] .</p><p>o Feature selection and engineering: The choice and preparation of input features can significantly affect model performance, proper feature selection and engineering can improve the discriminative power of the features and lead to better accuracy for both LR and ANN models [<xref ref-type="bibr" rid="scirp.126207-ref20">20</xref>] .</p><p>o Model complexity: The complexity of the model can impact classification accuracy. LR assumes a linear relationship, while ANN models, especially deep neural networks, can capture complex nonlinear relationships [<xref ref-type="bibr" rid="scirp.126207-ref17">17</xref>] . In general, more complex models like ANNs have the potential to achieve higher accuracy, but they are also more prone to overfitting [<xref ref-type="bibr" rid="scirp.126207-ref19">19</xref>] .</p><p>o Regularization techniques: Regularization methods, such as L1 or L2 regularization, can help prevent overfitting in both LR and ANN models. Regularization adds a penalty term to the model’s objective function, discouraging overly complex models and improving generalization [<xref ref-type="bibr" rid="scirp.126207-ref5">5</xref>] .</p><p>o Hyperparameter tuning: Both LR and ANN models have various hyperparameters that need to be tuned for optimal performance. Examples include learning rate, regularization strength, number of hidden layers, and number of neurons. Proper hyperparameter tuning can significantly affect classification accuracy [<xref ref-type="bibr" rid="scirp.126207-ref18">18</xref>] .</p><p>o Training duration and convergence: The duration and convergence of the training process can impact final accuracy. Training for too few iterations may result in underfitting, while training for too many iterations may lead to overfitting [<xref ref-type="bibr" rid="scirp.126207-ref19">19</xref>] . Finding the right balance and ensuring convergence is essential for achieving high accuracy.</p><p>o Class imbalance: Class imbalance occurs when one class has significantly more or fewer samples than others. This can affect the model’s ability to accurately predict the minority class. Techniques like oversampling, under sampling, or class weighting can help address class imbalance and improve accuracy [<xref ref-type="bibr" rid="scirp.126207-ref21">21</xref>] .</p><p>o Preprocessing and normalization: Proper preprocessing steps, such as handling missing values, scaling features, and handling outliers, can impact the accuracy of both LR and ANN models. Different preprocessing techniques may be more suitable for different models, and their proper application can enhance accuracy [<xref ref-type="bibr" rid="scirp.126207-ref22">22</xref>] .</p><p>o Model evaluation and validation: The choice of appropriate evaluation metrics and validation techniques can affect the reported accuracy. Metrics such as accuracy, precision, recall, and F1 score provide different perspectives on model performance, and using appropriate validation methods like cross-validation can give a more reliable estimate of the model’s accuracy [<xref ref-type="bibr" rid="scirp.126207-ref23">23</xref>] .</p><p>o Computational resources: ANN models, especially deep learning models, can be computationally intensive and may require specialized hardware, such as GPUs, for efficient training. The availability of computational resources can impact the size and complexity of the ANN models used, which can, in turn, affect their accuracy [<xref ref-type="bibr" rid="scirp.126207-ref17">17</xref>] .</p></sec></sec><sec id="s3"><title>3. Methodology</title><p>The aim and objective of this research is to examine the prediction ability of Logistic Regression and Artificial Neural Network Models in correctly classifying ozone levels into high and low categories, considering other predictor variables. The dataset consists of 973 rows and 14 variables (features). Among these variables, ozone was selected as the response (dependent) variable, with “low ozone” assigned as 0 and “high ozone” assigned as 1. Therefore, we are dealing with a binary classification problem. The dataset was analyzed using the R programming language. The first step in the analysis involved checking the variable types, identifying missing values, outliers, and potentially incorrect records, and conducting exploratory data analysis (EDA), including frequency distribution of the target variables and the association between them.</p><sec id="s3_1"><title>3.1. Descriptive Analysis of the Dataset</title><p>The dataset contains 14 variables (features), ‘ozone’ is assigned to be the target variable otherwise known as DORT or Y (dependent, observatory, response, or target variables) while the remaining 13 variables represents PIE or X (Predictor, Independent or explanatory variables) (see <xref ref-type="table" rid="table1">Table 1</xref>).</p></sec><sec id="s3_2"><title>3.2. Data Pre-Processing</title><p>The following steps were adopted during the data prr-processing to ensure the accuracy of our dataset. We performed exploratory data analysis on the dataset by cleaning the data and checking for missing values (refer to section 3.2.1). Additionally, we applied a for-loop to iterate over the dataset and cross-check for other types of missing values, such as “na”, “NA”, or an empty string (refer to <xref ref-type="fig" rid="fig3">Figure 3</xref>). Based on the output or results, our dataset showed no missing values.</p><sec id="s3_2_1"><title>3.2.1. Checking for Missing Values</title><p>The anyNA () function was used to check for missing variables in our dataset. The outcome was “False”. Thus, it implies that we did not have any missing data. In addition, we went further to visualize if there was any sort of missing data using the naniar package. <xref ref-type="fig" rid="fig4">Figure 4</xref> shows a bar plot depicting there are no missing variables in our dataset.</p></sec><sec id="s3_2_2"><title>3.2.2. Intensive Cross Checking of Other Missing Values</title><p>To ensure that our analysis and model would be free from errors. It is very important to thoroughly loop through the whole dataset to check for other missing values that may occur in other forms apart from “NA”.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Description of the variable’s data types</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >S/No</th><th align="center" valign="middle" >Variables</th><th align="center" valign="middle" >Data Type</th></tr></thead><tr><td align="center" valign="middle" >1</td><td align="center" valign="middle" >Nitric Oxide</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >2</td><td align="center" valign="middle" >Nitrogen Dioxide</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >3</td><td align="center" valign="middle" >Oxides of Nitrogen</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >4</td><td align="center" valign="middle" >Wind Speed</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >5</td><td align="center" valign="middle" >Resultant Wind Speed</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >6</td><td align="center" valign="middle" >Resultant Wind Direction</td><td align="center" valign="middle" >Integer</td></tr><tr><td align="center" valign="middle" >7</td><td align="center" valign="middle" >Maximum Wind Gust</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >8</td><td align="center" valign="middle" >Std. Dev. Wind Direction</td><td align="center" valign="middle" >Integer</td></tr><tr><td align="center" valign="middle" >9</td><td align="center" valign="middle" >Outdoor Temperature</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >10</td><td align="center" valign="middle" >Dew Point Temperature</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >11</td><td align="center" valign="middle" >Relative Humidity</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >12</td><td align="center" valign="middle" >Solar Radiation</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >13</td><td align="center" valign="middle" >PM10</td><td align="center" valign="middle" >Numeric</td></tr><tr><td align="center" valign="middle" >14</td><td align="center" valign="middle" >Ozone</td><td align="center" valign="middle" >Integer</td></tr></tbody></table></table-wrap><p><xref ref-type="table" rid="table2">Table 2</xref> shows V.name or the variable names, Mode (data types), N. level (number of occurrences out of the total observations), Ncom (number of total observations), Nmiss (number of missing observations), and Miss. Prop (percentage of missing observations).</p></sec><sec id="s3_2_3"><title>3.2.3. Frequency Distribution of Target Variable (Ozone)</title><p>Analyzing the ozone level, transformed the ozone level from binary-numerical to</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Iteration through the dataset using for loop to check for other missing values</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Col.num</th><th align="center" valign="middle" >V. name</th><th align="center" valign="middle" >Mode</th><th align="center" valign="middle" >N.level</th><th align="center" valign="middle" >ncom</th><th align="center" valign="middle" >nmiss</th><th align="center" valign="middle" >Miss.prop</th></tr></thead><tr><td align="center" valign="middle" >1</td><td align="center" valign="middle" >Nitric Oxide</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >82</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >2</td><td align="center" valign="middle" >Nitrogen Dioxide</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >228</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >3</td><td align="center" valign="middle" >Oxides of Nitrogen</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >238</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >4</td><td align="center" valign="middle" >Wind Speed</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >146</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >5</td><td align="center" valign="middle" >Resultant Wind Speed</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >153</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >6</td><td align="center" valign="middle" >Resultant Wind Direction</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >270</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >7</td><td align="center" valign="middle" >Maximum Wind Gust</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >242</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >8</td><td align="center" valign="middle" >Std. Dev. Wind Direction</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >68</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >9</td><td align="center" valign="middle" >Outdoor Temperature</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >367</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >10</td><td align="center" valign="middle" >Dew Point Temperature</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >442</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >11</td><td align="center" valign="middle" >Relative Humidity</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >459</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >12</td><td align="center" valign="middle" >Solar Radiation</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >549</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >13</td><td align="center" valign="middle" >PM10</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >451</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >14</td><td align="center" valign="middle" >Ozone</td><td align="center" valign="middle" >numeric</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" >973</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr></tbody></table></table-wrap><p>binary-categorical such that values of 1 are given “high level” while values of 0 are given “low level”. From the frequency distribution plot below, the days with low ozone levels occur more frequently than those of high ozone level. Comparing the difference between both rates however, we can say the distribution is a bit balanced since the difference is not significantly large (see <xref ref-type="fig" rid="fig5">Figure 5</xref>).</p></sec></sec></sec><sec id="s4"><title>4. Results</title><p>The first approach to building a model with high accuracy is to properly investigate</p><p>data quality, coherence, association, and correlations between the DORT and PIE variables. This will thus enable us to correctly predict areas with high ozone and low ozone effectively. Analyzing the output, we observe that all the variable types are continuous (quantitative) except for target (ozone) variable which is binary (categorical).</p><sec id="s4_1"><title>4.1. Data Exploration and Correlation</title><p>It is important to understand the degree of correlation and association between the predictor variables with the target variable (ozone). The correlations were computed with the data, and it shows different degrees of correlations ranging from strong negative to strong positive correlation (see <xref ref-type="fig" rid="fig3">Figure 3</xref>). Out of the 13 variables only “solar radiation”, “outdoor temperature” and “std. dev. wind direction” show positive correlation with the target variable (ozone). <xref ref-type="fig" rid="fig6">Figure 6</xref> illustrates predictor variables that are either positively or negatively correlated with the target variable (ozone).</p></sec><sec id="s4_2"><title>4.2. Box Plots Predictor Variables</title><p><xref ref-type="fig" rid="fig7">Figure 7</xref> and <xref ref-type="fig" rid="fig8">Figure 8</xref> show the boxplots of the predictor variables that are positively and negatively correlated with the target variable (ozone). In general, it could be seen that the negative correlated predictor variables have outliers.</p><sec id="s4_2_1"><title>4.2.1. Box Plots of Ozone and Selected Predictor Variables</title><p>The effect of selected predictor features (i.e., “Solar Radiation”, “Nitric Oxide”, “Nitrogen Dioxide” and “PM10”) differences used in determining the ozone level of a particular day (see Figures 9-12). The “high ozone” rate is nearly normal in most of the distributions as against the “low ozone level” with a negative skewness in distribution (see Figures 13-16). The histograms of “Solar Radiation”, “Nitric Oxide”, “Nitrogen Dioxide” and “PM10” show right-skewed distribution (see Figures 13-16). This already suggests that the distribution of the “Solar Radiation”, “Nitric Oxide”, “Nitrogen Dioxide” and “PM10” is not normal. Aside from Solar radiation, the other selected variables show presence of outliers.</p><p>Since some of the box plots for the selected predictor variables have outliers and skewness, we would proceed to test for normality using the Anderson Darling and Shapiro-Wilk tests (see <xref ref-type="table" rid="table3">Table 3</xref>).</p></sec><sec id="s4_2_2"><title>4.2.2. Normality Test and Wilcoxon Test</title><p>The normality test shows that the p-value of Anderson-Darling and Shapiro test are less than the significance level (0.05), which signifies that the distribution is not normal. Since the assumption of t-test is violated, we apply Wilcoxon rank sum test (non-parametric alternative test) to examine the association between Solar Radiation, Nitric Oxide, Nitrogen Dioxide and PM10 on ozone level (see <xref ref-type="table" rid="table3">Table 3</xref>).</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Association between target variable (ozone) and response variable (solar radiation)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Solar Radiation Vs Ozone Level</th><th align="center" valign="middle" >Nitric Oxide Vs Ozone Level</th><th align="center" valign="middle" >Nitrogen Dioxide Vs Ozone Level</th><th align="center" valign="middle" >PM10 Vs Ozone Level</th></tr></thead><tr><td align="center" valign="middle" >Anderson-Darling normality test</td><td align="center" valign="middle" >A = 54.466 p-value &lt; 2.2e−16</td><td align="center" valign="middle" >A = 210.08 p-value &lt; 2.2e−16</td><td align="center" valign="middle" >A = 76.434 p-value &lt; 2.2e−16</td><td align="center" valign="middle" >A = 63.252 p-value &lt; 2.2e−16</td></tr><tr><td align="center" valign="middle" >Shapiro-Wilk normality test</td><td align="center" valign="middle" >W = 0.84817 p-value &lt; 2.2e−16</td><td align="center" valign="middle" >W = 0.31033 p-value &lt; 2.2e−16</td><td align="center" valign="middle" >W = 0.7364 p-value &lt; 2.2e−16</td><td align="center" valign="middle" >W = 0.64781 p-value &lt; 2.2e−16</td></tr><tr><td align="center" valign="middle" >Wilcoxon rank test</td><td align="center" valign="middle" >W = 177795 p-value &lt; 2.2e−16</td><td align="center" valign="middle" >W = 61823 p-value &lt; 2.2e−16</td><td align="center" valign="middle" >W = 84314 p-value = 9.622e−11</td><td align="center" valign="middle" >W = 100056 p-value = 0.005453</td></tr><tr><td align="center" valign="middle" >Comments</td><td align="center" valign="middle"  colspan="4"  >The p-value of the Wilcoxon rank sum test above is lower than the alpha value (0.05) indicating that there is significance relationship between ozone level and Solar Radiation/Nitric Oxide/Nitrogen Dioxide/PM10.</td></tr><tr><td align="center" valign="middle"  colspan="5"  >data: Alternative hypothesis (H<sub>A</sub>): true location shift is not equal to 0</td></tr></tbody></table></table-wrap></sec></sec><sec id="s4_3"><title>4.3. Data Splitting</title><p>The dataset was partitioned into two parts with a ratio of 2:1, where the training data (D<sub>1</sub>) has 67%, and the test data (D<sub>2</sub>) takes 33%. Logistic regression technique was applied to the train data to build a predictive model. Firstly, we adopted the lasso regularization (L<sub>1</sub>) with penalty to obtain the tuning parameter (λ) with cross validation. The logistic regression model was fitted with lasso regularization method using our trained data, D<sub>1</sub>. The lasso method was applied because our aim is to build a parsimonious model which will properly explain our target (ozone) feature.</p><sec id="s4_3_1"><title>4.3.1. MSE and Tuning Parameter</title><p>The best lambda to regularize our model is evaluated using the MSE and miss-classification rate. The result of the first six rows of the Lambda’s, miss classification rate and mean square error is shown in <xref ref-type="table" rid="table4">Table 4</xref>. Using MSE (Mean Square Error) as the evaluation metric. Our Best Lambda (tuning parameter) is 0.0026.</p></sec><sec id="s4_3_2"><title>4.3.2. Mean Square Error vs Lambda</title><p>The plot below indicates that as the tuning parameter increases, the Mean Square Error increases as well. Therefore, it is important to keep Lambda very minimal to obtain low MSE. However, at 0.3 Lambda, the MSE becomes constant (see <xref ref-type="fig" rid="fig1">Figure 1</xref>7).</p></sec></sec><sec id="s4_4"><title>4.4. Model Fitting and Odds Ratio of LR Model</title><p>Having gotten the best lambda. We fit the final Lasso Logistic regression model with the Training and Validation data pooled together. The important features can be seen from <xref ref-type="table" rid="table5">Table 5</xref>. After fitting the model with the best lambda, both “Nitrogen Dioxide” and “Resultant Wind Speed” happen to be the unimportant variables in our LR model (see <xref ref-type="table" rid="table5">Table 5</xref>).</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Lambda’s, misclassification rate and MSE matrix</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref1">1</xref>]</th><th align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref2">2</xref>]</th><th align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref3">3</xref>]</th></tr></thead><tr><td align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref1">1</xref>]</td><td align="center" valign="middle" >0.00010</td><td align="center" valign="middle" >0.355</td><td align="center" valign="middle" >0.2500</td></tr><tr><td align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref2">2</xref>]</td><td align="center" valign="middle" >0.00261</td><td align="center" valign="middle" >0.126</td><td align="center" valign="middle" >0.0891</td></tr><tr><td align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref3">3</xref>]</td><td align="center" valign="middle" >0.00512</td><td align="center" valign="middle" >0.135</td><td align="center" valign="middle" >0.0930</td></tr><tr><td align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref4">4</xref>]</td><td align="center" valign="middle" >0.00764</td><td align="center" valign="middle" >0.138</td><td align="center" valign="middle" >0.0982</td></tr><tr><td align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref5">5</xref>]</td><td align="center" valign="middle" >0.01015</td><td align="center" valign="middle" >0.145</td><td align="center" valign="middle" >0.1017</td></tr><tr><td align="center" valign="middle" >[<xref ref-type="bibr" rid="scirp.126207-ref6">6</xref>]</td><td align="center" valign="middle" >0.01266</td><td align="center" valign="middle" >0.145</td><td align="center" valign="middle" >0.1041</td></tr></tbody></table></table-wrap><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> Coefficients of important predictors using LGR (l<sub>1</sub>) model</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Variables</th><th align="center" valign="middle" >Coefficients</th></tr></thead><tr><td align="center" valign="middle" >Nitric Oxide</td><td align="center" valign="middle" >−1.98759</td></tr><tr><td align="center" valign="middle" >Nitrogen Dioxide</td><td align="center" valign="middle" >.</td></tr><tr><td align="center" valign="middle" >Oxides of Nitrogen</td><td align="center" valign="middle" >−0.00638</td></tr><tr><td align="center" valign="middle" >Wind Speed</td><td align="center" valign="middle" >−0.25833</td></tr><tr><td align="center" valign="middle" >Resultant Wind Speed</td><td align="center" valign="middle" >.</td></tr><tr><td align="center" valign="middle" >Resultant Wind Direction</td><td align="center" valign="middle" >−0.00391</td></tr><tr><td align="center" valign="middle" >Maximum Wind Gust</td><td align="center" valign="middle" >−0.01733</td></tr><tr><td align="center" valign="middle" >Std. Dev. Wind Direction</td><td align="center" valign="middle" >0.06886</td></tr><tr><td align="center" valign="middle" >Outdoor Temperature</td><td align="center" valign="middle" >−0.00652</td></tr><tr><td align="center" valign="middle" >Dew Point Temperature</td><td align="center" valign="middle" >0.04620</td></tr><tr><td align="center" valign="middle" >Relative Humidity</td><td align="center" valign="middle" >−0.13025</td></tr><tr><td align="center" valign="middle" >Solar Radiation</td><td align="center" valign="middle" >1.23665</td></tr><tr><td align="center" valign="middle" >PM10</td><td align="center" valign="middle" >0.01135</td></tr></tbody></table></table-wrap><p>The negative coefficients of “Nitric Oxide” indicates that a slight increase in “Nitric Oxide” multiplies the odd ratio by a number &lt; 1 which effectively increases the probability of the output being labeled as low ozone level (0). In addition, the positive coefficients of “Solar Radiation” suggests that a unit increase in the variable “Solar Radiation” multiplies the odd ratio by a number greater than one which effectively increases the probability of the output being labeled as high ozone level (1). We will then use the best-fitted model on our test data. <xref ref-type="table" rid="table6">Table 6</xref> below presents the odds ratio of important predictor variables based on the best fit model.</p><sec id="s4_4_1"><title>4.4.1. Logistic Regression Model Evaluation</title><p>From <xref ref-type="table" rid="table7">Table 7</xref>, the AUC of 0.952 indicates that our fitted model has 95% ability to correctly classify ozone level into high or low. The confidence interval also indicates the true AUC falls within the interval (0.929, 0.975). Therefore, we are 95% confident that our AUC is accurate. From <xref ref-type="table" rid="table7">Table 7</xref>, we obtained an MSE value of 0.0833 which generally indicates a good performance for our model. We further computed the miss-classification rate since this is a logistic regression model and MSE is not the best evaluating method. The miss-classification rate value is 0.116 which means that our model correctly predicts the ozone levels into high and low ozone at a rate of 88.4% which suggests that our model performs well. The AUC of 0.952 implies that our best fitted model has 95.2% accuracy to predict if the ozone level is either high or low for a particular day. The C.I also indicates the true AUC falls within the interval (0.929, 0.975). Therefore, we are 95% confident that our AUC falls within this interval.</p></sec><sec id="s4_4_2"><title>4.4.2. Receiver Operating Characteristic (ROC) Curve for LR Model</title><p>The ROC curve stands for Receiver Operating Characteristic curve. It is a graphical representation used in evaluating the performance of binary classification models. It illustrates the relationship between the hit rate (also known as sensitivity or true positive rate) and the false alarm rate (also known as the false positive rate). The hit rate refers to the proportion of correctly identified positive instances (true positives) out of all actual positive instances. It represents the model’s ability to correctly classify positive cases. On the other hand, the false alarm rate represents the proportion of incorrectly identified negative instances (false positives) out of all actual negative instances.</p><p>The ROC curve plots the hit rate on the y-axis and the false alarm rate on the x-axis. It shows how the trade-off between these two rates changes as the classification threshold of the model varies. The threshold determines the point at which the model classifies instances as positive or negative based on the predicted probabilities or scores. Ideally, a good classification model would achieve a high hit rate and a low false alarm rate, resulting in a curve that hugs the upper-left corner of the ROC space. The closer the curve is to this corner, the better the model’s performance. The diagonal line from (0, 0) to (1, 1) represents the performance of a random classifier.</p><p>In addition to the ROC curve itself, the area under the curve (AUC) is often calculated to provide a single metric summarizing the model’s performance. The AUC represents the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. A higher</p><table-wrap id="table6" ><label><xref ref-type="table" rid="table6">Table 6</xref></label><caption><title> Odds ratio of important predictor variables based on the best fit model</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  ></th><th align="center" valign="middle"  rowspan="2"  >Odds Ratio</th><th align="center" valign="middle"  rowspan="2"  >Implications</th><th align="center" valign="middle"  colspan="2"  >Target Variable (Ozone)</th></tr></thead><tr><td align="center" valign="middle" >Low (0)</td><td align="center" valign="middle" >High (1)</td></tr><tr><td align="center" valign="middle" >Nitric Oxide</td><td align="center" valign="middle" >0.16</td><td align="center" valign="middle" >Nitric Oxide might not be a protective factor for high ozone level</td><td align="center" valign="middle" >✓</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Oxides of Nitrogen</td><td align="center" valign="middle" >0.981</td><td align="center" valign="middle" >Oxides of Nitrogen might lead to high ozone level subsequently</td><td align="center" valign="middle" >✓</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Wind Speed</td><td align="center" valign="middle" >0.733</td><td align="center" valign="middle" >Wind speed might lead to high ozone level subsequently</td><td align="center" valign="middle" >✓</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Resultant Wind Direction</td><td align="center" valign="middle" >0.997</td><td align="center" valign="middle" >Resultant wind direction might lead to high ozone level subsequently</td><td align="center" valign="middle" >✓</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Maximum Wind Gust</td><td align="center" valign="middle" >0.976</td><td align="center" valign="middle" >Maximum wind gust might lead to high ozone level subsequently</td><td align="center" valign="middle" >✓</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Std. Dev. Wind Direction</td><td align="center" valign="middle" >1.06</td><td align="center" valign="middle" >Std. Dev. wind direction is a risk factor for high ozone level</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >✓</td></tr><tr><td align="center" valign="middle" >Outdoor Temperature</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >Outdoor temperature is a risk factor for high ozone level</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >✓</td></tr><tr><td align="center" valign="middle" >Dew Point Temperature</td><td align="center" valign="middle" >1.04</td><td align="center" valign="middle" >Dew point temperature is a risk factor for high ozone level</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >✓</td></tr><tr><td align="center" valign="middle" >Relative Humidity</td><td align="center" valign="middle" >0.885</td><td align="center" valign="middle" >Relative humidity might lead to high ozone level subsequently</td><td align="center" valign="middle" >✓</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Solar Radiation</td><td align="center" valign="middle" >4.19</td><td align="center" valign="middle" >Solar radiation is certainly a major risk factor for high ozone level</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >✓</td></tr><tr><td align="center" valign="middle" >PM10</td><td align="center" valign="middle" >1.01</td><td align="center" valign="middle" >Particulate matter 10 is a risk factor for high ozone level</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >✓</td></tr></tbody></table></table-wrap><table-wrap id="table7" ><label><xref ref-type="table" rid="table7">Table 7</xref></label><caption><title> Results of the LR model evaluation</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Miss-classification rate</th><th align="center" valign="middle" >MSE</th><th align="center" valign="middle" >cvAUC</th><th align="center" valign="middle" >SE</th><th align="center" valign="middle" >CI</th><th align="center" valign="middle" >Confidence</th></tr></thead><tr><td align="center" valign="middle" >0.116</td><td align="center" valign="middle" >0.0833</td><td align="center" valign="middle" >0.952</td><td align="center" valign="middle" >0.0115</td><td align="center" valign="middle" >0.929, 0.975</td><td align="center" valign="middle" >0.95</td></tr></tbody></table></table-wrap><p>AUC value indicates better discrimination power of the model. Our model has a very high discriminatory power for correct prediction of high ozone and low ozone levels at any given day. <xref ref-type="fig" rid="fig1">Figure 1</xref>8 shows the ROC curve (i.e., trade-off between sensitivity (or TPR) and False Positive Rate [1 – Specificity]). It further indicates that the model performs better against the benchmark (50%) with total area of 0.952 (95.2%).</p></sec><sec id="s4_4_3"><title>4.4.3. Performance of LR Model using Confusion Matrix</title><p>Metrics such as accuracy, precision (positive prediction value), recall (sensitivity) and f1 score provide different perspectives on model performance. The confusion matrix also helps in the interpretation of model performance. The Sensitivity or Recall (TP rate) of 0.8761 (87.6%) indicates that the model has a higher % of detecting high ozone level of a particular day. The Specificity (TN rate) of 0.8878 (88.8%) which is relatively high indicates that the model has a higher % of detecting low ozone level of a particular day. Therefore, our fitted Model has an accuracy of 88.4% with respect to performance and a precision of 81.2% which implies that our Model has a low FP rate. Confusion matrix and other statistical prediction parameters for logistic regression model are shown in <xref ref-type="table" rid="table8">Table 8</xref>.</p><table-wrap id="table8" ><label><xref ref-type="table" rid="table8">Table 8</xref></label><caption><title> Confusion matrix and other statistical prediction parameters for LR</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  colspan="2"  >Confusion Matrix and Statistics ForLR</th></tr></thead><tr><td align="center" valign="middle" >Accuracy</td><td align="center" valign="middle" >0.884</td></tr><tr><td align="center" valign="middle" >95% CI</td><td align="center" valign="middle" >(0.843, 0.917)</td></tr><tr><td align="center" valign="middle" >Sensitivity/Recall</td><td align="center" valign="middle" >0.876</td></tr><tr><td align="center" valign="middle" >Specificity (True Negative Rate/TNR)</td><td align="center" valign="middle" >0.888</td></tr><tr><td align="center" valign="middle" >Pos Pred Value/Precision</td><td align="center" valign="middle" >0.811</td></tr><tr><td align="center" valign="middle" >F1 Score</td><td align="center" valign="middle" >0.843</td></tr><tr><td align="center" valign="middle" >Prediction Time</td><td align="center" valign="middle" >0.02 secs</td></tr><tr><td align="center" valign="middle" >Binary Cross Entropy</td><td align="center" valign="middle" >6.03</td></tr></tbody></table></table-wrap></sec></sec><sec id="s4_5"><title>4.5. Artificial Neural Network (ANN) Model</title><p>To fit an Artificial Neural Network (ANN) model with our trained dataset D<sub>1</sub>, to find the desired model. It is necessary to scale our training data, thereby creating a data frame with the target variable. After scaling, we then build our ANN structure which has 4 hidden layers containing 9, 7, 5, and 3 neurons respectively together with input and output layers.</p><sec id="s4_5_1"><title>4.5.1. Scaling and MSE of ANN Model</title><p>After scaling the test data D<sub>2</sub>, we proceeded to predict the target “ozone” variable using our ANN model. Computing the Mean Squared Error (MSE) of our model, we obtained a value of 0.0833 which indicates that the model performed well. However, MSE alone is not an optimal evaluation technique for our model, hence we need to further calculate the misclassification rate and the confusion matrix.</p></sec><sec id="s4_5_2"><title>4.5.2. Misclassification Rate of ANN Model</title><p>The miss-classification rate value is 0.107 which means that our model correctly predicts both high ozone and low ozone level at a rate of 89.3%. This suggests that our model performs well. Nevertheless, for better evaluation, we would further calculate the AUC and the confusion matrix of our ANN model (see <xref ref-type="table" rid="table9">Table 9</xref>).</p></sec><sec id="s4_5_3"><title>4.5.3. ANN Model Evaluation</title><p>The AUC of 0.954 implies that our best fitted model has 95.4% accuracy to predict if the ozone level is either high or low for a particular day. The C.I also indicates the true AUC falls within the interval (0.929, 0.979). Therefore, we are 95% confident that our AUC falls within this interval.</p></sec><sec id="s4_5_4"><title>4.5.4. ROC Curve</title><p>The AUC of 0.954 implies that our best fitted model has 95.4% accuracy to predict if the ozone level is either high or low for a particular day. The C.I also indicates the true AUC falls within the interval (0.929, 0.979). Therefore, we are 95% confident that our AUC falls within this interval (see <xref ref-type="fig" rid="fig1">Figure 1</xref>9).</p></sec><sec id="s4_5_5"><title>4.5.5. Performance of ANN Model using Confusion Matrix</title><p>The accuracy of our model is 0.893 (89.3%) which is relatively high indicating that our model performs well in predicting the ozone level for a day. The Sensitivity or Recall (TP rate) of 0.802 (80.2%) indicates that the model has a higher % of detecting high ozone level of a particular day. The Specificity (TN rate) of 0.957 (95.7%) which is relatively high indicates that the model has a higher % of detecting low ozone level of a particular day. With the high accuracy and a precision of 92.9%, these results imply that our Model has a low False Positive (FP) rate. Confusion matrix and statistical prediction parameters for artificial neural network model are shown in <xref ref-type="table" rid="table1">Table 1</xref>0.</p><table-wrap id="table9" ><label><xref ref-type="table" rid="table9">Table 9</xref></label><caption><title> Results of the ANN model</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Miss-classification rate</th><th align="center" valign="middle" >MSE</th><th align="center" valign="middle" >cvAUC</th><th align="center" valign="middle" >SE</th><th align="center" valign="middle" >CI</th><th align="center" valign="middle" >Confidence</th></tr></thead><tr><td align="center" valign="middle" >0.107</td><td align="center" valign="middle" >0.0833</td><td align="center" valign="middle" >0.954</td><td align="center" valign="middle" >0.0126</td><td align="center" valign="middle" >0.929, 0.979</td><td align="center" valign="middle" >0.95</td></tr></tbody></table></table-wrap><table-wrap id="table10" ><label><xref ref-type="table" rid="table1">Table 1</xref>0</label><caption><title> Confusion matrix and other statistical prediction parameters for ANN</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  colspan="2"  >Confusion Matrix and Statistics</th></tr></thead><tr><td align="center" valign="middle" >Accuracy</td><td align="center" valign="middle" >0.893</td></tr><tr><td align="center" valign="middle" >95%CI</td><td align="center" valign="middle" >(0.854, 0.925)</td></tr><tr><td align="center" valign="middle" >Sensitivity/Recall</td><td align="center" valign="middle" >0.802</td></tr><tr><td align="center" valign="middle" >Specificity (True Negative Rate/TNR)</td><td align="center" valign="middle" >0.957</td></tr><tr><td align="center" valign="middle" >Pos Pred Value/Precision</td><td align="center" valign="middle" >0.929</td></tr><tr><td align="center" valign="middle" >F1 Score</td><td align="center" valign="middle" >0.861</td></tr><tr><td align="center" valign="middle" >Prediction Time</td><td align="center" valign="middle" >0.00 secs</td></tr><tr><td align="center" valign="middle" >Binary Cross Entropy/Log Loss</td><td align="center" valign="middle" >3.74</td></tr></tbody></table></table-wrap></sec></sec><sec id="s4_6"><title>4.6. F1 Score</title><p>The F1 score is a metric commonly used in classification tasks to evaluate the overall performance of a model. It combines both precision and recall into a single value, providing a balanced measure of the model’s accuracy.</p><p>F 1 Score = 2 ∗ ( precision ∗ recall ) ( precision + recall )</p><p>F1 score considers both FP (false positive) and FN (false negatives), making it a useful metric when dealing with imbalanced datasets or when both precision and recall are equally important. When comparing different models or algorithms, a higher F1 score indicates better performance in terms of both precision and recall. Based on our results, ANN performs better than LR with F1 score of 0.861.</p></sec></sec><sec id="s5"><title>5. Conclusion and Recommendations</title><p>The accuracy of our model is 89.3% which is relatively high, thus it indicates that our model performs well in predicting the ozone level for a given day. Also, the Sensitivity or Recall (TP rate) of 80.2% indicates as well that our model has a higher chance of detecting the high ozone rate of a particular day. The Specificity (TN rate) of 95.7% indicates that the model has a higher chance of detecting the low ozone rate on a given day as well. With the high accuracy stated above and a precision of 92.9%, these results imply that our model has a low False Positive (FP) rate.</p><p>In addition, from our evaluation metrics for both models, Our ANN model performs slightly better than the LR model with the ANN model having higher accuracy 89.3% compared to LR’s 88.4% and AUC 95.4% compared to LR’s 95.2% while also having a lower miss-classification rate (10.7% compared to LR’s 11.6%).</p><p>Furthermore, when we consider the precision and recall of our models’ performance, both models perform very well with very high precision and high recall, meaning that our model has a high true positive (TP) rate and a low false positive (FP) rate. When the sensitivity is high, we also tend to have a lower false negative rate meaning that our model would most likely avoid a wrong prediction of a negative (low ozone level) outcome any day.</p><p>With regards to the prediction time, while both models show very small-time complexity for prediction execution, the ANN model has a lower prediction time. Also looking at the binary cross entropy, the ANN model has the lower binary cross entropy indicating that it performed better than the LR model in terms of classification.</p><p>We recommend that subsequent research should consider the following points:</p><p>Application of other types of supervised machine learning models, such as the Random Forest Model, Support Vector Machine, K-Nearest Neighbors, Decision Trees, and Na&#239;ve Bayes, for the classification of ozone.</p><p>Other researchers could try to expand the scope of the paper by using different datasets from regions affected by ozone pollution in various areas.</p></sec><sec id="s6"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s7"><title>Cite this paper</title><p>Obunadike, C., Adefabi, A., Olisah, S., Abimbola, D. and Oloyede, K. (2023) Application of Regularized Logistic Regression and Artificial Neural Network Model for Ozone Classification across El Paso County, Texas, United States. Journal of Data Analysis and Information Processing, 11, 217-239. https://doi.org/10.4236/jdaip.2023.113012</p></sec></body><back><ref-list><title>References</title><ref id="scirp.126207-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Di, Q., Wang, Y., Zanobetti, A., Wang, Y., Koutrakis, P., Choirat, C., Dominici, F. and Schwartz, J.D. (2017) Air Pollution and Mortality in the Medicare Population. The New England Journal of Medicine, 376, 2513-2522. https://doi.org/10.1056/NEJMoa1702747</mixed-citation></ref><ref id="scirp.126207-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Lin, S., Liu, X., Le, L.H. and Hwang, S.-A. (2008) Chronic Exposure to Ambient Ozone and Asthma Hospital Admissions among Children. Environmental Health Perspectives, 116, 1725-1730. https://doi.org/10.1289/ehp.11184</mixed-citation></ref><ref id="scirp.126207-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Jerrett, M., Burnett, R.T., Pope, C.A., Ito, K., Thurston, G., Krewski, D., Shi, Y., Calle, E. and Thun, M. (2009) Long-Term Ozone Exposure and Mortality. The New England Journal of Medicine, 360, 1085-1095. https://doi.org/10.1056/NEJMoa0803894</mixed-citation></ref><ref id="scirp.126207-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Parker, J.D., Akinbami, L.J. and Woodruff, T.J. (2009) Air Pollution and Childhood Respiratory Allergies in the United States. Environmental Health Perspectives, 117, 140-147. https://doi.org/10.1289/ehp.11497</mixed-citation></ref><ref id="scirp.126207-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Bhuiyan, M.A.M., Sahi, R.K., Islam, M.R. and Mahmud, S. (2021) Machine Learning Techniques Applied to Predict Tropospheric Ozone in a Semi-Arid Climate Region. Mathematics, 9, Article No. 2901. https://doi.org/10.3390/math9222901</mixed-citation></ref><ref id="scirp.126207-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">U.S. EPA. Nonattainment Areas for Criteria Pollutants (Green Book). https://www.epa.gov/green-book</mixed-citation></ref><ref id="scirp.126207-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">U.S. Environmental Protection Agency. Integrated Science Assessment (ISA) for Ozone and Related Photochemical Oxidants. https://www.epa.gov/isa/integrated-science-assessment-isa-ozone-and-related-photochemical-oxidants</mixed-citation></ref><ref id="scirp.126207-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Medina-Ramón, M. and Schwartz, J. (2008) Who Is More Vulnerable to Die from Ozone Air Pollution? Epidemiology, 19, 672-679. https://doi.org/10.1097/EDE.0b013e3181773476</mixed-citation></ref><ref id="scirp.126207-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Olufemi, I., Obunadike, C., Adefabi, A. and Abimbola, D. (2023) Application of Logistic Regression Model in Prediction of Early Diabetes across United States. International Journal of Scientific and Management Research, 6, 34-48. https://doi.org/10.37502/IJSMR.2023.6502</mixed-citation></ref><ref id="scirp.126207-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Tran, B., Sudusinghe, C., Nguyen, S. and Alahakoon, D. (2023) Building Interpretable Predictive Models with Context-Aware Evolutionary Learning. Applied Soft Computing, 132, Article ID: 109854. https://doi.org/10.1016/j.asoc.2022.109854</mixed-citation></ref><ref id="scirp.126207-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Issitt, R.W., Cortina-Borja, M., Bryant, W., Bowyer, S., Taylor, A.M. and Sebire, N. (2022) Classification Performance of Neural Networks versus Logistic Regression Models: Evidence from Healthcare Practice. Cureus, 14, e22443. https://doi.org/10.7759/cureus.22443</mixed-citation></ref><ref id="scirp.126207-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Valluri, C., Raju, S. and Patil, V.H. (2022) Customer Determinants of Used Auto Loan Churn: Comparing Predictive Performance Using Machine Learning Techniques. Journal of Marketing Analytics, 10, 279-296. https://doi.org/10.1057/s41270-021-00135-6</mixed-citation></ref><ref id="scirp.126207-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Xie, X., Wang, L. and Wang, A. (2010) Artificial Neural Network Modeling for Deciding If Extractions Are Necessary Prior to Orthodontic Treatment. The Angle Orthodontist, 80, 262-266. https://doi.org/10.2319/111608-588.1</mixed-citation></ref><ref id="scirp.126207-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Abiodun, O.I., Jantan, A., Omolara, A.E., Dada, K.V., Mohamed, N.A. and Arshad, H. (2018) State-of-the-Art in Artificial Neural Network Applications: A Survey. Heliyon, 4, e00938. https://doi.org/10.1016/j.heliyon.2018.e00938</mixed-citation></ref><ref id="scirp.126207-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Sarker, I.H. (2021) Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Computer Science, 2, Article No. 160. https://doi.org/10.1007/s42979-021-00592-x</mixed-citation></ref><ref id="scirp.126207-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Couronné, R., Probst, P. and Boulesteix, A.-L. (2018) Random Forest versus Logistic Regression: A Large-Scale Benchmark Experiment. BMC Bioinformatics, 19, Article No. 270. https://doi.org/10.1186/s12859-018-2264-5</mixed-citation></ref><ref id="scirp.126207-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Sarker, I.H. (2021) Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Computer Science, 2, Article No. 420. https://doi.org/10.1007/s42979-021-00815-1</mixed-citation></ref><ref id="scirp.126207-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M.A., Al-Amidie, M. and Farhan, L. (2021) Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. Journal of Big Data, 8, Article No. 53. https://doi.org/10.1186/s40537-021-00444-8</mixed-citation></ref><ref id="scirp.126207-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Montesinos López, O.A., Montesinos López, A. and Crossa, J. (2022) Multivariate Statistical Machine Learning Methods for Genomic Prediction. Springer, Cham. https://doi.org/10.1007/978-3-030-89010-0</mixed-citation></ref><ref id="scirp.126207-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Albaradei, S., Thafar, M., Alsaedi, A., Van Neste, C., Gojobori, T., Essack, M. and Gao, X. (2021) Machine Learning and Deep Learning Methods That Use Omics Data for Metastasis Prediction. Computational and Structural Biotechnology Journal, 19, 5008-5018. https://doi.org/10.1016/j.csbj.2021.09.001</mixed-citation></ref><ref id="scirp.126207-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Duan, F., Zhang, S., Yan, Y. and Cai, Z. (2022) An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE. Sensors, 22, Article No. 5166. https://doi.org/10.3390/s22145166</mixed-citation></ref><ref id="scirp.126207-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Karrar, A.E. (2022) The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values. Indonesian Journal of Electrical Engineering and Informatics, 10, 375-384. https://doi.org/10.52549/ijeei.v10i2.3730</mixed-citation></ref><ref id="scirp.126207-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Bin Rafiq, R., Modave, F., Guha, S. and Albert, M.V. (2020) Validation Methods to Promote Real-World Applicability of Machine Learning in Medicine. 2020 3rd International Conference on Digital Medicine and Image Processing, Kyoto, 6-9 November 2020, 13-19. https://doi.org/10.1145/3441369.3441372</mixed-citation></ref></ref-list></back></article>