<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2023.112010</article-id><article-id pub-id-type="publisher-id">JCC-123363</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Machine Learning-Based Approach for Identification of SIM Box Bypass Fraud in a Telecom Network Based on CDR Analysis: Case of a Fixed and Mobile Operator in Cameroon
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Eric</surname><given-names>Michel Deussom Djomadji</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Kabiena</surname><given-names>Ivan Basile</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Tchapga</surname><given-names>Tchito Christian</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Ferry</surname><given-names>Vaneck Kouam Djoko</given-names></name><xref ref-type="aff" rid="aff3"><sup>3</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Michael</surname><given-names>Ekonde Sone</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib></contrib-group><aff id="aff3"><addr-line>Division of Information and Communications Technology, National School of Posts, Telecommunications and Information and Communication Technologies, University of Yaoundé I, Yaoundé, Cameroon</addr-line></aff><aff id="aff2"><addr-line>Department of Computer Engineering and Telecommunications, National Advanced School of Engineering, University of Douala, Douala, Cameroon</addr-line></aff><aff id="aff1"><addr-line>Department of Electrical and Electronic Engineering, College of Technology, University of Buea, Buea, Cameroon</addr-line></aff><pub-date pub-type="epub"><day>15</day><month>02</month><year>2023</year></pub-date><volume>11</volume><issue>02</issue><fpage>142</fpage><lpage>157</lpage><history><date date-type="received"><day>29,</day>	<month>January</month>	<year>2023</year></date><date date-type="rev-recd"><day>25,</day>	<month>February</month>	<year>2023</year>	</date><date date-type="accepted"><day>28,</day>	<month>February</month>	<year>2023</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  In the telecommunications sector, companies suffer serious damages due to fraud, especially in Africa. One of the main types of fraud is SIM box bypass fraud, which includes using SIM cards to divert incoming international calls from mobile operators creating massive losses of revenue. In order to provide a solution to these shortcomings that apply almost to all network operators, we developed intelligent algorithms that exploit huge amounts of data from mobile operators and that detect fraud by analyzing CDRs from voice calls. In this paper we used three classification techniques: Random Forest, Support Vector Machine (SVM) and XGBoost to detect this type of fraud; we compared the performance of these different algorithms to evaluate the model by using data collected from an operator’s network in Cameroon. The algorithm that produced a better performance was the Random Forest with 92% accuracy, so we effectuated the detection of existing fraudulent numbers on the telecommunications operator’s network.
 
</p></abstract><kwd-group><kwd>CDR</kwd><kwd> Fraud Detection</kwd><kwd> Machine Learning</kwd><kwd> Voice Calls</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>Cameroon’s economy pays a high price for international telephone calls made through the SIM Box fraud system. In 2015, the loss of revenue reached 22.2 billion FCFA. That is 18 billion for the 4 local telephone operators, namely CAMTEL, MTN Cameroon, Orange Cameroon, and Nexttel. This is the bill for 100 million minutes of calls made from abroad. As for the state, it loses 4.2 billion in terms of uncollected taxes. In 2014, the overall losses were CFAF 9.3 billion. Operators bore 7.5 billion and the state 1.8 billion [<xref ref-type="bibr" rid="scirp.123363-ref1">1</xref>] . The SIM Box consists of making an international call for a local call via the internet. The receiver sees a local number displayed while the call comes from outside. Commonly especially in Africa and Asia, this fraud causes financial losses of between 2.3 and 7 billion dollars worldwide [<xref ref-type="bibr" rid="scirp.123363-ref2">2</xref>] . Fraud is a major problem for mobile network operators worldwide, costing them more than 38 billion U.S. dollars per year [<xref ref-type="bibr" rid="scirp.123363-ref3">3</xref>] . In many countries, the rate for routing international calls (ITR) is considerably higher than the rate for routing local calls. Fraudsters make considerable profit by bypassing the routing of the licensed international operator to terminate calls in the country. As a result, fraudsters pay the local rate, which is lower than the International Routing Rate (ITR). This practice is illegal in most countries and is an important issue for many operators because of the associated loss of revenue.</p><p>In the context of this research, we worked on the case of a fix and mobile operator in Cameroon. The operator has implemented several solutions to reduce SIM box frauds, but so far these methods of fighting SIM Box fraud are not effective and the operator continues to suffer financial losses. The existing solution used by the operator does not allow for obtaining a real time analysis of CDRs for the detection of fraud by SIM Box. However, these security measures have many limitations in terms of real time analysis of CDRs and detection of fraud. Therefore, we propose a Machine Learning based approach for real-time CDR analysis and efficient SIM Box fraud detection. The proposed method can be used in every telecommunications network, we apply it on this network operator as a case study given that the real data was collected there.</p><p>In SIM Boxes, local SIM cards are used for rerouting/bypassing international calls from mobile network operators then transfer them over the Internet and deliver them back by means of VoIP gateway device called SIM-Box, as local calls to the operator’s cellular network [<xref ref-type="bibr" rid="scirp.123363-ref4">4</xref>] . <xref ref-type="fig" rid="fig1">Figure 1</xref> and <xref ref-type="fig" rid="fig2">Figure 2</xref> respectively present the case of a normal international call and the case of a fraud using a SIM box.</p><p>A number of researches have been conducted using different tools and techniques or methods to solve the problem related to SIM-Box detection using machine-learning techniques.</p><p>D. I. Ighneiwa and H. S. Mohamed in [<xref ref-type="bibr" rid="scirp.123363-ref7">7</xref>] used unsupervised learning algorithms to cluster SIMs to get insights on how they could improve the designed algorithm; different models were trained to detect SIMs used in SIM boxes.</p><p>A. Krenker, M. Volk, U. Sedlar, J. Bešter, and A. Kos in [<xref ref-type="bibr" rid="scirp.123363-ref8">8</xref>] prove that using a bidirectional neural network (bi-ANN) to predict generic cell phone fraud in real time yielded a high percentage of accuracy. The bi-directional neural network is used to predict the time series of subscriber call duration to identify any unusual behavior. The results show that the Bi-ANN is able to predict these time series with a rate of 90% in an optimal network configuration.</p><p>A. H. Elmi, R. Sallehuddin, S. Ibrahim, and A. M. Zain in [<xref ref-type="bibr" rid="scirp.123363-ref9">9</xref>] used a set of 234,324 calls made by 6415 subscribers of a single cell ID over a two-month period for analysis. The dataset included 2126 fraudulent subscribers and 4289 normal subscribers, equivalent to two-thirds of legitimate subscribers and one-third of fraudulent SIM boxes. The researchers extracted 9 features, such as total number of calls, total number of minutes and average number of minutes, etc. They then used the extracted features to train an artificial neural network (ANN) classifier. They found that the best architecture was the one with two hidden layers, each with five hidden neurons, with a learning rate of 0.6. Accuracy reached 98.7% with only 20 counts wrongly classified as false positives.</p><p>DEUSSOM Eric et al., in [<xref ref-type="bibr" rid="scirp.123363-ref10">10</xref>] detect fraud by analyzing CDRs and internet traffic. The Differential Privacy model was used to encrypt users’ personal information, and the k-means algorithm and DBSCAN were used here to group users into different clusters. Using a plane representation, they were able to visualize the users that are suspected of fraud. These were the users who were very far away from the different cluster centres.</p><p>S. Subudhi and S. Panigrahi in [<xref ref-type="bibr" rid="scirp.123363-ref11">11</xref>] presented a new approach to detect fraudulent activities in mobile telecommunications networks using possibilistic fuzzy c-means clustering. First, the optimal values of the clustering parameters were estimated experimentally. The modelling of the subscriber behaviour profile is then performed by applying the clustering algorithm on two relevant call features selected from the subscriber’s historical call records. All symptoms of intrusive activity are detected by comparing the most recent call activity with their normal profile. Through the following authors presented, we can see that machine learning can be used in many use cases, like fraud detection, network maintenance [<xref ref-type="bibr" rid="scirp.123363-ref12">12</xref>] and so one. The rest of this paper is organized as follows: in section 2, the materials and methods are presented followed by the results and comments in section 3 and finally a conclusion.</p></sec><sec id="s2"><title>2. Materials and Methods</title><p>We used machine learning to analyze CDRs to develop a collaborative model capable of identifying SIM Box fraud using three machine learning algorithms: Random Forest, SVM and XGBOOST. Since the CDR data is labeled (data belonging to a fraudster and a non-fraudster respectively), the classification method is the best way to distinguish between fraudulent and non-fraudulent numbers. These three algorithms have many advantages; they are simple, fast and easy to understand and above all they give a result with good accuracy.</p><p>In this work, in order to explore data that has been shown to work well with unbalanced datasets, we implemented three learning algorithms.</p><sec id="s2_1"><title>2.1. The Random Forest Algorithm</title><p>The Random Forest algorithm is a classification algorithm that reduces the variance of the predictions of a single decision tree, thus improving their performance, by combining multiple decision trees in a bagging approach. In its most classical form, it performs parallel learning on multiple randomly constructed decision trees trained on different subsets of data. The random forest algorithm is known to be one of the most efficient “out-of-the-box” classifiers (i.e., requiring little data pre-processing) [<xref ref-type="bibr" rid="scirp.123363-ref13">13</xref>] . The random forest algorithm works by 4 steps that won’t be presented again here. <xref ref-type="fig" rid="fig3">Figure 3</xref> presents an illustration of the random forest.</p></sec><sec id="s2_2"><title>2.2. The SVM Algorithm</title><p>Support vector machines (SVM) are a set of supervised learning techniques designed to solve problems. They were developed in the 1990’s from the theoretical considerations of Vladimir Vapnik on the development of a statistical theory of learning: the Vapnik-Chervonenkis theory. They were quickly adopted for their ability to work with high-dimensional data, the low number of hyperparameters, their theoretical guarantees, and their good results in practice [<xref ref-type="bibr" rid="scirp.123363-ref15">15</xref>] .</p></sec><sec id="s2_3"><title>2.3. The XGBoost Algorithm</title><p>XGBoost was originally started as a research project by Tianqi Chen in the Distributed (Deep) Machine Learning Community (DMLC) group. XGBoost is a popular and efficient open-source implementation of the gradient boosted tree algorithm. Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining estimates from a set of simpler and weaker models [<xref ref-type="bibr" rid="scirp.123363-ref16">16</xref>] .</p><p>For the present work, Python version 3.8 was used as the programming language of choice for running machine learning algorithms. Anaconda is the Python distribution used; it is delivered with all the tools and libraries needed to do machine learning, such as Numpy, Matplotlib, sklearn, Jupiter, Spider...etc.</p></sec><sec id="s2_4"><title>2.4. Data Collection and Preparation</title><p>&#183; Data collection</p><p>Recall that the purpose of this study is to contribute to the creation of an effective fraud detection model for a telecommunication network in order to reduce or eliminate losses caused by fraud. Therefore, we need to develop a model that can identify each fraudster and stop his activities. In order to do this, we started by collecting and processing data. As a result, we obtained CDR files with 60,000 call lines that we sorted then selected the fields we needed to build our model. We were granted a special permission to use this data while preserving the confidentiality of the user’s information.</p><p>&#183; Data preparation</p><p>It is our responsibility to understand, analyze and determine what data can be used to build our model.</p><p>&#183; Description of the data</p><p>The CDRs data we collected from the MSOFTX3000 are in .csv format.</p><p>The CDRs from the MSOFTX3000 are dated APRIL 2021. The following <xref ref-type="table" rid="table1">Table 1</xref> lists the different fields present:</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Overview of MSOFTX3000 CDR fields</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Element</th><th align="center" valign="middle" >Description</th></tr></thead><tr><td align="center" valign="middle" >Calling number</td><td align="center" valign="middle" >The phone number of a caller</td></tr><tr><td align="center" valign="middle" >Called number</td><td align="center" valign="middle" >The phone number of a called party</td></tr><tr><td align="center" valign="middle" >Answer Date</td><td align="center" valign="middle" >The date the call was picked up</td></tr><tr><td align="center" valign="middle" >Answer time</td><td align="center" valign="middle" >The time when the call was picked up</td></tr><tr><td align="center" valign="middle" >Release date</td><td align="center" valign="middle" >Date the call was hung up</td></tr><tr><td align="center" valign="middle" >Release time</td><td align="center" valign="middle" >Time of call hang up</td></tr><tr><td align="center" valign="middle" >Call duration</td><td align="center" valign="middle" >Duration of the call.</td></tr><tr><td align="center" valign="middle" >Cause for term</td><td align="center" valign="middle" >Cause of call break</td></tr></tbody></table></table-wrap><p>&#183; Exploring the Data</p><p>It is important to visualize the data as it was collected and to show how the different domains relate to each other. The choice of datasets to be manipulated is crucial. Below is an image of some of the data fields.</p><p>&#183; Investigations of fraudulent numbers</p><p>The fraudulent SIM Box accounts were investigated by the operator’s fraud department and cancelled due to their malicious activity. As a result, we have obtained data tagged for the month of APRIL 2021; this presented by <xref ref-type="fig" rid="fig4">Figure 4</xref> and <xref ref-type="fig" rid="fig5">Figure 5</xref> presents a sample of data collected from the HUAWEI MSOFTX3000 which is the core network switching equipment.</p><p>Note: In <xref ref-type="fig" rid="fig4">Figure 4</xref> and <xref ref-type="fig" rid="fig5">Figure 5</xref> subscribers’ phone numbers were blurred intentionally to protect their privacy.</p></sec><sec id="s2_5"><title>2.5. Evaluation Method</title><p>Confusion matrix</p><p>The confusion matrix is the commonly used method to describe and characterize the performance of the classification model in the fraud detection system. The confusion matrix is a kind of summary of the prediction results for a particular classification problem. It compares the actual data for a target variable to that predicted by a model. Right and wrong predictions are revealed and divided by class, allowing them to be compared with defined values. The results of a confusion matrix are classified into four broad categories: true positives, true negatives, false positives, and false negatives [<xref ref-type="bibr" rid="scirp.123363-ref17">17</xref>] .</p><p>Different metrics can be calculated from the contingency <xref ref-type="table" rid="table2">Table 2</xref> to facilitate interpretation. This is, as example, the case for the error rate, Accuracy, precision, recall and F1 score. These indicators allow a better appreciation of the quality of the model’s precision.</p></sec><sec id="s2_6"><title>2.6. Construction and Training of the Model</title><p>In this part, we built the columns materializing the volume of incoming and outgoing calls of each number and build the fraud target variable (binary variable worth 1 if the call is fraudulent and 0 otherwise), <xref ref-type="fig" rid="fig6">Figure 6</xref> presents the construction of new columns.</p><p>&#183; Labelling:</p><p>We have for the column “is_fraudulent” labelled the SIM Box numbers. Thus, on each line we apply the lambda function which searches in a line; if there is a fraudulent number; it returns 1 and if not it returns 0, this is represented in <xref ref-type="table" rid="table3">Table 3</xref>.</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Confusion matrix</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  colspan="2"   rowspan="2"  ></th><th align="center" valign="middle"  colspan="2"  >Actual values</th></tr></thead><tr><td align="center" valign="middle" >P</td><td align="center" valign="middle" >N</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >Predictions</td><td align="center" valign="middle" >P</td><td align="center" valign="middle" >TP</td><td align="center" valign="middle" >FP</td></tr><tr><td align="center" valign="middle" >N</td><td align="center" valign="middle" >FN</td><td align="center" valign="middle" >TN</td></tr></tbody></table></table-wrap><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Labelling format</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Cible</th><th align="center" valign="middle" >Code</th></tr></thead><tr><td align="center" valign="middle" >Num&#233;ros non frauduleux (Not Fraud)</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >Num&#233;ros frauduleux (Fraud)</td><td align="center" valign="middle" >1</td></tr></tbody></table></table-wrap><p>&#183; Data transformation</p><p>We determined the outgoing call volume for each number:</p><p>Outcoming_call_volume: in our dataset, we select the numbers that appear several times, then we group them together by calling numbers and for each group we add the number of times it is in the dataset more precisely at the level of the calling number column.</p><p>We have determined the incoming call volume for each number:</p><p>Incoming_call_volume: We select the outgoing call number from the list of aggregated numbers and search for the number of times it appears in the list of called numbers within the initial dataset.</p><p>We determined the average call duration that a number had to make:</p><p>Mean_call_duration: represents the ratio between the total duration of calls by the total number of calls.</p><p>&#183; Data normalization</p><p>Since the machine learning platform does not understand strings, we had to encode the classes of the cause for term variables using the one hot encoding method which is a very common approach An encoding creates new (binary) columns, indicating the presence of each possible value from the original data: “normalRelease” to represent normal hang-up, “partialRecord” to represent Partial record and “nsuccesfulCallAttempt” to represent Unsuccessful calls.</p><p>&#183; Search for dependency between variables</p><p>The closer the value is to 1 (a solid red), the stronger and more positive is the correlation. On the other hand, if the correlation is close to 0 (dark blue), the correlation is very negative. This is presented by <xref ref-type="fig" rid="fig7">Figure 7</xref>.</p><p>&#183; Training the model</p><p>Therefore, we used the normal dataset splitting rule found in the Python Pandas library for our dataset with 80% set for training and 20% for testing. The function used to split the data set into training data and test data is present below.</p><p>X_train, X_test, y_train, y_test = train_test_split(X_train_res, y_train_res, random_state = 40, test_size = 0.2)</p></sec></sec><sec id="s3"><title>3. Results and Discussion</title><sec id="s3_1"><title>3.1. Learning and Creating Prediction Models</title><p>&#183; Prediction with the Random Forest</p><p>To do this, we imported the algorithm from the sklearn library via the following code:</p><p>from sklearn.ensemble import RandomForestClassifier</p><p>Then we created a Random forest classifier of 100 trees via the following code:</p><p>rf = RandomForestClassifier(n_estimators = 100, random_state = 40)</p><p>And we launched the training on our training dataset with the following python code:</p><p>rf.fit(X=X_train, y=y_train)</p><p>After training our Random Forest model, we obtained the following in <xref ref-type="fig" rid="fig8">Figure 8</xref>.</p><p>We had an accuracy to determine the fraudsters of 0.86 with an f1-score of 0.94 and an accuracy of the non-fraudsters of 0.95 with an f1-score of 0.88, and a total accuracy of 0.91 so our model predicted well in training.</p><p>After testing the Random Forest model, we obtained the result presented in <xref ref-type="fig" rid="fig9">Figure 9</xref>.</p><p>For the test, on the one hand we had an accuracy to determine the fraudsters of 0.88 with an f1-score of 0.96 and on the other hand we had an accuracy to determine the non-fraudsters of 0.96 with an f1-score of 0.89, and the trained model had a general accuracy of 0.92 so our model reacted well to the data test.</p><p>&#183; Prediction with the SVM</p><p>To do this, we imported the algorithm from the sklearn library via the following code:</p><p>from sklearn.svm import SVC</p><p>Then we created an SVM whose C value determines the penalty for the classifier. Presented via the following codes:</p><p>svc = SVC(random_state = 40, C = 20)</p><p>And we launched the training on our training dataset with the following python code:</p><p>svc.fit(X = X_train, y = y_train)</p><p>Training our SVM model, we obtained the following result presented in <xref ref-type="fig" rid="fig1">Figure 1</xref>0.</p><p>We had an accuracy to determine fraudsters of 0.74 with an f1-score of 0.83 and an accuracy of non-fraudsters of 0.95 with an f1-score of 0.83, and a general accuracy of 0.83. Here our model performed at a lower accuracy for the detection of fraudsters and non-fraudsters in training.</p><p>Testing our SVM model, we obtained the result in <xref ref-type="fig" rid="fig1">Figure 1</xref>1:</p><p>For the test, on the one hand we had an accuracy to determine the cheaters of 0.89 with an f1-score of 0.53 and on the other hand an accuracy of the non cheaters of 0.64 with an f1-score of 0.53, and the trained model had a general an accuracy of 0.69 so our model did not react well to the data test.</p><p>&#183; Prediction with the XGBoost</p><p>To do this, we imported the algorithm from the sklearn library via the following code:</p><p>from xgboost import XGBClassifier</p><p>Then we created a GaussianNB via the following code:</p><p>nb = GaussianNB()</p><p>And we launched the training on our training dataset with the following python code:</p><p>nb.fit(X = X_train, y = y_train)</p><p>Training our XGBoost model, we got the following result in <xref ref-type="fig" rid="fig1">Figure 1</xref>2:</p><p>We had accuracy for determining fraudsters of 0.71 with an f1-score of 0.82 and accuracy for non-fraudsters of 0.96 with an f1-score of 0.80, and a general accuracy of 0.81 so our model predicted well in training.</p><p>Testing XGBoost model, we obtained the results presented in <xref ref-type="fig" rid="fig1">Figure 1</xref>3.</p><p>For the testing, on the one hand we had an accuracy to determine the fraudsters of 0.72 with an f1-score of 0.83 and on the other hand an accuracy of the non-fraudsters of 0.72 with an f1-score of 0.79, and the trained model had a general accuracy of 0.81 so our model had an acceptable reaction to the data test.</p></sec><sec id="s3_2"><title>3.2. Evaluation of the Model by the Confusion Matrix</title><p>&#183; Random Forest algorithm</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref>4 presents Random Forest confusion matrix, in this confusion matrix, the number of false negatives is 23 so we predicted “no” but they are fraudsters while the number of false positives is 66 we predicted “yes” but they are not fraudsters. The number of true positives is 519, thus we predicted that they are not fraudsters and indeed they are not fraudsters, and the number of true negatives is 492 so we predicted that they are fraudsters and indeed they are fraudsters.</p><p>&#183; SVM algorithm</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref>5 presents SVM confusion matrix, in this confusion matrix, the number of false negatives is 322 so we predicted that they are not fraudsters but they are fraudsters, the number of false positives is 24 we predicted “yes” but they are not fraudsters, the number of true positives is 561 we predicted that they are not fraudsters and indeed they are not fraudsters, and the number of true negatives is 193 we predicted that they are fraudsters and indeed they are fraudsters.</p><p>&#183; XGBoost algorithm</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref>6 presents XGBoost confusion matrix, in this confusion matrix, the number of false negatives is 16 so we predicted that they are not fraudsters but they are fraudsters, the number of false positives is 192 we predicted yes but they are not fraudsters, the number of true positives is 393 we predicted that they are not fraudsters and indeed they are not fraudsters, and the number of true negatives is 499 we predicted that they are fraudsters and indeed they are fraudsters</p></sec><sec id="s3_3"><title>3.3. Discussion</title><p>As a follow-up to the experimental research we have done in the previous paragraphs, the machine learning model we propose is the Random Forest model. Indeed, this model is retained because it predicts the optimal SIM Box fraud detection solution with an accuracy of 0.92 and a score of 0.92, as presented in <xref ref-type="table" rid="table4">Table 4</xref>:</p><p>We made the prediction with our best performing model, and determine if Random Forest model is able to correctly determine a case of SIM Box fraud. <xref ref-type="fig" rid="fig1">Figure 1</xref>7 presents the command which can be used.</p><p>Then we tested each line of the dataset to bring out the fraudulent and non-fraudulent numbers. We obtained the dataset with the list of fraudulent and non-fraudulent numbers.</p><p>Finally we tested each line of our dataset to highlight the only fraudulent numbers without the lines of the dataset; we obtained the dataset with the list of fraudulent numbers. For that we used the following code and the figures presenting that results are <xref ref-type="fig" rid="fig1">Figure 1</xref>8 and <xref ref-type="fig" rid="fig1">Figure 1</xref>9:</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Comparison of the models</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  colspan="2"  ></th><th align="center" valign="middle" >Precision</th><th align="center" valign="middle" >Accuracy</th><th align="center" valign="middle" >F1-score</th></tr></thead><tr><td align="center" valign="middle"  rowspan="2"  >Random Forest</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0.96</td><td align="center" valign="middle"  rowspan="2"  >0.92</td><td align="center" valign="middle" >0.92</td></tr><tr><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.88</td><td align="center" valign="middle" >0.92</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >SVM</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0.64</td><td align="center" valign="middle"  rowspan="2"  >0.69</td><td align="center" valign="middle" >0.76</td></tr><tr><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.89</td><td align="center" valign="middle" >0.53</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >XGBoost</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0.96</td><td align="center" valign="middle"  rowspan="2"  >0.81</td><td align="center" valign="middle" >0.79</td></tr><tr><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.71</td><td align="center" valign="middle" >0.83</td></tr></tbody></table></table-wrap><p>Df_fraudulents = dataframe_predictions_with_numbers[dataframe_ predictions_with_numbers[“is_fraudulent”] = = [“fraudulent”]</p><p>Due to the rapid evolution of the SIM Box fraud, we think that it is necessary to refresh the detection model periodically, like every quarter and always use the more accurate model for fraud detection.</p></sec></sec><sec id="s4"><title>4. Conclusions</title><p>The objective of this paper consists of researching and implementing a SIM Box fraud detection system for a telecommunications network operator, with a case study based on data collected to a fixed and mobile network operator in Cameroon. The project aims to quickly identify SIM Box fraud and reduce or eliminate the financial loss caused by the scam in the company’s turnover.</p><p>We used machine learning techniques to effectively identify SIMboxing fraud based on CDR analysis and prevent it from harming telecom companies in terms of revenue, quality of service and security. In order to detect the SIM Box scam, since the dataset is unbalanced, we used classification algorithms. After this step, we performed a comparison of the incoming and outgoing call rates, and then we determined the total duration of a call in a day. Thus, an individual not detected during the first hours may be detected in the following hours. We ran the data under different Machine Learning models of unsupervised learning in order to compare the performance of different models based on their accuracy and select the best one for fraud detection. From the experiment, we found that Random Forest, SVM and XGBoost are able to detect the bypass SIM box fraud. The experimental results showed that Random Forest has the best accuracy compared to the others. Random Forest gave 92% accuracy while SVM model gave 76% accuracy and XGBoost gave 84% accuracy. Therefore, the Random Forest approach is more suitable for the classification model used for SIM BOX fraud detection with 92% accuracy. Then this model has been used to identify the fraudulent numbers in the mobile operator’s network successfully.</p></sec><sec id="s5"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s6"><title>Cite this paper</title><p>Deussom Djomadji, E.M., Basile, K.I., Christian, T.T., Djoko, F.V.K. and Michael, S.E. (2023) Machine Learning-Based Approach for Identification of SIM Box Bypass Fraud in a Telecom Network Based on CDR Analysis: Case of a Fixed and Mobile Operator in Cameroon. Journal of Computer and Communications, 11, 142-157. https://doi.org/10.4236/jcc.2023.112010</p></sec></body><back><ref-list><title>References</title><ref id="scirp.123363-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Agence Ecofin (2015) Cameroun: 22,2 milliards FCfa de pertes en 2015 sur les appels téléphoniques frauduleux par Simbox. Agence Ecofin.  https://www.agenceecofin.com/gestion-publique/0910-32980-cameroun-22-2-milliards-fcfa-de-pertes-en-2015-sur-les-appels-telephoniques-frauduleux-par-simbox</mixed-citation></ref><ref id="scirp.123363-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Cameroun: La fraude par simbox peut co&amp;ucirc;ter 18 milliards de F CFA aux opérateurs. JeuneAfrique.https://www.jeuneafrique.com/271158/economie/cameroun-la-fraude-par-simbox-peut-couter-18-milliards-de-f-cfa-aux-operateurs/</mixed-citation></ref><ref id="scirp.123363-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Revenue Assurance and Fraud: Risque et fraud management (revenue assur-ance).http://risquefraudtelecom.blogspot.com/2012/05/introduction-au-risque-et-fraud.html</mixed-citation></ref><ref id="scirp.123363-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Karunathilaka, A.V.V.S. (2020) Fraud Detection on International Direct Dial Calls. University of Colombo School of Computing, Colombo.https://dl.ucsc.cmb.ac.lk/jspui/bitstream/123456789/4451/1/2016%20MIT%20029.pdf</mixed-citation></ref><ref id="scirp.123363-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Reaves, B., Shernan, E., Bates, A., Carter, H. and Traynor, P. (2015) Boxed Out: Blocking Cellular Interconnect Bypass Fraud at the Network Edge. Proceedings of the 24th USENIX Security Symposium, Washington DC, 12-14 August 2015. https://www.usenix.org/system/files/conference/usenixsecurity15/sec15-paper-reaves-boxed.pdf</mixed-citation></ref><ref id="scirp.123363-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Witten, I.H., Frank, E. and Hall, M.A. (2016) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington.</mixed-citation></ref><ref id="scirp.123363-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Ighneiwa, I. and Mohamed, H.S. (2017) Bypass Fraud Detection: Artificial Intelligence Approach. https://www.researchgate.net/publication/321070233_Bypass_Fraud_Detection_Artificial_Intelligence_Approach</mixed-citation></ref><ref id="scirp.123363-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Krenker, A., Volk, M., Sedlar, U., Be&amp;scaron;ter, J. and Kos, A. (2009) Bidirectional Artificial Neural Networks for Mobile-Phone Fraud Detection. ETRI Journal, 31, 92-94. https://doi.org/10.4218/etrij.09.0208.0245</mixed-citation></ref><ref id="scirp.123363-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Sallehuddin, R., Ibrahim, S., Mohd Zain, A. and Hussein Elmi, A. (2015) Detecting SIM Box Fraud by Using Support Vector Machine and Artificial Neural Network. Jurnal Teknologi, 74, 131–143. https://doi.org/10.11113/jt.v74.2649</mixed-citation></ref><ref id="scirp.123363-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Deussom Djomadji, E.M., Matemtsap Mbou, B., Tchagna Kouanou, A., Ekonde Sone, M. and Bayonbog, P. (2022) Machine Learning-Based Approach for Designing and Implementing a collaborative Fraud Detection Model through CDR and Traffic Analysis. Transactions on Engineering and Computing Sciences, 10, 46-58. https://doi.org/10.14738/tmlai.104.12854</mixed-citation></ref><ref id="scirp.123363-ref11"><label>11</label><mixed-citation publication-type="book" xlink:type="simple">Subudhi, S. and Panigrahi, S. (2017) Use of Possibilistic Fuzzy C-Means Clustering for Telecom Fraud Detection. In: Behera, H., Mohapatra, D.., Eds., Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, Vol. 556, Springer, Singapore, 633-641. https://doi.org/10.1007/978-981-10-3874-7_60</mixed-citation></ref><ref id="scirp.123363-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Bernabe, B., Michel, D., Marie, C. and Fabrice, M. (2022) Comparing Machine Learning Algorithms for Improving the Maintenance of LTE Networks Based on Alarms Analysis. Journal of Computer and Communications, 10, 125-137. https://doi.org/10.4236/jcc.2022.1012010</mixed-citation></ref><ref id="scirp.123363-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Random Forest. Data Analytics Post. https://dataanalyticspost.com/Lexique/random-forest/</mixed-citation></ref><ref id="scirp.123363-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Schott, M. (2019) Random Forest Algorithm for Machine Learning. Capital One Tech.https://medium.com/capital-one-tech/random-forest-algorithm-for-machine-learning-c4b2c8cc9feb</mixed-citation></ref><ref id="scirp.123363-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Support Vector Machine. Wikipedia.https://fr.wikipedia.org/wiki/Machine_%C3%A0_vecteurs_de_support</mixed-citation></ref><ref id="scirp.123363-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">XGBoost. Machine Learning Book.https://vatsalparsaniya.github.io/ML_Knowledge/XGBoost/Readme.html</mixed-citation></ref><ref id="scirp.123363-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Matrice de confusion: Comment la lire et l’interpréter?  https://www.jedha.co/formation-ia/matrice-confusion</mixed-citation></ref></ref-list></back></article>