Skip to content

NadaMarei/Fraudulent-Transaction-Detection-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Fraudulent-Transaction-Detection-

Machine Learning model for detecting fraudulent transaction

Importing nesseasary liberaries

import numpy as np  # support multi-dimensional arrays and matrices
import pandas as pd # provide high-performance, easy-to-use data structures and data analysis tools

Reading the CSV file

data = pd.read_csv('Fraud.csv')

you can find the dataset here:

[https://drive.google.com/drive/folders/1kC2b0_rDrb5HpJFNXP3ZFlD2ePeMzZ9Y?usp=drive_link]

Understanding the data

show the first 5 rows in the df

data.head()

show the last 5 rows in the df

data.tail()

representing the dimensions of the DataFrame

data.shape

Cleaning the data¶

returns a Series where each value represents the count of missing values

data.isna().sum()

counts the occurrences of unique values in the column "isFraud"

data.isFraud.value_counts()

counts the occurrences of unique values in the column "isFlaggedFraud"

data.isFlaggedFraud.value_counts()

Feature Reduction

Droping unnecessary columns

data=data.drop(['nameOrig','nameDest'],axis=1) 
data.shape
data.head()

sklearn provides tools for data preprocessing, modeling, and evaluation.

Import label encoder

from sklearn import preprocessing

label_encoder object knows how to understand word labels.It assigns a unique integer to each category in the data

encoding categorical (non-numerical) labels into numerical labels.

label_encoder = preprocessing.LabelEncoder()

the 'type' column in the DataFrame data will be replaced with the numerical labels generated by the LabelEncoder.

data['type']= label_encoder.fit_transform(data['type'])
data.head()

Spliting The data

X: This variable contains the features or independent variables used for prediction.

  • It includes all columns from the DataFrame data except for the column 'isFraud'.
  • X will represent the input data for training the model.

y: This variable contains the target variable or dependent variable that we want to predict.

  • It corresponds to the column 'isFraud' from the DataFrame data.
  • y will represent the output or labels for training the model.

Spliting the data

data.loc[:, data.columns != 'isFraud'] selects all rows and all columns from data except for the column 'isFraud'.

data['isFraud'] selects only the column 'isFraud'.

X, y = data.loc[:, data.columns != 'isFraud'], data['isFraud']

Importing necesary liberary

Used for splitting a dataset into training and testing sets

from sklearn.model_selection import train_test_split

preprocessing technique used to standardize features by removing the mean and scaling them to unit variance.

preventing features with larger scales from dominating the algorithm's learning process.

from sklearn.preprocessing import StandardScaler

Define Train and Test sets

40% of the data will be used for testing, and the remaining 60% will be used for training.

Setting a random seed for reproducibility ensures that the data split is consistent across runs.

X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.40,random_state=42)

Standardization of the features in both the training set

It ensures that the mean and standard deviation of each feature are approximately 0 and 1, respectively, in both sets.

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

to create an instance of the Gaussian Naive Bayes model and use it to train and make predictions on your data.

assumes the likelihood of the features given the target variable is normaly distriputed.

from sklearn.naive_bayes import GaussianNB

Import scikit-learn metrics module for accuracy calculation

from sklearn import metrics

Create a Gaussian Classifier instance, useful for classification tasks

gnb = GaussianNB()

Training the model

Train the model using the training sets

gnb.fit(X_train, y_train)

Predict the response for test dataset, ".predict" takes the test features as input and returns the predicted class labels.

y_pred = gnb.predict(X_test)

returns the proportion of correctly predicted labels to the total number of labels in the test set.

print("Accuracy:",metrics.accuracy_score(y_test, y_pred)*100)

Using LogisticRegression

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print ("Accuracy : ", metrics.accuracy_score(y_test, y_pred)*100)

Explaining the Visualization

  • confusion_matrix: Visualizes the performance of the logistic regression model
  • roc_curve: Visualizes the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) across different threshold values.
  • AUC: The area under the ROC curve (AUC) quantifies the model's ability to discriminate between positive and negative cases.
from sklearn.metrics import confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt

# Compute confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.xticks([0, 1], ['Non-Fraud', 'Fraud'])
plt.yticks([0, 1], ['Non-Fraud', 'Fraud'])
plt.tight_layout()

for i in range(2):
    for j in range(2):
        plt.text(j, i, format(conf_matrix[i, j], 'd'),
                 horizontalalignment="center",
                 color="white" if conf_matrix[i, j] > conf_matrix.max() / 2. else "black")

plt.show()

# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)

# Plot Receiver Operating Characteristic curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

About

Machine Learning model for detecting fraudulent transaction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published