Dataset Link:
Context
It is essential that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
Content
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) accounts for 0.172% of all trades.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 is the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.
Analysis:
- There are a total of 91.2 million amounts used in this dataset. And about 55% of transactions were made by male customers.
- Out of 3.99 million is fraud and the rest are non-fraud transactions, thus the more fraud was done by male customers only
- In fraud transactions, the category shopping_nets has the highest transactions worth 1.7 million and travel has the least.
- In the state of New work, the highest number of fraud transactions worth 0.3 million occurred and the least is in Hawaii
- When we compare with more demographic, we can say that the people who were born between 150–1980, committed more fraud and the least is between 1980–2000, the next big group of customers who were born after 2000.
- In all categories, New york city is the top city for committing fraud transactions.
- In all cities the categories shoping_nets is on the top and travel is the least amount of fraud activities.
- In Non-Fraud transactions, category grocery_pos has the most number of transactions and grocery net has the least.
- In Non-Fraud transactions, Texas has the highest number of transactions and Rhode Island has the least.
- In Non-Fraud transactions, the total sum of transactions are 87.23 million and again male has more share of transactions when compared with female.
This is my first step in making this analysis and understanding the data, next step is to make a machine-learning model.
Github link
Thank you for reading, please feel free to connect over LinkedIn