Back to Projects
Fraud Detection — From Kaggle Dataset to Full-Stack App

Fraud Detection — From Kaggle Dataset to Full-Stack App

Building an end-to-end fraud detection system: a Random Forest trained on 6.3 million synthetic transactions, served through a FastAPI backend and a Next.js frontend.

machine-learningFastAPINext.jsrandom-forestpythonpandas

Introduction

Fraud detection sounds like a straightforward classification problem until you actually try it. The dataset is wildly imbalanced because for every fraudulent transaction, there are hundreds of legitimate ones. A model that just says "not fraud" every time would be right 99.87% of the time. That's a great accuracy score and a completely useless system.

Then there's the evaluation trap. Accuracy doesn't matter here. What matters is recall which means how many actual fraud cases does the model catch and precision how many of its fraud alerts are real. Get precision wrong and your system drowns analysts in false alarms. Get recall wrong and the actual fraud slips through undetected.

This project is a full-stack fraud detection app: a Random Forest classifier trained on a synthetic financial dataset, a FastAPI backend serving predictions, and a Next.js frontend for both batch analysis and single-transaction testing. The goal wasn't to build a production-ready fraud system it was to build a clean end-to-end pipeline where every trade-off is visible and every decision has a reason behind it.

The Data

PaySim and the Kaggle Dataset

The dataset comes from PaySim, a mobile money transaction simulator built by Edgar Lopez-Rojas. PaySim was calibrated against real transaction logs from a mobile financial service operating in Africa, so while the data is synthetic, the statistical patterns, transaction sizes, timing, flow between accounts are modeled after real-world behavior.

The numbers: roughly 6.3 million transactions spanning 30 simulated days, where each time step represents one hour. There are five transaction types (CASH-IN, CASH-OUT, DEBIT, PAYMENT, and TRANSFER) and only about 0.13% of transactions are fraudulent. That imbalance ratio (roughly 1 fraud per 800 legitimate transactions) is actually realistic for financial systems, which is part of what makes this dataset interesting to work with.

One important pattern baked into the simulation: fraud only ever occurs on TRANSFER and CASH-OUT transactions. This makes sense intuitively these are the types that move money out of an account.

Exploring the Data — Gabriel Preda's Notebook

Before touching any model, it's worth understanding what the data actually looks like. Gabriel Preda's Synthetic Financial Datasets - Data Exploration notebook on Kaggle is a solid reference for this.

A few key findings from the exploration phase. First, fraudulent transactions tend to drain the sender's account completely ( the balance goes to zero ). This is a strong signal and explains why oldbalanceOrg (the sender's balance before the transaction) ends up being the most important feature. Second, the isFlaggedFraud column a rule-based flag built into the simulator fires on almost no transactions. It was designed to flag transfers over 200,000 in a single transaction, but fraud in the dataset doesn't always follow that pattern. So a column that sounds like it should be a cheat code turns out to be nearly useless. Third, columns like nameOrig and nameDest are anonymized identifiers with millions of unique values and no real predictive power for a tree-based model, so they get dropped.

From Exploration to a Trainable Dataset

After exploration, the preprocessing pipeline keeps six features:

  • oldbalanceOrg: sender's balance before the transaction
  • amount : the transaction amount
  • type: transaction type, ordinal-encoded (DEBIT=1, TRANSFER=2, CASH_IN=3, PAYMENT=4, CASH_OUT=5)
  • step: the hour of the simulation
  • oldbalanceDest: receiver's balance before the transaction
  • isFlaggedFraud: the rule-based flag (kept despite low signal, since it costs nothing)

The dropped columns include newbalanceOrig and newbalanceDest (post-transaction balances), along with the name identifiers. All six retained features are then standardized using a StandardScaler (zero-centered and scaled to unit variance ) so that no single feature dominates the distance calculations just because it has a larger numeric range.

The Model

Random Forest — Why It Fits

Random Forest is an ensemble method: it trains many decision trees on random subsets of the data, then aggregates their votes for the final prediction. Each tree sees a slightly different slice of the data and a random subset of features, which makes the ensemble resistant to overfitting on any single pattern.

For tabular financial data with mixed feature types and heavy class imbalance, Random Forests are a strong practical choice. They handle non-linear relationships without feature engineering, they're relatively fast to train even on millions of rows, and they give you feature importances out of the box which matters when you need to explain to someone why the model flagged a transaction.

Training — Arjun Joshua's Notebook

The model was trained following the approach in Arjun Joshua's Predicting Fraud in Financial Payment Services notebook on Kaggle, using the same dataset and a comparable Random Forest configuration: 100 estimators, gini criterion for splits, fully grown trees with no depth cap.

To deal with the extreme class imbalance, the training pipeline used resampling via imbalanced-learn (SMOTE and random oversampling of the minority class) to give the model enough fraud examples to learn meaningful patterns rather than defaulting to "always predict legitimate."

The estimated performance on test data:

  • Accuracy : ~99.9%
  • Precision : ~97-99%
  • Recall : ~76-86%

The accuracy number is misleading for the reasons mentioned earlier — it's high because the dataset is 99.87% legitimate. The meaningful numbers are precision and recall on the fraud class. A precision of ~98% means most alerts are real fraud. A recall of ~80% means the model catches about four out of five fraudulent transactions. That's a solid baseline, though the recall gap means roughly one in five fraud cases slips through.

Feature Importance

After training, the Random Forest exposes how much each feature contributed to its decisions. The results tell a clear story:

Random Forest — Feature Importances

Three features carry roughly 80% of the model's total signal: oldbalanceOrg (29.6%), amount (26.3%), and type (24.8%). The sender's pre-transaction balance is the single strongest predictor — because fraudulent transfers systematically drain accounts to zero, so a high balance followed by a transfer of that exact amount is a strong fraud indicator.

step (the hour) contributes 11%, which likely captures some temporal pattern in when fraud tends to occur. oldbalanceDest adds 8.2%. And isFlaggedFraud sits at effectively 0% — confirming what the exploration phase suggested: the simulator's built-in rule system was catching almost nothing.

The App

Backend

The backend is a FastAPI application that serves the trained model through a REST API. The core endpoints handle three things: batch prediction from uploaded CSV files, single-transaction prediction from a form, and model metadata.

For batch prediction, a user uploads a CSV of transactions. The server reads it, runs each row through the preprocessing pipeline (ordinal encoding, scaling with the same fitted StandardScaler), feeds the features to the Random Forest, and returns predictions with fraud probabilities — paginated so the frontend isn't trying to render thousands of rows at once.

For single predictions, the API accepts the six input features, validates them using Pydantic schemas (which reject bad input with clear error messages instead of crashing), and returns both the risk label and the raw probability score. The risk thresholds are: ≤50% is clear, 50–85% is warning (manual review recommended), and >85% is fraud.

The /model-info endpoint exposes the algorithm name, hyperparameters, feature importances, and preprocessing details — so the frontend can render all of that dynamically instead of hardcoding it.

The model loads once when the server starts and stays in memory for all subsequent requests, which is a basic but important architectural decision. Model deserialization is expensive, and doing it per-request is the kind of thing that works fine when you're the only user but collapses under any concurrent load.

Frontend

The frontend is a Next.js application with two main flows accessible from the landing page.

Landing page

Both flows live on the same upload page: drop a CSV for batch analysis, or fill in the six fields manually for a single prediction.

Upload page — batch CSV upload and single transaction form

Batch analysis runs the full CSV through the model and returns a paginated results table. Each row shows the fraud probability and a status badge. The pagination keeps things responsive even for large files.

Batch results — page with mixed clear and warning transactions

Batch results — page with flagged fraud transactions at 100% probability

Single transaction returns an inline result with the raw probability score. A binary "fraud/not fraud" label throws away the most useful information — a transaction at 51% and one at 98% are both flagged, but they mean very different things for whoever is reviewing them. The risk thresholds make that explicit:

Output thresholds — clear ≤50%, warning 50–85%, fraud >85%

There's also a dedicated model info page that pulls from the /model-info endpoint — showing the feature importance bars, the hyperparameters, and the preprocessing steps. It keeps the model transparent rather than treating it as a black box.

Closing

This project taught me that the interesting part of ML isn't the model rather it's everything around it. The data exploration that tells you which features actually matter. The API design that determines whether your model is usable or just a pickle file on disk. The frontend decisions about what to show the user, like a probability score instead of a binary label.

If you're looking at this as a portfolio piece, the thing I'd want you to take away isn't that I trained a Random Forest, it's that I gained a perspective on how a Machine Learning model actually made and that the data preperation is the most important phase.


The model was trained using approaches from Arjun Joshua's notebook and data exploration from Gabriel Preda's notebook, both on the PaySim Kaggle dataset.