This repository contains the code for training a machine learning model for phishing URL detection. The dataset used and the latest model are hosted on Hugging Face:
- Dataset: https://huggingface.co/datasets/pirocheto/phishing-url
- Model: https://huggingface.co/pirocheto/phishing-url-detection
ℹ️ You can test the model on the demo page here.
The model architecture consists of a TF-IDF (character n-grams + word n-grams) for vectorization and a linear SVM for classification.
✅ Lightweight: Easy to handle, you can embed it in your applications without the need for a remote server to host it.
✅ Fast: Your application will experience no additional latency due to model inferences.
✅ Works Offline: The use of URL tokens alone enables usage without an internet connection.
On the other hand, it could be less efficient than more complex models or those using external features.
# 1. Clone the repository
git clone https://github.com/pirocheto/phishing-url-detection.git
# 2. Go inside the project
cd phishing-url-detection
# 3. Install dependencies
poetry install --no-root
# 4. Run the pipeline
dvc repro -s download_data
dvc repro -s train
For more details, see the pipeline in the dvc.yaml file.
live
: Artifacts created during pipeline executionnotebooks
: Contains the code for the exploration phaseressources
: Miscellaneous resources used by scriptstests
: Test filessrc
: Python scriptsparams.yaml
: Parameters for the DVC experimentdvc.yaml
: Pipeline to run the experiment and reproduce executions
- DVC: Version data and experiments
- CML: Post a comment to the pull request showing the metrics and parameters of an experiment
- Scikit-Learn: Framework to train the model
- Optuna: Find the best hyperparameters for model