This repository provides tools to gather, preprocess, and cluster articles based on their content. The main focus is to retrieve article data from web sources, clean and embed the text, and apply clustering techniques to group similar articles. This project can be useful for research, content management, or recommendation systems.
The project workflow includes:
- Web Scraping: Extract links to articles, download content, and parse it.
- Data Preprocessing: Normalize, tokenize, and clean article titles.
- Embedding: Convert titles into embeddings for clustering.
- Clustering: Apply clustering algorithms on the embeddings to group articles.
To run this project, clone the repository and install the dependencies:
git clone https://github.com/Bahram03/article-clustering.git
cd article-clustering
pip install -r requirements.txt
Follow these steps to perform article clustering.
The script find_articles_links.py
is used to collect links to articles from a specified website. Run this file to extract links to various volumes and issues.
python find_articles_links.py
After obtaining article links, use get_articles.py
to download and parse each article, including title, authors, abstract, and keywords. This data is saved in a CSV file (articles.csv
).
python get_articles.py
Clean and tokenize the article titles using title_cleaner.py
. This script normalizes titles, tokenizes them, and removes unnecessary words based on their parts of speech. The cleaned titles are saved for later embedding.
python title_cleaner.py
Run embedding.py
to generate embeddings from the cleaned titles using Word2Vec. The embeddings will be used for clustering similar articles.
python embedding.py
Use the clustering.ipynb
notebook to load the embeddings, apply clustering algorithms, and visualize the clustered groups of articles. Open the notebook in Jupyter to explore clustering interactively.
jupyter notebook clustering.ipynb
find_articles_links.py
: Scrapes article links from specified volumes and issues on the website.get_articles.py
: Downloads article details (title, abstract, authors, keywords) and stores them inarticles.csv
.title_cleaner.py
: Normalizes and tokenizes article titles, filtering out unnecessary parts of speech.embedding.py
: Generates embeddings for titles using Word2Vec, storing these embeddings for clustering.clustering.ipynb
: A Jupyter notebook for performing and visualizing clustering on the article embeddings.
- Python 3.x
gensim
pandas
numpy
beautifulsoup4
requests
hazm
(for Persian text processing)scikit-learn
(for clustering)
Install dependencies with:
pip install gensim pandas numpy beautifulsoup4 requests hazm scikit-learn
This project is licensed under the MIT License. See the LICENSE file for details.