Article Clustering

This repository provides tools to gather, preprocess, and cluster articles based on their content. The main focus is to retrieve article data from web sources, clean and embed the text, and apply clustering techniques to group similar articles. This project can be useful for research, content management, or recommendation systems.

Overview

The project workflow includes:

Web Scraping: Extract links to articles, download content, and parse it.
Data Preprocessing: Normalize, tokenize, and clean article titles.
Embedding: Convert titles into embeddings for clustering.
Clustering: Apply clustering algorithms on the embeddings to group articles.

Installation

To run this project, clone the repository and install the dependencies:

git clone https://github.com/Bahram03/article-clustering.git
cd article-clustering
pip install -r requirements.txt

Usage

Follow these steps to perform article clustering.

1. Collecting Article Links

The script find_articles_links.py is used to collect links to articles from a specified website. Run this file to extract links to various volumes and issues.

python find_articles_links.py

2. Retrieving Article Data

After obtaining article links, use get_articles.py to download and parse each article, including title, authors, abstract, and keywords. This data is saved in a CSV file (articles.csv).

python get_articles.py

3. Preprocessing Titles

Clean and tokenize the article titles using title_cleaner.py. This script normalizes titles, tokenizes them, and removes unnecessary words based on their parts of speech. The cleaned titles are saved for later embedding.

python title_cleaner.py

4. Generating Embeddings

Run embedding.py to generate embeddings from the cleaned titles using Word2Vec. The embeddings will be used for clustering similar articles.

python embedding.py

5. Clustering Articles

Use the clustering.ipynb notebook to load the embeddings, apply clustering algorithms, and visualize the clustered groups of articles. Open the notebook in Jupyter to explore clustering interactively.

jupyter notebook clustering.ipynb

Files

find_articles_links.py: Scrapes article links from specified volumes and issues on the website.
get_articles.py: Downloads article details (title, abstract, authors, keywords) and stores them in articles.csv.
title_cleaner.py: Normalizes and tokenizes article titles, filtering out unnecessary parts of speech.
embedding.py: Generates embeddings for titles using Word2Vec, storing these embeddings for clustering.
clustering.ipynb: A Jupyter notebook for performing and visualizing clustering on the article embeddings.

Dependencies

Python 3.x
gensim
pandas
numpy
beautifulsoup4
requests
hazm (for Persian text processing)
scikit-learn (for clustering)

Install dependencies with:

pip install gensim pandas numpy beautifulsoup4 requests hazm scikit-learn

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Article Clustering

Table of Contents

Overview

Installation

Usage

1. Collecting Article Links

2. Retrieving Article Data

3. Preprocessing Titles

4. Generating Embeddings

5. Clustering Articles

Files

Dependencies

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
README.md		README.md
clustering.ipynb		clustering.ipynb
embedding.py		embedding.py
find_articles_links.py		find_articles_links.py
get_articles.py		get_articles.py
title_cleaner.py		title_cleaner.py

Bahram03/article-clustering

Folders and files

Latest commit

History

Repository files navigation

Article Clustering

Table of Contents

Overview

Installation

Usage

1. Collecting Article Links

2. Retrieving Article Data

3. Preprocessing Titles

4. Generating Embeddings

5. Clustering Articles

Files

Dependencies

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages