Skip to content

Commit

Permalink
feat: Analyze CWE rank trends (1999–2024)
Browse files Browse the repository at this point in the history
- Added detailed analysis of the top 10 CWEs by cumulative rank over time.
- Visualized CWE rank trends with line plots, markers, and year highlights.
- Extracted key insights on rising, falling, and persistent CWEs.
- Refined dataset filtering for accuracy, excluding non-informative CWEs.
  • Loading branch information
cak committed Jan 11, 2025
1 parent 6492afe commit 71e1f86
Show file tree
Hide file tree
Showing 5 changed files with 794 additions and 9 deletions.
147 changes: 147 additions & 0 deletions markdown/cve_data_stories/cwe_trends/02_data_processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
---
jupyter:
jupytext:
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.16.6
kernelspec:
display_name: Python 3
language: python
name: python3
---

# CVE Data Stories: CWE Trends - Data Processing


```python
import csv
import json
from collections import defaultdict
from datetime import datetime
from pathlib import Path

import pandas as pd
```

# Paths Setup and Data Directories

We start by defining the paths for the raw CVE datasets and setting up the target directory for storing processed data. This includes creating a dictionary of dataset file names for each year and ensuring the target directory exists for saving outputs.

```python
# Paths
DATASETS = {year: f"nvdcve-1.1-{year}.json" for year in range(2002, 2025)}
data_folder = Path("../../../data/cve_data_stories/raw")

# Target directory for processed data
DATA_DIR = Path("../../../data/cve_data_stories/cwe_trends/processed")
DATA_DIR.mkdir(parents=True, exist_ok=True)

output_csv_yearly = DATA_DIR / "cwe_yearly_counts.csv"
output_csv_cumulative = DATA_DIR / "cwe_yearly_cumulative.csv"
```

# Collecting CWE Yearly Counts

This section processes the raw JSON datasets to extract CWE IDs and their associated publication years.

The key steps include:
1. Reading the JSON files.
2. Extracting CWE IDs and publication years from each CVE item.
3. Counting occurrences of each CWE ID by year.

The resulting yearly counts are stored in a dictionary for further processing.

```python
def collect_cwe_yearly_counts(json_file, year_counts):
try:
with open(json_file, 'r') as f:
data = json.load(f)

for item in data.get('CVE_Items', []):
published_date = item.get('publishedDate', None)

# Parse year from the published date
if published_date:
pub_year = datetime.strptime(published_date, "%Y-%m-%dT%H:%MZ").year
else:
continue # Skip if no published date

# Extract CWE IDs
cwe_ids = item.get('cve', {}).get('problemtype', {}).get('problemtype_data', [])
for cwe_entry in cwe_ids:
for desc in cwe_entry.get('description', []):
cwe = desc.get('value', '') # Get CWE ID (e.g., CWE-79)
if cwe:
year_counts[(cwe, pub_year)] += 1

except FileNotFoundError:
print(f"File not found: {json_file}")
except json.JSONDecodeError:
print(f"Error decoding JSON: {json_file}")
except Exception as e:
print(f"An error occurred: {e}")


# Initialize defaultdict to hold CWE yearly counts
cwe_yearly_counts = defaultdict(int)

# Process each dataset
for year, file_name in DATASETS.items():
input_file = data_folder / file_name
print(f"Processing {input_file}")
collect_cwe_yearly_counts(input_file, cwe_yearly_counts)

# Write CWE yearly counts to a CSV
with open(output_csv_yearly, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["CWE_ID", "Year", "Count"]) # Header row
for (cwe_id, year), count in sorted(cwe_yearly_counts.items()):
writer.writerow([cwe_id, year, count])

print(f"Yearly CWE counts written to {output_csv_yearly}")
```




# Preparing Yearly and Cumulative Counts

The yearly counts are loaded and preprocessed to ensure continuity in the timeline for each CWE ID. Missing years are filled with zero counts, and cumulative counts are calculated for each CWE over time.

The final dataset includes:
1. CWE ID
2. Year
3. Yearly Count
4. Cumulative Count

The processed data is saved to a CSV file for further analysis and visualization.

```python
# Load the yearly counts CSV
df = pd.read_csv(output_csv_yearly)

# Generate all years for each CWE
cwes = df["CWE_ID"].unique()
years = list(range(df["Year"].min(), df["Year"].max() + 1))

# Create a complete index for CWEs and years
full_index = pd.MultiIndex.from_product([cwes, years], names=["CWE_ID", "Year"])
df_full = pd.DataFrame(index=full_index).reset_index()

# Merge with original data, filling missing counts with 0
df = pd.merge(df_full, df, on=["CWE_ID", "Year"], how="left").fillna({"Count": 0})

# Sort by CWE ID and year
df = df.sort_values(by=["CWE_ID", "Year"])

# Calculate cumulative counts
df["Cumulative_Count"] = df.groupby("CWE_ID")["Count"].cumsum().astype(int)

# Save the final dataset
df.to_csv(output_csv_cumulative, index=False)

print(f"Cumulative counts saved to {output_csv_cumulative}")

```
151 changes: 151 additions & 0 deletions markdown/cve_data_stories/cwe_trends/03_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
jupyter:
jupytext:
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.16.6
kernelspec:
display_name: Python 3
language: python
name: python3
---

# CVE Data Stories: CWE Trends - Data Analysis

```python
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
```

## Preparing the Top 10 CWE Dataset (1999–2024)

This dataset focuses on the **Top 10 CWEs** based on cumulative counts up to 2024, providing a clear view of the most prevalent vulnerabilities over time. The preparation process includes:

- **Data Filtering**:
- Excluded `NVD-CWE-noinfo` and `NVD-CWE-Other` for cleaner analysis.
- Focused on data between **1999 and 2024**, explicitly excluding 2025.
- **Top 10 CWEs Selection**: Identified CWEs with the highest cumulative counts in 2024.
- **Streamlined Dataset**: Retained only relevant entries for the Top 10 CWEs across the years.

This refined dataset is saved for further analysis, enabling impactful visualizations and insights into long-term CWE trends.


```python
# Load the dataset
data = pd.read_csv("../../../data/cve_data_stories/cwe_trends/processed/cwe_yearly_cumulative.csv")

# Filter out `NVD-CWE-noinfo` and `NVD-CWE-Other` CWEs
data = data[~data["CWE_ID"].str.startswith("NVD-CWE")]

# Filter years after 1999 and before 2025
data = data[(data["Year"] >= 1999) & (data["Year"] < 2025)]

# Filter for the top 10 CWEs by cumulative count in 2024
top_cwes_2024 = data[data["Year"] == 2024].sort_values("Cumulative_Count", ascending=False).head(10)
top_cwes_ids = top_cwes_2024["CWE_ID"].tolist()

# Filter dataset for only the top 10 CWEs and exclude 2025 explicitly
filtered_data = data[(data["CWE_ID"].isin(top_cwes_ids)) & (data["Year"] < 2025)].copy()

# Save the final dataset
filtered_data.to_csv("../../../data/cve_data_stories/cwe_trends/processed/top_10_cwe_yearly_cumulative.csv",
index=False)
```

## Top 10 CWE Rank Trends (1999–2024)

This plot visualizes the **Top 10 CWEs** by rank over time, highlighting their evolution from 1999 to 2024. Each CWE’s line is color-coded, with key features to enhance clarity and impact:

- **End Markers & Labels**: Final ranks in 2024 are highlighted with markers and labeled directly for easy interpretation.
- **Inverted Y-Axis**: Rank 1 is at the top, emphasizing higher frequency.
- **Highlighted Years**: Dashed vertical lines mark notable years (2007, 2017, 2018, 2022, 2024).
- **Readable Design**: A vibrant color palette, clear gridlines, and padding ensure visual appeal and clarity for sharing.

```python
# Step 1: Calculate ranks for each year
filtered_data["Rank"] = (
filtered_data.groupby("Year")["Cumulative_Count"]
.rank(method="dense", ascending=False)
.astype(int)
)

# Sort CWEs by their final rank (2024)
final_ranks = (
filtered_data[filtered_data["Year"] == 2024]
.sort_values("Rank")[["CWE_ID", "Rank"]]
.set_index("CWE_ID")
)
filtered_data["Final_Rank"] = filtered_data["CWE_ID"].map(final_ranks["Rank"])

# Step 2: Sort data by final rank
filtered_data = filtered_data.sort_values(["Final_Rank", "Year"])

# Step 3: Plotting the rank trends
plt.figure(figsize=(16, 10)) # Larger figure size for better readability
sns.set_style("whitegrid")

# Use a vibrant color palette
palette = sns.color_palette("husl", len(filtered_data["CWE_ID"].unique()))

# Plot each CWE line with markers
for i, (cwe_id, cwe_data) in enumerate(filtered_data.groupby("CWE_ID")):
plt.plot(
cwe_data["Year"],
cwe_data["Rank"],
color=palette[i],
label=cwe_id,
linewidth=5.5,
alpha=0.9,
)
# Add markers at the end of each line
plt.scatter(
cwe_data["Year"].iloc[-1], # Last year
cwe_data["Rank"].iloc[-1], # Last rank
color=palette[i],
edgecolor="black",
s=100, # Marker size
zorder=5,
)
# Add right-side labels with additional spacing
plt.text(
cwe_data["Year"].iloc[-1] + 0.25, # Offset for label spacing
cwe_data["Rank"].iloc[-1],
cwe_id,
fontsize=12,
weight="bold",
color=palette[i],
verticalalignment="center",
)

# Invert y-axis to show rank 1 at top
plt.gca().invert_yaxis()

# TITLES: Main title + optional subtitle for clarity
plt.title("Top 10 CWE Rank Trends Over Time\n(1999–2024)", fontsize=26, weight="bold", pad=20)

# Axis labels and ticks
plt.xticks(ticks=range(1999, 2025), fontsize=12)
plt.yticks(range(1, 11), fontsize=12) # showing ranks 1 to 10

# Adjust x-axis limits to provide padding for dots and labels
plt.xlim(1999, 2025)

# Remove legend since lines are labeled directly on the right
plt.legend([], [], frameon=False)

# Gridlines
plt.grid(visible=True, linestyle="--", linewidth=0.5, alpha=0.7)

# Highlight specific years with vertical lines
highlight_years = [2007, 2017, 2018, 2022, 2024]
for year in highlight_years:
plt.axvline(x=year, color="gray", linestyle="--", linewidth=1, alpha=0.4)

plt.tight_layout()
plt.savefig("../../../data/cve_data_stories/cwe_trends/processed/top_25_cwe_rank_trends.png", dpi=300,
bbox_inches="tight")
plt.show()
```
18 changes: 9 additions & 9 deletions notebooks/cve_data_stories/01_data_collection.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-01-04T17:52:27.003359Z",
"start_time": "2025-01-04T17:52:26.999031Z"
"end_time": "2025-01-11T10:58:00.353617Z",
"start_time": "2025-01-11T10:58:00.217081Z"
}
},
"cell_type": "code",
Expand All @@ -22,7 +22,7 @@
],
"id": "f0ea410ba01c8838",
"outputs": [],
"execution_count": 5
"execution_count": 1
},
{
"metadata": {},
Expand All @@ -39,8 +39,8 @@
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-01-04T17:52:27.026069Z",
"start_time": "2025-01-04T17:52:27.022020Z"
"end_time": "2025-01-11T10:58:00.363251Z",
"start_time": "2025-01-11T10:58:00.359895Z"
}
},
"cell_type": "code",
Expand All @@ -51,7 +51,7 @@
],
"id": "99e5bc4542e6d1d7",
"outputs": [],
"execution_count": 6
"execution_count": 2
},
{
"cell_type": "markdown",
Expand All @@ -73,8 +73,8 @@
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-01-04T17:52:37.079374Z",
"start_time": "2025-01-04T17:52:27.049230Z"
"end_time": "2025-01-11T10:58:08.702411Z",
"start_time": "2025-01-11T10:58:00.637617Z"
}
},
"cell_type": "code",
Expand Down Expand Up @@ -198,7 +198,7 @@
]
}
],
"execution_count": 7
"execution_count": 3
}
],
"metadata": {
Expand Down
Loading

0 comments on commit 71e1f86

Please sign in to comment.