feat: Analyze CWE rank trends (1999–2024)

- Added detailed analysis of the top 10 CWEs by cumulative rank over time. - Visualized CWE rank trends with line plots, markers, and year highlights. - Extracted key insights on rising, falling, and persistent CWEs. - Refined dataset filtering for accuracy, excluding non-informative CWEs.
TypeError · Jan 11, 2025 · 71e1f86 · 71e1f86
1 parent 6492afe
commit 71e1f86
Show file tree

Hide file tree

Showing 5 changed files with 794 additions and 9 deletions.
diff --git a/markdown/cve_data_stories/cwe_trends/02_data_processing.md b/markdown/cve_data_stories/cwe_trends/02_data_processing.md
@@ -0,0 +1,147 @@
+---
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .md
+      format_name: markdown
+      format_version: '1.3'
+      jupytext_version: 1.16.6
+  kernelspec:
+    display_name: Python 3
+    language: python
+    name: python3
+---
+
+# CVE Data Stories: CWE Trends - Data Processing
+
+
+```python
+import csv
+import json
+from collections import defaultdict
+from datetime import datetime
+from pathlib import Path
+
+import pandas as pd
+```
+
+# Paths Setup and Data Directories
+
+We start by defining the paths for the raw CVE datasets and setting up the target directory for storing processed data. This includes creating a dictionary of dataset file names for each year and ensuring the target directory exists for saving outputs.
+
+```python
+# Paths
+DATASETS = {year: f"nvdcve-1.1-{year}.json" for year in range(2002, 2025)}
+data_folder = Path("../../../data/cve_data_stories/raw")
+
+# Target directory for processed data
+DATA_DIR = Path("../../../data/cve_data_stories/cwe_trends/processed")
+DATA_DIR.mkdir(parents=True, exist_ok=True)
+
+output_csv_yearly = DATA_DIR / "cwe_yearly_counts.csv"
+output_csv_cumulative = DATA_DIR / "cwe_yearly_cumulative.csv"
+```
+
+# Collecting CWE Yearly Counts
+
+This section processes the raw JSON datasets to extract CWE IDs and their associated publication years.
+
+The key steps include:
+1. Reading the JSON files.
+2. Extracting CWE IDs and publication years from each CVE item.
+3. Counting occurrences of each CWE ID by year.
+
+The resulting yearly counts are stored in a dictionary for further processing.
+
+```python
+def collect_cwe_yearly_counts(json_file, year_counts):
+    try:
+        with open(json_file, 'r') as f:
+            data = json.load(f)
+
+        for item in data.get('CVE_Items', []):
+            published_date = item.get('publishedDate', None)
+
+            # Parse year from the published date
+            if published_date:
+                pub_year = datetime.strptime(published_date, "%Y-%m-%dT%H:%MZ").year
+            else:
+                continue  # Skip if no published date
+
+            # Extract CWE IDs
+            cwe_ids = item.get('cve', {}).get('problemtype', {}).get('problemtype_data', [])
+            for cwe_entry in cwe_ids:
+                for desc in cwe_entry.get('description', []):
+                    cwe = desc.get('value', '')  # Get CWE ID (e.g., CWE-79)
+                    if cwe:
+                        year_counts[(cwe, pub_year)] += 1
+
+    except FileNotFoundError:
+        print(f"File not found: {json_file}")
+    except json.JSONDecodeError:
+        print(f"Error decoding JSON: {json_file}")
+    except Exception as e:
+        print(f"An error occurred: {e}")
+
+
+# Initialize defaultdict to hold CWE yearly counts
+cwe_yearly_counts = defaultdict(int)
+
+# Process each dataset
+for year, file_name in DATASETS.items():
+    input_file = data_folder / file_name
+    print(f"Processing {input_file}")
+    collect_cwe_yearly_counts(input_file, cwe_yearly_counts)
+
+# Write CWE yearly counts to a CSV
+with open(output_csv_yearly, 'w', newline='') as csvfile:
+    writer = csv.writer(csvfile)
+    writer.writerow(["CWE_ID", "Year", "Count"])  # Header row
+    for (cwe_id, year), count in sorted(cwe_yearly_counts.items()):
+        writer.writerow([cwe_id, year, count])
+
+print(f"Yearly CWE counts written to {output_csv_yearly}")
+```
+
+
+
+
+# Preparing Yearly and Cumulative Counts
+
+The yearly counts are loaded and preprocessed to ensure continuity in the timeline for each CWE ID. Missing years are filled with zero counts, and cumulative counts are calculated for each CWE over time.
+
+The final dataset includes:
+1. CWE ID
+2. Year
+3. Yearly Count
+4. Cumulative Count
+
+The processed data is saved to a CSV file for further analysis and visualization.
+
+```python
+# Load the yearly counts CSV
+df = pd.read_csv(output_csv_yearly)
+
+# Generate all years for each CWE
+cwes = df["CWE_ID"].unique()
+years = list(range(df["Year"].min(), df["Year"].max() + 1))
+
+# Create a complete index for CWEs and years
+full_index = pd.MultiIndex.from_product([cwes, years], names=["CWE_ID", "Year"])
+df_full = pd.DataFrame(index=full_index).reset_index()
+
+# Merge with original data, filling missing counts with 0
+df = pd.merge(df_full, df, on=["CWE_ID", "Year"], how="left").fillna({"Count": 0})
+
+# Sort by CWE ID and year
+df = df.sort_values(by=["CWE_ID", "Year"])
+
+# Calculate cumulative counts
+df["Cumulative_Count"] = df.groupby("CWE_ID")["Count"].cumsum().astype(int)
+
+# Save the final dataset
+df.to_csv(output_csv_cumulative, index=False)
+
+print(f"Cumulative counts saved to {output_csv_cumulative}")
+
+```
diff --git a/markdown/cve_data_stories/cwe_trends/03_analysis.md b/markdown/cve_data_stories/cwe_trends/03_analysis.md
@@ -0,0 +1,151 @@
+---
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .md
+      format_name: markdown
+      format_version: '1.3'
+      jupytext_version: 1.16.6
+  kernelspec:
+    display_name: Python 3
+    language: python
+    name: python3
+---
+
+# CVE Data Stories: CWE Trends - Data Analysis
+
+```python
+import matplotlib.pyplot as plt
+import pandas as pd
+import seaborn as sns
+```
+
+## Preparing the Top 10 CWE Dataset (1999–2024)
+
+This dataset focuses on the **Top 10 CWEs** based on cumulative counts up to 2024, providing a clear view of the most prevalent vulnerabilities over time. The preparation process includes:
+
+- **Data Filtering**:
+  - Excluded `NVD-CWE-noinfo` and `NVD-CWE-Other` for cleaner analysis.
+  - Focused on data between **1999 and 2024**, explicitly excluding 2025.
+- **Top 10 CWEs Selection**: Identified CWEs with the highest cumulative counts in 2024.
+- **Streamlined Dataset**: Retained only relevant entries for the Top 10 CWEs across the years.
+
+This refined dataset is saved for further analysis, enabling impactful visualizations and insights into long-term CWE trends.
+
+
+```python
+# Load the dataset
+data = pd.read_csv("../../../data/cve_data_stories/cwe_trends/processed/cwe_yearly_cumulative.csv")
+
+# Filter out `NVD-CWE-noinfo` and `NVD-CWE-Other` CWEs
+data = data[~data["CWE_ID"].str.startswith("NVD-CWE")]
+
+# Filter years after 1999 and before 2025
+data = data[(data["Year"] >= 1999) & (data["Year"] < 2025)]
+
+# Filter for the top 10 CWEs by cumulative count in 2024
+top_cwes_2024 = data[data["Year"] == 2024].sort_values("Cumulative_Count", ascending=False).head(10)
+top_cwes_ids = top_cwes_2024["CWE_ID"].tolist()
+
+# Filter dataset for only the top 10 CWEs and exclude 2025 explicitly
+filtered_data = data[(data["CWE_ID"].isin(top_cwes_ids)) & (data["Year"] < 2025)].copy()
+
+# Save the final dataset
+filtered_data.to_csv("../../../data/cve_data_stories/cwe_trends/processed/top_10_cwe_yearly_cumulative.csv",
+                     index=False)
+```
+
+## Top 10 CWE Rank Trends (1999–2024)
+
+This plot visualizes the **Top 10 CWEs** by rank over time, highlighting their evolution from 1999 to 2024. Each CWE’s line is color-coded, with key features to enhance clarity and impact:
+
+- **End Markers & Labels**: Final ranks in 2024 are highlighted with markers and labeled directly for easy interpretation.
+- **Inverted Y-Axis**: Rank 1 is at the top, emphasizing higher frequency.
+- **Highlighted Years**: Dashed vertical lines mark notable years (2007, 2017, 2018, 2022, 2024).
+- **Readable Design**: A vibrant color palette, clear gridlines, and padding ensure visual appeal and clarity for sharing.
+
+```python
+# Step 1: Calculate ranks for each year
+filtered_data["Rank"] = (
+    filtered_data.groupby("Year")["Cumulative_Count"]
+    .rank(method="dense", ascending=False)
+    .astype(int)
+)
+
+# Sort CWEs by their final rank (2024)
+final_ranks = (
+    filtered_data[filtered_data["Year"] == 2024]
+    .sort_values("Rank")[["CWE_ID", "Rank"]]
+    .set_index("CWE_ID")
+)
+filtered_data["Final_Rank"] = filtered_data["CWE_ID"].map(final_ranks["Rank"])
+
+# Step 2: Sort data by final rank
+filtered_data = filtered_data.sort_values(["Final_Rank", "Year"])
+
+# Step 3: Plotting the rank trends
+plt.figure(figsize=(16, 10))  # Larger figure size for better readability
+sns.set_style("whitegrid")
+
+# Use a vibrant color palette
+palette = sns.color_palette("husl", len(filtered_data["CWE_ID"].unique()))
+
+# Plot each CWE line with markers
+for i, (cwe_id, cwe_data) in enumerate(filtered_data.groupby("CWE_ID")):
+    plt.plot(
+        cwe_data["Year"],
+        cwe_data["Rank"],
+        color=palette[i],
+        label=cwe_id,
+        linewidth=5.5,
+        alpha=0.9,
+    )
+    # Add markers at the end of each line
+    plt.scatter(
+        cwe_data["Year"].iloc[-1],  # Last year
+        cwe_data["Rank"].iloc[-1],  # Last rank
+        color=palette[i],
+        edgecolor="black",
+        s=100,  # Marker size
+        zorder=5,
+    )
+    # Add right-side labels with additional spacing
+    plt.text(
+        cwe_data["Year"].iloc[-1] + 0.25,  # Offset for label spacing
+        cwe_data["Rank"].iloc[-1],
+        cwe_id,
+        fontsize=12,
+        weight="bold",
+        color=palette[i],
+        verticalalignment="center",
+    )
+
+# Invert y-axis to show rank 1 at top
+plt.gca().invert_yaxis()
+
+# TITLES: Main title + optional subtitle for clarity
+plt.title("Top 10 CWE Rank Trends Over Time\n(1999–2024)", fontsize=26, weight="bold", pad=20)
+
+# Axis labels and ticks
+plt.xticks(ticks=range(1999, 2025), fontsize=12)
+plt.yticks(range(1, 11), fontsize=12)  # showing ranks 1 to 10
+
+# Adjust x-axis limits to provide padding for dots and labels
+plt.xlim(1999, 2025)
+
+# Remove legend since lines are labeled directly on the right
+plt.legend([], [], frameon=False)
+
+# Gridlines
+plt.grid(visible=True, linestyle="--", linewidth=0.5, alpha=0.7)
+
+# Highlight specific years with vertical lines
+highlight_years = [2007, 2017, 2018, 2022, 2024]
+for year in highlight_years:
+    plt.axvline(x=year, color="gray", linestyle="--", linewidth=1, alpha=0.4)
+
+plt.tight_layout()
+plt.savefig("../../../data/cve_data_stories/cwe_trends/processed/top_25_cwe_rank_trends.png", dpi=300,
+            bbox_inches="tight")
+plt.show()
+```
diff --git a/notebooks/cve_data_stories/01_data_collection.ipynb b/notebooks/cve_data_stories/01_data_collection.ipynb
@@ -9,8 +9,8 @@
   {
    "metadata": {
     "ExecuteTime": {
-     "end_time": "2025-01-04T17:52:27.003359Z",
-     "start_time": "2025-01-04T17:52:26.999031Z"
+     "end_time": "2025-01-11T10:58:00.353617Z",
+     "start_time": "2025-01-11T10:58:00.217081Z"
     }
    },
    "cell_type": "code",
@@ -22,7 +22,7 @@
    ],
    "id": "f0ea410ba01c8838",
    "outputs": [],
-   "execution_count": 5
+   "execution_count": 1
   },
   {
    "metadata": {},
@@ -39,8 +39,8 @@
   {
    "metadata": {
     "ExecuteTime": {
-     "end_time": "2025-01-04T17:52:27.026069Z",
-     "start_time": "2025-01-04T17:52:27.022020Z"
+     "end_time": "2025-01-11T10:58:00.363251Z",
+     "start_time": "2025-01-11T10:58:00.359895Z"
     }
    },
    "cell_type": "code",
@@ -51,7 +51,7 @@
    ],
    "id": "99e5bc4542e6d1d7",
    "outputs": [],
-   "execution_count": 6
+   "execution_count": 2
   },
   {
    "cell_type": "markdown",
@@ -73,8 +73,8 @@
   {
    "metadata": {
     "ExecuteTime": {
-     "end_time": "2025-01-04T17:52:37.079374Z",
-     "start_time": "2025-01-04T17:52:27.049230Z"
+     "end_time": "2025-01-11T10:58:08.702411Z",
+     "start_time": "2025-01-11T10:58:00.637617Z"
     }
    },
    "cell_type": "code",
@@ -198,7 +198,7 @@
      ]
     }
    ],
-   "execution_count": 7
+   "execution_count": 3
   }
  ],
  "metadata": {