Skip to content

Commit e40388f

Browse files
final commit added eda to streamlit and updated Readme.md file #406
1 parent b88e6bf commit e40388f

File tree

3 files changed

+190
-599
lines changed

3 files changed

+190
-599
lines changed

opensource_analysis/OpenSourceEda.ipynb

-596
This file was deleted.

opensource_analysis/README

+79
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,82 @@ streamlit run app.py
1818

1919
## Access the App
2020
Open the URL http://localhost:8501 in your web browser to access the Streamlit app
21+
22+
23+
# Survey Data EDA and Machine Learning App
24+
25+
This repository contains an application built using **Streamlit** to explore and analyze survey data from developers. The app performs **Exploratory Data Analysis (EDA)** and includes visualizations of key data features. Additionally, it can be extended to support **machine learning** tasks like prediction.
26+
27+
## Table of Contents
28+
- [Overview](#overview)
29+
- [Installation](#installation)
30+
- [Dataset](#dataset)
31+
- [Features](#features)
32+
- [1. Data Loading](#1-data-loading)
33+
- [2. Basic Information](#2-basic-information)
34+
- [3. Categorical Value Counts](#3-categorical-value-counts)
35+
- [4. Visualizations](#4-visualizations)
36+
- [5. Correlation Heatmap](#5-correlation-heatmap)
37+
- [6. Cumulative Distribution](#6-cumulative-distribution)
38+
- [Usage](#usage)
39+
- [Future Improvements](#future-improvements)
40+
- [License](#license)
41+
42+
## Overview
43+
This project provides an **interactive web-based application** that allows users to explore a dataset of developer survey results. The app is built using **Streamlit** and includes several exploratory data analysis features, such as visualizing distributions of different variables (e.g., salary, job satisfaction, age). It also displays the relationship between various factors, such as job satisfaction and company size, and can be extended to machine learning tasks.
44+
45+
## Dataset
46+
The dataset used in this project is a sample from the **2018 Developer Survey Results**. It contains various columns such as:
47+
- `Country`: Respondent's country.
48+
- `Employment`: Employment status of the respondent.
49+
- `ConvertedSalary`: Salary converted into USD.
50+
- `DevType`: Developer types (e.g., web developer, data scientist).
51+
- `LanguageWorkedWith`: Programming languages the respondent has worked with.
52+
- `CompanySize`: Size of the company the respondent works for.
53+
- `JobSatisfaction`: Job satisfaction rating on a scale.
54+
- `CareerSatisfaction`: Career satisfaction rating.
55+
56+
## Features
57+
58+
### 1. Data Loading
59+
The application loads the dataset (CSV file) and fills in missing values where necessary. If the file is not found, an error message will be displayed on the app.
60+
61+
### 2. Basic Information
62+
Displays essential information about the dataset, including:
63+
- General structure of the data (`df.info()`).
64+
- Descriptive statistics (`df.describe()`).
65+
66+
### 3. Categorical Value Counts
67+
For the categorical columns (`Country`, `Employment`, `DevType`, `LanguageWorkedWith`), the app shows the distribution of values using value counts and percentages.
68+
69+
### 4. Visualizations
70+
The app provides the following visualizations to explore the data:
71+
- **Salary Distribution**: A histogram with kernel density estimation (KDE) to visualize salary distribution.
72+
- **Job Satisfaction Analysis**: Bar charts for `JobSatisfaction` and `CareerSatisfaction`.
73+
- **Programming Languages**: The top 10 most-used programming languages among respondents.
74+
- **Job Satisfaction by Company Size**: A box plot showing the relationship between company size and job satisfaction.
75+
- **Age Distribution**: A histogram with KDE to show the age distribution of respondents.
76+
- **Country Distribution**: A line plot showing the top 10 countries by the number of respondents.
77+
- **Employment Status**: A pie chart showing the employment status distribution.
78+
- **Database Usage**: A bar chart of the top 10 databases used by respondents.
79+
- **Job Satisfaction by Gender**: A bar chart comparing job satisfaction across genders.
80+
81+
### 5. Correlation Heatmap
82+
Displays a heatmap showing the correlation between numerical variables in the dataset.
83+
84+
### 6. Cumulative Distribution
85+
Provides an **Empirical Cumulative Distribution Function (ECDF)** plot for the first numerical column in the dataset.
86+
87+
## Usage
88+
After launching the app:
89+
1. The app loads the dataset and displays key information and visualizations on the home page.
90+
2. Navigate through the sections to explore different parts of the dataset interactively.
91+
3. The app is designed to be modular, allowing for future extensions, such as adding machine learning models for prediction tasks.
92+
93+
## Future Improvements
94+
- Implement a **machine learning model** to predict job satisfaction or salary based on features like `Country`, `Employment`, `DevType`, etc.
95+
- Enhance the **EDA** with more detailed visualizations and insights.
96+
- Allow users to **upload their own dataset** for customized analysis.
97+
98+
## License
99+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.

opensource_analysis/app.py

+111-3
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,8 @@
6464

6565
# Evaluate the model
6666
y_pred = model.predict(X_test)
67-
classification_rep = classification_report(y_test, y_pred)
68-
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
67+
classification_rep = classification_report(y_test, y_pred, zero_division=1)
68+
roc_auc = roc_auc_score(pd.get_dummies(y_test).values[:, 1], model.predict_proba(X_test)[:, 1])
6969

7070
# Get feature importance
7171
importances = model.named_steps['classifier'].feature_importances_
@@ -94,7 +94,7 @@
9494

9595
# Plot ROC Curve
9696
st.header('ROC Curve')
97-
y_test_binary = y_test.map({'No': 0, 'Yes': 1})
97+
y_test_binary = pd.get_dummies(y_test).values[:, 1] # Convert to binary
9898
fpr, tpr, _ = roc_curve(y_test_binary, model.predict_proba(X_test)[:, 1])
9999
roc_auc = auc(fpr, tpr)
100100
fig, ax = plt.subplots()
@@ -151,5 +151,113 @@
151151
except Exception as e:
152152
st.error(f"An error occurred during prediction: {e}")
153153

154+
# ================== EDA Enhancements ==================
155+
st.header('Enhanced Exploratory Data Analysis (EDA)')
156+
157+
# Load full dataset for EDA
158+
eda_data = pd.read_csv(file_path)
159+
160+
# Salary Analysis
161+
st.subheader('Salary Distribution')
162+
eda_data['ConvertedSalary'] = pd.to_numeric(eda_data['ConvertedSalary'], errors='coerce')
163+
fig, ax = plt.subplots()
164+
sns.histplot(eda_data['ConvertedSalary'].dropna(), kde=True, ax=ax)
165+
ax.set_title('Distribution of Salaries')
166+
ax.set_xlabel('Salary (USD)')
167+
st.pyplot(fig)
168+
169+
# Job Satisfaction Analysis
170+
satisfaction_cols = ['JobSatisfaction', 'CareerSatisfaction']
171+
for col in satisfaction_cols:
172+
st.subheader(f'Distribution of {col}')
173+
fig, ax = plt.subplots()
174+
eda_data[col].value_counts().plot(kind='bar', ax=ax)
175+
ax.set_title(f'Distribution of {col}')
176+
ax.set_xlabel('Satisfaction Level')
177+
ax.set_ylabel('Count')
178+
st.pyplot(fig)
179+
180+
# Programming Languages Analysis
181+
st.subheader('Top 10 Programming Languages')
182+
languages = eda_data['LanguageWorkedWith'].str.split(';', expand=True).stack()
183+
fig, ax = plt.subplots()
184+
languages.value_counts().head(10).plot(kind='bar', ax=ax)
185+
ax.set_title('Top 10 Programming Languages')
186+
ax.set_xlabel('Language')
187+
ax.set_ylabel('Count')
188+
st.pyplot(fig)
189+
190+
# Job Satisfaction by Company Size
191+
st.subheader('Job Satisfaction by Company Size')
192+
fig, ax = plt.subplots()
193+
sns.boxplot(x='CompanySize', y='JobSatisfaction', data=eda_data, ax=ax)
194+
ax.set_title('Job Satisfaction by Company Size')
195+
ax.set_xlabel('Company Size')
196+
ax.set_ylabel('Job Satisfaction')
197+
st.pyplot(fig)
198+
199+
# Age Distribution
200+
st.subheader('Age Distribution of Respondents')
201+
fig, ax = plt.subplots()
202+
sns.histplot(eda_data['Age'], kde=True, ax=ax)
203+
ax.set_title('Age Distribution of Respondents')
204+
ax.set_xlabel('Age')
205+
st.pyplot(fig)
206+
207+
# Top 10 Countries of Respondents
208+
st.subheader('Top 10 Countries of Respondents')
209+
country_counts = eda_data['Country'].value_counts().head(10)
210+
fig, ax = plt.subplots()
211+
ax.plot(country_counts.index, country_counts.values, marker='o')
212+
ax.set_title('Top 10 Countries of Respondents')
213+
ax.set_xlabel('Country')
214+
ax.set_ylabel('Number of Respondents')
215+
st.pyplot(fig)
216+
217+
# Employment Status Distribution
218+
st.header("Employment Status Distribution")
219+
employment_counts = eda_data['Employment'].value_counts()
220+
fig, ax = plt.subplots()
221+
ax.pie(employment_counts.values, labels=employment_counts.index, autopct='%1.1f%%')
222+
ax.set_title('Employment Status Distribution')
223+
ax.axis('equal')
224+
st.pyplot(fig)
225+
226+
# Databases Used
227+
st.header("Top 10 Databases Used")
228+
databases = eda_data['DatabaseWorkedWith'].str.split(';', expand=True).stack()
229+
db_counts = databases.value_counts().head(10)
230+
fig, ax = plt.subplots()
231+
db_counts.plot(kind='barh', ax=ax)
232+
ax.set_xlabel('Number of Users')
233+
ax.set_ylabel('Database')
234+
st.pyplot(fig)
235+
236+
# Job Satisfaction by Gender
237+
st.header("Job Satisfaction by Gender")
238+
job_sat_gender = pd.crosstab(eda_data['JobSatisfaction'], eda_data['Gender'])
239+
fig, ax = plt.subplots()
240+
job_sat_gender.plot(kind='bar', ax=ax)
241+
ax.set_title('Job Satisfaction by Gender')
242+
ax.set_xlabel('Job Satisfaction Level')
243+
st.pyplot(fig)
244+
245+
# Correlation Heatmap
246+
st.header("Correlation Heatmap of Numeric Variables")
247+
numeric_columns = eda_data.select_dtypes(include=['int64', 'float64']).columns
248+
fig, ax = plt.subplots()
249+
sns.heatmap(eda_data[numeric_columns].corr(), annot=True, cmap='coolwarm', ax=ax)
250+
ax.set_title('Correlation Heatmap of Numeric Variables')
251+
st.pyplot(fig)
252+
253+
# Cumulative Distribution
254+
st.header(f"Cumulative Distribution of {numeric_columns[0]}")
255+
fig, ax = plt.subplots()
256+
sns.ecdfplot(data=eda_data, x=numeric_columns[0], ax=ax)
257+
ax.set_title(f'Cumulative Distribution of {numeric_columns[0]}')
258+
ax.set_xlabel(numeric_columns[0])
259+
ax.set_ylabel('Cumulative Proportion')
260+
st.pyplot(fig)
261+
154262
except Exception as e:
155263
st.error(f"An error occurred while loading data: {e}")

0 commit comments

Comments
 (0)