Welcome to the new midterm project, it's been put together just for you! You'll find a data folder with quite a few JSON files. In each of these files you'll find data for housing sales in the US. To wrangle the data, try iterating over each file, loading it, and storing it in memory.
Your tasks are as follows:
- Load the data from the provided files (in the
data/
directory) into a Pandas dataframe - Explore, clean and preprocess your data to make it ready for ML Modelling - hints and guidance can be found in the
1 - EDA.ipynb
notebook - (Stretch) Explore some outside data sources - is there any other information you could join to your data that might be helpful to predict housing prices?
- Perform EDA on your data to help understand the distributions and relationships between your variables
- Save your finalized dataframes (
X_train
,y_train
,X_test
andy_test
) as .csv's in yourdata/
directory. You may want to make aprocessed/
subfolder.
Complete the 1 - EDA.ipynb notebook to demonstrate how you executed the tasks above.
- Try a variety of supervized learning models on your preprocessed data
- Decide on your criteria for model selection - what metrics are most important in this context? Describe your reasoning
- (Stretch) Even after preprocessing, you may have a lot of features, but they not all be needed to make an accurate prediction. Explore Feature Selection. How does this change model performance? Remember that a simpler model is generally preferred to a complex one if performance is similar
Complete the 2 - model_selection.ipynb notebook to demonstrate how you executed the tasks above.
- Perform hyperparameter tuning on the best performing models from Part 2. But be careful! Depending on how you preprocessed your data, you may not be able to use the default Scikit-Learn functions without leaking information. You'll find some helpful starter docstrings in the
3 - tuning_pipeline.ipynb
notebook. - Save your tuned model - you may want to create a new
models/
directory in your repo - Build a pipeline that performs your preprocessing steps and makes predictions using your tuned model for new data - assume the new data is in the same JSON format as your original data.
- Save your final pipeline
Complete the 3 - tuning_pipeline.ipynb notebook to demonstrate how you executed the tasks above.
Congratulations, your project is complete!