Ingest | Prepare | Train | Deployment | Tests
. ├── Script for downloading and saving the dataset. ├── Script for processing the dataset for training. ├── Script for fine-tuning the model. ├── config.ini Configuration file for setting model hyperparameters and paths ├── FastAPI application for serving the model. ├── Dockerfile Docker configuration for containerizing the FastAPI application. ├── docker-compose.yml Docker Compose configuration for container application. ├── service.yml Kubernetes Service configuration. ├── deployment.yml Kubernetes deployment configuration. ├── Unit tests for the FastAPI application. ├── requirements.txt Python package dependencies. ├── .gitignore Git ignore file to exclude unnecessary files from version control. └── Project documentation.
Install dependencies
pip3 install -r requirements.txt
usage: [-h] [--path PATH]
Download and save the ImperialCollegeLondon/health_fact dataset.
optional arguments:
-h, --help show this help message and exit
--path PATH Path to save the dataset (default: ./dataset)
Run python3
or python3 --path=your_path
usage: [-h] [--dataset-path DATASET_PATH] [--preprocess-path PREPROCESS_PATH]
Preprocess and tokenize datasets.
optional arguments:
-h, --help show this help message and exit
--dataset-path DATASET_PATH
Path to the dataset directory(default: ./dataset)
--preprocess-path PREPROCESS_PATH
Path to save preprocessed tokens(default: ./preprocess)
Run python3
or python3 --dataset-path=your_path --preprocess-path=save_path
usage: [-h] [--preprocess-path PREPROCESS_PATH] [--store-model STORE_MODEL]
Train DistilBert Model
optional arguments:
-h, --help show this help message and exit
--preprocess-path PREPROCESS_PATH
Path to retrieve preprocessed tokens(default: ./preprocess)
--store-model STORE_MODEL
Path to save trained weight & tokenizer(default: ./models)
Run python3
or python3 --preprocess-path=your_path --store-model=save_path
docker-compose build # Build image
kubectl apply -f service.yml # Configure service for external traffic
kuberctl apply -f deployment.yml # Roll out the deployment in the cluster
Run pytest
or python3 -m pytest
Q: How would you go about deciding which model to use? Select a model and we will use it to be deployed.
Ans: Since the length of tokens in train, test and validation test never reaches 512 I decided to ahead with DistilBert also It is faster to train and experiment with(considering the limited time that I had). If can track sequence length of deployed model collect those and token length are more that 512, we can train longformers(or use chucking)
Q: How would you evaluate this new fine-tuned model? How would you optimise this model further?
Ans: Accuracy, Precision (in case of costly FP), recall(if FN are costly), F1-score (for a balance between FP and FN).
For further optimization:
- We can do hyperparameter tuning (with
) - Perform output probability calibration
- Use difference loss function(class weighted CE)
- Improve Training Pipeline, using Early Stopping, Weight Initialization after classfication layer(after pooler_output) etc.
Q: Explain in a few sentences how you would monitor this model and decide when to make updates (e.g. retrain). Ans:
Track model performace metric i.e Accuracy, Precision, Recall, F1-score using prometeus
and grafana
, trigger retraing if metric are significantly poor that expectation. Capture Data and concept drift re-training the model if significant drift is detected.