DB Data/Stages.json

{"_id":3,"Stage":"Data cleaning","Description":"In this stage, the noisy, duplicated or corrupted data is removed from the data set, also the data that is incorrectly formatted is fixed.","Tasks":[{"Task":"Profiling data","Description_1":" Examining, analyzing, and creating useful summaries of data. This proce aids in the discovery of data quality issues, risks, and overall trends."},{"Task":"Remove duplicate or irrelevant observations","Description_1":"Remove observations that are duplicated or that do not fit into the specific problem you are trying to analyze."},{"Task":"Data transformation.","Description_1":"Converting data from one format or structure into another format for warehousing and analyzing."},{"Task":"Handle noise e.g., unwanted outliers","Description_1":"Find if there are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. If so, decide if there is a reason to delete it or if it is given information about the problem."},{"Task":"Handle missing data","Description_1":"Decide what to do with missing values. They can be deleted, predicted, or left as null values. Each option has possible negative consequences and must be decided based on the problem."},{"Task":"Validate data coherence (Benchmark)","Description_1":"Does the data make sense?Does the data follow the appropriate rules for its field?Does it prove or disprove your working theory, or bring any insight to light?Can you find trends in the data to help you form your next theory?If not, is that because of a data quality issue?"},{"Task":"Divide the data set for next steps","Description_1":"Divide the data set for the training and the testing steps."},{"Task":"Adjust the class distribution of a data set e.g., oversampling, undersampling","Description_1":"First try training on the true distribution. If the model works well and generalizes, you're done! If not, try the following downsampling and upweighting technique."}]}
{"_id":7,"Stage":"Model evaluation","Description":"The output is evaluated with tested datasets and pre-defined metrics.","Tasks":[{"Task":"Run the algorithm with the testing part of the data set","Description_1":"Run the algorithm with data that is hasn't being used in any of the steps before"},{"Task":"Calculate true positives, true negatives, false positives, false negatives","Description_1":"Use the outcomes to calculate the FP, FN, TP, and TN. This can be displayed on a confusion matrix to be analyzed."},{"Task":"Calculate the metrics that were previously defined on test set.","Description_1":"Apply the metrics that were defined before. This will show how is the model performance."},{"Task":"Plot learning curves","Description_1":"Learning curves are plots that show changes in learning performance over time in terms of experience. Learning curves of model performance on the train and validation datasets can be used to diagnose an underfit, overfit, or well-fit model."},{"Task":"Analyze the results","Description_1":"Analyze the metrics outcomes and the learning curves outcomes in order to decide if there is a problem with the model or if it works well for the objective of the project."},{"Task":"Compare and select the model and hyperparameters that fits better to the problem.","Description_1":"Decide which model is appropriate to solve the problem taking into account the results."},{"Task":"Troubleshooting model issues (If needed)","Description_1":"Model evaluation metrics should be good, but not perfect. Poor model performance and perfect model performance are both indications that something went wrong with the training process."}]}
{"_id":4,"Stage":"Data labeling","Description":"In this stage, meaningful labels are added and context is given for the model to learn.","Tasks":[{"Task":"Define the methodology of labeling","Description_1":"Automatic or manually?"},{"Task":"Define label set","Description_1":"List all the labels you would like to use and describe the meaning of each label."},{"Task":"Labeling","Description_1":"Identify and call out features that are present in the data. It’s critical to choose informative, discriminating, and independent features to label. To avoid bias cross labeling is done, so more than 1 person is labeling the same data."},{"Task":"Merging labels","Description_1":"As there is more than 1 person labeling the same data, is important to get a consensus on the label."},{"Task":"Audit the labels","Description_1":"Verify the accuracy of labels and update them as necessary."}]}
{"_id":8,"Stage":"Model deployment","Description":"Once the algorithm passes the evaluation metrics, it is deployed on the targered device","Tasks":[{"Task":"Define deployment environment (Storage, size,data retrieval, scalability, feedback)","Description_1":"How is the data going to be stored? How large is the data? How is the data going to be retrieved? How fast is going to grow the model? Which infrastructure is needed?"},{"Task":"Deploy the selected model","Description_1":" The model is moved into its deployed environment, where it has access to the hardware resources it needs as well as the data source."},{"Task":"Integrate the model into the process.","Description_1":"This includes making it accessible from an end user’s laptop using an API or integrating it into software currently being used by the end user. "},{"Task":"Test the deployment","Description_1":"This includes that que la info está llegando bien, que el usuario está pudiendo acceder bien y que el usuario final entiende cómo utilizarlo"}]}
{"_id":9,"Stage":"Model monitoring","Description":"The deployed model is monitored constantly in order to avoid execution errors in real-time.","Tasks":[{"Task":"Collect data while in production and compare changes in model inferences.","Description_1":"Check if the performance is less than the baseline defined."},{"Task":"Collect data distribution and compare it from the baseline data distribution","Description_1":"Check if the data distribution is different from the baseline defined."},{"Task":"Save new training data for retrain.","Description_1":"Save new data occasionally in order to have a new training set if needed."},{"Task":"Incorporate customers feedback.","Description_1":"Users can find errors or occurrences that during the developing phase were not taken into account so it is important to have constant feedback on the model."},{"Task":"Update/Retrain the model","Description_1":"In the case of the performance or the data, distribution changes have impacted the model, it must be retrained with this new data or be updated."}]}
{"_id":2,"Stage":"Data collection","Description":"In this stage, the team looks for the data, it could be already available datasets, or they collect it them selfs. Once they have the data needed, they integrated.","Tasks":[{"Task":"Determine what information the team wants to collect.","Description_1":"What type of information is needed? "},{"Task":"Identify sources according to the needs.","Description_1":"Select reliable sources to have good quality data"},{"Task":"Determine data collection method. ","Description_1":"Already available data sets? Or is the team going to gather data and create a new one?"},{"Task":"Define the timeframe for the collection.","Description_1":"For the data that is continuously collected is important to set a timeframe for the collection."},{"Task":"Collect the data.","Description_1":"Apply the methodology selected before to collect the data."},{"Task":"Integrate the data sets gathered.","Description_1":"Consolidating data from disparate sources into a single dataset."},{"Task":"Enhance and augment data","Description_1":"Artificially increasing the amount of data by generating new data points from existing data."},{"Task":"Anonymizing data (if needed)","Description_1":"Preserving private or confidential information by deleting or encoding identifiers that link individuals to the stored data."}]}
{"_id":1,"Stage":"Model requirements","Description":"In this stage, the problem is analyzed in order to decide if it is appropriate to solve it with ML or not. Also, it is decided wich types of models will be useful in this case.","Tasks":[{"Task":"Define objective","Description_1":"What is the project's goal? What does the team want to achieve by doing ML?"},{"Task":"Define input and output.","Description_1":"Depending on the goal, what are the expected inputs and outputs? What input is available? What output is more useful for the user?"},{"Task":"Define the 'success criteria'","Description_1":"How are the benefits going to be measured? It is important to select achievable 'success'."},{"Task":"Define ethical considerations","Description_1":"Does the project is going to have sensitive data? How it is going to be managed?"},{"Task":"Define the type of problem to solve.","Description_1":"Define if the problem is a classification, regression, clustering, etc."},{"Task":"Define which models are going to be use","Description_1":"Taking into account the definition of the problem, decide which models to use."},{"Task":"Define the metrics to use","Description_1":"Define which are going to be the metrics to measure the performance of the model depending on the objective and the model."}]}
{"_id":10,"Stage":"Support tasks","Description":"Tasks that are done at the same time as the ml pipeline but aren't part of it.","Tasks":[{"Task":"Implement the model.","Description_1":"Build the model taking into account the design decisions previously decided."},{"Task":"Experiment design.","Description_1":"Design the experiment methodologies and procedure."},{"Task":"Test the pipeline, techniques, and models implementation.","Description_1":"Test the code and structural components of the model."},{"Task":"Present the system to the user","Description_1":"Explain and make the user get familiar with the model. Make it clear the uses and limitations."},{"Task":"Support possible errors","Description_1":"Backup plan if something goes wrong, for example that the user can get control of the process if the system fails."}]}
{"_id":5,"Stage":"Feature engineering","Description":"The extraction and selection of informative features for ML. New variables can be created in order to speed up the data transformation.","Tasks":[{"Task":"Define the methodology of extracting and selecting features.","Description_1":"Select the methodology to use taking into account the techniques available."},{"Task":"Transform features","Description_1":"Manipulating the predictor variables to improve model performance; e.g. ensuring the model is flexible in the variety of data it can ingest; ensuring variables are on the same scale, making the model easier to understand; improving accuracy, and avoiding computational errors by ensuring all features are within an acceptable range for the model. "},{"Task":"Extract features","Description_1":"Extract features from a data set to identify useful information. Without distorting the original relationships or significant information, this compresses the amount of data into manageable quantities for algorithms to process."},{"Task":"Create features","Description_1":"Manual creation of variables by extracting them from existing variables.  "},{"Task":"Select features","Description_1":"Feature selection algorithms essentially analyze, judge, and rank various features to determine which features are irrelevant and should be removed, which features are redundant and should be removed, and which features are most useful for the model and should be prioritized."},{"Task":"Dimensionality reduction","Description_1":" process of mapping an n-dimensional point, into a lower k-dimensional space. "}]}
{"_id":6,"Stage":"Model training","Description":"During this stage, the chosen model train the relationship between data features and the label that was previously given or between features it selfs.","Tasks":[{"Task":"Select hyper-parameters that should be consider/optimized in the process","Description_1":"Define the optimal configuration for the model in terms of parameters that are essential for making predictions, and hyperparameters that are essential for optimizing the model."},{"Task":"Train the model","Description_1":"Apply the algorithm."},{"Task":"Check for overfitting or under fitting (root-mean-square error)","Description_1":"Reduce overfitting or underfitting if needed. "},{"Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets).","Description_1":"The model is validated with new data. In order to locate an ideal model with best execution."},{"Task":"Optimize hyperparameters","Description_1":"Choosing a set of optimal hyperparameters for the learning algorithm in order to lead to the lowest error."},{"Task":"Establish a baselines for monitoring in production","Description_1":"Define which is going to be the comparison point when monitoring the model as soon as it is deployed. This should include Baseline Model Performance and Baseline Data Distribution."}]}