Run test_diarisation.py
from the project's root folder. Speaker IDs will show up in a text file under data/processed/
.
Name of that file, as well as a bunch of other input parameters are stored in a config file under models/config.ini
.
Please read it carefully as it contains a lot of information critical for understanding of the workflow.
For Flask-app to use Bootstrap 4, install bootstrap-flask
, not flask-bootstrap
pyaudio
- for opening.wav
fileskeras
,tensoflow
- for running the Celebrity Recognition modelbokeh
for interactive visualizationflask
,bootstrap-flask
,flask-nav
in case you'd like to run the demo in your browser
- Figuring out if we can avoid downloading data and just use a pre-trained model from somewhere that produces embeddings with similarity property -- did not work, so we share our weights for the model trained on VoxCeleb1
- Initialize arguments and params as seen in
split_test.ipynb
- Create network using
network = src.vggvoxvlad.split.make_network(weight_path, args, input_dim=(257, None, 1), num_class=1251)
whereweight_path
is a.h5
file - Create a list of dataframes using
result_list = src.vggvoxvlad.split.voxceleb1_split(path, network, split_seconds=3, n=3,win_length=400, sr=16000, hop_length=160,n_fft=512, spec_len=250, n_classes=1251)
wherepath
is a.wav
file - Each dataframe contains the headers
Time (s)
,Speaker
,Probability
,Country
andGender
- Each dataframe will display the three speakers with the highest predicted probability (to change this, change the
n
parameter) voxceleb1_split()
will predict a speaker every three seconds by default (to change this, change thesplit_seconds
parameter)- To show a bargraph of the results, use
plot_split(result_list, num_speakers=2)
wherenum_speakers
is the actual number of speakers in the.wav
file
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ └── flask_app <- Scripts to create a Flask demo app
│ └── visualize.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ │
│ └── vggvoxvlad <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
- Tie a microphone recording module to the visualization via Bokeh server app
- Start working on a Flask application and think about how to deploy it
-
Create a CLI wrapper for the set of scripts, that takes a path to an audiofile as an input argument, or a name of a preset
.wav
file, and launches abokeh
application locally on the user's machine. Rely onclick
package for settting up arguments -
Create a quick
gif
animation of running the sript in the terminal - use https://github.com/faressoft/terminalizer to help you create agif
. Gif would show the following:- Clone repo,
- Run CLI wrapper with a
--help
parameter, that would show some info about the script usage, as well as possible values for preset examples ls
in the terminal for a customwav
- Run CLI on some custom
wav
file
-
Clean this README, provide a nice overview of what the project does and how it achieves the goal - mention steps like:
- VAD for voice activation detection
- Speech Activity
- Diarization clustering
- VoxCeleb v1 dataset, metadata example, how the validation data was prepared (point to the right folder in the repo that has the codes, and the weights) -- this step is quite important actually, as people might be interested in downloading our weights
- How the embeddings and the final layer were trained
Visualization - tie. -- done by Antonwav
,rttm
together, create series of pngsTie thepng
together into a video, concatenate it with the original video of Zuck fighting with CruzGet the(done by Anton), ORVoxCeleb
/VoxCeleb2
data, calculate all the embeddings, store them with labelsUsing all above, select an appropriate model to be used as a feature extractor, and get embeddings of a given-- done by Dan.wav
in a sliding window fashionFigure out how to tie metadata, and pull out Male/Female ids- done by DanSet up KNN in the embedding space with an appropriate metric / pick- done by Dan & Antontop_n
from the prediction vectorDevelop an interactive web-friendly visualization- done by AntonApply Speaker Classification to- done by Dan.wav
in a rolling window, get top N predictionsBuild a classifier of masculinity/femininity (could be simply an average of top 10 / N predictions' gender)Wrap everything into a nicer set of functions
Project based on the cookiecutter data science project template. #cookiecutterdatascience