Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleaned dataset #4

Open
ha1990-12 opened this issue Nov 14, 2019 · 2 comments
Open

cleaned dataset #4

ha1990-12 opened this issue Nov 14, 2019 · 2 comments

Comments

@ha1990-12
Copy link

Could you please share the cleaned dataset?

@ruochunjin
Copy link
Owner

The cleaned dataset is "clean_list.7z" with clean image IDs in it.
You may want to read the "How to use C-MS-Celeb" section for details.

@AGenchev
Copy link

AGenchev commented Dec 31, 2020

How to use could be extended with some practical NFO:
First, you will download the academic torrent by Hyper.AI Datasets Team, because the original dataset was removed by MS. It contains 2 data files:

  1. FaceImageCroppedWithAlignment.tsv
  2. FaceImageCroppedWithOutAlignment.tsv
    TSV means tab-separated-values. A line of these files looks like:
    m.0107_f 0 http://getbeatmadrid.files.wordpress.com/2013/01/magic-alex.jpg http://getbeatmadrid.wordpress.com/2013/01/28/magic-alex/ FaceId-0 KsQsP3Pumj2B6UE/Vj4/Pg== base64_jpegdata
    The columns inside are as follows:
    m_id, image_search_rank, image_url, page_url, face_id, face_rectangle, face_data
    for many rows the image_url and page_url will be useless, since many pages were removed/died.
    Found description of the columns here: https://frchallenge.github.io/download/aligned cached below for reference:

File format: text files, each line is an image record containing 7 columns, delimited by TAB.
Column1: Freebase MID
Column2: ImageSearchRank
Column3: ImageURL
Column4: PageURL
Column5: FaceID
Column6: FaceRectangle_Base64Encoded (four floats, relative coordinates of UpperLeft and BottomRight corner)
Column7: FaceData_Base64Encoded d Data]

The initial dataset is very noisy, I don't recommend for training person recognition on it: if you see person "m.0107_f", you'll see images of males, females most of them not belonging to the same person...
The face images are not high quality (checked FaceImageCroppedWithAlignment.tsv).
You need to extract the data to perform the filtering.
Extraction script: https://www.programmersought.com/article/53293636195/ The extracted data has a folder for each person id named with the index values: "m.0107_f" (for example).
Clean list (from stage 2) has 4,924,737 rows. Relabel list contains 1,539,279 rows.
Combined, the lists have 6,464,016 rows which cover the C-MS-Celeb dataset.
The 2 lists can be combined by concatenation, because the columns are equal:
Clean list has 2 columns, space-separated:

m.0107_f m.0107_f/100-FaceId-0.jpg
m.0107_f m.0107_f/102-FaceId-0.jpg

I guess the first column is the selected person_id, the second - the photos of this person.
We observe the sex is the same and it is likely the same person. Hence, noise is reduced.
There are also omission errors - for example m.0107_f/116-Faceid-0.jpg is missing.
at least the false images are mostly removed. So this is our new person index to use. Next, we want to merge the relabel list:
the columns have the same meaning, just relabel list is cross-folder index.
We merge and sort the index and are ready.
Next step: data set is still noisy, you might want to run a (well trained) gender detector to clean the non same gender pictures.
Next step: data set is not so diverse - there are repeating "same" images taken from one and the same photo, so it can be further reduced to contain only unique pictures of the same person.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants