Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/get endpoint does not return the best match to the query #26

Open
snejus opened this issue Oct 22, 2024 · 9 comments
Open

/get endpoint does not return the best match to the query #26

snejus opened this issue Oct 22, 2024 · 9 comments

Comments

@snejus
Copy link

snejus commented Oct 22, 2024

Hi @tranxuanthang, following your suggestion under beetbox/beets#5406 I now attempt to find matching lyrics using the /get endpoint, and only perform /search if they could not be found.

Thanks to synced lyrics availableion this database, the other day I added lyrics display in my music widget, which depends on accurate timestamps, and I noticed that lyrics are out of sync for some tracks.

One of them was this track:

Artist: Armin van Buuren Feat. Laura V
Title: Drowning (Avicii Remix)
Album: A State Of Trance Classics 14
Duration: 473.0

I checked and found its lyrics were fetched using the /get endpoint:

$ curl https://lrclib.net/api/get \
  --url-query artist_name="Armin van Buuren Feat. Laura V" \
  --url-query track_name="Drowning (Avicii Remix)" \
  --url-query album_name="A State Of Trance Classics 14" | 
  jq '{albumName, artistName, trackName, duration}'

{
  "albumName": "Mirage (The Remixes) [Bonus Tracks Edition]",
  "artistName": "Armin van Buuren feat. Laura V",
  "trackName": "Drowning - Avicii Remix",
  "duration": 472.0
}

Note that I receive the same data when I provide the duration field:

$ curl https://lrclib.net/api/get \
  --url-query artist_name="Armin van Buuren Feat. Laura V" \
  --url-query track_name="Drowning (Avicii Remix)" \
  --url-query album_name="A State Of Trance Classics 14" \
  --url-query duration=473 | 
  jq '{albumName, artistName, trackName, duration}'

{
  "albumName": "Mirage (The Remixes) [Bonus Tracks Edition]",
  "artistName": "Armin van Buuren feat. Laura V",
  "trackName": "Drowning - Avicii Remix",
  "duration": 472.0
}

When I perform the search for the artist and title

curl https://lrclib.net/api/search \
  --url-query artist_name="Armin van Buuren Feat. Laura V" \
  --url-query track_name="Drowning (Avicii Remix)" | 
  jq 'map({id, albumName, artistName, trackName, duration})' | 
  table

I see the following data

image

The lyrics I'm after are under id 12429604, and it seems like it should be the closest match to my query. I can provide more examples if required.

The results ranking algorithm I added in beetbox/beets#5406 picks up the correct lyrics.

@snejus
Copy link
Author

snejus commented Oct 22, 2024

recording-2024-10-22_14.11.24.mp4

That's the widget I mentioned, you can see how it depends on correct timestamps.

@tranxuanthang
Copy link
Owner

Your song album name is A State Of Trance Classics 14.

The track ID 12429604 album name is A State of Trance: Classics, Volume 14.

After normalizing, these became a state of trance classics 14 and a state of trance classics volume 14. Because of the extra word volume, LRCLIB doesn't consider the ID 12429604 a match. It then retry without album name, and you finally get the ID 1029622.

The best way to resolve this in my opinion is resubmitting the correct lyrics for your song's metadata, for example with LRCGET:

  1. Find your song Drowning (Avicii Remix) in the LRCGET song list, then use the search lyrics feature for this song
  2. Apply the matching lyrics (ID 12429604)
  3. Resubmit the lyrics by going to Lyrics Editor > Publish

@snejus
Copy link
Author

snejus commented Oct 22, 2024

How come does it match album Mirage (The Remixes) [Bonus Tracks Edition] instead?

In addition to this, neither the track name nor the duration returned by the /get endpoint match the query. Meanwhile, there is a record in the database that matches them exactly.

I was wondering how does the matching/comparison logic work internally; which fields are prioritised for the comparison?

@tranxuanthang
Copy link
Owner

tranxuanthang commented Oct 22, 2024

How come does it match album Mirage (The Remixes) [Bonus Tracks Edition] instead?

It just retries one more time, ignoring the album name parameter. The ID 1029622 is probably the first record that matches the criteria. The duration 472 vs 473 seconds is considered good enough (±2 seconds).

// Retry fetching the track without the album name
if let Some(track) = fetch_track_without_album(&params, &state).await? {
return Ok(Json(create_response(track)));
}

@snejus
Copy link
Author

snejus commented Oct 22, 2024

I was aware of the duration comparison, but it's surprising to me that the difference in the trackName is ignored, since my query is Drowning (Avicii Remix) but it returns Drowning - Avicii Remix.

Do you reckon we could prioritize exact matches here?

@snejus
Copy link
Author

snejus commented Oct 22, 2024

I would be more than happy to contribute!

@tranxuanthang
Copy link
Owner

tranxuanthang commented Oct 22, 2024

Meanwhile, there is a record in the database that matches them exactly.

Unfortunately it is not really exact, because of the extra word "volume".

LRCLIB doesn't deduplicate the metadata, it is a very difficult matter that also requires contribution from community, and someone else does this better already (musicbrainz). Even if it could, there might be still minor syncing issue because of differences between CD rips and musics downloaded from digital/streaming platform.

I know it sucks, I hate the fact that there are usually multiple duplicated lyrics records for the same song in LRCLIB. But this issue is almost impossible to resolve.

I was aware of the duration comparison, but it's surprising to me that the difference in the trackName is ignored, since my query is Drowning (Avicii Remix) but it returns Drowning - Avicii Remix.

All of the strings are normalized (converting to lowercase, removing special characters and accents from accented character). In your case:

  • Drowning (Avicii Remix) will become drowning avicii remix
  • Drowning - Avicii Remix will become drowning avicii remix

So they are considered an exact match.

The part of the code that does the normalization is here:

pub fn prepare_input(input: &str) -> String {
let mut prepared_input = lower_lay_string(&input);
let re = Regex::new(r#"[`~!@#$%^&*()_|+\-=?;:",.<>\{\}\[\]\\\/]"#).unwrap();
prepared_input = re.replace_all(&prepared_input, " ").to_string();
let re = Regex::new(r#"['’]"#).unwrap();
prepared_input = re.replace_all(&prepared_input, "").to_string();
prepared_input = prepared_input.to_lowercase();
prepared_input = collapse(&prepared_input);
prepared_input
}

@tranxuanthang
Copy link
Owner

I would be more than happy to contribute!

I'd love to have your contribution! But, we need to come to an agreement on the best way to address this first.

@snejus
Copy link
Author

snejus commented Oct 22, 2024

Meanwhile, there is a record in the database that matches them exactly.

Unfortunately it is not really exact, because of the extra word "volume".

LRCLIB doesn't deduplicate the metadata, it is a very difficult matter that also requires contribution from community, and someone else does this better already (musicbrainz). Even if it could, there might be still minor syncing issue because of differences between CD rips and musics downloaded from digital/streaming platform.

I know it sucks, I hate the fact that there are usually multiple duplicated lyrics records for the same song in LRCLIB. But this issue is almost impossible to resolve.

I was aware of the duration comparison, but it's surprising to me that the difference in the trackName is ignored, since my query is Drowning (Avicii Remix) but it returns Drowning - Avicii Remix.

All of the strings are normalized (converting to lowercase, removing special characters and accents from accented character). In your case:

  • Drowning (Avicii Remix) will become drowning avicii remix
  • Drowning - Avicii Remix will become drowning avicii remix

So they are considered an exact match.

The part of the code that does the normalization is here:

pub fn prepare_input(input: &str) -> String {
let mut prepared_input = lower_lay_string(&input);
let re = Regex::new(r#"[`~!@#$%^&*()_|+\-=?;:",.<>\{\}\[\]\\\/]"#).unwrap();
prepared_input = re.replace_all(&prepared_input, " ").to_string();
let re = Regex::new(r#"['’]"#).unwrap();
prepared_input = re.replace_all(&prepared_input, "").to_string();
prepared_input = prepared_input.to_lowercase();
prepared_input = collapse(&prepared_input);
prepared_input
}

This makes a lot of sense.

My last straw then is the duration - given that normalised artist and track names are the same, could we prioritize results that match the duration exactly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants