Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

koraynilay
Copy link

Summary

  • This is a…
    • Bug fix
    • Feature addition
    • Refactoring
    • Minor / simple change (like a typo)
    • Other
  • Describe this change in 1-2 sentences: Try to match by catalog number, barcode and asin when using the Lookup function on clusters.

Problem

When using the Lookup (Ctrl+L) function on clusters, Picard won't try to match using available release identifiers (like catalogue number or barcode), this PR aims to fix this.
The problem still applies to unclustered files (I haven't checked to them yet).

I searched online but couldn't find anything similar to this (except for the mentioned JIRA tickets) so I apologize if there already is something similar.

Solution

Make a dictionary containing a Counter for each tag, iterating through it and using most_common() gives us the most common value in all files in the cluster for each tag specified in tags_to_update (Method shamelessly copied from FileCluster's artist and title). This way we can populate the cluster's metadata with additional useful info.
This PR also adds those identifiers to Metadata.compare() and Metadata.compare_to_release_parts() (using their respective weights defined in CLUSTER_COMPARISON_WEIGHTS)

Action

Additional actions required:

  • Update Picard documentation (please include a reference to this PR)
  • Other (please specify below)

This is the first commit for a possible solution for PICARD-2466
also applies to PICARD-2793
Thanks to @kokon191 for the catno comparison code
Cluster weights are sensible defaults which work for some
of my tested files, but they will probably need some tweaking.
The Metadata weights are sensible defaults guessed from
the other weights.
I haven't added them to File yet.
@zas zas requested review from phw and zas August 26, 2024 06:21
Copy link
Contributor

@Sophist-UK Sophist-UK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am not sure that this actually resolves anything that is part of PICARD-2793.

@@ -72,6 +72,9 @@
'releasecountry': 2,
'releasetype': 10,
'totalalbumtracks': 5,
'catalognumber': 60,
'asin': 65,
'barcode': 68,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear how these weights were chosen or why they are so high.

Copy link
Author

@koraynilay koraynilay Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose these in the same way as the ones in metadata.py but after trying to match ~80 albums (all files tagged with only catalognumber), with 'catalognumber': 60, it was able to correctly identify almost all of them (except for ~4), while with a lower value it would identify quite a few less.

But all these were guesses and as I am not familiar with the codebase yet they may very well be wrong (even tho they worked for my specific use case).

@@ -163,6 +163,9 @@ class Metadata(MutableMapping):
('totaltracks', 5),
('discnumber', 5),
('totaldiscs', 4),
('catalognumber', 30),
('barcode', 33),
('asin', 32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear how these weights were chosen or why they are so high.

Copy link
Author

@koraynilay koraynilay Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose these values so they are a bit higher than the highest one (title), since they more uniquely identify releases/recordings/etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not had any experience with determining optimum search weightings in Picard, but I have designed complex weighted search algorithms in other search engines - and I know that getting the right weights is an art, and if you get them wrong then all sorts of funny results come up first.

I don't know whether @zas or @phw can help with determining whether these weights are good or bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also believe these weights are too high. This can easily overrule better matches based on title and artist. It's also one reason why I think we should go a different route.

metadata_counters[tag] = Counter()
metadata_counters[tag][value] += 1
for tag in metadata_counters:
self.metadata[tag] = metadata_counters[tag].most_common(1)[0][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this additional code (which is only for these tags)?

And why has genre been added to these tags?

Copy link
Author

@koraynilay koraynilay Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this additional code (which is only for these tags)?

This is in order to add those tags to the cluster so they can be used in the lookup.

And why has genre been added to these tags?

Because I thought it would make sense for a cluster to also have the most common genre in its files, there may also be other tags which makes sense to include (like releasecountry or comment for release disambiguation maybe?) and all these can be quickly added/removed to the lookup (which could also lead to PICARD-2793).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional code - I haven't read the existing code in detail, but I surely code to include these tags in the cluster should be co-located with the code that puts the existing tags into the cluster.

Genres should not IMO be added because they are highly subjective and IMO the MB implementation is far too free-form i.e. lacking in hierarchy and unstructured.

Copy link
Author

@koraynilay koraynilay Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I had searched for that a bit after trying to just add catno=self.metadata['catalognumber'], to the find_releases call in Cluster.lookup_metadata() (which didn't work) but I couldn't find anything.

Now I tried (on master) printing self.metadata right before the find_releases call and the log just reads

store: {'album': ['Do the Dive [Limited Edition]'], 'albumartist': ['Call of Artemis'], 'totaltracks': ['4'], '~totalalbumtracks': ['4']}
deleted: set()
images: ["TagCoverArtImage from '/I/Raccolte/Musica/D4DJ Collection/Call of Artemis - Do the Dive [Limited Edition] (2022) [FLAC+CUE+LOG] {BRMM-10544}/01. Do the Dive.flac' of type front"]
length: 1066065

(all 4 files have catalognumber set)
So it doesn't seem to get populated with the catalog number.
Unless I didn't understand something and those tag shouldn't be in self.metadata

Now that I think about it tho, in the UI I can see the cluster files' metadata, I'm gonna search if I can find where it gets them from.
Update: seems like the metadata box just gets the tags by iterating through the files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally this tag merging for clusters makes sense, and we should do something like this for some relevant tags. It makes the clusters far more useful if we add such lookups. Whether this includes genres or not is a different question, and currently I don't have any strong opinion on this either way.

I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).

Copy link
Author

@koraynilay koraynilay Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once

I was also concerned about that, but from what I saw, self.update() gets called only once after all files have been added (after for file in files loop) in cluster.add_files(), so it should run only one time for each cluster (as you said it needs to be tested tho).

(i.e. usually when creating the cluster).

Presumably this code would therefore be better placed inside wherever the cluster is created.

Actually my first approach was to do this, by passing the metadata objects of the various files to FileCluster[1] and have a FileCluster.metadata property returning a metadata object with the most common tags, but that would require quite a bit of changes, also in the tests (as it would change the arguments of FileCluster.add() and Cluster's constructor).

I have pushed these changes to koraynilay:identifiers-lookup-filecluster, if you want to see them (they should work, but the tests didn't all pass and there are still some problems (see 742b9d2's commit message for an example)).

[1] like this

# in cluster.py, Cluster.cluster()
    @staticmethod
    def cluster(files):
        # ...
        for file in files:
            # ...
            token = tokenize(album)
            if token:
                # Basically pass `metadata` here
-               cluster_list[token].add(album, artist or various_artists, file)
+               cluster_list[token].add(metadata, file)

@koraynilay
Copy link
Author

koraynilay commented Aug 26, 2024

Also, I am not sure that this actually resolves anything that is part of PICARD-2793.

Hmm, yes, you're right, it doesn't relate directly, I included it because this change would also be needed for PICARD-2793's main use case (see from kokonut191's comments on JIRA).

I can remove it from the PR description tho, since it actually is a separate issue.

Also thanks for reviewing this.

Copy link
Member

@phw phw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR. I'm however not too sure about including this in the matching algorithm, especially not with such high weights. A mismatch in cat. no. or barcode would quickly lead to no match at all. IMHO if we use those identifiers in the matching they should have a low weight, so they will be used to push the result in favor of the releases with matching identifier, but don't spoil the search result completely.

Because questions about looking up by some identifier come up from time to time, but with different identifiers such as cat. no., barcode, disc ID, ISRC etc., I had been thinking about a different approach: We could add a context menu to clusters and files for looking up specifically by certain identifiers. This could be context aware and only offer lookups if the relevant tag is present.

The lookup would do an explicit search for the identifier in question on MB. It could either match the results automatically as it is done for "Lookup" and "Scan" or maybe show a search result window.

I think this would be the more flexible approach and would better fulfill the requirement of looking up by barcode, disc ID, cat. no., ISRC etc.


if 'barcode' in self and 'barcode' in weights:
score = 1.0 if self['barcode'] == release['barcode'] else 0.0
parts.append((score, weights['barcode']))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper comparison of barcodes likely should be a bit more complex then exact string match. Because of UPC vs EAN-13 it is e.g. not uncommon that UPCs are written with a leading zero. Compare also e.g. https://github.com/kellnerd/harmony/blob/1744a1a87ee8a8f8a0a59cbf8208e8f01ee9bae0/utils/gtin.ts#L66

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I did see that when I was looking at the docs for the search API, but it didn't come to mind when writing this code.

@@ -163,6 +163,9 @@ class Metadata(MutableMapping):
('totaltracks', 5),
('discnumber', 5),
('totaldiscs', 4),
('catalognumber', 30),
('barcode', 33),
('asin', 32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also believe these weights are too high. This can easily overrule better matches based on title and artist. It's also one reason why I think we should go a different route.

metadata_counters[tag] = Counter()
metadata_counters[tag][value] += 1
for tag in metadata_counters:
self.metadata[tag] = metadata_counters[tag].most_common(1)[0][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally this tag merging for clusters makes sense, and we should do something like this for some relevant tags. It makes the clusters far more useful if we add such lookups. Whether this includes genres or not is a different question, and currently I don't have any strong opinion on this either way.

I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).

@Sophist-UK
Copy link
Contributor

Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).

Presumably this code would therefore be better placed inside wherever the cluster is created.

@koraynilay
Copy link
Author

Thanks for opening this PR. I'm however not too sure about including this in the matching algorithm, especially not with such high weights. A mismatch in cat. no. or barcode would quickly lead to no match at all. IMHO if we use those identifiers in the matching they should have a low weight, so they will be used to push the result in favor of the releases with matching identifier, but don't spoil the search result completely.

Thanks for reviewing it. Yeah that makes sense, unfortunately it didn't occur to me that there could be files tagged with a wrong identifier, as while trying to tag my files I was just thinking "if the catno matches, that means it's correct, whatever the other tags may be".

Because questions about looking up by some identifier come up from time to time, but with different identifiers such as cat. no., barcode, disc ID, ISRC etc., I had been thinking about a different approach: We could add a context menu to clusters and files for looking up specifically by certain identifiers.

I agree this would be the better solution, so even if the matching doesn't automatically find the right release, it can still be found explicitly.

The lookup would do an explicit search for the identifier in question on MB. It could either match the results automatically as it is done for "Lookup" and "Scan" or maybe show a search result window.

I'd prefer for it to match the results automatically, so that it can be done for a lot of clusters/files (as in my case, I could just select all ~80 clusters and press the "Lookup by catalogue number")

This could be context aware and only offer lookups if the relevant tag is present.

IMO it would be better to always offer those lookups, even if the tag isn't present (or show it if it's present in at least one file), because consider if the user has e.g.

  1. 10 album/clusters all tagged with catno
  2. 15 album/clusters without catno
  3. 20 album/clusters with some files having the correct catno and some files having no catno at all
  4. 5 unclustered files with catno
  5. 7 unclustered files without catno

selecting everything and pressing "lookup by catno" should match 1., 3. and 4., then pressing the normal lookup should match 2. and 5..
(catno here can be changed to barcode or another identifier)

@phw
Copy link
Member

phw commented Sep 3, 2024

I added PICARD-2961 for the feature of a "Lookup by..." menu. This could work well for many similar feature requests for lookups, I linked the relevant tickets there.

Let us know if this is something you would like to work on. If not I might actually look into this sometime in the future because I personally would like to have the lookup from disc IDs in tags handled by this. But currently this will take quite some time until I can get to this.

Yeah that makes sense, unfortunately it didn't occur to me that there could be files tagged with a wrong identifier, as while trying to tag my files I was just thinking "if the catno matches, that means it's correct, whatever the other tags may be".

That's indeed always a tricky part with different data quality. A lot of Picard's matching is build on the fact that existing data might be bad. This is only a main reason why a lot of identifiers are not directly used.

Unfortunately we don't really have a good solution for this. All of the matching is usually done with local testing. I have a my own collection and a folder of test stuff, but it is by no means comprehensive. What we would really need is all kinds of test collection data of varying quality, together with the intended result. Then those weights could be tuned to give best results. This is likely a project we should tackle in some way, but this is really a separate topic on its own.

I'd prefer for it to match the results automatically, so that it can be done for a lot of clusters/files (as in my case, I could just select all ~80 clusters and press the "Lookup by catalogue number")

Yes, I tend to agree.

IMO it would be better to always offer those lookups, even if the tag isn't present (or show it if it's present in at least one file), because consider if the user has e.g.

Also makes sense.

@koraynilay
Copy link
Author

Let us know if this is something you would like to work on. If not I might actually look into this sometime in the future because I personally would like to have the lookup from disc IDs in tags handled by this. But currently this will take quite some time until I can get to this.

Unfortunately I have basically never worked with GUIs and I don't have a lot of time rn to understand how everything works, but if I start working on it I will comment here to let you guys know.

That's indeed always a tricky part with different data quality. A lot of Picard's matching is build on the fact that existing data might be bad. This is only a main reason why a lot of identifiers are not directly used.

Unfortunately we don't really have a good solution for this. All of the matching is usually done with local testing. I have a my own collection and a folder of test stuff, but it is by no means comprehensive. What we would really need is all kinds of test collection data of varying quality, together with the intended result. Then those weights could be tuned to give best results. This is likely a project we should tackle in some way, but this is really a separate topic on its own.

Maybe we can add a prompt when using "Lookup"/"Lookup by..." (or global option in the settings) like "trust these files are tagged correctly" and then use exact[1] tag (also usable for title/artist/album) matching on them.

[1] would be better if it could also work with romanized tags (e.g. Russian or Japanese) and match the correct (prioritize romanized, but match also the non-romanized version), but I'm not sure how that would work (something like the artist translation field?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants