Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions picard/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@
'releasecountry': 2,
'releasetype': 10,
'totalalbumtracks': 5,
'catalognumber': 60,
'asin': 65,
'barcode': 68,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear how these weights were chosen or why they are so high.

Copy link
Author

@koraynilay koraynilay Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose these in the same way as the ones in metadata.py but after trying to match ~80 albums (all files tagged with only catalognumber), with 'catalognumber': 60, it was able to correctly identify almost all of them (except for ~4), while with a lower value it would identify quite a few less.

But all these were guesses and as I am not familiar with the codebase yet they may very well be wrong (even tho they worked for my specific use case).

}


Expand Down Expand Up @@ -166,6 +169,17 @@ def remove_file(self, file, new_album=True):

def update(self, signal=True):
self.metadata['~totalalbumtracks'] = self.metadata['totaltracks'] = len(self.files)
metadata_counters = {}
tags_to_update = ['catalognumber', 'barcode', 'asin', 'genre']
for file in self.files:
for tag in tags_to_update:
if tag in file.metadata:
value = file.metadata[tag]
if tag not in metadata_counters:
metadata_counters[tag] = Counter()
metadata_counters[tag][value] += 1
for tag in metadata_counters:
self.metadata[tag] = metadata_counters[tag].most_common(1)[0][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this additional code (which is only for these tags)?

And why has genre been added to these tags?

Copy link
Author

@koraynilay koraynilay Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this additional code (which is only for these tags)?

This is in order to add those tags to the cluster so they can be used in the lookup.

And why has genre been added to these tags?

Because I thought it would make sense for a cluster to also have the most common genre in its files, there may also be other tags which makes sense to include (like releasecountry or comment for release disambiguation maybe?) and all these can be quickly added/removed to the lookup (which could also lead to PICARD-2793).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional code - I haven't read the existing code in detail, but I surely code to include these tags in the cluster should be co-located with the code that puts the existing tags into the cluster.

Genres should not IMO be added because they are highly subjective and IMO the MB implementation is far too free-form i.e. lacking in hierarchy and unstructured.

Copy link
Author

@koraynilay koraynilay Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I had searched for that a bit after trying to just add catno=self.metadata['catalognumber'], to the find_releases call in Cluster.lookup_metadata() (which didn't work) but I couldn't find anything.

Now I tried (on master) printing self.metadata right before the find_releases call and the log just reads

store: {'album': ['Do the Dive [Limited Edition]'], 'albumartist': ['Call of Artemis'], 'totaltracks': ['4'], '~totalalbumtracks': ['4']}
deleted: set()
images: ["TagCoverArtImage from '/I/Raccolte/Musica/D4DJ Collection/Call of Artemis - Do the Dive [Limited Edition] (2022) [FLAC+CUE+LOG] {BRMM-10544}/01. Do the Dive.flac' of type front"]
length: 1066065

(all 4 files have catalognumber set)
So it doesn't seem to get populated with the catalog number.
Unless I didn't understand something and those tag shouldn't be in self.metadata

Now that I think about it tho, in the UI I can see the cluster files' metadata, I'm gonna search if I can find where it gets them from.
Update: seems like the metadata box just gets the tags by iterating through the files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally this tag merging for clusters makes sense, and we should do something like this for some relevant tags. It makes the clusters far more useful if we add such lookups. Whether this includes genres or not is a different question, and currently I don't have any strong opinion on this either way.

I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).

Copy link
Author

@koraynilay koraynilay Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once

I was also concerned about that, but from what I saw, self.update() gets called only once after all files have been added (after for file in files loop) in cluster.add_files(), so it should run only one time for each cluster (as you said it needs to be tested tho).

(i.e. usually when creating the cluster).

Presumably this code would therefore be better placed inside wherever the cluster is created.

Actually my first approach was to do this, by passing the metadata objects of the various files to FileCluster[1] and have a FileCluster.metadata property returning a metadata object with the most common tags, but that would require quite a bit of changes, also in the tests (as it would change the arguments of FileCluster.add() and Cluster's constructor).

I have pushed these changes to koraynilay:identifiers-lookup-filecluster, if you want to see them (they should work, but the tests didn't all pass and there are still some problems (see 742b9d2's commit message for an example)).

[1] like this

# in cluster.py, Cluster.cluster()
    @staticmethod
    def cluster(files):
        # ...
        for file in files:
            # ...
            token = tokenize(album)
            if token:
                # Basically pass `metadata` here
-               cluster_list[token].add(album, artist or various_artists, file)
+               cluster_list[token].add(metadata, file)

if signal and self.ui_item:
self.ui_item.update()

Expand Down Expand Up @@ -286,6 +300,9 @@ def lookup_metadata(self):
artist=self.metadata['albumartist'],
release=self.metadata['album'],
tracks=str(len(self.files)),
catno=self.metadata['catalognumber'],
barcode=self.metadata['barcode'],
asin=self.metadata['asin'],
limit=config.setting['query_limit'])

def clear_lookup_task(self):
Expand Down
25 changes: 25 additions & 0 deletions picard/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,9 @@ class Metadata(MutableMapping):
('totaltracks', 5),
('discnumber', 5),
('totaldiscs', 4),
('catalognumber', 30),
('barcode', 33),
('asin', 32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear how these weights were chosen or why they are so high.

Copy link
Author

@koraynilay koraynilay Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose these values so they are a bit higher than the highest one (title), since they more uniquely identify releases/recordings/etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not had any experience with determining optimum search weightings in Picard, but I have designed complex weighted search algorithms in other search engines - and I know that getting the right weights is an art, and if you get them wrong then all sorts of funny results come up first.

I don't know whether @zas or @phw can help with determining whether these weights are good or bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also believe these weights are too high. This can easily overrule better matches based on title and artist. It's also one reason why I think we should go a different route.

]

__date_match_factors = {
Expand Down Expand Up @@ -232,6 +235,8 @@ def compare(self, other, ignored=None):
ia = a
ib = b
score = 1.0 - (int(ia != ib))
elif name in {'catalognumber', 'barcode', 'asin'}:
score = 1.0 if a == b else 0.0
else:
score = similarity2(a, b)
parts.append((score, weight))
Expand Down Expand Up @@ -289,6 +294,26 @@ def compare_to_release_parts(self, release, weights):
except (ValueError, KeyError):
pass

if 'catalognumber' in self and 'catalognumber' in weights:
try:
found = 0.0
for _ in release['label-info']:
if self['catalognumber'] == _['catalog-number']:
print(_['catalog-number'])
found += 1.0
break
parts.append((found, weights['catalognumber']))
except (KeyError, IndexError):
pass

if 'barcode' in self and 'barcode' in weights:
score = 1.0 if self['barcode'] == release['barcode'] else 0.0
parts.append((score, weights['barcode']))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper comparison of barcodes likely should be a bit more complex then exact string match. Because of UPC vs EAN-13 it is e.g. not uncommon that UPCs are written with a leading zero. Compare also e.g. https://github.com/kellnerd/harmony/blob/1744a1a87ee8a8f8a0a59cbf8208e8f01ee9bae0/utils/gtin.ts#L66

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I did see that when I was looking at the docs for the search API, but it didn't come to mind when writing this code.


if 'asin' in self and 'asin' in weights:
score = 1.0 if self['asin'] == release['asin'] else 0.0
parts.append((score, weights['asin']))

# Date Logic
date_match_factor = 0.0
if 'date' in weights:
Expand Down
Loading