Skip to content

Commit

Permalink
search config
Browse files Browse the repository at this point in the history
  • Loading branch information
evanwill committed Sep 4, 2024
1 parent 46586d7 commit 67f6ae1
Show file tree
Hide file tree
Showing 4 changed files with 95 additions and 9 deletions.
2 changes: 1 addition & 1 deletion content/maintainers/merging.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
section: Maintainers
nav_order: 2
nav_order: 4
title: Merging Main into Branch
---

Expand Down
2 changes: 1 addition & 1 deletion content/maintainers/nwdh.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
section: Maintainers
nav_order: 2
nav_order: 5
title: NWDH Harvest
---

Expand Down
87 changes: 87 additions & 0 deletions content/maintainers/search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
section: Maintainers
nav_order: 3
title: Config Search
---

Digital collection data is added to the search index using a standardized JSON form prepped for ingest.
The JSON file is built and deployed with each CollectionBuilder digital collection using a template in "assets/data/search.json"
The mapping between the collection metadata and the "search.json" output is configured in "_data/config-search-index.csv".

Ensuring the mappings in "config-search-index" are correct is essential to having good data for search and aggregation in DPLA.

## Concept

The contents for each item record in "search.json" are modeled on the standard [DCMI core terms](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) to provide standardized, interoperable data for the search index.
This enables useable search facets and reuse in other platforms.

The "search.json" file is built with each digital collection and exposed on the web.
The search index application harvests the json to ingest new items or update existing records.
The list of collection json files is used to manage the "Sources" for the search index.

The list of deployed example "search.json" files is at:
https://www.lib.uidaho.edu/digital/data/search-links.txt

## search.json

The "search.json" file has two main keys, "collection" and "items".

- "collection" - contains summary information about the digital collection.
- "items" - contains individual records for each item in the digital collection.

Custom metadata in each digital collection is mapped to these core fields:
title, date, creator, description, subject, coverage, identifier, source, type, format, language, rights, relation, publisher, genre.

Additionally, in each item record "search.json" provides:

- thumb - full URL to thumbnail image if available.
- transcript - plain text transcript if available.
- file - full URL to item download if available.
- text - `true` if item file is pdf where full text should be extracted by the index.
- collectionid - unique id for the individual digital collection, based on the collection's url slug (baseurl).
- objectid - unique id for the item within the collection.
- url - direct URL to the item's page.

### Configure

The contents of the json are configured using the file "_data/config-search-index.csv".
**Do not edit the first column** of the config-search-index, it must contain the same set of dublin core fields for every collection.
The second column "field" should be customized if necessary to match the most relevant column names in the collection metadata, to enable proper mapping of the metadata to the search index.

In cases where the collection metadata is customized beyond the standard template and contains unique fields, be thoughtful about which fields should be mapped to provide valuable search to users.
If necessary, do metadata work such as combining or cleaning columns to create the most relevant and useful data.

#### genre

One special field for search is "genre" as it will displayed at the top of the search listing.
By default "genre" is mapped to "display_template" which should work for the majority of collections.
If the collection has odd customized item layouts that won't make sense as a label in that context, do some metadata work to create a new "genre" column that makes sense, then change the mapping in the config.

#### Full Text search

By default, "search.json" is set to indicate that all items with "format" value of "application/pdf" should be tagged to have their text extracted and entered in full text search.
If this *should not* be the default for a particular collection, comment out `pdf-full-text: true` in the front matter of the "search.json" file.
Individual items can be tagged for full-text by adding a column named "full-text" to the metadata, and giving it value `true`.

#### Default Config

first column of the config should not be modified.

```
dcterm,field
title,title
date,date
creator,creator
description,description
subject,subject
coverage,location
identifier,identifier
source,source
type,type
format,format
language,language
rights,rightsstatement
relation,findingaid
publisher,publisher
genre,display_template
```
13 changes: 6 additions & 7 deletions content/metadata/02-metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ General notes:
- If there is no year included with the item, you can estimate the date to the nearest decade if you know enough information about the content of the image or item.
- For example, if you think a photo was probably taken in the 1950s based on a car or clothing pictured, you can put '1950' or '1955' in the data cell. Be sure to fill in the date range in YYYY-YYYY format (in this case, 1950-1960) in the 'archival_date' cell in this situation, but do not put the range in the "date" column.
- If you do not know the date and cannot estimate it, leave this field blank.
- The "date" field is intended to be sortable and machine readable in order that items can be put on a timeline or sorted--so having it strictly in ISO format is important!
- The "date" field is intended to be sortable and machine readable in order that items can be put on a timeline or sorted--so having it strictly in ISO format is important! No slashes or ranges!
- Example values: `1955-12-08`; `1955-12`; `1955`

### archival_date
Expand Down Expand Up @@ -73,8 +73,8 @@ General notes:
- A detailed, 1-3 sentence accounting of the item, communicating what it is and its contents.
- This includes small details such as "mountains can be seen in the background", and should include names when known.
- All descriptions should be in complete sentences, with a single space between sentences. Descriptions should be no more than 1-3 sentences.
- If the item is an image that includes text, ensure the description includes a transcription of the text. e.g. `Text on the image reads: ...`
- Please note: description is used as the default "alt" text for images, so should convey the content of the image. If a more targeted alt text would function better, please use the "image_alt_text" field.
- If the item is an image that includes text, ensure the description includes a transcription of the text in context. e.g. `Text on the image reads: ...`
- Please note: description is used as the default "alt" text for images, so should convey the content of the image. If a more targeted alt text would function better for that purpose, please use the "image_alt_text" field.
- Example value: `Students on lawn in front of old Gault Hall, which was torn down in 2003 to make room for the current Living Learning Center.`

### subject
Expand Down Expand Up @@ -137,7 +137,6 @@ General notes:
- `InteractiveResource`: Webpage, VR environment
- `Software`: Computer program
- You may only choose one value for the type field. If you encounter an item with multiple types of content, choose the option that describes the item best.
- At minimum, the input should contain a value chosen from the [DCMI Type Vocabulary](https://www.dublincore.org/specifications/dublin-core/dcmi-type-vocabulary/2003-02-12/).
- Example values: `Image;StillImage`; `Image;MovingImage`; `Text`; `Sound`

### format_original
Expand Down Expand Up @@ -209,7 +208,6 @@ General notes:

### filename

- *required*
- The value must exactly match the actual filename, including capitalization and extension. This value is case-sensitive!
- Generally, the filenames will be based on the identifier PLUS extension (.jpg, .tif, .pdf, .wav, etc.)
- Our digital content management system uses this field to correctly link the digitized item to the corresponding metadata entry.
Expand All @@ -223,6 +221,7 @@ General notes:
## Technical Fields

These fields are used for CollectionBuilder.
They can generally be filled out by CDIL team after metadata creation.

### display_template

Expand All @@ -239,8 +238,8 @@ These fields are used for CollectionBuilder.
### date_is_approximate

- **legacy only, don't use for new collections**
- This field lets anyone looking at the collection know that we are certain of our estimation, not that our estimation is the accurate date.
- Only fill out 'yes' if the Year, Year-Month, or an actual estimation is provided. If date is accurate, leave blank.
- This field lets anyone looking at the collection know that the value in "date" is an estimation only.
- Only fill out 'yes' if the value in "date" field is an estimation. If date is accurate, leave blank.
- Example value: `yes`

### relation
Expand Down

0 comments on commit 67f6ae1

Please sign in to comment.