You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/GleanerConfig.md
+14-1
Original file line number
Diff line number
Diff line change
@@ -76,7 +76,7 @@ The miller and summon sections are true and we will leave them that way. It mea
76
76
Now look at the "miller:" section when lets of pick what milling to do. Currently it is set with only graph set to true. Let's leave it that way for now. This means Gleaner will only attempt to make graph and not also run validation or generate prov reports for the process.
77
77
78
78
The final section we need to look at is the "sources:" section.
79
-
Here is where the fun is. While there are two types, sitegraphand sitemaps we will normally use sitemap type.
79
+
Here is where the fun is. While there multiple types, sitegraph, sitemaps, googledrive and api, we will normally use sitemap type.
80
80
81
81
82
82
A standard [sitemap](./SourceSitemap.md) is below:
Sometimes, instead of crawling webpages using a list in a sitemap, we have the opportunity to query an API that will let us directly ingest JSON-LD. To do so, we can specify a `sourcetype: api` in our Gleaner config yaml, and Gleaner will iterate through a paged API, using the given `url` as a template. For example, let's say that you want to use the API endpoint at `http://test-api.com`, and that you can page through it by using a url like `http://test/api.com/page/4`. You would put this in your config:
4
+
5
+
```yaml
6
+
url: http://test-api.com/page/%d
7
+
```
8
+
9
+
Notice the `%d` where the page number goes. Gleaner will then increment that number (starting from 0) until it gets an error back from the API.
10
+
11
+
Optionally, you can set a limit on the number of pages to iterate through, using `apipagelimit`. This means that Gleaner will page through the API until it gets an error back *or* until it reaches the limit you set. That looks like the example below:
repologger.WithFields(log.Fields{"url": urlloc, "sha": sha, "issue": "Uploaded to object store"}).Trace(err)
239
-
log.WithFields(log.Fields{"url": urlloc, "sha": sha, "issue": "Uploaded to object store"}).Info("Successfully put ", sha, " in summoned bucket for ", urlloc)
240
-
repoStats.Inc(common.Stored)
241
-
}
242
-
// TODO Is here where to add an entry to the KV store
0 commit comments