Skip to content

Commit aefb014

Browse files
authored
Merge pull request #48 from gleanerio/dv_master-fix_47
#47. run one source, fix init
2 parents 6ec4bcc + 1185b10 commit aefb014

File tree

6 files changed

+259
-14
lines changed

6 files changed

+259
-14
lines changed

configs/template/GleanerConfig.md

+195
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Gleaner Configuration file
2+
3+
This assumes that you have a container stack running
4+
5+
```
6+
s3 store
7+
triple store
8+
headless
9+
```
10+
## Gleaner Configuration generation
11+
The suggested method of creating a configuration file is to use glcon command can intialize a configuration directory, and allow for the generation of
12+
configuration files for gleaner and nabu. Download a glcon release from github
13+
The pattern is to intiialize a configuration directory, edit files, and generate new configurations
14+
### initialize a configuraiton directory
15+
```
16+
glcon config init -cfgName test
17+
```
18+
initializes a configuration in configs with name of 'test'
19+
Inside you will find
20+
```
21+
test % ls
22+
gleaner_base.yaml readme.txt sources.csv
23+
nabu_base.yaml servers.yaml
24+
```
25+
26+
### Edit the files
27+
Usually, you will only need to edit the servers.yaml and sources.csv
28+
The servers.yaml
29+
30+
#### Servers.yaml
31+
```yaml
32+
---
33+
minio:
34+
address: 0.0.0.0 # can be overridden with MINIO_ADDRESS
35+
port: 9000 # can be overridden with MINIO_PORT
36+
accessKey: worldsbestaccesskey # can be overridden with MINIO_ACCESS_KEY
37+
secretKey: worldsbestsecretkey # can be overridden with MINIO_SECRET_KEY
38+
ssl: false # can be overridden with MINIO_SSL
39+
bucket: gleaner # can be overridden with MINIO_BUCKET
40+
sparql:
41+
endpoint: http://localhost/blazegraph/namespace/earthcube/sparql
42+
s3:
43+
bucket: gleaner # sync with above... can be overridden with MINIO_BUCKET... get's zapped if it's not here.
44+
domain: us-east-1
45+
46+
#headless field in gleaner.summoner
47+
headless: http://127.0.0.1:9222
48+
```
49+
First, in the "mino:" section make sure the accessKey and secretKey here match the access keys for your minio.
50+
These can be overridden with the environent variables:
51+
* "MINIO_ACCESS_KEY"
52+
* "MINIO_SECRET_KEY"
53+
54+
#### sources.csv
55+
This is designed to be edited in a spreadsheet, or dumped as csv from a google spreadsheet
56+
57+
```csv
58+
hack,SourceType,Active,Name,ProperName,URL,Headless,Domain,PID,Logo
59+
1,sitegraph,FALSE,aquadocs,AquaDocs,https://oih.aquadocs.org/aquadocs.json ,FALSE,https://aquadocs.org,http://hdl.handle.net/1834/41372,
60+
3,sitemap,TRUE,opentopography,OpenTopography,https://opentopography.org/sitemap.xml,FALSE,http://www.opentopography.org/,https://www.re3data.org/repository/r3d100010655,https://opentopography.org/sites/opentopography.org/files/ot_transp_logo_2.png
61+
,sitemap,TRUE,iris,IRIS,http://ds.iris.edu/files/sitemap.xml,FALSE,http://iris.edu,https://www.re3data.org/repository/r3d100010268,http://ds.iris.edu/static/img/layout/logos/iris_logo_shadow.png
62+
```
63+
64+
Fields:
65+
1. hack:a hack to make the fields are properly read.
66+
2. SourceType : [sitemap, sitegraph] type of source
67+
3. Active: [TRUE,FALSE] is source active.
68+
4. Name: short name of source. It should be one word (no space) and be lower case.
69+
5. ProperName: Long name of source that will be added to organization record for provenance
70+
6. URL: URL of sitemap or sitegraph.
71+
7. Headless: [FALSE,TRUE] should be set to false unless you know this site uses JavaScript to place the JSON-LD into the page. This is true of some sites and it is supported but not currently auto-detected. So you will need to know this and set it. For most place, this will be false.
72+
if the json-ld is generated in a page dynamically, then use , TRUE
73+
8. Domain:
74+
9. PID: a unique identifier for the source. Perfered that is is a research id.
75+
10. Logo: while no longer used, logo of the source
76+
77+
### generate the configuraiton files
78+
```
79+
glcon generate -cfgName test
80+
```
81+
This will generate files 'gleaner' and 'yaml' and make copies of the existing configuration files
82+
83+
The full details are discussed below
84+
85+
## Gleaner Configuration
86+
87+
So now we are ready to review the Gleaner configuration file named gleaner. There is actually quite a bit in this file, but for this starting demo only a few things we need to worry about. The default file will look like:
88+
89+
```yaml
90+
---
91+
minio:
92+
address: 0.0.0.0
93+
port: 9000
94+
accessKey: worldsbestaccesskey
95+
secretKey: worldsbestsecretkey
96+
ssl: false
97+
bucket: gleaner
98+
gleaner:
99+
runid: runX # this will be the bucket the output is placed in...
100+
summon: true # do we want to visit the web sites and pull down the files
101+
mill: true
102+
context:
103+
cache: true
104+
contextmaps:
105+
- prefix: "https://schema.org/"
106+
file: "./configs/schemaorg-current-https.jsonld"
107+
- prefix: "http://schema.org/"
108+
file: "./configs/schemaorg-current-https.jsonld"
109+
summoner:
110+
after: "" # "21 May 20 10:00 UTC"
111+
mode: full # full || diff: If diff compare what we have currently in gleaner to sitemap, get only new, delete missing
112+
threads: 5
113+
delay: # milliseconds (1000 = 1 second) to delay between calls (will FORCE threads to 1)
114+
headless: http://127.0.0.1:9222 # URL for headless see docs/headless
115+
millers:
116+
graph: true
117+
# will be built from sources.csv
118+
sources:
119+
- sourcetype: sitegraph
120+
name: aquadocs
121+
logo: ""
122+
url: https://oih.aquadocs.org/aquadocs.json
123+
headless: false
124+
pid: http://hdl.handle.net/1834/41372
125+
propername: AquaDocs
126+
domain: https://aquadocs.org
127+
active: false
128+
- sourcetype: sitemap
129+
name: opentopography
130+
logo: https://opentopography.org/sites/opentopography.org/files/ot_transp_logo_2.png
131+
url: https://opentopography.org/sitemap.xml
132+
headless: false
133+
pid: https://www.re3data.org/repository/r3d100010655
134+
propername: OpenTopography
135+
domain: http://www.opentopography.org/
136+
active: false
137+
```
138+
139+
A few things we need to look at.
140+
141+
First, in the "mino:" section make sure the accessKey and secretKey here match the ones you have and set via your demo.env file.
142+
143+
Next, lets look at the "gleaner:" section. We can set the runid to something. This is the ID for a run and it allows you to later make different runs and keep the resulting graphs organized. It can be set to any lower case string with no spaces.
144+
145+
The miller and summon sections are true and we will leave them that way. It means we want Gleaner to both fetch the resources and process (mill) them.
146+
147+
Now look at the "miller:" section when lets of pick what milling to do. Currently it is set with only graph set to true. Let's leave it that way for now. This means Gleaner will only attempt to make graph and not also run validation or generate prov reports for the process.
148+
149+
The final section we need to look at is the "sources:" section.
150+
Here is where the fun is. While there are two types, sitegraph and sitemaps we will normally use sitemap type.
151+
152+
A standard sitemap is below:
153+
```yaml
154+
sources:
155+
- sourcetype: sitemap
156+
name: opentopography
157+
logo: https://opentopography.org/sites/opentopography.org/files/ot_transp_logo_2.png
158+
url: https://opentopography.org/sitemap.xml
159+
headless: false
160+
pid: https://www.re3data.org/repository/r3d100010655
161+
propername: OpenTopography
162+
domain: http://www.opentopography.org/
163+
active: true
164+
```
165+
166+
A sitegraph
167+
```yaml
168+
sources:
169+
- sourcetype: sitegraph
170+
name: aquadocs
171+
logo: ""
172+
url: https://oih.aquadocs.org/aquadocs.json
173+
headless: false
174+
pid: http://hdl.handle.net/1834/41372
175+
propername: AquaDocs
176+
domain: https://aquadocs.org
177+
active: false
178+
```
179+
These are the sources we wish to pull and process.
180+
Each source has a type, and 8 entries though at this time we no longer use the "logo" value.
181+
It was used in the past to provide a page showing all the sources and
182+
a logo for them. However, that's really just out of scope for what we want to do.
183+
You can leave it blank or set it to any value, it wont make a difference.
184+
185+
The name is what you want to call this source. It should be one word (no space) and be lower case.
186+
187+
The url value needs to point to the URL for the site map XML file. This will be created and served by the data provider.
188+
189+
The headless value should be set to false unless you know this site uses JavaScript to place the JSON-LD into the page. This is true of some sites and it is supported but not currently auto-detected. So you will need to know this and set it. For most place, this will be false.
190+
191+
You can have as many sources as you wish. For an example look the configure file for the CDF Semantic Network at: https://github.com/gleanerio/CDFSemanticNetwork/blob/master/configs/cdf.yaml
192+
193+
194+
195+

internal/config/sources.go

+34-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
package config
22

33
import (
4+
"errors"
45
"fmt"
56
"github.com/gocarina/gocsv"
67
"github.com/spf13/viper"
@@ -147,7 +148,6 @@ func GetActiveSourceByType(sources []Sources, key string) []Sources {
147148
return sourcesSlice
148149
}
149150

150-
151151
func SourceToNabuPrefix(sources []Sources, includeProv bool) []string {
152152

153153
var prefixes []string
@@ -171,3 +171,36 @@ func SourceToNabuPrefix(sources []Sources, includeProv bool) []string {
171171
}
172172
return prefixes
173173
}
174+
175+
func PruneSources(v1 *viper.Viper, useSources []string) (*viper.Viper, error) {
176+
var finalSources []Sources
177+
allSources, err := GetSources(v1)
178+
if err != nil {
179+
log.Fatal("error retrieving sources: %s", err)
180+
}
181+
for _, s := range allSources {
182+
if contains(useSources, s.Name) {
183+
s.Active = true // we assume you want to run this, even if disabled, normally
184+
finalSources = append(finalSources, s)
185+
}
186+
}
187+
if len(finalSources) > 0 {
188+
v1.Set("sources", finalSources)
189+
return v1, err
190+
} else {
191+
192+
return v1, errors.New("cannot find a source with the name ")
193+
}
194+
195+
}
196+
197+
// contains checks if a string is present in a slice
198+
func contains(s []string, str string) bool {
199+
for _, v := range s {
200+
if v == str {
201+
return true
202+
}
203+
}
204+
205+
return false
206+
}

pkg/cli/batch.go

+23-5
Original file line numberDiff line numberDiff line change
@@ -21,32 +21,42 @@ import (
2121
configTypes "github.com/gleanerio/gleaner/internal/config"
2222
"github.com/gleanerio/gleaner/pkg"
2323
bolt "go.etcd.io/bbolt"
24+
"os"
2425

2526
"log"
2627
"path"
2728

2829
"github.com/spf13/cobra"
2930
)
3031

32+
var sourceVal string
33+
3134
// batchCmd represents the batch command
3235
var batchCmd = &cobra.Command{
33-
Use: "batch",
34-
Short: "Execute gleaner process",
36+
Use: "batch",
37+
TraverseChildren: true,
38+
Short: "Execute gleaner process",
3539
Long: `run gleaner process to extract JSON-LD from pages using sitemaps, conver to triples
3640
and store to a S3 server:
3741
--cfgName
3842
--mode`,
43+
3944
Run: func(cmd *cobra.Command, args []string) {
4045
fmt.Println("batch called")
41-
Batch(glrVal, cfgPath, cfgName, modeVal)
46+
var runSources []string
47+
if sourceVal != "" {
48+
runSources = append(runSources, sourceVal)
49+
}
50+
Batch(glrVal, cfgPath, cfgName, modeVal, runSources)
4251
},
4352
}
4453

4554
func init() {
4655
gleanerCmd.AddCommand(batchCmd)
4756

4857
// Here you will define your flags and configuration settings.
49-
58+
batchCmd.Flags().StringVar(&sourceVal, "source", "", "Override config file source(s) to specify an index target")
59+
batchCmd.Flags().StringVar(&modeVal, "mode", "mode", "Set the mode")
5060
// Cobra supports Persistent Flags which will work for this command
5161
// and all subcommands, e.g.:
5262
// batchCmd.PersistentFlags().String("foo", "", "A help for foo")
@@ -56,7 +66,7 @@ func init() {
5666
// batchCmd.Flags().BoolP("toggle", "t", false, "Help message for toggle")
5767
}
5868

59-
func Batch(filename string, cfgPath string, cfgName string, mode string) {
69+
func Batch(filename string, cfgPath string, cfgName string, mode string, runSources []string) {
6070

6171
v1, err := configTypes.ReadGleanerConfig(filename, path.Join(cfgPath, cfgName))
6272
if err != nil {
@@ -70,5 +80,13 @@ func Batch(filename string, cfgPath string, cfgName string, mode string) {
7080
}
7181
defer db.Close()
7282

83+
if len(runSources) > 0 {
84+
85+
v1, err = configTypes.PruneSources(v1, runSources)
86+
if err != nil {
87+
log.Fatal(err)
88+
os.Exit(1)
89+
}
90+
}
7391
pkg.Cli(mc, v1, db)
7492
}

pkg/cli/config.go

+1-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ nabu uploads and manages data processed by gleaner to a sparql triplestore
2424
var glrVal, nabuVal, sourcesVal, templateGleaner, templateNabu string
2525

2626
var configBaseFiles = map[string]string{"gleaner": "gleaner_base.yaml", "sources": "sources.csv", "sources_min": "sources_min.csv",
27-
"nabu": "nabu_base.yaml", "servers": "servers.yaml", "readme": "readme.txt"}
27+
"nabu": "nabu_base.yaml", "servers": "servers.yaml", "readme": "readme.txt", "configdoc": "GleanerConfig.md"}
2828

2929
var gleanerFileNameBase = "gleaner"
3030
var nabuFilenameBase = "nabu"

pkg/cli/gleaner.go

+4-5
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,9 @@ import (
77

88
// gleanerCmd represents the run command
99
var gleanerCmd = &cobra.Command{
10-
Use: "gleaner",
11-
Short: "command to execute gleaner processes",
10+
Use: "gleaner",
11+
TraverseChildren: true,
12+
Short: "command to execute gleaner processes",
1213
Long: `run gleaner process to extract JSON-LD from pages using sitemaps, conver to triples
1314
and store to a S3 server:
1415
--cfgName
@@ -18,13 +19,11 @@ and store to a S3 server:
1819
fmt.Println("gleaner called")
1920
},
2021
}
21-
var sourceVal, modeVal string
22+
var modeVal string
2223

2324
func init() {
2425
rootCmd.AddCommand(gleanerCmd)
25-
2626
// Here you will define your flags and configuration settings.
27-
gleanerCmd.Flags().StringVar(&modeVal, "mode", "mode", "Set the mode")
2827

2928
// Cobra supports Persistent Flags which will work for this command
3029
// and all subcommands, e.g.:

pkg/cli/init.go

+2-2
Original file line numberDiff line numberDiff line change
@@ -59,12 +59,12 @@ func initCfg(cfgpath string, cfgName string, configBaseFiles map[string]string)
5959
// do not overwrite the source.csv or servers.yaml
6060
_, err := os.Stat(path.Join(cfgpath, cfgName, configBaseFiles["sources"]))
6161
if err == nil {
62-
copy(path.Join(cfgpath, cfgName, configBaseFiles["sources"]), path.Join(cfgpath, cfgName, configBaseFiles["sources"]+"_latest"))
62+
copy(path.Join(cfgpath, "template", configBaseFiles["sources"]), path.Join(cfgpath, cfgName, configBaseFiles["sources"]+"_latest"))
6363
delete(configBaseFiles, "sources")
6464
}
6565
_, err = os.Stat(path.Join(cfgpath, cfgName, configBaseFiles["servers"]))
6666
if err == nil {
67-
copy(path.Join(cfgpath, cfgName, configBaseFiles["servers"]), path.Join(cfgpath, cfgName, configBaseFiles["servers"]+"_latest"))
67+
copy(path.Join(cfgpath, "template", configBaseFiles["servers"]), path.Join(cfgpath, cfgName, configBaseFiles["servers"]+"_latest"))
6868
delete(configBaseFiles, "servers")
6969
}
7070
// copy files listed in config.go: configBaseFiles

0 commit comments

Comments
 (0)