Skip to content

Commit 7d99125

Browse files
equals215NGTmeatyCorentinB
authoredJan 26, 2025··
Fixing diverse v2 crashes (#179)
* postprocess: corrected some smells * postprocess: renamed some variables and corrected forloop variables * postprocess: postprocessItem args * postprocess: never set the state of the parent before adding a child, this is done via AddChild() method * item: reinforced CheckConsistency method * global: enforcing stricter state and consistency check for items throughout stages in the pipeline * item: corrected CheckConsistency() and made more unit tests * item&finisher: make use of CompleteAndCheck() method on an item to parse the tree before handling further * item: CompleteAndCheck() overlooked return conditions * pre/postprocess: trying to fix the flow of childs * dumper: add a Dump() function to properly dump an Item for further debugging * preprocessor: correct exclusion logic * item.Dedupe: corrected an edge case where a completed child has the same URL as the seed and dedupe was trying to remove the seed * postprocess: correct failed outlink extraction behaviour * Add more detailed pyroscope information * postprocess: add more debug logging to troubleshoot an unknown bug * preprocess: add itemId in panic * postprocess: always postprocess an item EVEN IF ASSETS CAPTURE IS DISABLED * archiver: close spooledBuffer if error happened during body processing * postprocess: close all bodies of an item tree before continuing in the pipeline * archiver: try to write bodies only on disk * add: small memory optimization for URLToString & encodeQuery * chore: upgrade Go version & dependencies * chore: bump warc lib to v.0.8.62 * fix: usage of spooledtempfile lib * chore: bump warc lib to v.0.8.63 * postprocess: defer a closeBodies call on every item that goes through * log: disable log queue full error message when TUI is used * cmd: add no-stderr-log flag * hq.consumer: replace previousBatch check with a reactor duplicate check * pyroscope: bump upload rate from 15s to 5s * fix: add panic for errors in startPipeline, retry indefinitely on HQ start error * fix: not returning when hq.Start fails to init HQ client * fix: typo * fix: HQ Start failure marking init as already done * fix: panic when HQ init fails * add: truthsocial.com preprocessing & post-processing * chore: bump warc lib to v.0.8.64 * add: more truthsocial.com special handling * add: more truthsocial.com special handling * add: more truthsocial.com special handling * fix: variable scope for truthsocial special handling * fix: domains crawl * fix: set assets hops to their seed hop * fix: extraction of outlinks on assets --------- Co-authored-by: Jake L <[email protected]> Co-authored-by: Corentin Barreau <[email protected]>

35 files changed

+1107
-295
lines changed
 

‎cmd/cmd.go

+1
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ func Run() error {
4343
rootCmd.PersistentFlags().String("log-level", "info", "stdout log level (debug, info, warn, error)")
4444
rootCmd.PersistentFlags().String("config-file", "", "config file (default is $HOME/zeno-config.yaml)")
4545
rootCmd.PersistentFlags().Bool("no-stdout-log", false, "disable stdout logging.")
46+
rootCmd.PersistentFlags().Bool("no-stderr-log", false, "disable stderr logging.")
4647
rootCmd.PersistentFlags().Bool("consul-config", false, "Use this flag to enable consul config support")
4748
rootCmd.PersistentFlags().String("consul-address", "", "The consul address used to retreive config")
4849
rootCmd.PersistentFlags().String("consul-path", "", "The full Consul K/V path where the config is stored")

‎cmd/get_hq.go

+7-2
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,14 @@ import (
44
"fmt"
55
"os"
66
"runtime"
7+
"time"
78

89
"github.com/google/uuid"
910
"github.com/grafana/pyroscope-go"
1011
"github.com/internetarchive/Zeno/internal/pkg/config"
1112
"github.com/internetarchive/Zeno/internal/pkg/controler"
1213
"github.com/internetarchive/Zeno/internal/pkg/ui"
14+
"github.com/internetarchive/Zeno/internal/pkg/utils"
1315
"github.com/spf13/cobra"
1416
)
1517

@@ -38,11 +40,14 @@ var getHQCmd = &cobra.Command{
3840
return fmt.Errorf("error getting hostname for Pyroscope: %w", err)
3941
}
4042

43+
Version := utils.GetVersion()
44+
4145
_, err = pyroscope.Start(pyroscope.Config{
42-
ApplicationName: fmt.Sprintf("zeno-%s-%s-%s", hostname, cfg.Job, uuid.New().String()[:5]),
46+
ApplicationName: fmt.Sprintf("zeno"),
4347
ServerAddress: cfg.PyroscopeAddress,
4448
Logger: nil,
45-
Tags: map[string]string{"hostname": hostname},
49+
Tags: map[string]string{"hostname": hostname, "job": cfg.Job, "version": Version.Version, "goVersion": Version.GoVersion, "uuid": uuid.New().String()[:5]},
50+
UploadRate: 5 * time.Second,
4651
ProfileTypes: []pyroscope.ProfileType{
4752
pyroscope.ProfileCPU,
4853
pyroscope.ProfileAllocObjects,

‎go.mod

+26-26
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,47 @@
11
module github.com/internetarchive/Zeno
22

3-
go 1.23.3
3+
go 1.23.5
44

55
require (
6-
github.com/CorentinB/warc v0.8.60
7-
github.com/PuerkitoBio/goquery v1.10.0
8-
github.com/ada-url/goada v0.0.0-20240402045241-5e45a5777313
6+
github.com/CorentinB/warc v0.8.64
7+
github.com/PuerkitoBio/goquery v1.10.1
8+
github.com/ada-url/goada v0.0.0-20250104020233-00cbf4dc9da1
99
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc
1010
github.com/elastic/go-elasticsearch v0.0.0
1111
github.com/elastic/go-elasticsearch/v7 v7.17.10
1212
github.com/gabriel-vasile/mimetype v1.4.8
13-
github.com/gdamore/tcell/v2 v2.7.1
13+
github.com/gdamore/tcell/v2 v2.8.1
1414
github.com/google/uuid v1.6.0
1515
github.com/grafana/pyroscope-go v1.2.0
16-
github.com/grafov/m3u8 v0.12.0
16+
github.com/grafov/m3u8 v0.12.1
1717
github.com/internetarchive/gocrawlhq v1.2.26
1818
github.com/philippgille/gokv/leveldb v0.7.0
19-
github.com/rivo/tview v0.0.0-20241103174730-c76f7879f592
19+
github.com/rivo/tview v0.0.0-20241227133733-17b7edb88c57
2020
github.com/spf13/cobra v1.8.1
2121
github.com/spf13/pflag v1.0.5
2222
github.com/spf13/viper v1.19.0
2323
go.uber.org/goleak v1.3.0
24-
golang.org/x/net v0.33.0
25-
mvdan.cc/xurls/v2 v2.5.0
24+
golang.org/x/net v0.34.0
25+
mvdan.cc/xurls/v2 v2.6.0
2626
)
2727

2828
require (
29-
github.com/andybalholm/brotli v1.1.0 // indirect
30-
github.com/andybalholm/cascadia v1.3.2 // indirect
31-
github.com/cloudflare/circl v1.4.0 // indirect
32-
github.com/fsnotify/fsnotify v1.7.0 // indirect
33-
github.com/gdamore/encoding v1.0.0 // indirect
29+
github.com/andybalholm/brotli v1.1.1 // indirect
30+
github.com/andybalholm/cascadia v1.3.3 // indirect
31+
github.com/cloudflare/circl v1.5.0 // indirect
32+
github.com/fsnotify/fsnotify v1.8.0 // indirect
33+
github.com/gdamore/encoding v1.0.1 // indirect
3434
github.com/gobwas/httphead v0.1.0 // indirect
3535
github.com/gobwas/pool v0.2.1 // indirect
3636
github.com/gobwas/ws v1.4.0 // indirect
3737
github.com/golang/snappy v0.0.4 // indirect
3838
github.com/grafana/pyroscope-go/godeltaprof v0.1.8 // indirect
3939
github.com/hashicorp/hcl v1.0.0 // indirect
4040
github.com/inconshreveable/mousetrap v1.1.0 // indirect
41-
github.com/klauspost/compress v1.17.10 // indirect
41+
github.com/klauspost/compress v1.17.11 // indirect
4242
github.com/lucasb-eyer/go-colorful v1.2.0 // indirect
43-
github.com/magiconair/properties v1.8.7 // indirect
44-
github.com/mattn/go-runewidth v0.0.15 // indirect
43+
github.com/magiconair/properties v1.8.9 // indirect
44+
github.com/mattn/go-runewidth v0.0.16 // indirect
4545
github.com/miekg/dns v1.1.62 // indirect
4646
github.com/mitchellh/mapstructure v1.5.0 // indirect
4747
github.com/onsi/gomega v1.34.2 // indirect
@@ -51,23 +51,23 @@ require (
5151
github.com/philippgille/gokv/util v0.7.0 // indirect
5252
github.com/refraction-networking/utls v1.6.7 // indirect
5353
github.com/rivo/uniseg v0.4.7 // indirect
54-
github.com/sagikazarmark/locafero v0.4.0 // indirect
54+
github.com/sagikazarmark/locafero v0.7.0 // indirect
5555
github.com/sagikazarmark/slog-shim v0.1.0 // indirect
5656
github.com/sourcegraph/conc v0.3.0 // indirect
57-
github.com/spf13/afero v1.11.0 // indirect
58-
github.com/spf13/cast v1.6.0 // indirect
57+
github.com/spf13/afero v1.12.0 // indirect
58+
github.com/spf13/cast v1.7.1 // indirect
5959
github.com/subosito/gotenv v1.6.0 // indirect
6060
github.com/syndtr/goleveldb v1.0.0 // indirect
6161
github.com/ulikunitz/xz v0.5.12 // indirect
6262
go.uber.org/multierr v1.11.0 // indirect
63-
golang.org/x/crypto v0.31.0 // indirect
64-
golang.org/x/exp v0.0.0-20240909161429-701f63a606c0 // indirect
65-
golang.org/x/mod v0.21.0 // indirect
63+
golang.org/x/crypto v0.32.0 // indirect
64+
golang.org/x/exp v0.0.0-20250106191152-7588d65b2ba8 // indirect
65+
golang.org/x/mod v0.22.0 // indirect
6666
golang.org/x/sync v0.10.0 // indirect
67-
golang.org/x/sys v0.28.0 // indirect
68-
golang.org/x/term v0.27.0 // indirect
67+
golang.org/x/sys v0.29.0 // indirect
68+
golang.org/x/term v0.28.0 // indirect
6969
golang.org/x/text v0.21.0 // indirect
70-
golang.org/x/tools v0.25.0 // indirect
70+
golang.org/x/tools v0.29.0 // indirect
7171
gopkg.in/ini.v1 v1.67.0 // indirect
7272
gopkg.in/yaml.v3 v3.0.1 // indirect
7373
)

0 commit comments

Comments
 (0)
Please sign in to comment.