scrapyd-k8s is configured with the file scrapyd_k8s.conf
. The file format is meant to
stick to scrapyd's configuration where possible.
http_port
- defaults to6800
(➽)bind_address
- defaults to127.0.0.1
(➽)max_proc
- (implementation pending), if unset or0
it will use the number of nodes in the cluster, defaults to0
(➽)repository
- Python class for accessing the image repository, defaults toscrapyd_k8s.repository.Remote
launcher
- Python class for managing jobs on the cluster, defaults toscrapyd_k8s.launcher.K8s
username
- Set this andpassword
to enable basic authentication (➽)password
- Set this andusername
to enable basic authentication (➽)log_level
- Log level, defaults toINFO
The Docker and Kubernetes launchers have their own additional options.
Each project you want to be able to run, gets its own section, prefixed with project.
. For example,
consider an example
spider, this would be defined in a [project.example]
section.
repository
- container repository for the project, e.g.ghcr.io/q-m/scrapyd-k8s-spider-example
This section describes Docker-specific options.
See scrapyd_k8s.sample-docker.conf
for an example.
[scrapyd]
launcher
- set this toscrapyd_k8s.launcher.Docker
[scrapyd]
repository
- choose betweenscrapyd_k8s.repository.Local
andscrapyd_k8s.repository.Remote
TODO: explain Local
and Remote
repository, and how to use them
This section describes Kubernetes-specific options.
See scrapyd_k8s.sample-k8s.conf
for an example.
[scrapyd]
launcher
- set this toscrapyd_k8s.launcher.K8s
[scrapyd]
repository
- set this toscrapyd_k8s.repository.Remote
For Kubernetes, it is important to set resource limits.
TODO: explain how to set limits, with default, project and spider specificity.
logs_dir
- a directory to store log files collected on k8s cluster (implemented only for Kubernetes). When configuring, keep in mind that in the Dockerfile theUSER
is set tonobody
so not all directories are writable, but if you make a child directory under/tmp
you won't encounter permission problems.
The Kubernetes event watcher is used in the code as part of the joblogs feature and is also utilized for limiting the number of jobs running in parallel on the cluster. Both features are not enabled by default and can be activated if you choose to use them.
The event watcher establishes a connection to the Kubernetes API and receives a stream of events from it. However, the
nature of this long-lived connection is unstable; it can be interrupted by network issues, proxies configured to terminate
long-lived connections, and other factors. For this reason, a mechanism was implemented to re-establish the long-lived
connection to the Kubernetes API. To achieve this, three parameters were introduced: reconnection_attempts
,
backoff_time
and backoff_coefficient
.
reconnection_attempts
- defines how many consecutive attempts will be made to reconnect if the connection fails;backoff_time
,backoff_coefficient
- are used to gradually slow down each subsequent attempt to establish a connection with the Kubernetes API, preventing the API from becoming overloaded with requests. Thebackoff_time
increases exponentially and is calculated asbackoff_time *= self.backoff_coefficient
.
Default values for these parameters are provided in the code and are tuned to an "average" cluster setting. If your network requirements or other conditions are unusual, you may need to adjust these values to better suit your specific setup.