Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GitHub Action to Generate a WARC of Hosted Site #66

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ikreymer
Copy link
Member

Inspired by @anjackson's tweet, here's a github action that will generate a WARC of the github pages site after every commit to master.
I figured having a WARC for every commit of the WARC specification might be a good test case for this idea!

This PR adds an action that builds the site via Jekyll and then generates a WARC using warcit and uploads it as an artifact to github, like this:
https://github.com/ikreymer/warc-specifications/actions/runs/170823529
(Note that due to limitation of github, the artifact is always also zipped, so that WARC file is placed in a zip file - can't be changed for now).

This PR also adds:

  • add Gemfile and Gemfile.lock necessary for local Jekyll build
  • add local jquery instead of loading from cdn so that the WARC is more self-contained

(The github api to list active issues an of course the active issues themselves are not included, which might be a nice future extension...)

- generate only on master branch
- add Gemfile and Gemfile.lock necessary for local Jekyll build
- add local jquery instead of loading from cdn so that the WARC is more self-contained
- use 'iipc-warc-specification.warc.gz' as the name (note github still zips the file)
@ibnesayeed
Copy link
Contributor

It is worth noting that these artifacts will not be preserved forever. They expire after 90 days automatically. Also, there might be some disk quota associated. If we had an external storage where these WARCs can be pushed as the next step after artifacts are built, that would be great. Also, it will be better to add timestamp in the filename.

@ikreymer
Copy link
Member Author

Yeah, timestamp is a good idea.. Maybe there should be a separate workflow for turning the artifacts into releases, which would be permanent.. perhaps on a version change?

@ibnesayeed
Copy link
Contributor

Maybe there should be a separate workflow for turning the artifacts into releases, which would be permanent.. perhaps on a version change?

I don't think this repository is tagged/versioned, but if we plan to do that every now and then after major changes, uploading workflow artifacts as release artifacts would be a good idea.

On the other hand, one can always recreate a WARC file of a prior state by checking the code out at a specific repo state, building the site, and running warcit on it.

@ikreymer
Copy link
Member Author

Latest update uses [user]-[repo]-[timestamp].warc.gz as the filename:
https://github.com/ikreymer/warc-specifications/actions/runs/170913379

Copy link
Contributor

@ibnesayeed ibnesayeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes intend to add the ability to manually trigger an event to create WARC file on demand at a supplied Git reference.

This, unfortunately will not work on prior commit states as Gemfile will be missing. However, if we do not mandate inclusion of Gemfile in the repo and install all the ruby dependencies inline in this workflow file then it should work on historical versions as well.

.github/workflows/warcit.yaml Show resolved Hide resolved
.github/workflows/warcit.yaml Outdated Show resolved Hide resolved
@ikreymer
Copy link
Member Author

ikreymer commented Jul 16, 2020

This, unfortunately will not work on prior commit states as Gemfile will be missing. However, if we do not mandate inclusion of Gemfile in the repo and install all the ruby dependencies inline in this workflow file then it should work on historical versions as well.

I suppose you can check if Gemfile exists and, if not, create it on the fly..
I was thinking of this is a prototype for a more generic workflow that could be added to any repo, including non-Jekyll static sites.
So probably it should check:

  1. if no _config.yml, then not a Jekyll site, just warcit the root repo
  2. if _config.yml but no Gemfile, try adding a default one and building it before running Jekyll
  3. if Gemfile exists, just run Jekyll.

There may be more variations too, like if the gh pages root is in the docs directory.
Or maybe that should be a future PR/improvement.

@ibnesayeed
Copy link
Contributor

I suppose you can check if Gemfile exists and, if not, create it on the fly..

If we were to do that, then it would be simpler to not rely on a Gemfile and have all the packages necessary to replicate default GH Pages builder.

I was thinking of this is a prototype for a more generic workflow that could be added to any repo, including non-Jekyll static sites.

In that case you should be able to ask users to provide input variables to identify which category their site falls under while having a more sensible and common default. There are a handful of reusable actions to host static sites on GH Pages, built from many different static site generators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants