This repo is public access to what I use to automate my monthly arxiv paper skimming series on youtube. I don't think anybody will really find a majority of these files useful but who knows.
if you're looking to see the backlog of papers i've looked into each week in my videos, then check out the following two files:
papers_downloaded.csv
: includes every single paper that has shown up on the weekly paper videos since 2024/06/21. From reading the abstracts of these papers I selected which papers would make their way into the following filepapers_kept.csv
: includes every single paper that has shown up on the monthly substack newsletter/podcast since 2024/06/21. These are the papers that I actually bother starting to read, some percentage of which get deleted part of the way through, some get read but never discussed again, and some get read & talked about on the channel in one of my paper breakdown videos
arxiv-link-downloader.py
- this script takes as input any number of arxiv links and downloads them as well as adds them tolinks.txt
,papers_seen.csv
andpapers_downloaded.csv
arxiv-search.py
- this script opens up an app window with a list of paper titles and allows you to download these papers intopdfs/
with the click of a button. It selects them according to search criteria specified inconfig.py
; by default the search terms are ones that I prefer and it shows you the most recent papers you've not yet seen with a cap at 2000 total (i don't recommend sifting through that many in one sitting, it's mind-numbing). Whenever this is run to completion every single paper in the list gets added topapers_seen.csv
. Whenever you download a file the script writes the ArXiv link intolinks.txt
for use later and a bunch of info intopapers_downloaded.csv
in the hopes that i'll one day be able to train a model to select papers for me using these two csv files.cleanup.py
- this will take any pdf files inpdfs-to-summarize/
and send them along with corresponding .md files to your obsidian vault if you have that enabled inconfig.py
. You need to specify the location of your obsidian vault inconfig.py
in order for it to work. When sending files to obsidian, it also records the fact that you decided to keep them by adding lines topapers_kept.csv
; if you want to use that csv but don't want to use obsidian then hop intoconfig.py
to change that setting. Finally, it deletes all of the files that are generated by all the other scripts.config.py
- Where you can change a couple settings if you'd like.newsletter-podcast.py
- this will consume all PDFs in thepdfs-to-summarize/
folder and use OpenAI's API to generate summaries which will go intonewsletter.txt
. It then turns this newsletter into an mp3 file for a podcast using OpenAI's TTS. You need to create a filekey_openai.txt
and paste in your individual (not organization) OpenAI API key in order for this to workrecording.py
- this file handles everything that happens during the actual video recordings.- Running it begins the hotkey listener
- Hitting the primary hotkey ("[" by default) the first time begins a timer, opens the link from the first line of
links.txt
in your default browser, and writes and writes the first timestamp totimestamps.txt
- Hitting the hotkey any following time both records the next timestamp and opens the next link
- note: In
timestamps.txt
common phrases (such as "Neural Network") are shortened to acronyms (eg "NN"). Add/remove phrases inconfig.py
. Delete all phrases from the config to remove this functionality
- note: In
- Hitting the delete_hotkey ("]" by default) deletes the previously written timestamp. This is very useful in conjunction with the automatic video silence remover from my auto-video-editing-suite as it allows me to not include a given paper in the video if i get to it and find it to be boring
- Hit
Esc
to end the timer and script
timestamp_trimmer.py
- this will trim lines until they get below a specified character count (4,500 by default; Youtube's max description length is 5,000 characters) prioritizing those which have the shortest time length to be trimmed firstpapers_seen.csv
- includes every single paper that had its title pass in front of my eyes BEFORE the weekly abstract reading video; this is basically every single paper that gets published to arxiv under the AI category and every tangentially related category. From reading these titles I usearxiv-search.py
to select which papers will go in the following filepapers_downloaded.csv
- includes every single paper that has shown up on the monthly paper videos since 2024/06/21. From reading the abstracts of these papers I selected which papers would make their way into the following file by moving them intopdfs-to-summarize/
during the videopapers_kept.csv
- includes every single paper that has shown up on the monthly substack newsletter/podcast since 2024/06/21. These are the papers that I actually bother starting to read, some percentage of which get deleted part of the way through, some read but never discussed again, and some read & talked about on the channel in one of my paper breakdown videos
- Clone the repository to your local machine.
- Create a virtual environment and install the required Python packages by running
pip install -r requirements.txt
- Obtain an API key from OpenAI and save it in a file named
key_openai.txt
in the root directory of the repository. - Run
cleanup.py
to get rid of all the pdf and text files that I may or may not have left in here by accident on the most recent push. I'd also recommend deletingpapers_seen.csv
,papers_downloaded.csv
, andpapers_kept.csv
assuming you plan to use them to train your own custom recommendation model like I do one day and don't want my data confounding yours. - If you don't use obsidian then open
config.py
and setsend-to-obsidian = False
. If you plan to send files to an Obsidian vault then openconfig.py
and define directories foryour/obsidian/vault/location/here
andyour/obsidian/vault/location/here/attachments-folder
. If you robsidian vault location and the attachments location are the same then just set them both as the same. Also insideconfig.py
you can editfrontmatter_lines
to fit your own tagging system. By default the first line after the tags will be a link to the file and you'd have to jump intocleanup.py
to change that - Maybe peruse
config.py
to check settings and try to gain a better understanding of this monstrocity I've created
- Write out your search terms in
search_terms_include.txt
andsearch_terms_exclude.txt
to fit your use-case. Each search term should be on its own line. If you just want all of today's newest papers then leave both blank. For me personally I exclude papers that I know I'm not going to be interested in, for example anything related to the medical field. Also by default it will only downloads papers published after the date inmost_recent_day_searched.txt
but if you'd like to disable that then open upconfig.py
and setrestrict_to_most_recent = False
. - Run
arxiv-search.py
, wait for it to finish printing out every title and link to console, and then it should create a little app window. Drag expand this window and then you'll see a bunch of buttons with names of papers. Click on a paper and it'll be downloaded topdfs/
- If no papers show up and you get a blank window that's either because
- the arXiv API wrapper is bugging out. Just run it a couple times until it works, preferably waiting at least 15 minutes if not an hour between attempts
- the conditions of your search are such that no results have returned. Try removing lines from
search_terms_exclude.txt
, adding terms tosearch_terms_include.txt
, changingmost_recent_day_searched.txt
to an earlier date, or adjusting therestrict_to_most_recent
,max_results
, orcategories
variables inconfig.py
.
- If no papers show up and you get a blank window that's either because
- Run
recording.py
and the link to the first paper inlinks.txt
will open up in your default browser 3b. Once you're ready to record, hit record and the primary hotkey ("[" by default but configurable inconfig.py
) at the same time which will start the timer, record the first timestamp, and open the first link 3c. Every successive hit of the primary hotkey will both open the corresponding links and record the timestamp 3d. If you hit thedelete_hotkey
("]" by default) then the recently recorded timestamp will be deleted 3e. HitEsc
to end the script - If
timestamps.txt
is too long to fit into Youtube's description, runtimestamp_trimmer.py
- Run the
newsletter-podcast.py
script to generate anewsletter.txt
andpodcast.mp3
based on all the pdf files inpdfs-to-summarize/
. Basically it has chatGPT summarize the paper and then OpenAI's TTS model read that summary to create the podcast; look inconfig.py
to adjust the prompt. I assumed that during step 3 you were dropping every pdf file you wanted to read into that folder - Once you're finished run
cleanup.py
. This will send the pdfs inpdfs-to-summarize/
to your obsidian vault and create corresponding markdown notes for them, and then delete all files created by the previous scripts to clean up the repo
- over-engineer arxiv-search.py to make it easier for me to go long periods of time without checking papers
- train a model (BERT based?) off of
papers_seen.csv
,papers_downloaded.csv
, andpapers_kept.csv
to automatically grab for me the papers that i find interesing in a given week rather than having to read through the boring list myself