This project is designed to automate the process of gathering information from a variety of key Bitcoin-related sources. It leverages GitHub Actions to schedule nightly cron jobs, ensuring that the most up-to-date content is captured from each source according to a defined frequency. The scraped data are then stored in an Elasticsearch index.
Below is a detailed breakdown of the sources scraped and the schedule for each:
Daily at 00:00 AM UTC
Weekly
Additionally, for on-demand scraping tasks, we utilize a Scrapybot, details of which can be found in the Scrapybot section below.
You need an env file to indicate where you are pushing the data.
cp .env.sample .env
ES_LOCAL_URL =
with your local elastic search urlcd common && yarn install && cd ../mailing-list && yarn install && cd ..
node mailing-list/main.js
with additional env vars like URL='https://lists.linuxfoundation.org/pipermail/bitcoin-dev/'
and NAME='bitcoin'
3a. Or you can do something like `cd bitcointranscripts && pip install -r requirements.txt && cd .. && python3 bitcointranscripts/main.pyYou should be calling the scrapers from the root dir because they use the common dir.
We have implemented a variety of crawlers (spiders), each designed for a specific website of interest.
You can find all the spiders in the scrapybot/scrapybot/spiders
directory.
This section explains how to run the scrapers in the scrapybot
folder.
To run a crawler using scrapybot, for example rusty
, which will scrape the site https://rusty.ozlabs.org
, switch to the root directory(where there is this README file) and run these commands from your terminal:
pip install -r requirements.txt && cd scrapybot
scrapy crawl rusty -O rusty.json
The above commands will install scrapy dependencies, then run the rusty
spider(one of the crawlers) and store the collected document in rusty.json
file in the scrapybot
project directory.
The same procedure can be applied to any of the crawlers in the scrapybot/spiders
directory.
There is also a script in scrapybot
directory called scraper.sh
which can run all the spiders at once.
example.ini
file inside the scrapybot
directory with the following contents
[ELASTIC]
cloud_id = `your_cloud_id`
user = `your_elasticsearch_username`
password = `your_elasticsearch_password`
pipelines.py file
in the scrapybot
directory, read the above file to load your elasticsearch credentials with the line below:
config.read("/path/to/your/example.ini") - replace what's in quotes with your actual `ini` file
The bitcointalk forum scraper takes many hours to scrape so to start from the beginning, you’ll need to do it from a server rather than use GitHub actions which has a 6 hours timeout. It’s not a big deal if it times out on GitHub actions because it’s written to index the last 100 posts and work in reverse chronological order.
If testing on your local machine, it won’t pull things it already has, but it will for GitHub actions. This could be optimized.