Shenanigan hoarding. Scraping random content from image boards and stuff
- python
- beautifulsoup4
- requests
- Pillow
- dotenv
- click
i. Download the ZIP file of the repo and extract it
ii. Create a virtual environment inside the project (optional, but recommended)
iii. Run the command on a terminal inside the project
pip install --editable .
iv. Done
py scraper {command} {args}
- danbooru
- gelbooru
- safebooru
- nhentai
- zerochan
- yandere
Options | Values | Description |
---|---|---|
--url | None | The website section/page to scrape. For nhentai , the specific doujinshi url |
--pages | 1~ | The amount of pages to scrape. Depends on what the website can load |
--score | 0~ | The minimum score of a post to be scraped |
--rating | q e s g | q Questionable e Explicit s Sensitive/Safe g General |
--scope | img, vid | img Only Image vid With Video |
--clearance | None | initially required cookie from nhentai that validates request |
--token | None | initially required cookie from nhentai that validates request |
--save | flagged | Flag for saving cookies for future use |
--clean | flagged | Flag for removing doujinshi subdirectory |
--zhash | None | initially required cookie from zerochan that lets access to Subscribed section for registered users |
--zid | None | initially required cookie from zerochan that lets access to Subscribed section for registered users |
url
- The URL website to scrape. Users are the ones to browse to their preferred page/section to scrape
pages
- The amount of pages to scrape. By default, it starts with the page provided by the url which is 1
. Users can start at a specific page by navigating to that page and using the URL of that page.
score
- The minimum upvote/score/favorite a post must have to be included in the scraping process. Otherwise, it will be skipped.
rating
- Categories or types that posts must have to be scraped. q e s g
are all available for danbooru and gelbooru e s g
are available for safebooru and yandere and s
is Safe
scope
- Content types to include in the process.
zhash
and zid
- Cookies that lets the scraper access the Subscribed section (Registered user feature)
clearance
and token
- Cookies that validate request for the website
save
- Configure environment variables to save the just-inputted cookies for future use.
clean
- Remove the subdirectory where doujinshi pages were downloaded after session
py scraper.py danbooru --url https://mybooruurl --pages 3 --score 10 --rating q e s --scope vid
py scraper.py gelbooru --url https://mybooruurl --pages 3 --score 10 --rating q e s --scope vid
py scraper.py safebooru --url https://mybooruurl --pages 3 --score 10 --rating e s
py scraper.py yandere --url https://mybooruurl --pages 3 --score 10 --rating q e s --scope vid
py scraper.py nhentai https://nhentai.net/g/123456 --clearance cookie1 --token cookie 2 --save --clean
py scraper.py nhentai https://nhentai.net/g/123456 --clearance cookie1 --token cookie 2 --save
i. Open nhentai.net and authenticate yourself (captcha)
ii. Open the developer tools / inspect
Ctrl + Shift + I / F12
iii. Go to Storage
section and copy the value of the two cookies
For chromium browsers, the
Storage
section may be accessed through the >> icon at the far right part of the devtools
Important
Cookies expire after a year I think. Just in case you're still using this junk and whether if it still works till that time
Although this can scrape the Subscribed section, it does not allow for scraping registered user-only contents (Will find a way soon)
py scraper.py zerochan --url https://zerochanurl --zhash cookie1 --zid cookie2 --pages 10 --score 5 --keep
i. Open nhentai.net and authenticate yourself (captcha)
ii. Open the developer tools / inspect
Ctrl + Shift + I / F12
iii. Go to Storage
section and copy the value of the two cookies
For chromium browsers, the
Storage
section may be accessed through the >> icon at the far right part of the devtools
Important
The cookies expire after a month. Make sure to change it when the time comes if ever you're still using this junk or this still works