Skip to content

A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and clips models for your purposes. Custom datasets can be added!

License

x-CK-x/Dataset-Curation-Tool

Repository files navigation

Dataset-Curation-Tool

A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. Diffusion and Clip models for your purposes. Custom datasets can be added!

WIKI-Page / Tutorial for this Repository HERE

General Config

Installation Requirements

Make sure you have git installed!

Download either the windows, mac, or linux run file (repo will be installed for you):

Windows Download

Linux Download

MacOS Download

Mac and Linux Users should make the file executable with the following terminal command:

chmod +x linux_run.sh

OR

chmod +x mac_run.sh

Other System Install Options

  • Unzip for the (optionally) downloaded zip files

(Linux)

sudo apt-get install unzip

How to Run Program

"DO NOT" run the file with admin/sudo perms!

"DO NOT" put the manually downloaded run file from the (INSTALLATION STEP ^^^) in the Data-Curation-Tool folder!

"DO NOT" use the run file/s in the Data-Curation-Tool folder! (Use the manually downloaded run file, from the INSTALLATION STEP ^^^ to install and/or update the repo)

"DO NOT" move the generated "dataset_curation_path.txt" file out of the Data-Curation-Tool folder!

The "DUPLICATE" run files (run.bat, mac_run.sh, linux_run.sh) residing in the Data-Curation-Tool folder, are intentionally deleted when the program is run.

Double-Click file to run with (Default) settings

Update dependencies i.e. in the yaml file with the following (make sure to use the most recent yaml file in the repo: https://raw.githubusercontent.com/x-CK-x/Dataset-Curation-Tool/main/environment.yml):

./RUN_FILE --update

Below are Several Run (additional) Options to choose from

Run with sharing turned on : Provides a live link that anyone can use

./RUN_FILE --share

Run password protected : Requires user to type in a username & password to access the webUI

./RUN_FILE --server_port 7860 --username NAME --password PASS

Run on a specified PORT : Displays the webUI relative to a specified PORT

./RUN_FILE --server_port 7860

OR CHOOSE ANY COMBINATION OF ^

Important Information

  • the most current STABLE build --> v4.4.5
  • for new users, it's highly recommended to use releases instead of pulling from the main branch
  • in addition it is important to avoid using the alpha builds in the releases
  • if an alpha build is present it will be labeled as a pre-release and the main-branch of the repo is also likely to contain those changes; as such please use the most recent stable build as denoted above

Bug Reporting & Troubleshooting

Create a Support Ticket or Bug Report here: https://github.com/x-CK-x/Dataset-Curation-Tool/issues

New Feature Requests

Feel free to suggest new feature/s here: https://github.com/x-CK-x/Dataset-Curation-Tool/discussions/categories/ideas

Future Objectives/Features

  • Update the existing Version_3 WebUI WIKI Page for the Version_4 WebUI
  • Finish Code Refactor
  • Conda setup instructions
  • CSV load time optimization with the pandas framework
  • .sh & .bat installer scripts for conda
  • Image Board manager class object
  • PNG Info & tag combination options

NEW Features Paused as of (09/05/2023) :: unless there are willing contributors to develop any of the other features.

New image board specific tagging/captioning models will be supported as they are released :: (There is "no" current eta. on the progress of those models being developed by others)

Contributors are welcome to open a Pull Request for their developments & I will promptly review it to be added

  • Add Aliases for tags suggestions in the textboxes
  • Add Support for brand new tag & captioning models & tag combining options
    • deepdanbooru
    • huggingface IDEFICS (api call)
    • gpt-4 (api call)
  • Add Auto-caption feature using various heuristics to determine from each auto-tag/caption model; what tags are best
  • Include support for a variation of different public image boards
  • Add De-Noise & Upscale Models, e.g. StableSR
  • Add Segmentation & Detection Models, e.g. SegmentAnything-HQ
  • Add Cross Attention Visualization DAAM
  • Add Grad-CAM
  • Add UMAP
  • Color code tag categories for tag suggestions in the dropdown menus (blocked : gradio-app/gradio#4988)

Additional Information

Default folder directory tree
base_folder/
├─ batch_folder/
│  ├─ downloaded_posts_folder/
│  │  ├─ png_folder/
│  │  ├─ jpg_folder/
│  │  ├─ gif_folder/
│  │  ├─ webm_folder/
│  │  ├─ swf_folder/
│  ├─ resized_img_folder/
│  ├─ tag_count_list_folder/
│  │  ├─ tags.csv
│  │  ├─ tag_category.csv
│  ├─ save_searched_list_path.txt

Any file path parameter that are empty will use the default path.

Files/folders that use the same path are merged, not overwritten. For example, using the same path for save_searched_list_path at every batch will result in a combined searched list of every batch in one .txt file.

Notes

  • When downloading, if the file already exists, it is skipped, unless, the file exists but was modified, it will download and the modified file will be renamed. Hence, I recommend not setting delete_original to true if you plan redownloading using the same destination folder.
  • When resizing, when resized_img_folder uses a different folder from the source folder, if the file in the destination folder already exists, it is skipped. It does not check if the already existing file has the specified min_short_side.
  • When running a new batch using the same save directories for tag files, tag count folders, and save_searched_list, tag files, tag count csvs, and save_searched_lists will be overwritten.
For more information/help on the downloading script, Please see the original image board downloader script : https://github.com/pikaflufftuft/pikaft-e621-posts-downloader

License

MIT

Usage conditions

By using this downloader, the user agrees that the author is not liable for any misuse of this downloader. This downloader is open-source and free to use.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and clips models for your purposes. Custom datasets can be added!

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published