Skip to content

jackfsuia/chats-crawler

Repository files navigation

GitHub Code License

English | 简体中文

Discourse-based websites chat data crawling and on-the-way parsing straight for LLM instruction finetuning. Data include the texts, images (crucial for multimodal finetuning) and links. Will support more than Discourse-based websites soon.

Table of Contents

Quick Start

Run

git clone https://github.com/jackfsuia/chats-crawler.git && cd chats-crawler

Then install the requirements, run

npm i

Before crawling, please read the Notice. Config the target website at config.ts, edit the url and rex properties to match your needs, i.e., replace the two https://discuss.pytorch.orgs there with your target Discourse-based website. A Discourse-based website basically all looks like this:

To start crawling, run

npm start

That's all! The discourse chat data are saved at storage/datasets/default as .json files, and the images at storage/datasets/imgs.

Examples

Lets say we crawling https://discuss.pytorch.org. We should edit the config.ts as:

...
 url: "https://discuss.pytorch.org/",
...
rex: "https://discuss.pytorch.org/t/[^/]+/[0-9]+$",

One of the chat page we have crawled might be this one:

then at one of the .json files in storage/datasets/default, the "conversations" property inside will be

<# ztf-ucasTengfei Zhang #>:
How to delete a Tensor in GPU to free up memory?
I can get a Tensor in GPU by Tensor.cuda(), but it just returns a copy in GPU. I wonder how can I delete this Tensor in GPU? I try to delete it with “del Tnesor” but it doesn’t work.


              Quote:"
                Could you show a minimum example? The following code works for me for PyTorch 1.1.0:
import torch
a = torch.zero(300000000, dtype=torch.int8, device='cuda')
b = torch.zero(300000000, dtype=torch.int8, device='cuda')
# Check GPU memory using nvidia-smi
del a
torch.cuda.empty_cache()
# Check GPU memo…
              "

<# smth #>:
del Tensor will delete it from GPU memory. Why do you think it doesn’t work?
<# ztf-ucasTengfei Zhang #>:
Thank you very much!
I loaded an OrderedDict of pre-trained weights to gpu by torch.load(), then used a for loop to delete its elements, but there was no change in gpu memory.
Besides, it is strange that there was no change in gpu memory even I deleted the OrderedDict of pre-trained weights.
Pytorch version is 0.4.0.2
...

<# ztf-ucasTengfei Zhang #> and <# smth #> are the two posters' names, and are formatted this way for you to easily template it to instruction-finetune LLMs (e.g., maybe replace <# smth #> with <assistant>, and <# ztf-ucasTengfei Zhang #> with <user>, etc.). If there are images interspersed in the texts, they will not only be downloaded and saved in storage/datasets/imgs with a new FILENAME, but also replaced in place with "[img FILENAME]" in texts. If there are links interspersed in the texts, they will be replaced in place with "[link LINK]" in texts. All the other elements are deleted.

Do you like this repo? Give us a ⭐

Notice

Make sure by yourself the crawling is legal, check the website's robots.txt if you're not sure. We are not responsible for any law risks and issues.

Future Work

  • Support image data auto OCR to texts, then inserted among original texts data. It makes the data complete in text form, and save some space too if OCR happens when on the crawling, not post crawling.

License

chats-crawler is licensed under the MIT License found in the LICENSE file in the root directory of this repository.

Citation

If this work is helpful, please kindly cite as:

@article{chats-crawler,
  title={chats-crawler: discourse chat data crawling and parsing for LLM instruction finetuning.}, 
  author={Yannan Luo},
  year={2024},
  url={https://github.com/jackfsuia/chats-crawler}
}

Acknowledgement

Learned a lot from gpt-crawler and crawlee. Thanks for their wonderful works.

Releases

No releases published

Packages

No packages published