Skip to content

Generating headlines for the VICE Youtube channel using BLOOM

Notifications You must be signed in to change notification settings

marcderbauer/bloom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generate Vice Headlines with Bloom

Try it here

❗ Requirements

Your Python installation needs to be version 3.8 or higher.

🏃 Quickstart

If you can't be bothered to read all of this, you can just run

chmod +x run.sh     # Make run.sh executable
./run.sh            # Run the program

This will:

  1. Install all the required libraries
  2. Run three epochs of training
  3. Generate an inference

You can then generate more inferences as described below.

❄️ Context

This project originally started out as an RNN I wanted to implement in Pytorch. I had difficulties getting the model to create a coherent output. As I lacked reference values for training, I decided to finetune an existing model -- BLOOM. I hoped to learn more about the text-generation process from a top-down perspective, and to gather reference values for training in a "best-case" scenario.

🤖 Setup

1. Install the Required Dependencies

pip install -r requirements.txt

2. Setup YouTube API

❗ This step is only necessary if you want to source the data yourself❗
The dataset used to train the model is included under /data/. It was collected 23.09.2022.

The data for this project is gathered through the YouTube Data API v3. Setting up this API can roughly be divided into the following steps:

  1. Create a Google Developer Account
  2. Create a new project
  3. Enable the YouTube Data API v3
  4. Create credentials
  5. Make the credentials accessible to your environment

For in-depth guidance, please refer to this excellent HubSpot Article.

📊 Data

❗If you decided to use the data included in the repository, you can skip this section.❗

1. Collecting the Data

Assuming you setup the YouTube API correctly, all you need to do is run the youtube/query_api.py. It requires the name of your client_secrets_file. You need to supply the requested channel's playlistId as an argument when launching the program. It is possible to supply multiple playlistIds at once by seperating them with a space.

In order to find a channel's playlistId you need to

  1. Go to the channel
  2. Find a playlist with all the channel's videos included (often the first playlist)
  3. Click PLAY ALL
  4. Copy everything after list= from the link

Thus, the command to download all the titles for VICE and VICE News is:

python3 youtube/query_api.py UUn8zNIfYAQNdrFRrr8oibKw PLw613M86o5o7q1cjb26MfCgdxJtshvRZ-

2. Cleaning the Data

To clean the data, you just need to run the preprocess.py. Assuming the file to process is called vice.txt, the command is:

python3 preprocess.py vice.txt

By default, this removes non-english sentences, duplicates and entries consisting of less than three words. The resulting file is automatically split into sets of 80% train and 20% test in /data/.


📉 Training

Training can easily be run by executing the main.py.
If you have Weights & Biases set up, you add a flag to activate it as such:

python3 main.py --wandb

🗿 Inference

Inference can be run by executing inference.py with the prompt as argument. Furthermore, you can pass certain inference parameters as arguments e.g.:

python3 inference.py North Korea --temp 0.42 --top_k 32 --rp 1.3

Output:
temp=0.42; k=32, p=0.92, rep=1.3:
----------------------------------------------------------------------------------------------------
North Korea's 'Most Humane' Hospital

Huggingface made a great tutorial on different generation strategies, where each inference parameter is explained in depth.


♻️ Conclusion

This project has been very insightful in gaining an understanding of text-generation from a top-down perspective. While implementing this project as a PyTorch RNN, I mostly scrambled around without having much of an understanding of what I was doing.
By fine-tuning BLOOM, I learned how to fine-tune an existing model, how to source data, how to pre-process it correctly and how to host the resulting model on Hugging Face Hub with Gradio.

About

Generating headlines for the VICE Youtube channel using BLOOM

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published