Skip to content

Any NLP would require a basic proprocessing to be carried out, inorder to clean the and the Label the dataset on which NLP PreProcessing can be done.

License

Notifications You must be signed in to change notification settings

bellamkondaprakash/Classification_Text_Cancer_Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Text Processing and Labelling the Dataset

To Label the semi-structured data(XML Files), it should be preprocess to be carried out, inorder to have a clean data & Label the data ,on which NLP Precesing can be done.

Load and Remove the XML, HTML tags & Alphanumeric characters

  1. Load the whole xml dataset using tqdm library and clean the all the files using BeautifulSoup and regex Libraries from the xml documents.
  2. Replace the numberic with space.

Labelling the Dataset

  1. Using the Pandas append the summarizing documents with their labels
  2. Load dataframe with labeling and documents.

About

Any NLP would require a basic proprocessing to be carried out, inorder to clean the and the Label the dataset on which NLP PreProcessing can be done.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published