Addition of ground truth labels on Amazon movie reviews dataset

What is it?

This is a side project for my thesis "Classification/Clustering Techniques for Large Web Data Collections" (supervised by Prof. Ioannis Anagnostopoulos).

My main goal was to provide a new, enriched, ground truth labeled dataset to the data science community. All labels have been collected by crawling/scraping Amazon.com for a period of some months. By labels I mean the categories in which the products are classified (look the green underlined labels on the screenshot below).

Please, feel free to make any contributions you feel will make it better.

It is also available on Kaggle.

The original dataset

The Amazon Movies Reviews dataset consists of 7,911,684 reviews Amazon users left between Aug 1997 - Oct 2012 about 253,059 products.

Data format:

product/productId: B00006HAXW
review/userId: A1RSDE90N6RSZF
review/profileName: Joseph M. Kotow
review/helpfulness: 9/9
review/score: 5.0
review/time: 1042502400
review/summary: Pittsburgh - Home of the OLDIES
review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD!!

where:

product/productId: asin, e.g. amazon.com/dp/B00006HAXW
review/userId: id of the user, e.g. A1RSDE90N6RSZF
review/profileName: name of the user
review/helpfulness: fraction of users who found the review helpful
review/score: rating of the product
review/time: time of the review (unix time)
review/summary: review summary
review/text: text of the review

The new labeled dataset

All the collected data (for every ASIN of the SNAP Dataset, ~253k products for ~8m reviews) are stored in a csv file labels.csv in the following format:

ASIN: unique identifier for the product
Categories: [label₀, label₁, label₂,..., label_n]

The new data format will be:

product/productId: B00006HAXW
review/userId: A1RSDE90N6RSZF
review/profileName: Joseph M. Kotow
review/helpfulness: 9/9
review/score: 5.0
review/time: 1042502400
review/summary: Pittsburgh - Home of the OLDIES
review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD!!
product/categories: ['CDs & Vinyl', 'Pop', 'Oldies', 'Doo Wop']

Hierarchical format

There are also included two JSON files that contain all the labels in hierarchical format (Tree structure). With ASINs included and without. hierarchy.py generates them.

Instructions

You can follow the steps mentioned below on how to get the enriched dataset:

Download the original dataset from the SNAP website (~ 3.3 GB compressed) and put it in the root folder of the repository (where you can find also the labels.csv file).
Execute the python file enrich.py, so the new enriched multi-labeled dataset be exported. The name of the new file should be output.txt.gz.

Notice: Please be patient as the python script will take a while to parse all these reviews.

The python script generates a new compressed file that is actually same with the original one, but with an extra feature (product/categories).

In fact, (the python script) applies a mapping between ASIN values in both files and adds the labels data of the product in every review instance of that, as an extra column.

Here is the code:

import gzip
import csv
import ast

def look_up(asin, diction):
    try:
        return diction[asin]
    except KeyError:
        print asin
        return []

def load_labels():
    labels_dictionary = {}
    with open('labels.csv', mode='r') as infile:
        csvreader = csv.reader(infile)
        next(csvreader)
        for rows in csvreader:
            labels_dictionary[rows[0]] = ast.literal_eval(rows[1])
    return labels_dictionary

def parse(filename):
    labels_dict = load_labels()
    f = gzip.open(filename, 'r')
    entry = {}
    for l in f:
        l = l.strip()
        colonPos = l.find(':')
        if colonPos == -1:
            yield entry
            entry = {}
            continue
        eName = l[:colonPos]
        rest = l[colonPos+2:]
        entry[eName] = rest
        if eName == 'product/productId':
            entry['product/categories'] = look_up(rest, labels_dict)   
    yield entry

if __name__ == "__main__":
    try:
        print ("Parsing dataset...\nPlease be patient, this will take a while...")
        with gzip.open('output.txt.gz', 'wb') as fo:
            for e in parse("movies.txt.gz"):
                for i in e:
                    fo.write('%s: %s\n' % (i, e[i]))
                fo.write("\n")
        print ("New enriched dataset has been exported successfully!\nFile name: output.txt.gz")
    except Exception as inst:
        print type(inst)
        print inst.args
        print inst

Acknowledgements

If you publish articles based on this dataset, please cite the following papers:

Bazakos Konstantinos and Ioannis Anagnostopoulos. Classification/Clustering Techniques for Large Web Data Collections. Dissertation, Hellenic Open University, 2017.
J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.

BibTeX is also available:

@ptychionthesis{bzks:2017,
  author = {Bazakos Konstantinos and Anagnostopoulos Ioannis},
  title = {Classification/Clustering Techniques for Large Web Data Collections},
  school = {Hellenic Open University},
  year = {2017},
  month = {Jul}
}

@inproceedings{McAuley:2013:ACM:2488388.2488466,
 author = {McAuley, Julian John and Leskovec, Jure},
 title = {From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise Through Online Reviews},
 booktitle = {Proceedings of the 22Nd International Conference on World Wide Web},
 series = {WWW '13},
 year = {2013},
 isbn = {978-1-4503-2035-1},
 location = {Rio de Janeiro, Brazil},
 pages = {897--908},
 numpages = {12},
 url = {http://doi.acm.org/10.1145/2488388.2488466},
 doi = {10.1145/2488388.2488466},
 acmid = {2488466},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {expertise, recommender systems, user modeling},
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
enrich.py		enrich.py
hierarchy.py		hierarchy.py
hierarcy_of_labels.json		hierarcy_of_labels.json
hierarcy_of_labels_ASINs.json		hierarcy_of_labels_ASINs.json
labels.csv		labels.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

_config.yml

_config.yml

enrich.py

enrich.py

hierarchy.py

hierarchy.py

hierarcy_of_labels.json

hierarcy_of_labels.json

hierarcy_of_labels_ASINs.json

hierarcy_of_labels_ASINs.json

labels.csv

labels.csv

Repository files navigation

Addition of ground truth labels on Amazon movie reviews dataset

What is it?

The original dataset

The new labeled dataset

Hierarchical format

Instructions

Acknowledgements

About

Releases

Packages

Languages

License

bazakoskon/labels-on-Amazon-movie-reviews-dataset

Folders and files

Latest commit

History

Repository files navigation

Addition of ground truth labels on Amazon movie reviews dataset

What is it?

The original dataset

The new labeled dataset

Hierarchical format

Instructions

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages