Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Pipeline: Instagram post scraper #60

Open
iam-benyamin opened this issue Oct 26, 2021 · 4 comments · May be fixed by #76
Open

New Pipeline: Instagram post scraper #60

iam-benyamin opened this issue Oct 26, 2021 · 4 comments · May be fixed by #76
Assignees
Labels
new data source Issues about including a new data source pipeline/instagram-posts Issues related to the instagram-posts pipeline pipeline Issues related to pipelines

Comments

@iam-benyamin
Copy link

iam-benyamin commented Oct 26, 2021

This is going to be a new pipeline scraping public posts on Instagram with meta information such as location and hashtags.

@mattigrthr mattigrthr added new data source Issues about including a new data source pipeline Issues related to pipelines pipeline/instagram-posts Issues related to the instagram-posts pipeline labels Oct 26, 2021
@mattigrthr mattigrthr added this to To do in Kuwala via automation Oct 26, 2021
@mattigrthr
Copy link
Contributor

It should be possible to scrape public Instagram posts using hashtags, locations, and (public) users. This article provides some insights and ideas: https://blog.apify.com/scrape-instagram-posts-comments-and-more-21d05506aeb3/

Since @bmahmoudyan can't continue working on this issue, it's up for grabs again. :)

@mattigrthr
Copy link
Contributor

The requirements for a new pipeline are the following:

  • Saving the raw data (in this case the scraping results) in a file as is
  • Transforming at least lat/lng to H3 and moving all nested properties to a column/table format using meaningful variable names (since we are switching to Postgres and dbt for transformations)
  • Ideally saving the results in Parquet format since it’s much more storage efficient and optimized for parallel processing (just one command with PySpark)

@arifluthfi16
Copy link
Contributor

I am interested in taking this issue, i will be splitting the pipelines into 2 parts:

  • Scrapper API
  • And the actual Pipelines

just like the Google POI pipelines

Currently still working on the instagram scrapper and looking at what's possible.

@arifluthfi16
Copy link
Contributor

PR for this issue #74

@mattigrthr mattigrthr removed this from To do in Kuwala Feb 6, 2022
@mattigrthr mattigrthr linked a pull request Mar 1, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new data source Issues about including a new data source pipeline/instagram-posts Issues related to the instagram-posts pipeline pipeline Issues related to pipelines
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

3 participants