Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instagram index problem creating index error #1046

Open
Jb2817 opened this issue Apr 28, 2024 · 0 comments
Open

Instagram index problem creating index error #1046

Jb2817 opened this issue Apr 28, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Jb2817
Copy link

Jb2817 commented Apr 28, 2024

Describe the bug

Index error when trying to access Instagram posts

How to reproduce

Any accessing of Ig posts should produce the error.

Loop over each post for the current year

for post in tqdm(snsinstagram.InstagramHashtagScraper(query).get_items()):

Expected behaviour

The program should save Instagram information as a pandas data frame. However, when trying to access posts I am getting an index error. Theres a comment specifying that if Instagram changed anything this might cause an error.

Screenshots and recordings

No response

Operating system

macOS Ventura 13.4

Python version: output of python3 --version

python 3.11.5

snscrape version: output of snscrape --version

snscrape 0.7.0.20230622

Scraper

Snscrape.module.instagram

How are you using snscrape?

Module (import snscrape.modules.something in Python code)

Backtrace

No response

Log output

0it [00:00, ?it/s]INFO:snscrape.modules.instagram:Retrieving initial data
INFO:snscrape.base:Retrieving https://www.instagram.com/explore/tags/Advanced%20Micro%20Devices%20OR%20AMD/
DEBUG:snscrape.base:... with headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
DEBUG:snscrape.base:... with environmentSettings: {'proxies': OrderedDict(), 'stream': False, 'verify': True, 'cert': None}
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.instagram.com:443
DEBUG:snscrape.base:Connected to: ('157.240.241.174', 443)
DEBUG:snscrape.base:Connection cipher: ('TLS_CHACHA20_POLY1305_SHA256', 'TLSv1.3', 256)
DEBUG:urllib3.connectionpool:https://www.instagram.com:443 "GET /explore/tags/Advanced%20Micro%20Devices%20OR%20AMD/ HTTP/1.1" 200 None
INFO:snscrape.base:Retrieved https://www.instagram.com/explore/tags/Advanced%20Micro%20Devices%20OR%20AMD/: 200
DEBUG:snscrape.base:... with response headers: {'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Set-Cookie': 'csrftoken=YBE20kuKsQmjab47aSnkSn; expires=Sun, 27-Apr-2025 18:27:16 GMT; Max-Age=31449600; path=/; domain=.instagram.com; secure', 'accept-ch-lifetime': '4838400', 'accept-ch': 'viewport-width,dpr,Sec-CH-Prefers-Color-Scheme,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Platform-Version,Sec-CH-UA-Model', 'Link': 'https://www.instagram.com/explore/tags/Advanced%20Micro%20Devices%20OR%20AMD/top/; rel="canonical"', 'reporting-endpoints': 'coop_report="https://www.facebook.com/browser_reporting/coop/?minimize=0", coep_report="https://www.facebook.com/browser_reporting/coep/?minimize=0", default="https://www.instagram.com/error/ig_web_error_reports/?device_level=unknown", permissions_policy="https://www.instagram.com/error/ig_web_error_reports/"', 'report-to': '{"max_age":2592000,"endpoints":[{"url":"https:\/\/www.facebook.com\/browser_reporting\/coop\/?minimize=0"}],"group":"coop_report","include_subdomains":true}, {"max_age":86400,"endpoints":[{"url":"https:\/\/www.facebook.com\/browser_reporting\/coep\/?minimize=0"}],"group":"coep_report"}, {"max_age":259200,"endpoints":[{"url":"https:\/\/www.instagram.com\/error\/ig_web_error_reports\/?device_level=unknown"}]}, {"max_age":21600,"endpoints":[{"url":"https:\/\/www.instagram.com\/error\/ig_web_error_reports\/"}],"group":"permissions_policy"}', 'content-security-policy-report-only': "default-src *.facebook.com *.fbcdn.net *.instagram.com data: blob:;script-src *.teststagram.com *.instagram.com static.cdninstagram.com *.google-analytics.com https://translate.google.com/ https://apis.google.com/ https://accounts.google.com/ *.facebook.com *.fbcdn.net *.facebook.net 'unsafe-inline' 'unsafe-eval' blob: data: 'self';style-src *.teststagram.com *.instagram.com static.cdninstagram.com data: blob: 'unsafe-inline' *.fbcdn.net *.facebook.com;connect-src *.teststagram.com .instagram.com wss://edge-chat.instagram.com/ connect.facebook.net .facebook.com facebook.com .fbcdn.net .facebook.net wss://.facebook.com: ws://localhost: blob: .cdninstagram.com wss://.instagram.com: 'self' https://meta.privacy-gateway.cloudflare.com/relay;font-src *.teststagram.com *.instagram.com static.cdninstagram.com data: *.fbcdn.net *.intern.facebook.com *.facebook.com fonts.gstatic.com;img-src *.teststagram.com *.instagram.com *.facebook.com *.fbcdn.net data: *.igsonar.com *.cdninstagram.com *.google-analytics.com blob: *.fbsbx.com android-webview-video-poster: *.giphy.com;media-src *.facebook.com *.fbcdn.net *.instagram.com *.cdninstagram.com cdn.fbsbx.com data: blob: https://*.giphy.com;frame-src *.instagram.com *.facebook.com *.fbsbx.com fbsbx.com data:;worker-src *.instagram.com/static_resources/webworker_v1/init_script/ *.instagram.com/static_resources/webworker/init_script/ *.instagram.com/static_resources/sharedworker/init_script/ *.instagram.com/www-service-worker.js;block-all-mixed-content;report-uri https://www.facebook.com/csp/reporting/?minimize=0;", 'content-security-policy': "default-src *.facebook.com *.fbcdn.net *.instagram.com data: blob:;script-src *.teststagram.com *.instagram.com static.cdninstagram.com *.google-analytics.com https://translate.google.com/ https://apis.google.com/ https://accounts.google.com/ *.facebook.com *.fbcdn.net .facebook.net 127.0.0.1: 'unsafe-inline' 'unsafe-eval' blob: data: 'self';style-src *.teststagram.com *.instagram.com static.cdninstagram.com data: blob: 'unsafe-inline' *.fbcdn.net *.facebook.com;connect-src *.teststagram.com .instagram.com wss://edge-chat.instagram.com/ connect.facebook.net .facebook.com facebook.com .fbcdn.net .facebook.net wss://.facebook.com: ws://localhost: blob: .cdninstagram.com wss://.instagram.com: 'self' https://meta.privacy-gateway.cloudflare.com/relay;font-src *.teststagram.com *.instagram.com static.cdninstagram.com data: *.fbcdn.net *.intern.facebook.com *.facebook.com fonts.gstatic.com;img-src *.teststagram.com *.instagram.com *.facebook.com *.fbcdn.net data: *.igsonar.com *.cdninstagram.com *.google-analytics.com *.whatsapp.net blob: www.gstatic.com *.fbsbx.com android-webview-video-poster: *.oculuscdn.com www.googleadservices.com *.doubleclick.net *.google.com *.google.co.uk *.giphy.com;media-src *.facebook.com *.fbcdn.net *.instagram.com *.cdninstagram.com cdn.fbsbx.com data: blob: https://*.giphy.com;frame-src *.instagram.com *.facebook.com *.fbsbx.com fbsbx.com data: www.googleadservices.com *.doubleclick.net *.google.com *.google.co.uk;block-all-mixed-content;upgrade-insecure-requests;", 'document-policy': 'force-load-at-top', 'permissions-policy': 'accelerometer=(self), attribution-reporting=(), autoplay=(), bluetooth=(), camera=(self), ch-device-memory=(), ch-downlink=(), ch-dpr=(), ch-ect=(), ch-rtt=(), ch-save-data=(), ch-ua-arch=(), ch-ua-bitness=(), ch-viewport-height=(), ch-viewport-width=(), ch-width=(), clipboard-read=(), clipboard-write=(self), display-capture=(), encrypted-media=(), fullscreen=(self), gamepad=(), geolocation=(self), gyroscope=(self), hid=(), idle-detection=(), keyboard-map=(), local-fonts=(), magnetometer=(), microphone=(self), midi=(), otp-credentials=(), payment=(), picture-in-picture=(self), publickey-credentials-get=(), screen-wake-lock=(), serial=(), usb=(), window-management=(), xr-spatial-tracking=();report-to="permissions_policy"', 'cross-origin-resource-policy': 'same-origin', 'cross-origin-embedder-policy-report-only': 'require-corp;report-to="coep_report"', 'cross-origin-opener-policy': 'same-origin-allow-popups;report-to="coop_report"', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, must-revalidate', 'Expires': 'Sat, 01 Jan 2000 00:00:00 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'X-Frame-Options': 'DENY', 'Strict-Transport-Security': 'max-age=31536000; preload; includeSubDomains', 'x-stack': 'www', 'Content-Type': 'text/html; charset="utf-8"', 'X-FB-Debug': 'mQypGwsXYcnBYHVu2sXcPrKTeI1apWlyzIcvhTq92feWnrc4DROmKwi+swdfI2uNPh7w5XTrxe0YvuGaeWLHTQ==', 'Date': 'Sun, 28 Apr 2024 18:27:16 GMT', 'Alt-Svc': 'h3=":443"; ma=86400', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive'}
0it [00:00, ?it/s]

IndexError Traceback (most recent call last)
Cell In[19], line 19
16 year_end_date = pd.Timestamp('{}-12-31'.format(year))
18 # Loop over each post for the current year
---> 19 for post in tqdm(snsinstagram.InstagramHashtagScraper(query).get_items()):
20 if post.date >= year_start_date and post.date <= year_end_date:
21 if len(posts) >= limit*(year-start_date.year+1):

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/std.py:1182, in tqdm.iter(self)
1179 time = self._time
1181 try:
-> 1182 for obj in iterable:
1183 yield obj
1184 # Update and possibly print the progressbar.
1185 # Note: does not call self.update(1) for speed optimisation.

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/modules/instagram.py:110, in _InstagramCommonScraper.get_items(self)
109 def get_items(self):
--> 110 r = self._initial_page()
111 if r.status_code == 404:
112 _logger.warning('Page does not exist')

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/modules/instagram.py:78, in _InstagramCommonScraper._initial_page(self)
76 if self._initialPage is None:
77 _logger.info('Retrieving initial data')
---> 78 r = self._get(self._initialUrl, headers = self._headers, responseOkCallback = self._check_initial_page_callback)
79 if r.status_code not in (200, 404):
80 raise snscrape.base.ScraperException(f'Got status code {r.status_code}')

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/base.py:275, in Scraper._get(self, *args, **kwargs)
274 def _get(self, *args, **kwargs):
--> 275 return self._request('GET', *args, **kwargs)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/base.py:246, in Scraper._request(self, method, url, params, data, headers, timeout, responseOkCallback, allowRedirects, proxies)
244 _logger.debug(f'... ... with response headers: {redirect.headers!r}')
245 if responseOkCallback is not None:
--> 246 success, msg = responseOkCallback(r)
247 errors.append(msg)
248 else:

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/modules/instagram.py:89, in _InstagramCommonScraper._check_initial_page_callback(self, r)
87 if r.status_code != 200:
88 return True, None
---> 89 jsonData = r.text.split('<script type="text/javascript">window._sharedData = ')[1].split(';</script>')[0] # May throw an IndexError if Instagram changes something again; we just let that bubble.
90 try:
91 obj = json.loads(jsonData)

IndexError: list index out of range

Dump of locals

No response

Additional context

No response

@Jb2817 Jb2817 added the bug Something isn't working label Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant