Scrape images, video, and post forwarding information for Telegram #413

loganwilliams · 2022-02-24T14:31:46Z

A small enhancement that adds some additional information from Telegram channel posts.

…annel posts

JustAnotherArchivist

Hi, thank you for this – and happy to see snscrape used by Bellingcat!

I don't have time at the moment to review and test it in detail, but a few general thoughts:

For forwarded posts, I'd like to see a URL to the original post as well as a reference to the channel behind it. I guess there's zero info in the HTML besides the username, so that might require some changes on the Channel class (making title, verified, and photo optional).
Posts can have more than one video. I believe the current code only catches the last video.
For videos and audio, I'd like to extract everything Telegram provides. Definitely the duration and thumbnail, perhaps even the audio amplitude bars although that's probably overkill and of little value. This would require separate dataclasses to carry this extra data, similar to how the Twitter module handles media.

loganwilliams · 2022-03-09T07:05:33Z

Makes sense to me. I don't have a timeline for when we'd be able to make those changes -- there's a few high priority things happening right now -- but we've been using our fork for a while and I wanted to open a PR to remember to merge it upstream at some point.

…e string in membersDiv has the word 'subscribers' rather than 'members'.

…tracting a post's view count

…ribute type Channel.

…bute

… attribute; fixed video edge cases.

…s didn't have a next page link (added reasonable default)

…se they weren't in a post containing a 'tgme_widget_message_text' div

…edundant outlinks

…t wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text

…TTERN as variable

Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR

upstream merge

trislee · 2022-05-25T06:22:18Z

I implemented the requested changes:

Made attachment handling similar to Twitter's: dataclasses for Image, Video, and Gif.
Added capability to scrape multiple Videos from a single message
Added attribute for the full forwarded URL and made the forwarded attribute have type Channel
Added capability to scrape number of views for messages

Additional changes:

Telegram seems to have changed their interface somehow such that the tme_messages_more, data-before tag often doesn't appear on some pages. To deal with this, I added a default that decrements the before query parameter by 20. This requires a few additional changes to handle edge cases:
- If the querystring doesn't contain the before parameter, get the canonical url tag in the page
- Added a termination condition: if the first tgme_widget_message_date has an href to the first post (t.me/CHANNEL/1), terminate the scraping loop
Moved attachment extraction out of if (message := post.find('div', class_ = 'tgme_widget_message_text')): clause, since some attachments are in messages without text, so they weren't being added to the media list
I also added a responseOkCallback function to retry the request if we get a 5xx response.

TheTechRobo · 2022-05-25T12:18:53Z

Hm, should this be rebased? 25 commits is a lot, but I'm not sure on @JustAnotherArchivist's policy on that.

TheTechRobo · 2022-05-25T12:20:27Z

Pasting something from the PR to the fork that I think is relevant:

I got frustrated with the slowness of the scraping so I changed the forwarding Channel method by modifying the Channel definition so that it only requires the username, rather than retrieving the full forwarded channel information for every forwarded message.

JustAnotherArchivist · 2022-05-29T07:59:50Z

The changes sound good so far, though I haven't reviewed the code thoroughly yet. Some quick comments on things I noticed at a glance:

I don't mind the number of commits. The merges make the history slightly messy, but that's alright.
The 'thin' Channel change is fine; the Twitter module does that as well, only including data that is already available e.g. for replied-to users.
The functions at the bottom should be prefixed with an underscore to mark them as private API.
views attribute: parse_num returns an IntWithGranularity, not an int.
outlinks, mentions, etc. should be None if there aren't any, not an empty list. Related to that: typing.Optional is missing on a couple in the class definition.
The changes to the VK module should be a separate PR.
Do you have an example of a channel page that often lacks the before= link? I haven't noticed this before.

trislee · 2022-06-23T20:35:43Z

This is an example of a channel page with no tme_messages_more data-before attribute: https://t.me/s/proudboysusa?before=8033
I only started noticing such pages after I had started working on this fork, so maybe Telegram changed something in their web interface in the last few months.

… to VK module

trislee · 2022-06-23T20:48:35Z

Incorporated your changes, let me know if there are other issues you'd like me to address

trislee · 2022-12-02T13:45:08Z

@JustAnotherArchivist Any additional changes you want us to make? We've been using this quite a bit and would love to see it get merged.

JustAnotherArchivist

Sorry for the delay, and thanks for the fixes! I'll have some style nits, but let's get the functionality sorted out first.

JustAnotherArchivist · 2022-12-19T23:57:03Z

snscrape/modules/telegram.py

+					# Generic filter of links to the post itself, catches videos, photos, and the date link
+					if style != '':
+						imageUrls = _STYLE_MEDIA_URL_PATTERN.findall(style)
+						if len(imageUrls) == 1:


Are there any examples with more than one match (here or a few lines below)?

JustAnotherArchivist · 2022-12-20T00:13:21Z

snscrape/modules/telegram.py

+
+				media.append(VoiceMessage(url = audioUrl, duration = duration, bars = barHeights))
+
+			for videoPlayer in post.find_all('a', {'class': 'tgme_widget_message_video_player'}):


Because the extraction of images and videos is done separately, the order is not preserved. For example, https://t.me/s/nexta_live/43102 has video 1 (without URL), image, video 2 (with URL), but the image gets listed first. I think that can be fixed by simply merging this loop (and also the one for the voice player extraction) into the general link loop above, since they're all a tags in the post div.

JustAnotherArchivist · 2022-12-20T00:20:33Z

snscrape/modules/telegram.py

+				}
+				timeTag = videoPlayer.find('time')
+				if timeTag is None:
+					cls = Gif


Do you have some examples? I don't remember seeing fake-GIFs on Telegram before. (Also for the future test suite.)

Telegram doesn't have a policy on whether or not they're allowed, right? I don't think a real-GIF would ever inaccurately go down this path, so isn't it just making the logic more robust against change?

I prefer erroring out on things the code doesn't actually understand and implement. It might be 'more robust' in some sense, but it can easily result in misparsing the data as well.

But if 'videos' without a time tag already exist similar to how it is on Twitter, this is totally fine. Hence why I'm asking for examples. :-)

I'm not an active Telegram user, so I don't think I'll be able to quickly come up with an example myself. @loganwilliams , do you remember running into a problem which required adding this line back when you implemented this?

On the other hand, what data misparsing are you imagining from this, @JustAnotherArchivist ? Especially if Twitter already has examples which require this behavior, what's the error mode that we'd want to call out by throwing here?

I'm hoping that merging this will get everyone off the fork, but am concerned that if we introduce new exceptions, it'll require more significant updates to existing workflows.

Edit: As a compromise, I'm adding a warning log to this in my PR. It won't stop execution, but will let the user know in case there's something actually wrong.

Hi @john-osullivan. Thanks for your work pushing this forward. You can see an example of a GIF here: https://t.me/thisisatestchannel19451923/3

It sits in the same .tgme_widget_message_video_player element and lacks a duration.

JustAnotherArchivist · 2022-12-20T00:22:24Z

snscrape/modules/telegram.py

+				if kwargs['href'] in outlinks:
+					outlinks.remove(kwargs['href'])


I'd prefer leaving the link preview href in outlinks as well, similar to how the Twitter scraper will have outlinks from link cards in outlinks.

JustAnotherArchivist · 2022-12-20T00:26:52Z

snscrape/modules/telegram.py

+			try:
+				if soup.find('a', attrs = {'class': 'tgme_widget_message_date'}, href = True)['href'].split('/')[-1] == '1':
+					# if message 1 is the first message in the page, terminate scraping
+					break
+			except:
+				pass


Bare except is awful and hides various exceptions that shouldn't be caught, such as ^C interrupts. This test should really be done without a try-except.

JustAnotherArchivist · 2022-12-20T00:32:49Z

snscrape/modules/telegram.py

+				# some pages are missing a "tme_messages_more" tag, causing early termination
+				if '=' not in nextPageUrl:
+					nextPageUrl =  soup.find('link', attrs = {'rel': 'canonical'}, href = True)['href']
+				nextPostIndex = int(nextPageUrl.split('=')[-1]) - 20
+				if nextPostIndex > 20:
+					pageLink = {'href': nextPageUrl.split('=')[0] + f'={nextPostIndex}'}
+				else:
+					break


Wouldn't this approach lead to duplicates in some cases? When a post includes multiple media, those get their own ID, so there'd be a gap in post IDs.
I feel like it'd be better to get the ID of the first post on the page and then use that as the before parameter.

JustAnotherArchivist · 2022-12-20T00:59:29Z

Also, while testing this and looking around for odd cases, I discovered that Telegram supports 'round videos'. Example: https://t.me/s/memes/9641
Support for those doesn't need to be part of this PR, but I thought it'd be appropriate to mention it here in case you do want to handle it. Else I'll add it after this is merged.

turicas · 2023-02-06T01:38:19Z

Hello. I'd love to get this PR merged. Is there anything I can do to help?

Demmenie · 2023-07-10T21:32:16Z

Please finish and merge this; it's quite a useful feature. If there's anything I can do to help, I'd be more than happy to.

turicas · 2023-07-10T22:13:16Z

Just so you know: I've created a Python library called tchan that scrapes Telegram public channels and does not have the current problems snscrape has regarding this PR (still missing some features like scrape polls).

Demmenie · 2023-07-11T18:21:52Z

What I need (and what I'm pretty sure this PR provides) is an easy way to check if a post contains one or more videos.

john-osullivan · 2023-12-31T00:34:56Z

Hey, I recently did a Bellingcat workshop which used this fork -- I'd love to close the gap and get it merged. I'll try to cut something soon, @JustAnotherArchivist, and will let you know if I have any questions on how you'd like it!

john-osullivan · 2024-02-22T06:25:09Z

I wrapped up my changes to all your comments, @JustAnotherArchivist !

Asked @.loganwilliams for a review over there, but if you'd like to preemptively call out any issues with my changes, I'd love to get em fixed ahead of time and only do one merge process 👍

Scrape images, video, and post forwarding information for Telegram ch…

72b26f2

…annel posts

loganwilliams closed this Feb 24, 2022

loganwilliams deleted the more-tg-info branch February 24, 2022 14:31

loganwilliams restored the more-tg-info branch February 24, 2022 14:33

Fix KeyError caused by retweets without URLs in TwitterProfileScraper

de4ebed

JustAnotherArchivist added enhancement New feature or request module:telegram labels Feb 26, 2022

loganwilliams reopened this Mar 8, 2022

Clean up unnecessary imports

b8efce2

JustAnotherArchivist requested changes Mar 9, 2022

View reviewed changes

trislee added 5 commits March 29, 2022 01:12

added capability to extract the number of channel members when the th…

ed82916

…e string in membersDiv has the word 'subscribers' rather than 'members'.

handled case where channel has no profile image

fb8d73a

added capability to scrape multiple videos from a single post

d32c9ad

implemented Media dataclasses for Telegram, and added variable for ex…

a7eb54d

…tracting a post's view count

added a forwardedUrl attribute to TelegramPost and made forwarded att…

4e59638

…ribute type Channel.

trislee mentioned this pull request Mar 31, 2022

Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR bellingcat/snscrape#2

Merged

trislee and others added 13 commits April 3, 2022 01:45

fixed edge case for videos that have data-link-attr but no href attri…

2ce014a

…bute

Merge branch 'JustAnotherArchivist:master' into master

f978954

made Telegram scraper not return full channel info for forwarded_from…

babcddd

… attribute; fixed video edge cases.

fixed issue where Telegram scraper terminated early because some page…

1e4e0c2

…s didn't have a next page link (added reasonable default)

fixed issue where some videos and photos weren't being scraped (becau…

b276c3c

…se they weren't in a post containing a 'tgme_widget_message_text' div

added additional termination criteria to Telegram scraper

97d38e5

added additional attributes for hashtags and user mentions, removed r…

9b3faec

…edundant outlinks

moved forward finding out of tgme_widget_message_text clause, since i…

21f7b62

…t wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text

improved consistency of code formatting and added _STYLE_MEDIA_URL_PA…

5648e95

…TTERN as variable

Merge branch 'master' into telegram-media

c18ca0f

Merge pull request #2 from bellingcat/telegram-media

0a4bd39

Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR

fixed merge conflicts

f385135

Merge branch 'JustAnotherArchivist-master'

b13e62e

trislee and others added 3 commits May 9, 2022 09:37

forgot to save modified twitter.py module

e2d9223

Merge pull request #4 from JustAnotherArchivist/master

0822a9c

upstream merge

merged master into more-tg-info to update upstream PR

07a5f6f

fixed merge

65723f1

fixed typo

56e4232

incorporated requested changes from maintainer, removed modifications…

056cd62

… to VK module

fixed edge case where channel with no members fails _get_entity

73f10a4

JustAnotherArchivist linked an issue Nov 6, 2022 that may be closed by this pull request

Telegram image, video, audio message or file from post #583

Open

fixed edge case where members information wasnt included

cbdfeed

JustAnotherArchivist requested changes Dec 20, 2022

View reviewed changes

merged upstram changes

cacd783

TheTechRobo mentioned this pull request Aug 29, 2023

Add Photos, video thumbs and views to posts #1026

Open

JustAnotherArchivist mentioned this pull request Nov 24, 2023

add: viewCount property to TelegramPost data class #1035

Open

john-osullivan mentioned this pull request Jan 18, 2024

Solved outstanding issues mentioned in snscrape#413 bellingcat/snscrape#8

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape images, video, and post forwarding information for Telegram #413

Scrape images, video, and post forwarding information for Telegram #413

loganwilliams commented Feb 24, 2022 •

edited

JustAnotherArchivist left a comment

loganwilliams commented Mar 9, 2022

trislee commented May 25, 2022

TheTechRobo commented May 25, 2022

TheTechRobo commented May 25, 2022

JustAnotherArchivist commented May 29, 2022

trislee commented Jun 23, 2022

trislee commented Jun 23, 2022

trislee commented Dec 2, 2022

JustAnotherArchivist left a comment

JustAnotherArchivist Dec 19, 2022

JustAnotherArchivist Dec 20, 2022

JustAnotherArchivist Dec 20, 2022

john-osullivan Dec 31, 2023

JustAnotherArchivist Dec 31, 2023

john-osullivan Jan 16, 2024 •

edited

loganwilliams Jan 16, 2024

JustAnotherArchivist Dec 20, 2022

JustAnotherArchivist Dec 20, 2022

JustAnotherArchivist Dec 20, 2022

JustAnotherArchivist commented Dec 20, 2022

turicas commented Feb 6, 2023

Demmenie commented Jul 10, 2023

turicas commented Jul 10, 2023

Demmenie commented Jul 11, 2023

john-osullivan commented Dec 31, 2023

john-osullivan commented Feb 22, 2024


		media.append(VoiceMessage(url = audioUrl, duration = duration, bars = barHeights))

		for videoPlayer in post.find_all('a', {'class': 'tgme_widget_message_video_player'}):

		if kwargs['href'] in outlinks:
		outlinks.remove(kwargs['href'])

Scrape images, video, and post forwarding information for Telegram #413

Are you sure you want to change the base?

Scrape images, video, and post forwarding information for Telegram #413

Conversation

loganwilliams commented Feb 24, 2022 • edited

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

loganwilliams commented Mar 9, 2022

trislee commented May 25, 2022

TheTechRobo commented May 25, 2022

TheTechRobo commented May 25, 2022

JustAnotherArchivist commented May 29, 2022

trislee commented Jun 23, 2022

trislee commented Jun 23, 2022

trislee commented Dec 2, 2022

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

john-osullivan Jan 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustAnotherArchivist commented Dec 20, 2022

turicas commented Feb 6, 2023

Demmenie commented Jul 10, 2023

turicas commented Jul 10, 2023

Demmenie commented Jul 11, 2023

john-osullivan commented Dec 31, 2023

john-osullivan commented Feb 22, 2024

loganwilliams commented Feb 24, 2022 •

edited

john-osullivan Jan 16, 2024 •

edited