Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape images, video, and post forwarding information for Telegram #413

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

loganwilliams
Copy link

@loganwilliams loganwilliams commented Feb 24, 2022

A small enhancement that adds some additional information from Telegram channel posts.

@loganwilliams loganwilliams deleted the more-tg-info branch February 24, 2022 14:31
@loganwilliams loganwilliams restored the more-tg-info branch February 24, 2022 14:33
Copy link
Owner

@JustAnotherArchivist JustAnotherArchivist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for this – and happy to see snscrape used by Bellingcat!

I don't have time at the moment to review and test it in detail, but a few general thoughts:

  • For forwarded posts, I'd like to see a URL to the original post as well as a reference to the channel behind it. I guess there's zero info in the HTML besides the username, so that might require some changes on the Channel class (making title, verified, and photo optional).
  • Posts can have more than one video. I believe the current code only catches the last video.
  • For videos and audio, I'd like to extract everything Telegram provides. Definitely the duration and thumbnail, perhaps even the audio amplitude bars although that's probably overkill and of little value. This would require separate dataclasses to carry this extra data, similar to how the Twitter module handles media.

@loganwilliams
Copy link
Author

Makes sense to me. I don't have a timeline for when we'd be able to make those changes -- there's a few high priority things happening right now -- but we've been using our fork for a while and I wanted to open a PR to remember to merge it upstream at some point.

@trislee
Copy link

trislee commented May 25, 2022

I implemented the requested changes:

  • Made attachment handling similar to Twitter's: dataclasses for Image, Video, and Gif.
  • Added capability to scrape multiple Videos from a single message
  • Added attribute for the full forwarded URL and made the forwarded attribute have type Channel
  • Added capability to scrape number of views for messages

Additional changes:

  • Telegram seems to have changed their interface somehow such that the tme_messages_more, data-before tag often doesn't appear on some pages. To deal with this, I added a default that decrements the before query parameter by 20. This requires a few additional changes to handle edge cases:
    • If the querystring doesn't contain the before parameter, get the canonical url tag in the page
    • Added a termination condition: if the first tgme_widget_message_date has an href to the first post (t.me/CHANNEL/1), terminate the scraping loop
  • Moved attachment extraction out of if (message := post.find('div', class_ = 'tgme_widget_message_text')): clause, since some attachments are in messages without text, so they weren't being added to the media list
  • I also added a responseOkCallback function to retry the request if we get a 5xx response.

@TheTechRobo
Copy link
Contributor

Hm, should this be rebased? 25 commits is a lot, but I'm not sure on @JustAnotherArchivist's policy on that.

@TheTechRobo
Copy link
Contributor

Pasting something from the PR to the fork that I think is relevant:

I got frustrated with the slowness of the scraping so I changed the forwarding Channel method by modifying the Channel definition so that it only requires the username, rather than retrieving the full forwarded channel information for every forwarded message.

@JustAnotherArchivist
Copy link
Owner

The changes sound good so far, though I haven't reviewed the code thoroughly yet. Some quick comments on things I noticed at a glance:

  • I don't mind the number of commits. The merges make the history slightly messy, but that's alright.
  • The 'thin' Channel change is fine; the Twitter module does that as well, only including data that is already available e.g. for replied-to users.
  • The functions at the bottom should be prefixed with an underscore to mark them as private API.
  • views attribute: parse_num returns an IntWithGranularity, not an int.
  • outlinks, mentions, etc. should be None if there aren't any, not an empty list. Related to that: typing.Optional is missing on a couple in the class definition.
  • The changes to the VK module should be a separate PR.
  • Do you have an example of a channel page that often lacks the before= link? I haven't noticed this before.

@trislee
Copy link

trislee commented Jun 23, 2022

This is an example of a channel page with no tme_messages_more data-before attribute: https://t.me/s/proudboysusa?before=8033
I only started noticing such pages after I had started working on this fork, so maybe Telegram changed something in their web interface in the last few months.

@trislee
Copy link

trislee commented Jun 23, 2022

Incorporated your changes, let me know if there are other issues you'd like me to address

@JustAnotherArchivist JustAnotherArchivist linked an issue Nov 6, 2022 that may be closed by this pull request
@trislee
Copy link

trislee commented Dec 2, 2022

@JustAnotherArchivist Any additional changes you want us to make? We've been using this quite a bit and would love to see it get merged.

Copy link
Owner

@JustAnotherArchivist JustAnotherArchivist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, and thanks for the fixes! I'll have some style nits, but let's get the functionality sorted out first.

# Generic filter of links to the post itself, catches videos, photos, and the date link
if style != '':
imageUrls = _STYLE_MEDIA_URL_PATTERN.findall(style)
if len(imageUrls) == 1:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any examples with more than one match (here or a few lines below)?


media.append(VoiceMessage(url = audioUrl, duration = duration, bars = barHeights))

for videoPlayer in post.find_all('a', {'class': 'tgme_widget_message_video_player'}):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the extraction of images and videos is done separately, the order is not preserved. For example, https://t.me/s/nexta_live/43102 has video 1 (without URL), image, video 2 (with URL), but the image gets listed first. I think that can be fixed by simply merging this loop (and also the one for the voice player extraction) into the general link loop above, since they're all a tags in the post div.

}
timeTag = videoPlayer.find('time')
if timeTag is None:
cls = Gif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some examples? I don't remember seeing fake-GIFs on Telegram before. (Also for the future test suite.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telegram doesn't have a policy on whether or not they're allowed, right? I don't think a real-GIF would ever inaccurately go down this path, so isn't it just making the logic more robust against change?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer erroring out on things the code doesn't actually understand and implement. It might be 'more robust' in some sense, but it can easily result in misparsing the data as well.

But if 'videos' without a time tag already exist similar to how it is on Twitter, this is totally fine. Hence why I'm asking for examples. :-)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an active Telegram user, so I don't think I'll be able to quickly come up with an example myself. @loganwilliams , do you remember running into a problem which required adding this line back when you implemented this?

On the other hand, what data misparsing are you imagining from this, @JustAnotherArchivist ? Especially if Twitter already has examples which require this behavior, what's the error mode that we'd want to call out by throwing here?

I'm hoping that merging this will get everyone off the fork, but am concerned that if we introduce new exceptions, it'll require more significant updates to existing workflows.

Edit: As a compromise, I'm adding a warning log to this in my PR. It won't stop execution, but will let the user know in case there's something actually wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @john-osullivan. Thanks for your work pushing this forward. You can see an example of a GIF here: https://t.me/thisisatestchannel19451923/3

It sits in the same .tgme_widget_message_video_player element and lacks a duration.

Comment on lines +222 to +223
if kwargs['href'] in outlinks:
outlinks.remove(kwargs['href'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer leaving the link preview href in outlinks as well, similar to how the Twitter scraper will have outlinks from link cards in outlinks.

Comment on lines +243 to +248
try:
if soup.find('a', attrs = {'class': 'tgme_widget_message_date'}, href = True)['href'].split('/')[-1] == '1':
# if message 1 is the first message in the page, terminate scraping
break
except:
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except is awful and hides various exceptions that shouldn't be caught, such as ^C interrupts. This test should really be done without a try-except.

Comment on lines +251 to +258
# some pages are missing a "tme_messages_more" tag, causing early termination
if '=' not in nextPageUrl:
nextPageUrl = soup.find('link', attrs = {'rel': 'canonical'}, href = True)['href']
nextPostIndex = int(nextPageUrl.split('=')[-1]) - 20
if nextPostIndex > 20:
pageLink = {'href': nextPageUrl.split('=')[0] + f'={nextPostIndex}'}
else:
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this approach lead to duplicates in some cases? When a post includes multiple media, those get their own ID, so there'd be a gap in post IDs.
I feel like it'd be better to get the ID of the first post on the page and then use that as the before parameter.

@JustAnotherArchivist
Copy link
Owner

Also, while testing this and looking around for odd cases, I discovered that Telegram supports 'round videos'. Example: https://t.me/s/memes/9641
Support for those doesn't need to be part of this PR, but I thought it'd be appropriate to mention it here in case you do want to handle it. Else I'll add it after this is merged.

@turicas
Copy link

turicas commented Feb 6, 2023

Hello. I'd love to get this PR merged. Is there anything I can do to help?

@Demmenie
Copy link

Please finish and merge this; it's quite a useful feature. If there's anything I can do to help, I'd be more than happy to.

@turicas
Copy link

turicas commented Jul 10, 2023

Just so you know: I've created a Python library called tchan that scrapes Telegram public channels and does not have the current problems snscrape has regarding this PR (still missing some features like scrape polls).

@Demmenie
Copy link

What I need (and what I'm pretty sure this PR provides) is an easy way to check if a post contains one or more videos.

@john-osullivan
Copy link

Hey, I recently did a Bellingcat workshop which used this fork -- I'd love to close the gap and get it merged. I'll try to cut something soon, @JustAnotherArchivist, and will let you know if I have any questions on how you'd like it!

@john-osullivan
Copy link

I wrapped up my changes to all your comments, @JustAnotherArchivist !

Asked @.loganwilliams for a review over there, but if you'd like to preemptively call out any issues with my changes, I'd love to get em fixed ahead of time and only do one merge process 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module:telegram
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Telegram image, video, audio message or file from post
7 participants