Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to always get absolute URLs? #576

Open
gonssal opened this issue Nov 26, 2020 · 4 comments
Open

Is there a way to always get absolute URLs? #576

gonssal opened this issue Nov 26, 2020 · 4 comments
Labels
status/review-needed type/enhancement New feature or request type/question Further information is requested
Milestone

Comments

@gonssal
Copy link

gonssal commented Nov 26, 2020

I wanted to know if there's a way to make Ferret always return absolute URLs when they are relative in the source code, like web browsers do.

I'm crawling a site by getting a bunch of href attribute values from different anchors into an array and then iterating that array to load and return the content I need from each of the URLs.

The problem is that some of the URLs are absolute (https://example.com/whatever) and others are relative (/whichever), so when I try to get a DOCUMENT from one of the relative URLs, I get the following error:

Failed to execute the query
failed to retrieve a document /whichever: Get /whichever: unsupported protocol scheme "": DOCUMENT(url) at 11:16: FORurlinurlsLETpropDoc=DOCUMENT(url)RETURN{...} at 10:1

I'd ideally want to run the entire process in a single FQL script, but I couldn't find a way to convert the relative URLs or make them work, so it seems my only option is to first return them to a Go program to be fixed and then run an additional data-gathering query on each of them.

@ziflex ziflex added the type/question Further information is requested label Nov 26, 2020
@ziflex
Copy link
Member

ziflex commented Nov 26, 2020

If it's relative, why don't you just concat it with a base url?

doc.url + link.attributes.href

@gonssal
Copy link
Author

gonssal commented Nov 26, 2020

If it's relative, why don't you just concat it with a base url?

doc.url + link.attributes.href

Because as I explain in the issue, there's both relative and absolute URLs. In the third paragraph specifically.

@ziflex
Copy link
Member

ziflex commented Nov 27, 2020

You can do something like this:

LET href = link.attributes.href
LET url = CONTAINS(href, "http") ? href : doc.url + link.attributes.href

I might add helper functions for url manipulations in the future release.

@gonssal
Copy link
Author

gonssal commented Nov 27, 2020

I ended up using FIND_FIRST instead, thank you.

I think it would be really nice to automatically convert all relative paths in href, src, etc... in the same way web broswers do, if you hover a link it will always show the absolute URL it points to. Considering this is a crawling tool, I don't think relative URLs make a lot of sense.

This is also specially true for URI fragments. For example if I'm in https://example.com/some-url and there's an anchor with href="#marker", with your proposed solution I'd get https://example.com/#marker instead of the correct https://example.com/some-url#marker.

@stale stale bot added the status/stale Stale issue label Dec 28, 2020
@ziflex ziflex added status/review-needed type/enhancement New feature or request and removed status/stale Stale issue labels Dec 29, 2020
@MontFerret MontFerret deleted a comment from stale bot Dec 30, 2020
@ziflex ziflex added this to the Backlog milestone Mar 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/review-needed type/enhancement New feature or request type/question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants