Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish web content not displayed correctly '?' is putted instead of the correct character #189

Open
ElliotFer2000 opened this issue Jul 18, 2023 · 1 comment

Comments

@ElliotFer2000
Copy link

ElliotFer2000 commented Jul 18, 2023

Spanish words with accents are not properly displayed, char with accents are being replaced with a "?" character

why is this happening? How can I tell the scrapper I'm dealing with the spanish language?

code:

$web = new \Spekulatius\PHPScraper\PHPScraper;

$web->go("https://www.marca.com");

return $web->outlineWithParagraphs;

I return the outline back to the client in json format, the result I'm getting is something like this:

[
    {
        "tag": "h2",
        "content": "Joao F?lix: \"El Bar?a siempre ha sido mi primera opci?n\""
    }
]

I have already tried to solve the problem by putting this at the beggining of the script: setlocale(LC_ALL, 'es_AR')

F?lix and opci?n are not properly displayed in the response, it should be Félix and Opción , ? is being showed instead of é and ó

When I return the result of this function the characters display correctly

utf8_encode(file_get_contents("https://www.marca.com"))

I have tried to request the document with file_get_contents , encode the result and then pass the result to $web->setContent function, I get the expected output working in this way.

            $web = new PHPScraper;
            $rawPageContent = utf8_encode(file_get_contents("https://www.marca.com"));
            $web->setContent("https://www.marca.com",$rawPageContent);
@ElliotFer2000 ElliotFer2000 changed the title Spanish web content not displayed correctly ? is putted instead of the correct character Spanish web content not displayed correctly '?' is putted instead of the correct character Jul 18, 2023
@spekulatius
Copy link
Owner

Hello @ElliotFer2000

it looks like the fetching isn't using the correct encoding. I managed to confirm the issue. Have you checked how this could be resolved?

Peter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants