Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to disable injectableTypes filter ? #299

Open
ayqy opened this issue Jan 9, 2019 · 5 comments
Open

How to disable injectableTypes filter ? #299

ayqy opened this issue Jan 9, 2019 · 5 comments

Comments

@ayqy
Copy link

ayqy commented Jan 9, 2019

The same situation with #243 ,there is no ContentType specified in my target response header, so, JQuery injection was broken, although it's valid HTML, and JQuery Injected successfully if I remove injectableTypes filter.

How about provide an option to disable this restriction and log a warning message only, or just catch injection failure cases ?

@ayqy
Copy link
Author

ayqy commented Jan 9, 2019

HTML/XML like legacy documents are one of common cases.

@ayqy
Copy link
Author

ayqy commented Jan 9, 2019

Here is a workaround:

var cheerio = require('cheerio');

new Crawler({
    callback : function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            // Workaround goes here
            loadJQuery(res);
            var $ = res.$;
        }
    }
});

function loadJQuery(response) {
    if (response.$) return response;
    var cheerioOptions = {
        normalizeWhitespace: false,
        xmlMode: false,
        decodeEntities: true
    };
    $ = cheerio.load(response.body, cheerioOptions);
    response.$ = $;
}

@mike442144
Copy link
Collaborator

Actually, these years we've found that many websites responded json text or html document without content-type field. In this case, we have to use the same code you pasted when we got an error or warning message as you suggested above, but I prefer to keep correctness as a library. It is developers who are in charge of the rest work unexpected. Hope it helpful :)

@ayqy
Copy link
Author

ayqy commented Feb 21, 2019

Honestly, the consideration mentioned above is reasonable, just lift these ambiguous cases up to developers, and do not put any assumption on unspecific cases.

But the workaround here is extracted from source, and not stable. Therefore, a more beautiful solution is an option like this:

new Crawler({

    responseHeaderPatch: {
        // TRUST ME, I know it's a real HTML document
        'Content-Type': 'text/html'
    },

    callback : function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            var $ = res.$;
        }
    }
});

And respectively warning:

response body is not HTML, skip injecting. You can use responseHeaderPatch option and see [DocURL]...

@mike442144
Copy link
Collaborator

It's really a good idea, but we have no response stage for plugin. We are planning to refactor crawler to middleware based architecture, then I think this is the best choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants