Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPath results contain namespace in the keys #20

Open
aemreunal opened this issue Oct 16, 2016 · 5 comments
Open

XPath results contain namespace in the keys #20

aemreunal opened this issue Oct 16, 2016 · 5 comments
Assignees
Labels

Comments

@aemreunal
Copy link
Collaborator

Hello,

First of all, commendable job. Thank you for your work.

I'm working on a Jupyter notebook, which will be a tutorial on how to use Riko to access unstructured website data in a structured manner. When I finish it, I will send you a pull request with the notebook (or get it to you in an alternative way), as I think it could be a great beginner's guide for everyone who'd like to use Riko.

As I am preparing the notebook, I ran in to an interesting situation: when I am parsing <li> elements using the xpathfetchpage and if those elements have other elements nested underneath it, the keys to those nested elements have a weird {http://www.w3.org/1999/xhtml} prefix. The following code snippet can illustrate it:

url = 'http://www.sozcu.com.tr/kategori/yazarlar/yilmaz-ozdil/'
xpath = '/html/body/div[5]/div[6]/div[3]/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li/a'
xpath_conf = {'xpath': xpath, 'url': url}
flow_main = SyncPipe('xpathfetchpage', conf=xpath_conf)
print next(flow_main.output)

This prints:

{
    u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/', 
    u'{http://www.w3.org/1999/xhtml}p': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n', 
    u'{http://www.w3.org/1999/xhtml}span': {
        u'content': u'16 Ekim 2016', 
        u'class': u'date'
    }, 
    u'title': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n'
}

for the fetched structure:

<a href="http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/" title="GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın">
    <p>GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın</p> 
    <span class="date">16 Ekim 2016</span>
</a>

(This page is updated daily so the exact output might differ when you run it but the structure remains the same)
I was unable to figure out why there's that '{http://www.w3.org/1999/xhtml}' prefix on the nested key values or how to get rid of them. I understand that it differentiates between the attributes of a tag and the nested elements but maybe there is a flag (that I was unable to find) to retrieve them as a list under a key like 'child' in top-level dictionary.

Thank you for your assistance.

@reubano
Copy link
Member

reubano commented Nov 5, 2016

Hello,

First of all, commendable job. Thank you for your work.

Thank you! And sorry for the late reply. Hopefully I can be of some
assistance.

I'm working on a Jupyter notebook, which will be a tutorial on how to use
Riko to access unstructured website data in a structured manner. When I
finish it, I will send you a pull request with the notebook (or get it to
you in an alternative way), as I think it could be a great beginner's guide
for everyone who'd like to use Riko.

That would be amazing! I'll go ahead and point you to a few resources that
may help you out:

As I am preparing the notebook, I ran in to an interesting situation: when
I am parsing <li> elements using the xpathfetchpage and if those elements
have other elements nested underneath it, the keys to those nested elements
have a weird {http://www.w3.org/1999/xhtml} prefix. The following code
snippet can illustrate it:

url = 'http://www.sozcu.com.tr/kategori/yazarlar/yilmaz-ozdil/'
xpath = '/html/body/div[5]/div[6]/div[3]/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li/a'
xpath_conf = {'xpath': xpath, 'url': url}
flow_main = SyncPipe('xpathfetchpage', conf=xpath_conf)
print next(flow_main.output)

This prints:

{
    u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/',
    u'{http://www.w3.org/1999/xhtml}p': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n',
    u'{http://www.w3.org/1999/xhtml}span': {
        u'content': u'16 Ekim 2016',
        u'class': u'date'
    },
    u'title': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n'
}

for the fetched structure:

<a href="http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/" title="GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın">
    <p>GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın</p>
    <span class="date">16 Ekim 2016</span>
</a>

(This page is updated daily so the exact output might differ when you run
it but the structure remains the same)
I was unable to figure out why there's that {http://www.w3.org/1999/xhtml } prefix on the nested key values or how to get rid of them.

This is due to a patch required to properly handle namespaces without lxml. You can see if installing riko with the lxml parser fixes the issue since it works a bit differently than the native Python xml parser (ElementTree).

I understand that it differentiates between the attributes of a tag and
the nested elements but maybe there is a flag (that I was unable to find)
to retrieve them as a list under a key like 'child' in top-level
dictionary.

Could you please provide an example of the desired output?

Thank you for your assistance.

@aemreunal
Copy link
Collaborator Author

Hello,

Thank you for your detailed response. Unfortunately, the deadline for the tutorial was a week ago and I submitted my tutorial. I believe it turned out pretty good and I hope it'll be useful to someone. The tutorial is available here. I'll still check out everything you linked to.

Thank you very much.

@reubano
Copy link
Member

reubano commented Nov 5, 2016

Well, glad you were able to get along without me. I really do need to start replying to my emails/gh issues more timely :). It's soooo cool to have proof that someone else besides me is using this library! Please let me know if there was any [other] part of riko you found confusing. I'm going through your notebook now, very impressive!! I'll submit an issue with typo corrections later. Also, free feel to submit a PR to the readme linking to your notebook.

@aemreunal
Copy link
Collaborator Author

Thank you very much, I'm glad that you liked it! I just sent a pull request... I think :D I might've messed it up, it's been some time since I opened a pull request. Please tell me if I messed up and I'll re-send it.

@reubano reubano changed the title Dict Keys Have Weird Prefix XPath results contain namespace in the keys Nov 6, 2016
@reubano
Copy link
Member

reubano commented Nov 6, 2016

Did a bit more investigating and this issue is compounded in certain cases (search for 'title': 'Amok'). You can see here really weird keys:

{
    '{http://www': {
        'w3': {
            'org/1999/xhtml}span': {
                'class': 'date', 'content': '6 Kasım 2016'}, 'org/1999/xhtml}p': 'Amok'}}}}

This is due to the original key {http://www.w3.org/1999/xhtml}span being split at each dot (.), which in turn is an implementation detail of dotdict. dotdict is intended to make it easy to access (and set) values of nested dicts via dot notation. And since the namespace includes dots, you get this unintended consequence.

So, the upshot of all this is that I need to figure out how to remove the namespace from being included in the result that xpath returns.

@reubano reubano reopened this Nov 6, 2016
@reubano reubano added the bug label Nov 6, 2016
@reubano reubano self-assigned this Nov 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants