XPath results contain namespace in the keys #20

aemreunal · 2016-10-16T06:57:53Z

Hello,

First of all, commendable job. Thank you for your work.

I'm working on a Jupyter notebook, which will be a tutorial on how to use Riko to access unstructured website data in a structured manner. When I finish it, I will send you a pull request with the notebook (or get it to you in an alternative way), as I think it could be a great beginner's guide for everyone who'd like to use Riko.

As I am preparing the notebook, I ran in to an interesting situation: when I am parsing <li> elements using the xpathfetchpage and if those elements have other elements nested underneath it, the keys to those nested elements have a weird {http://www.w3.org/1999/xhtml} prefix. The following code snippet can illustrate it:

url = 'http://www.sozcu.com.tr/kategori/yazarlar/yilmaz-ozdil/'
xpath = '/html/body/div[5]/div[6]/div[3]/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li/a'
xpath_conf = {'xpath': xpath, 'url': url}
flow_main = SyncPipe('xpathfetchpage', conf=xpath_conf)
print next(flow_main.output)

This prints:

{
    u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/', 
    u'{http://www.w3.org/1999/xhtml}p': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n', 
    u'{http://www.w3.org/1999/xhtml}span': {
        u'content': u'16 Ekim 2016', 
        u'class': u'date'
    }, 
    u'title': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n'
}

for the fetched structure:

<a href="http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/" title="GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın">
    <p>GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın</p> 
    <span class="date">16 Ekim 2016</span>
</a>

(This page is updated daily so the exact output might differ when you run it but the structure remains the same)
I was unable to figure out why there's that '{http://www.w3.org/1999/xhtml}' prefix on the nested key values or how to get rid of them. I understand that it differentiates between the attributes of a tag and the nested elements but maybe there is a flag (that I was unable to find) to retrieve them as a list under a key like 'child' in top-level dictionary.

Thank you for your assistance.

The text was updated successfully, but these errors were encountered:

reubano · 2016-11-05T14:41:28Z

Hello,

First of all, commendable job. Thank you for your work.

Thank you! And sorry for the late reply. Hopefully I can be of some
assistance.

I'm working on a Jupyter notebook, which will be a tutorial on how to use
Riko to access unstructured website data in a structured manner. When I
finish it, I will send you a pull request with the notebook (or get it to
you in an alternative way), as I think it could be a great beginner's guide
for everyone who'd like to use Riko.

That would be amazing! I'll go ahead and point you to a few resources that
may help you out:

riko usage: source, nbviewer
dev-craft stream processing workshop: slides, source, binder
PyConZA stream processing talk: video, slides
PyConZA data mining tutorial: video, slides, source, binder

As I am preparing the notebook, I ran in to an interesting situation: when
I am parsing <li> elements using the xpathfetchpage and if those elements
have other elements nested underneath it, the keys to those nested elements
have a weird {http://www.w3.org/1999/xhtml} prefix. The following code
snippet can illustrate it:
url = 'http://www.sozcu.com.tr/kategori/yazarlar/yilmaz-ozdil/'
xpath = '/html/body/div[5]/div[6]/div[3]/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li/a'
xpath_conf = {'xpath': xpath, 'url': url}
flow_main = SyncPipe('xpathfetchpage', conf=xpath_conf)
print next(flow_main.output)
This prints:
{
    u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/',
    u'{http://www.w3.org/1999/xhtml}p': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n',
    u'{http://www.w3.org/1999/xhtml}span': {
        u'content': u'16 Ekim 2016',
        u'class': u'date'
    },
    u'title': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n'
}
for the fetched structure:
<a href="http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/" title="GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın">
    <p>GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın</p>
    <span class="date">16 Ekim 2016</span>
</a>
(This page is updated daily so the exact output might differ when you run
it but the structure remains the same)
I was unable to figure out why there's that {http://www.w3.org/1999/xhtml } prefix on the nested key values or how to get rid of them.

This is due to a patch required to properly handle namespaces without lxml. You can see if installing riko with the lxml parser fixes the issue since it works a bit differently than the native Python xml parser (ElementTree).

I understand that it differentiates between the attributes of a tag and
the nested elements but maybe there is a flag (that I was unable to find)
to retrieve them as a list under a key like 'child' in top-level
dictionary.

Could you please provide an example of the desired output?

Thank you for your assistance.

aemreunal · 2016-11-05T16:06:33Z

Hello,

Thank you for your detailed response. Unfortunately, the deadline for the tutorial was a week ago and I submitted my tutorial. I believe it turned out pretty good and I hope it'll be useful to someone. The tutorial is available here. I'll still check out everything you linked to.

Thank you very much.

reubano · 2016-11-05T16:25:13Z

Well, glad you were able to get along without me. I really do need to start replying to my emails/gh issues more timely :). It's soooo cool to have proof that someone else besides me is using this library! Please let me know if there was any [other] part of riko you found confusing. I'm going through your notebook now, very impressive!! I'll submit an issue with typo corrections later. Also, free feel to submit a PR to the readme linking to your notebook.

aemreunal · 2016-11-05T16:43:28Z

Thank you very much, I'm glad that you liked it! I just sent a pull request... I think :D I might've messed it up, it's been some time since I opened a pull request. Please tell me if I messed up and I'll re-send it.

reubano · 2016-11-06T12:07:28Z

Did a bit more investigating and this issue is compounded in certain cases (search for 'title': 'Amok'). You can see here really weird keys:

{
    '{http://www': {
        'w3': {
            'org/1999/xhtml}span': {
                'class': 'date', 'content': '6 Kasım 2016'}, 'org/1999/xhtml}p': 'Amok'}}}}

This is due to the original key {http://www.w3.org/1999/xhtml}span being split at each dot (.), which in turn is an implementation detail of dotdict. dotdict is intended to make it easy to access (and set) values of nested dicts via dot notation. And since the namespace includes dots, you get this unintended consequence.

So, the upshot of all this is that I need to figure out how to remove the namespace from being included in the result that xpath returns.

aemreunal closed this as completed Nov 5, 2016

reubano mentioned this issue Nov 6, 2016

notebook updates aemreunal/riko-tutorial#1

Open

reubano changed the title ~~Dict Keys Have Weird Prefix~~ XPath results contain namespace in the keys Nov 6, 2016

reubano reopened this Nov 6, 2016

reubano added the bug label Nov 6, 2016

reubano self-assigned this Nov 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XPath results contain namespace in the keys #20

XPath results contain namespace in the keys #20

aemreunal commented Oct 16, 2016

reubano commented Nov 5, 2016 •

edited

aemreunal commented Nov 5, 2016

reubano commented Nov 5, 2016

aemreunal commented Nov 5, 2016

reubano commented Nov 6, 2016 •

edited

XPath results contain namespace in the keys #20

XPath results contain namespace in the keys #20

Comments

aemreunal commented Oct 16, 2016

reubano commented Nov 5, 2016 • edited

aemreunal commented Nov 5, 2016

reubano commented Nov 5, 2016

aemreunal commented Nov 5, 2016

reubano commented Nov 6, 2016 • edited

reubano commented Nov 5, 2016 •

edited

reubano commented Nov 6, 2016 •

edited