Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html_table could expose .name_repair argument, to pass through to as_tibble() #340

Open
rhalbersma opened this issue Nov 6, 2021 · 1 comment
Labels
feature a feature request or enhancement table 🏓

Comments

@rhalbersma
Copy link

When parsing HTML tables, it is frequently the case that non-unique column names appear, e.g. when column names are multi-row and the first row spans multiple columns.

It would be nice if html_table could expose .name_repair as an argument to pass through to as_tibble. As it stands, the current implementation uses a hard-coded .name_repair = "minimal" in its call to as_tibble. This currently requires users to add an extra as_tibble(.name_repair = "unique") in pipelines parsing more complicated HTML tables.

Such as an extension would be in line with the recommendation from https://www.tidyverse.org/blog/2018/11/tibble-2.0.0-pre-announce/

Packages that are in the business of making tibbles may even want to expose the .name_repair argument and pass it through to tibble() or as_tibble(). For example, this is the approach planned for readxl, which reads rectangular data out of Excel workbooks.

@djvill
Copy link

djvill commented Sep 11, 2022

I've run into this issue too, and I have a reprex based on a Wiki page:

library(rvest)
mich <- "https://en.wikipedia.org/w/index.php?title=List_of_tallest_buildings_by_county_in_Michigan&oldid=1089312494"
michTable <- read_html(mich) %>% 
  html_element(".wikitable")
html_elements(michTable, "th")
#> {xml_nodeset (11)}
#>  [1] <th rowspan="2">County</th>
#>  [2] <th rowspan="2">City</th>
#>  [3] <th rowspan="2" style="width: 22%;">Building</th>
#>  [4] <th rowspan="2">Image</th>
#>  [5] <th colspan="2">Height</th>
#>  [6] <th rowspan="2">Floors</th>
#>  [7] <th rowspan="2">Year<sup class="reference" id="ref_note02^"><a href="#en ...
#>  [8] <th rowspan="2">Primary purpose</th>
#>  [9] <th rowspan="2">Previous<br>names\n</th>
#> [10] <th>(ft)</th>
#> [11] <th>(m)\n</th>
michDF <- html_table(michTable)
head(michDF[,1:6])
#> # A tibble: 6 x 6
#>   County         City        Building                        Image Height Height
#>   <chr>          <chr>       <chr>                           <chr> <chr>  <chr> 
#> 1 County         City        Building                        "Ima~ "(ft)" "(m)" 
#> 2 Alcona County  Harrisville Alcona County Building[1]       ""    "12.1~ "3.70"
#> 3 Alger County   Munising    Alger County Courthouse[2]      ""    "24.2~ "7.40"
#> 4 Allegan County Allegan     Allegan County Building[3]      ""    "24.2~ "7.40"
#> 5 Alpena County  Alpena      Northland Area Federal Credit ~ ""    "48.5~ "14.8~
#> 6 Antrim County  Bellaire    Antrim County Courthouse[5]     ""    ""     ""
tryCatch(select(michDF, Height), 
         error = function(e) e)
#> <simpleError in select(michDF, Height): could not find function "select">

Created on 2022-09-11 by the reprex package (v2.0.1)

As @rhalbersma indicated, this is triggered by a bunch of ths that have a rowspan >1 and one with a colspan > 1. Ideally, the name repair would default to treating the 2nd-row ths as suffixes to "Height", giving us unique colnames "Height_(ft)" and "Height_(m)".

@hadley hadley added feature a feature request or enhancement table 🏓 labels Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement table 🏓
Projects
None yet
Development

No branches or pull requests

3 participants