Skip to content

Latest commit

 

History

History
76 lines (47 loc) · 3.72 KB

README.md

File metadata and controls

76 lines (47 loc) · 3.72 KB

techStandards

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Lifecycle: stable

Download and parse technical standard documents

Introduction

This repository contains functions to download standard documents from the ETSI website and parse standard documents. For related functions (e.g., accessing ITU-T standard documents), see here.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("lorenzbr/techStandards")

What does the standard document parser do?

Technical standards are often described in extremely large documents comprising hundreds and sometimes even thousands of pages. This can lead to huge challenges for NLP and ML models dealing with such large texts. Thus, it is helpful to consider smaller parts of a standard and apply your model of choice to those. To select specific chapters, sections or paragraphs of a technical standard, this parser identifies the table of contents of a standard document and searches for the corresponding text using the title of the section and the page number as specified in the table of contents. The output are csv files with the structured text data (full text for each paragraph as outlined in the table of contents). Currently, the text data is also aggregated on chapter level and is stored in a separate txt file. The algorithm is based on regular expressions and excact as well as string similarity matches. While it works very well for most standard, for some, the parsing may fail or may not be that accurate. A log file with further details and messages is also outputted.

The two following pictures show an excerpt of a standard document. Exemplarily, the red boxes highlight what kind of information the standard document parser extracts. In practice, all the information of a document is parsed.

toc_example

fulltext_example

Examples

library(techStandards)

# Download ETSI standard documents
data("etsi_standards_meta")
download_etsi_standards(etsi_standards_meta, path = "")

# Get file names
files <- list.files(system.file("extdata/etsi_examples", package = "techStandards"), 
                    pattern = "pdf", full.names = TRUE)
file <- files[1]

# Set paths
input.path <- "inst/extdata/etsi_examples"
output.path <- input.path

# Parse a single standard document
parse_standard_doc(file, output.path, sso = "ETSI", overwrite = TRUE)

# Parse all standard documents
parse_standard_docs(input.path, output.path, sso = "ETSI", overwrite = TRUE)

Potential use cases

  • Standard essentiality/relevance assessments: fine-grained comparisons of patents with specific technical aspects of a standard
  • Track changes of standard documents over time: how does the text change relative to associated declared standard-essential patents?
  • Identify which sections of a technical standard have become void
  • Find technically similar implementations in other technical standards (e.g., from other standard-setting organizations)
  • Identify undisclosed standard-essential patents (e.g., patents filed through blanket declarations or potentially undeclared patents)

License

This R package is licensed under the MIT license.

See here for further information.