Skip to content

lorenzbr/techStandards

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

techStandards

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Lifecycle: stable

Download and parse technical standard documents

Introduction

This repository contains functions to download standard documents from the ETSI website and parse standard documents. For related functions (e.g., accessing ITU-T standard documents), see here.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("lorenzbr/techStandards")

What does the standard document parser do?

Technical standards are often described in extremely large documents comprising hundreds and sometimes even thousands of pages. This can lead to huge challenges for NLP and ML models dealing with such large texts. Thus, it is helpful to consider smaller parts of a standard and apply your model of choice to those. To select specific chapters, sections or paragraphs of a technical standard, this parser identifies the table of contents of a standard document and searches for the corresponding text using the title of the section and the page number as specified in the table of contents. The output are csv files with the structured text data (full text for each paragraph as outlined in the table of contents). Currently, the text data is also aggregated on chapter level and is stored in a separate txt file. The algorithm is based on regular expressions and excact as well as string similarity matches. While it works very well for most standard, for some, the parsing may fail or may not be that accurate. A log file with further details and messages is also outputted.

The two following pictures show an excerpt of a standard document. Exemplarily, the red boxes highlight what kind of information the standard document parser extracts. In practice, all the information of a document is parsed.

toc_example

fulltext_example

Examples

library(techStandards)

# Download ETSI standard documents
data("etsi_standards_meta")
download_etsi_standards(etsi_standards_meta, path = "")

# Get file names
files <- list.files(system.file("extdata/etsi_examples", package = "techStandards"), 
                    pattern = "pdf", full.names = TRUE)
file <- files[1]

# Set paths
input.path <- "inst/extdata/etsi_examples"
output.path <- input.path

# Parse a single standard document
parse_standard_doc(file, output.path, sso = "ETSI", overwrite = TRUE)

# Parse all standard documents
parse_standard_docs(input.path, output.path, sso = "ETSI", overwrite = TRUE)

Potential use cases

  • Standard essentiality/relevance assessments: fine-grained comparisons of patents with specific technical aspects of a standard
  • Track changes of standard documents over time: how does the text change relative to associated declared standard-essential patents?
  • Identify which sections of a technical standard have become void
  • Find technically similar implementations in other technical standards (e.g., from other standard-setting organizations)
  • Identify undisclosed standard-essential patents (e.g., patents filed through blanket declarations or potentially undeclared patents)

License

This R package is licensed under the MIT license.

See here for further information.

About

Download and parse technical standard documents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages