Latest release of OXPath Project: 1.0.4
Documentation (OXPath Project): 1.0.4 javadocs
Continuous Integration:
License: 3-Clause BSD License
OXPath is a web data extraction tool. The original version, OXPath 2.0, was provided by the Diadem Team.
The first version, OXPath 1.0, can be found at https://github.com/diadem/OXPath.
The current version supports Linux and OSX platforms.
Meltwater uses OXPath to extract millions of documents from 100'000s of sources daily.
OXPath Project consists of the following modules:
- OXPath Core, implementing the core functionality of the OXPath language.
- WebAPI, implementing an interface to web browsers (only Firefox 47.0.1 is currently supported).
- Util contains functionality required for the project.
- Output Handlers are a set of modules for serialising the result tree of OXPath into different formats (e.g., XML, JSON, CSV, RDB).
- OXPath CLI is a command line interface for OXPath.
- Browser Installer installs a web browser required by OXPath.
The project requires Java 1.7 (or higher).
Linux users need to run Browser Installer, which will install web browser into .oxpath
in their home directory.
Mac users need to install a web browser supported by OXPath (i.e., Firefox 47.0.1) and convey OXPath with a configuration file as follows:
<?xml version="1.0" encoding="UTF-8" ?>
<diadem>
<webapi>
<platforms>
<platform os-type="OSX">
<home user-home-rel="true">.oxpath</home>
<browser name="FIREFOX">
<relpath>firefox_47.0.1</relpath>
<run-file-path>/Applications/Firefox 47.0.1.app/Contents/MacOS/firefox</run-file-path>
<display-size-file-relpath>display_size</display-size-file-relpath>
<download-dir-relpath>download</download-dir-relpath>
</browser>
</platform>
</platforms>
</webapi>
</diadem>
The installation of OXPath requires Maven v.3.
All OXPath maven artifacts can be installed with either of the following commands:
mvn install
(with unit tests) or mvn install -Dmaven.test.skip=true
(without unit tests).
These commands will also create a binary file oxpath-cli.jar
, which you can find in the oxpath-cli/target
directory.
The implementation of the command line interface for OXPath is in the directory oxpath-cli, which can produce an executable binary oxpath-cli.jar.
Details of running the binary oxpath-cli.jar can be found in oxpath-cli/README.md.
OXPath can be integrated into other maven artifacts with the following dependency statements:
<dependency>
<groupId>org.oxpath</groupId>
<artifactId>oxpath-core</artifactId>
<version>2.2.1</version>
</dependency>
<dependency>
<groupId>org.oxpath</groupId>
<artifactId>webapi</artifactId>
<version>1.4.1</version>
</dependency>
To specify the output handler, which can be used to convert the OXPath output tree, add a relevant dependency statement. All available output handlers can be found in the directory output-handlers.
An example for the OXPath XML Output Handler:
<dependency>
<groupId>org.oxpath</groupId>
<artifactId>oxpath-output-xml</artifactId>
<version>1.0.1</version>
</dependency>
- The Javadoc API
- User manual: Fayzrakhmanov et al. "Introduction to OXPath" (2018)
- Paper: Furche et al. "OXPath: A language for scalable data extraction, automation, and crawling on the deep web" (2013)
The OXPath syntax highlighting, language-oxpath package, is implemented for Atom Editor by Mandy Neumann.
- Andrew Sellers, the University of Oxford
- Giovanni Grasso, the University of Oxford & Meltwater
- Tim Furche, the University of Oxford & Meltwater
- Ruslan Fayzrakhmanov, the University of Oxford & QuantumBlack (a McKinsey company). The main contact person for the open source version (ruslan.fayzrakhmanov AT cs.ox.ac.uk)
- Giorgio Orsi, the University of Oxford & Meltwater
- Christian Schallhart, the University of Oxford
A complete list of authors and contributors is in CONTRIBUTORS.md.
- Georg Gottlob, the University of Oxford & TU Wien
- Tim Furche, the University of Oxford & Meltwater
Copyright (C) 2016-2019, OXPath Team.
This project is licensed under the 3-Clause BSD License. See the top-level file LICENSE.md and LICENSE-3RD-PARTY.md (for used third-party software) for details.