Skip to content

oxpath/oxpath

Repository files navigation

OXPath Project

Latest release of OXPath Project: 1.0.4

Documentation (OXPath Project): 1.0.4 javadocs
Continuous Integration: Build Status
License: 3-Clause BSD License


OXPath is a web data extraction tool. The original version, OXPath 2.0, was provided by the Diadem Team.

The first version, OXPath 1.0, can be found at https://github.com/diadem/OXPath.

The current version supports Linux and OSX platforms.

Meltwater uses OXPath to extract millions of documents from 100'000s of sources daily.

Project Structure

OXPath Project consists of the following modules:

  • OXPath Core, implementing the core functionality of the OXPath language.
  • WebAPI, implementing an interface to web browsers (only Firefox 47.0.1 is currently supported).
  • Util contains functionality required for the project.
  • Output Handlers are a set of modules for serialising the result tree of OXPath into different formats (e.g., XML, JSON, CSV, RDB).
  • OXPath CLI is a command line interface for OXPath.
  • Browser Installer installs a web browser required by OXPath.

Installation

The project requires Java 1.7 (or higher).

Linux

Linux users need to run Browser Installer, which will install web browser into .oxpath in their home directory.

OSX

Mac users need to install a web browser supported by OXPath (i.e., Firefox 47.0.1) and convey OXPath with a configuration file as follows:

<?xml version="1.0" encoding="UTF-8" ?>
<diadem>
	<webapi>
		<platforms>
			<platform os-type="OSX">
				<home user-home-rel="true">.oxpath</home>
				<browser name="FIREFOX">
					<relpath>firefox_47.0.1</relpath>
					<run-file-path>/Applications/Firefox 47.0.1.app/Contents/MacOS/firefox</run-file-path>
					<display-size-file-relpath>display_size</display-size-file-relpath>
					<download-dir-relpath>download</download-dir-relpath>
				</browser>
			</platform>
		</platforms>
	</webapi>
</diadem>

Installation Into Your Local Repository

The installation of OXPath requires Maven v.3.

All OXPath maven artifacts can be installed with either of the following commands: mvn install (with unit tests) or mvn install -Dmaven.test.skip=true (without unit tests). These commands will also create a binary file oxpath-cli.jar, which you can find in the oxpath-cli/target directory.

Binaries

The implementation of the command line interface for OXPath is in the directory oxpath-cli, which can produce an executable binary oxpath-cli.jar.

Running

Details of running the binary oxpath-cli.jar can be found in oxpath-cli/README.md.

Integration

OXPath can be integrated into other maven artifacts with the following dependency statements:

<dependency>
	<groupId>org.oxpath</groupId>
	<artifactId>oxpath-core</artifactId>
	<version>2.2.1</version>
</dependency>
<dependency>
	<groupId>org.oxpath</groupId>
	<artifactId>webapi</artifactId>
	<version>1.4.1</version>
</dependency>

To specify the output handler, which can be used to convert the OXPath output tree, add a relevant dependency statement. All available output handlers can be found in the directory output-handlers.

An example for the OXPath XML Output Handler:

<dependency>
	<groupId>org.oxpath</groupId>
	<artifactId>oxpath-output-xml</artifactId>
	<version>1.0.1</version>
</dependency>

Documentation and References

OXPath Syntax Highlighting

The OXPath syntax highlighting, language-oxpath package, is implemented for Atom Editor by Mandy Neumann.

People

Core Contributors

  • Andrew Sellers, the University of Oxford
  • Giovanni Grasso, the University of Oxford & Meltwater
  • Tim Furche, the University of Oxford & Meltwater
  • Ruslan Fayzrakhmanov, the University of Oxford & QuantumBlack (a McKinsey company). The main contact person for the open source version (ruslan.fayzrakhmanov AT cs.ox.ac.uk)
  • Giorgio Orsi, the University of Oxford & Meltwater
  • Christian Schallhart, the University of Oxford

A complete list of authors and contributors is in CONTRIBUTORS.md.

Project Leaders

License

Copyright (C) 2016-2019, OXPath Team.

This project is licensed under the 3-Clause BSD License. See the top-level file LICENSE.md and LICENSE-3RD-PARTY.md (for used third-party software) for details.