Skip to content
This repository has been archived by the owner on Sep 6, 2023. It is now read-only.

Tokenize Japanese text on BigQuery with Kuromoji in Apache Beam/Google Dataflow at scale

Notifications You must be signed in to change notification settings

yu-iskw/kuromoji-for-bigquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kuromoji-for-bigquery

Build Status

kuromoji-for-bigquery tokenizes text on a BigQuery table with kuromoji and apache beam. And then the tokenized result will be stored into another BigQuery table.

It is horizontally-scalable on top of distributed system, since apache beam can run on Google Dataflow, Apache Spark, Apache Flink and so on.

Overview

Requirements

  • Maven
  • Java 1.8+
  • Google Cloud Platform account

Version Info

  • Apache Beam: 2.42.0
  • Kuromoji: 0.7.7

How to Use

Command Line Options

Required Options

  • --project: Google Cloud Project
  • --inputDataset: Input BigQuery dataset ID
  • --inputTable: Input BigQuery table ID
  • --tokenizedColumn: Column name to tokenize in a input table
  • --outputDataset: Output BigQuery dataset ID
  • --outputTable: Output BigQuery table ID
  • --schema: BigQuery schema to select columns in a input table. (Format: id:integer,name:string,value:float,ts:timestamp)
  • --tempLocation: The Cloud Storage path to use for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
  • --gcpTempLocation: A GCS path for storing temporary files in GCP.

Optional Options

  • --outputColumn: Output column for tokenized result in output table. (Default: token)
  • --kuromojiMode: Kuromoji Mode. (NORMAL, SEARCH, or EXTENDED) (Default: NORMAL)
  • --createDisposition: Create Disposition option for BigQuery. (CREATE_NEVER or CREATE_IF_NEEDED)
  • --writeDisposition: Write Disposition option for BigQuery. (WRITE_TRUNCATE, WRITE_APPEND or WRITE_EMPTY)
  • --runner: Apache Beam runner.
    • When you don't set this option, it will run on your local machine, not Google Dataflow.
    • e.g. DataflowRunner
  • --numWorkers: The number of workers when you run it on top of Google Dataflow.
  • --workerMachineType: Google Dataflow worker instance type
    • e.g. n1-standard-1, n1-standard-4

Run the command

# compile
mvn clean package

# Run bigquery-to-datastore via the compiled JAR file
java -jar $(pwd)/target/kuromoji-for-bigquery-bundled-0.4.1.jar \
  --project=test-project-id \
  --schema=id:integer \
  --inputDataset=test_input_dataset \
  --inputTable=test_input_table \
  --outputDataset=test_output_dataset \
  --outputTable=test_output_table \
  --tokenizedColumn=text \
  --outputColumn=token \
  --kuromojiMode=NORMAL \
  --tempLocation=gs://test_yu/test-log/ \
  --gcpTempLocation=gs://test_yu/test-log/ \
  --maxNumWorkers=10 \
  --workerMachineType=n1-standard-2

Versions

kuromoji-for-bigquery Apache Beam kuromoji
0.1.0 2.1.0 0.7.7
0.2.x 2.20.0 0.7.7
0.3.x 2.34.0 0.7.7
0.4.x 2.42.0 0.7.7

License

Copyright (c) 2017 Yu Ishikawa.