Authorship-Identification-with-NLP

This repository conducts ongoing research in authorship attribution, see Google Slides for more details.

Learning Multi-Domain and Cross-Domain Authorship Representation

Motivation

Million-terabytes of contents are produced online by anonymous users every day
- These contents inherently contain individual portrait information
- Malicious agents may use these information to reveal their identities, which could potentially place them in danger
- They may also generate fake contents and claim to be someone else’s opinion, thus control the public
We are motivated to develop system to recognize and protect authorship privacy
- We limit the content type to be text only
- This research area is called Authorship Attribution (or Authorship Identification).

Research Question

Given two pieces of text, are they written by the same person?
- We want to determine whether two documents come from the same author in a large-scale setting with hundreds of thousands of authors
- We wish to know does author-level representation coming from one domain transfers to another domain, and whether multi-domain representations can be learned with carefully designed contrastive training objectives
HIATUS (Human Interpretable Attribution of Text using Underlying Structure)
- Our work is served as one of the performers in the HIATUS project
- Teams across universities and companies compete to generate higher fidelity representations between individual authors’ unique linguistic fingerprints

Abstract

Authorship identification and attribution aim to identify the belongings of the given text from a set of known authors. Previous approaches tried to learn author-level embeddings via contrastive learning that can be transferred to multiple domains that the author has written content about, but failed to give satisfactory results. We first scale the contrastive learning batch size beyond GPU memory constraint by using more negatives in each training batch and a larger pre-trained model backbone, then propose a data sampling and augmentation technique that greatly improves previous state-of-the-art results on multiple large-scale datasets, incorporating hard-positive and negative examples during in-batch sampling, and further augmenting this data by fine-tuning a generation model that produces the missing hard text corpora. We find that this method enables the model to focus its attention less on topic-related tokens of the authors, and more on the combination of punctuation and semantic properties, which is where its main performance improvement comes from.

Full Paper Link.

(Cross-Repo) Graph-based Authorship Identification and Portrait Sketching

Abstract

Our paper proposes a new method for authorship identification that incorporates graph structures and contrastive learning techniques. Authorship identification (AID) is the process of identifying the author of a given text using the structure of the text and the author’s writing style. It is usually treated as a text classification problem, which exhibits limitations when encountering real-world datasets with too many authors to classify. To overcome this issue, we used a method similar to contrastive learning, where the positive and negative pairs would be an article with its correct and incorrect author features, respectively. We further improve the model’s performance by utilizing graph machine learning, which could capture the inherent structure and relationships of authors and articles. In the end, we increased the model’s AUC value from 73% to 79% on a sampled subset from the Citation Network of DBLP.

Full Paper Link.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Authorship Attribution		Authorship Attribution
Graph-based AID		Graph-based AID
Authorship_COLM.pdf		Authorship_COLM.pdf
Graph-based AID.pdf		Graph-based AID.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Authorship Attribution

Authorship Attribution

Graph-based AID

Graph-based AID

Authorship_COLM.pdf

Authorship_COLM.pdf

Graph-based AID.pdf

Graph-based AID.pdf

README.md

README.md

Repository files navigation

Authorship-Identification-with-NLP

Learning Multi-Domain and Cross-Domain Authorship Representation

Motivation

Research Question

Abstract

(Cross-Repo) Graph-based Authorship Identification and Portrait Sketching

Abstract

Data Illustration by Network Sampling

About

Releases

Packages

Contributors 3

Languages

yang-su2000/Authorship-Identification-with-NLP

Folders and files

Latest commit

History

Repository files navigation

Authorship-Identification-with-NLP

Learning Multi-Domain and Cross-Domain Authorship Representation

Motivation

Research Question

Abstract

(Cross-Repo) Graph-based Authorship Identification and Portrait Sketching

Abstract

Data Illustration by Network Sampling

About

Topics

Resources

Stars

Watchers

Forks

Languages