GitHub - yangdanny97/OCaMOSS: Plagiarism detection software inspired by MOSS & written in OCaml

OCaMoss User's Guide

Plagiarism detection software, inspired by MOSS and implemented in OCaml. Runs using a command-line interface.

This is NOT an OCaml client for MOSS, this is a completely separate program. For more details about the system or how MOSS works in general, read the PDF report in this repository or this blog post I wrote.

Note: this was originally written as the final project for a course - it has since been updated by me, so some aspects of the PDF report may not be accurate. In particular, the latest version of OCaMOSS no longer uses a 2-3 tree.

To build - make
To build & run the REPL - make run
To build & run unit tests -make test

Required Dependencies:

Yojson
ANSITerminal
OUnit (for unit tests)

Commands:

(note - commands are case-sensitive)

run [threshold] - runs OCaMoss on the working directory. The threshold argument gives the program the percentage of the file to match with another for it to be flagged as plagiarised, and must be at least 0.4 and at most 1
dir - lists the working directory and the files that it contains
setdir [dir] - sets the relative directory to look for files and resets any results
results - lists the file names for which there are results
results [filename] - lists the detailed results of overlap for that file (Make sure to include the extension of the file)
resultpairs -- lists all the pairs of files for which there are positive results
compare [fileA] [fileB] - prints out specific overlaps of fileA and fileB (Make sure to include the extension of the files)
quit - exits the REPL
help - display the available commands

Usage instructions/tutorial:

setdir to folder you want to test. requirements: file names have no spaces and all files have the same extension (example: setdir tests/test1)
run with desired params (example: run 0.5 is the same as run)
results to view list of results, [results filename] to view list of results for specific file, and [compare A B to compare matching patterns for two files (example: results Camel.txt)

Example for runnning test case 1 and inspecting results:
1. setdir tests/test1
2. run
3. results/results intset.ml/compare intset1.ml intset.ml/resultpairs

Other information:

Similarity score:

used as a measure of how likely file A plagiarized from file B
ratio of # matching hashes between A and B : # hashes in fingerprint for A
overall similarity score for A is the average of all similarity scores for file A that are > 0.5
threshold score for detecting possible plagiarism varies with the file type, but experimentally we determined it to be around 0.5

Supported languages/file formats:

OCaml - .ml
Java - .java
C - .c
Python - .py
English - .txt (note: english comparison does NOT account for semantics)

Self-generated test case descriptions (test case N is in directory tests/testN):

NOTE: to replicate results, run using threshold = 0.4

exact duplicates - should return positive result
variable names changed - should return positive result
functions/comments reordered - should return positive result
functions1.ml is a copy of functions.ml but with large sections deleted - should return positive result for functions1 but not functions
different implementations of the same algorithm - should NOT return positive result
completely different files - should NOT return positive result
functions/comments reordered - should return positive result
more than 2 files - files changed respectively as follows: function/variable names changed; random spaces/new lines added; rec declarations/ match statement lines changed - should return positive result for all files except for lab034.ml which is a dummy.
more than 2 files - files changed respectively as follows: same comments but different code; comments deleted and same code with variable/function names changed - should NOT return positive result for first and should return positive result for second
large group of all different files - should NOT return positive result
txt files check - files are changed respectively as follows: exact wikipedia article; edited but very similar wikipedia article; sentences shifted around of original; exact same; a file that says “camel” five times; a more hazy edit of the original - should return positive result for all except last two: “Camels.txt” and “CamelMaybeCopy.txt”
Java check - test for a Java file, where one file has all comments removed
C check - test for a C file where one file has all comments removed
Python check - test for a Python file, where one files has comments removed and variable names changed.

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
language_files		language_files
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.merlin		.merlin
3110 Final Project Writeup.pdf		3110 Final Project Writeup.pdf
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
VERSION		VERSION
_tags		_tags
comparison.ml		comparison.ml
comparison.mli		comparison.mli
dictionary.ml		dictionary.ml
dictionary.mli		dictionary.mli
main.ml		main.ml
main.mli		main.mli
preprocessing.ml		preprocessing.ml
preprocessing.mli		preprocessing.mli
test.ml		test.ml
test_comparison.ml		test_comparison.ml
test_dictionary.ml		test_dictionary.ml
test_preprocessing.ml		test_preprocessing.ml
test_winnow.ml		test_winnow.ml
winnowing.ml		winnowing.ml
winnowing.mli		winnowing.mli

License

yangdanny97/OCaMOSS

Folders and files

Latest commit

History

Repository files navigation

OCaMoss User's Guide

Commands:

Usage instructions/tutorial:

Other information:

Similarity score:

Supported languages/file formats:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages