Skip to content

Releases: unreadablewxy/fs-curator

0.4.0

30 Jan 10:08
ec68bdc
Compare
Choose a tag to compare

It seems a recent libmagic regression (detected on Gentoo and Arch) is causing webm files to be incorrectly identified. If you have them in your mono-collection, it might be a good time to ask for a patrolling read against your by-id index

Have received some complaints that the *nix binaries are built with WAY too new glibc. So they will now be built on latest release of Debian instead of bleeding edge Gentoo.

Breaking Changes

  • Risk: moderate. Deprecated source_* parameters has been dropped
    • This affects qualifier expressions of all stages of the pipeline
    • This also affects transform argument generation
  • Risk: moderate. Store qualifiers and path generation no longer bind file_* attributes (except for file_extension)
    • Offering files to stores is a self contained process. Hoppers can be configured to auto invoke this process after certain files are ingested, but should not change said process. To convey extra information when auto invoked by hoppers is contrarian to this design
    • If we need per-file attributes lets design it properly as opposed to hacking pieces of it onto two colocated features

New Features

  • Added inline named capture groups support for regex
    • Realized through the PCRE2 library
    • Yes these are still applied at a lower precedence to named constants
    • Yes this means we now support match specific group attributes
  • Regex qualifiers now support minimum match length thresholds
    • The new value for the include config directive is PROPERTY /EXPRESSION/FLAGS THRESHOLD
    • eg: require the expression match at least 50% of the value include = x /\d+/ 50%
    • eg: require the expression match at least 12 characters include = x /\d+/ 12

Behavior Changes

  • Workflows resumed through WIP files now bypass hopper evaluation
    • WIP files now contain group attributes as well as workflow parameters, allowing manual touch ups
  • Store qualifiers and path generation now bind file_extension from the file identification process instead of copied verbatim from the imported file's path
  • Order assignment now sorts all files by length then character codes
    • This ensures semantically correct order for variable length numbers in file names: 0, 1, 10, 11, 2, 3 (the order without length factoring)
    • Another happy coincidence is this tends to cluster together similarly named files

Performance

  • Removed extraneous memory allocations from INI parsing
  • Removed unnecessary memory allocations for attribute matching at the cost of a bit of short lived heap fragmentation
  • Time complexity of matching files has been improved from m log(n) to m + n

Bug Fixes

  • Reduced FFMPEG warning spam when dealing with JPEG files
    • A side effect of this change is that phash has started producing slightly different results
    • So do not be alarmed if you see a lot of phash corrections while patrolling by-id

0.3.0

15 Feb 00:43
ec68bdc
Compare
Choose a tag to compare

Project now exceeds 8K lines of C++20 🎉

Breaking Changes

  • Risk: minimal. PHash querying command is incompatible with previous versions and will randomly fail if used with them
  • Risk: minimal. Thumbnail storage in the mono-collection has been redesigned and moved to cache/thumbnail. Existing deployments should delete and regenerate their thumbnails directory to reclaim otherwise wasted space
  • Risk: minimal. Hopper constants are now applied at a lower priority than NCGs. Allowing them to serve as fallback default values

New Features

  • Added JPEG thumbnails support with configurable quality

Performance

  • Added PHash index cashing, stored in cache/phash to reduce cold start delays for those with 100K+ collections
    • This cache is invalidated based on directory modification times and will be ineffective for those that has disabled it in their filesystems (if you need to ask, you haven't)

Behavioral Changes

  • Thumbnails are hence regarded as ephemeral data and will be overwritten automatically when offered to stores
    • This is done via delete-then-link, so there's no risk of corrupting the mono-collection. But this could still clobber files that are not linked by this service so please make sure your workflow is not affected before upgrading
  • File importing will now disregard singular 0 byte files. A behavior sometimes exhibited by browsers
  • Successfully imported directories will now be auto deleted
  • Remove the necessity to specify a store in hoppers to create import only hoppers

Bug Fixes

  • Fixed a bug where if a file is projected into two directories with differing thumbnail requirements only one will win over the other
  • Fixed an error contextualization bug that caused a lot of errors to be mistranslated as "unknown"
  • Fixed an IO bug that caused the thumbnailer to fail on some GIF files
  • Fixed cli side segfaults from not enforcing argument count requirements
  • Added workaround for ffmpeg bug #8747

Dependencies

For linux users, please install: libmagic1.5+, libffmpeg4.3+ (LGPL), libopencv4.5+

0.2.0

25 May 11:58
350e9a0
Compare
Choose a tag to compare

New Features

  • Windows support 🎉
  • Management socket location is now configurable via the FS_MGMT_SOCKET environment variable as well as the config file
    • On windows defaults to %APPDATA%\fs-curator\socket
    • On *nix defaults to /run/fs-curator/socket
    • Service mode will try to create the parent path of the management socket

Performance

  • Added codec caching in the thumbnailer
  • Switched to IO buffers sized as a multiple of both modern disks sectors & typical OS memory pages in for better memory & IO efficiency
  • Removed some unnecessary memory allocations when reading attributes
  • PHash queries now retrieve top 3 instead of 5 most similar images unless otherwise specified
    • This is the performance sweet spot for 10K+ collections

Behavior Changes

  • Temporary directories generated by transforms will now be destroyed
  • Empty directories, even those matching hopper qualifiers will now be disregarded to avoid infinite loops

Bug Fixes

  • Fixed a rare crash that occurs when merging more than 2 groups
  • Fixed a crash that occurs when the thumbnailer fails to open a file
  • Program will no longer start if config file doesn't exist
  • Files that fails to be identified will now be assigned the ".bin" extension instead of causing crashes

Dependencies

For linux users, please install: libmagic1.5+, libffmpeg4.1+ (LGPL), libopencv3.2+)

0.1.2

08 May 08:34
a793e9d
Compare
Choose a tag to compare
0.1.2 Pre-release
Pre-release

New Features

  • WIP files will now indicate the group & index of the most similar file for PHash conflicts
  • Added file_name, file_stem, and file_extension as testable properties
  • Added hopper defined constants
    • All source_* formatting fields & testable properties are now deprecated. See the relevant wiki article for rationale
  • Added configuration for logging verbosity

Behavior Changes

  • Group merging will now be done by a link-then-drop instead of one rename operation to keep rollback robust and simple
  • Log indicating how many files are being ingested now correctly counts files that are being dropped, as they are technically "ingested" (into /dev/null)

Bug Fixes

  • In perceptual hashing
    • Size limit (32MiB) is now applied consistently and clear errors are added for when exceeded.
    • Collisions resolved by the combine action will no longer drop the new file
  • Data integrity issues encountered whilst scanning groups designated for merging will now correctly trigger rollbacks
  • Fixed a rare heap corruption that occurs when generating thumbnails for multiple formats
  • For ingested files, file_* properties at the hopper level will now correctly binds to their path instead their parent directory

0.1.1

29 Apr 11:40
149d0ec
Compare
Choose a tag to compare
0.1.1 Pre-release
Pre-release

New Features

  • Added perceptual hash based image similarity deduplication
  • Added ignore conflict resolution action. Valid only for phash conflicts
  • Added a command to query for perceptual hash similarity
  • Added "crop to aspect" thumbnailing
  • Added regenerate thumbnail command

Upgrade Advisory

After first run of this release please run the following:

curator --patrol by-id to ensure any existing files are properly assigned their perceptual hashes.

0.1.0

29 Mar 12:56
987706c
Compare
Choose a tag to compare
0.1.0 Pre-release
Pre-release

Breaking Changes

Risk: Low, chance of losing ordering and grouping meta-data if migration fails.

  • The by-order index now uses directories to represent groups.
    • Existing collections should auto-migrate on first run.
    • Once migrated, do not run older versions. Possible crash risk.
  • Checksum collisions will no longer be reported via renaming

Upgrade Advisory

After first run of this release please run the following:

  • curator --patrol by-id to ensure newly required attributes are assigned to all files.
  • curator --patrol by-order to ensure your collection don't have any 0 indexed groups or files.

Security Advisory

  • In this release, the service will begin accepting Unix Domain Socket connections to accept commands. Processes residing on the same system may be able to issue commands to the daemon, so please ensure the permissions configured for the daemon's socket at /run/fs-curator/socket aligns with your security goals.
  • If incorrectly configured, transforms may become an attack vector for malicious insiders to perform elevation & arbitrary code execution attacks.

New Features

  • Added {file_extension} as a valid store path format field.
  • Added work-in-progress file based collision reporting
    • Append .continue to WIP file's name to continue with import
  • Added conflict resolving actions: combine and drop
  • Added Unix domain sockets support for issuing control commands
    • Not to be confused with IP networking.
    • Protocol not finalized, use at own risk.
  • Added a command to re-offer groups to stores.
    • curator -o | --offer GROUP [GROUP_RANGE_END] STORE_NAME
  • Added Xattrs saving support for groups
    • Configured in hopper scope, applied immediately after the files are imported
    • save = PROPERTY_NAME
    • Any attrs on the file can be used in store path expressions the same way as capture groups
    • Attrs that doesn't exist but referenced anyways results in failure & rollback
    • These are reloaded when re-offering files to stores

Performance

  • Reduced IO during startup. The by-order index directory will only be scanned if the cached value for next group ID is missing.
  • Removed random patrolling read on startup
    • A full patrolling read must now be manually requested.
    • curator -p|--patrol by-id | by-order
  • File ingestion is now top priority, thumbnailing & request processing happens after file processing completes.
  • Thumbnailer will now link existing thumbnails instead of generating new ones whenever possible. It is recommended thumbnail paths do not include any file extensions. Doing so could make migrating to subsequent releases problematic.

Bug fixes

  • Fixed a thumbnailer backoff bug that prevented it from running at full speed
  • Fixed a crash that manifests rarely for gif files between 60~130KiBs

0.0.0

21 Feb 08:39
987706c
Compare
Choose a tag to compare
0.0.0 Pre-release
Pre-release
Update README.md