Skip to content

Releases: Unstructured-IO/unstructured

0.4.11

17 Feb 17:12
601f250
Compare
Choose a tag to compare

0.4.11

  • Adds partition_doc for partitioning Word documents in .doc format. Requires libreoffice.
  • Adds partition_ppt for partitioning PowerPoint documents in .ppt format. Requires libreoffice.

0.4.10

16 Feb 17:26
f5ff140
Compare
Choose a tag to compare

0.4.10

  • Fixes ElementMetadata so that it's JSON serializable when the filename is a Path object.

0.4.9

15 Feb 18:27
74e6b84
Compare
Choose a tag to compare

0.4.9

  • Added ingest modules and s3 connector
  • Default to url=None for partition_pdf and partition_image
  • Add ability to skip English specific check by setting the UNSTRUCTURED_LANGUAGE env var to "".
  • Document Element objects now track metadata

0.4.8

13 Feb 19:32
a920e55
Compare
Choose a tag to compare

0.4.8

  • Modified XML and HTML parsers not to load comments.

0.4.7

10 Feb 16:40
962de78
Compare
Choose a tag to compare
  • Added the ability to pull an HTML document from a url in partition_html.
  • Added the the ability to get file summary info from lists of filenames and lists
    of file contents.
  • Added optional page break to partition for .pptx, .pdf, images, and .html files.
  • Added to_dict method to document elements.
  • Include more unicode quotes in replace_unicode_quotes.

0.4.6

03 Feb 22:15
014585e
Compare
Choose a tag to compare

0.4.6

  • Loosen the default cap threshold to 0.5.
  • Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling
    the cap ratio threshold.
  • Unknown text elements are identified as Text for HTML and plain text documents.
  • Body Text styles no longer default to NarrativeText for Word documents. The style information
    is insufficient to determine that the text is narrative.
  • Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
  • Adds an Address element for capturing elements that only contain an address.
  • Suppress the UserWarning when detectron is called.
  • Checks that titles and narrative test have at least one English word.
  • Checks that titles and narrative text are at least 50% alpha characters.
  • Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
    environment variable for controlling the max number of words in a title.
  • Updated partition_pptx to order the elements on the page

0.4.4

25 Jan 17:01
1ce8447
Compare
Choose a tag to compare

0.4.4

  • Updated partition_pdf and partition_image to return unstructured Element objects
  • Fixed the healthcheck url path when partitioning images and PDFs via API
  • Adds an optional coordinates attribute to document objects
  • Adds FigureCaption and CheckBox document elements
  • Added ability to split lists detected in LayoutElement objects
  • Adds partition_pptx for partitioning PowerPoint documents
  • LayoutParser models now download from HugginfaceHub instead of DropBox
  • Fixed file type detection for XML and HTML files on Amazone Linux

0.4.3

18 Jan 17:31
59f972d
Compare
Choose a tag to compare

0.4.3

  • Adds requests as a base dependency
  • Fix in exceeds_cap_ratio so the function doesn't break with empty text
  • Fix bug in _parse_received_data.
  • Update detect_filetype to properly handle .doc, .xls, and .ppt.

0.4.2

17 Jan 16:36
9c3c14e
Compare
Choose a tag to compare

0.4.2

  • Added partition_image to process documents in an image format.
  • Fixed utf-8 encoding error in partition_email with attachments for text/html

0.4.1

13 Jan 22:23
419c086
Compare
Choose a tag to compare

0.4.1

  • Added support for text files in the partition function
  • Pinned opencv-python for easier installation on Linux