Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.4.0
0.4.0
- Added generic
partition
brick that detects the file type and routes a file to the appropriate
partitioning brick. - Added a file type detection module.
- Updated
partition_html
andpartition_eml
to support file-like objects in 'rb' mode. - Cleaning brick for removing ordered bullets
clean_ordered_bullets
. - Extract brick method for ordered bullets
extract_ordered_bullets
. - Test for
clean_ordered_bullets
. - Test for
extract_ordered_bullets
. - Added
partition_docx
for pre-processing Word Documents. - Added new REGEX patterns to extract email header information
- Added new functions to extract header information
parse_received_data
andpartition_header
- Added new function to parse plain text files
partition_text
- Added new cleaners functions
extract_ip_address
,extract_ip_address_name
,extract_mapi_id
,extract_datetimetz
- Add new
Image
element and function to find embedded imagesfind_embedded_images
- Added
get_directory_file_info
for summarizing information about source documents
0.3.5
0.3.5
- Add support for local inference
- Add new pattern to recognize plain text dash bullets
- Add test for bullet patterns
- Fix for partition_html that allows for processing div tags that have both text and child elements
- Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
- Helper functions for identifying and extracting phone numbers
- Add new function extract_attachment_info that extracts and decode the attachment of an email.
- Staging brick to convert a list of Elements to a pandas dataframe.
0.3.4
0.3.4
- Python-3.7 compat
0.3.3
0.3.3
- Removes BasicConfig from logger configuration
- Adds the
partition_email
partitioning brick - Adds the
replace_mime_encodings
cleaning bricks - Small fix to HTML parsing related to processing list items with sub-tags
0.3.2
0.3.2
- Added
translate_text
brick for translating text between languages - Add an
apply
method to make it easier to apply cleaners to elements
0.3.1
0.3.1
- Added __init.py__ to
partition
0.3.0
0.3.0
- Implement staging brick for Argilla. Converts lists of
Text
elements toargilla
dataset classes. - Removing the local PDF parsing code and any dependencies and tests.
- Reorganizes the staging bricks in the unstructured.partition module
- Allow entities to be passed into the Datasaur staging brick
- Added HTML escapes to the
replace_unicode_quotes
brick - Fix bad responses in partition_pdf to raise ValueError
- Adds
partition_html
for partitioning HTML documents.
0.2.4
- Add an alternative way of importing
Final
to support google colab
0.2.3
0.2.3
- Add cleaning bricks for removing prefixes and postfixes
- Add cleaning bricks for extracting text before and after a pattern
0.2.2
0.2.2
- Add staging brick for Datasaur