Skip to content

Commit

Permalink
fix: parse URL response Content-Type according to RFC 9110 (#2950)
Browse files Browse the repository at this point in the history
Currently, `file_and_type_from_url()` does not correctly handle the
`Content-Type` header. Specifically, it assumes that the header contains
only the mime-type (e.g. `text/html`), however, [RFC
9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows
for additional directives — specifically the `charset` — to be returned
in the header. This leads to a `ValueError` when loading a URL with a
response Content-Type header such as `text/html; charset=UTF-8`.

To reproduce the issue:

```python
from unstructured.partition.auto import partition

url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/"
partition(url=url)
```

Which will result in the following exception:

```python
{
	"name": "ValueError",
	"message": "Invalid file. The FileType.UNK file type is not supported in partition.",
	"stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 4
      1 from unstructured.partition.auto import partition
      3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\"
----> 4 partition(url=url)

File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs)
    539 else:
    540     msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\"
--> 541     raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\")
    543 for element in elements:
    544     element.metadata.url = url

ValueError: Invalid file. The FileType.UNK file type is not supported in partition."
}
```

This PR fixes the issue by parsing the mime-type out of the
`Content-Type` header string.


Closes #2257
  • Loading branch information
adieuadieu committed Apr 30, 2024
1 parent 7720e72 commit 0d80886
Show file tree
Hide file tree
Showing 4 changed files with 30 additions and 2 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## 0.13.6

### Enhancements

### Features

### Fixes

- **ValueError: Invalid file (FileType.UNK) when parsing Content-Type header with charset directive** URL response Content-Type headers are now parsed according to RFC 9110.

## 0.13.5

### Enhancements
Expand Down
16 changes: 16 additions & 0 deletions test_unstructured/partition/test_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -609,6 +609,22 @@ def test_auto_partition_from_url():
assert elements[0].metadata.url == url


def test_auto_partition_from_url_with_rfc9110_content_type():
url = "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/LICENSE.md"
elements = partition(
url=url, content_type="text/plain; charset=utf-8", strategy=PartitionStrategy.HI_RES
)
assert elements[0] == Title("Apache License")
assert elements[0].metadata.url == url


def test_auto_partition_from_url_without_providing_content_type():
url = "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/LICENSE.md"
elements = partition(url=url, strategy=PartitionStrategy.HI_RES)
assert elements[0] == Title("Apache License")
assert elements[0].metadata.url == url


def test_partition_md_works_with_embedded_html():
url = "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/README.md"
elements = partition(url=url, content_type="text/markdown", strategy=PartitionStrategy.HI_RES)
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.13.5" # pragma: no cover
__version__ = "0.13.6" # pragma: no cover
4 changes: 3 additions & 1 deletion unstructured/partition/auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -564,7 +564,9 @@ def file_and_type_from_url(
response = requests.get(url, headers=headers, verify=ssl_verify, timeout=request_timeout)
file = io.BytesIO(response.content)

content_type = content_type or response.headers.get("Content-Type")
content_type = (
content_type or response.headers.get("Content-Type", "").split(";")[0].strip().lower()
)
encoding = response.headers.get("Content-Encoding", "utf-8")

filetype = detect_filetype(file=file, content_type=content_type, encoding=encoding)
Expand Down

0 comments on commit 0d80886

Please sign in to comment.