Metadata file #16

kissgyorgy · 2021-11-22T12:40:16Z

Store the metadata file in extract_root in one JSON file.

We don't want to pollute the extracted folder with lots of small files.
It's nice if this is easy to read, so a JSON is easy to look at.

For example:


class Metadata:
    filename: Optional[str]
    size: Optional[int]
    perms: Optional[int]
    endianness: Optional[str]
    uid: Optional[int]
    username: Optional[str]
    gid: Optional[int]
    groupname: Optional[str]
    inode: Optional[int]
    vnode: Optional[int]

@attr.define
class Chunk:
    """Chunk of a Blob, have start and end offset, but still can be invalid."""

    start_offset: int
    # This is the last byte included
    end_offset: int
    handler: "Handler" = attr.ib(init=False, eq=False)
    metadata: Optional[Metadata]

The text was updated successfully, but these errors were encountered:

martonilles · 2022-03-18T10:29:22Z

we have a root File and in case of processing a directory, then we have a list of root Files
FilesystemObject

parent (Chunk, null for root)
children (list of Chunk, could be zero)
path
type (File, Directory, Device, Symlink etc.)
permission, ownership, timestamp, acl etc. (coming from the handler which extract metadata from the chunk, otherwise leave as null)
magic/mime
NB: we want to record metadata on "files" that are not written as part of the extraction (eg: char devices from squashfs)

Chunk

parent (File)
children (list of Files, could be zero)
start/end offset
length
type (handler)
tags: encryption,
metadata key/values (FIXME)

Questions:

how can we get the metadata? (can we get it from the extractors, are they smart enough)?
if metadata gathering is expensive, we should probably make those optional
do we want to store any errors (eg: extraction errors) related to files/chunks and if yes how?

qkaiser · 2023-04-02T15:08:15Z

Almost all of the information described above is now part of the reporting feature of unblob.

The information that is missing right now:

meta-data about files that were not created because we run without elevated privileges (block devices, character devices)
exact permission, ownership, and timestamp information on every file

I don't think item 1 has a lot of added value right now. Regarding item 2, we already have the structure in place to collect that information. What remains is making sure the extraction phase preserve that information so that we can simply stat the file for details.

I would take care of item 2 in two steps:

add permission, ownership, and timestamps to StatReports
once it's there, spend time making sure we extract or use extractors in a way that they preserve that information whenever they can. We already collected that information at https://unblob.org/formats/, and it's something our intern can do :)

On top of that, I would like to add a specific feature to our meta-data collection effort: saving header information. The idea is to have a metadata field as part of our ChunkReports, which is simply a dict where the handler developer can put relevant information, such as parsed headers.

I submitted a PR to dissect.cstruct going into that direction (see fox-it/dissect.cstruct#29).

The idea behind this is to expose metadata to further analysis steps through the unblob report (e.g. a binary analysis toolkit would read the load address and architecture from a uImage chunk to analyze the file extracted from that chunk with the right settings).

All of these changes are quite simple to implement since reporting is already there:

diff --git a/unblob/handlers/archive/sevenzip.py b/unblob/handlers/archive/sevenzip.py
index 040b409..de171c5 100644
--- a/unblob/handlers/archive/sevenzip.py
+++ b/unblob/handlers/archive/sevenzip.py
@@ -70,4 +70,8 @@ class SevenZipHandler(StructHandler):
         # We read the signature header here to get the offset to the header database
         first_db_header = start_offset + len(header) + header.next_header_offset
         end_offset = first_db_header + header.next_header_size
-        return ValidChunk(start_offset=start_offset, end_offset=end_offset)
+        return ValidChunk(
+            start_offset=start_offset,
+            end_offset=end_offset,
+            metadata=dict(header),
+        )
diff --git a/unblob/models.py b/unblob/models.py
index 2b8431f..d101a08 100644
--- a/unblob/models.py
+++ b/unblob/models.py
@@ -88,6 +88,7 @@ class ValidChunk(Chunk):
 
     handler: "Handler" = attr.ib(init=False, eq=False)
     is_encrypted: bool = attr.ib(default=False)
+    metadata: dict = attr.ib(default={})
 
     def extract(self, inpath: Path, outdir: Path):
         if self.is_encrypted:
@@ -108,6 +109,7 @@ class ValidChunk(Chunk):
             size=self.size,
             handler_name=self.handler.NAME,
             is_encrypted=self.is_encrypted,
+            metadata=self.metadata,
             extraction_reports=extraction_reports,
         )
 
@@ -188,7 +190,7 @@ class _JSONEncoder(json.JSONEncoder):
 
         if isinstance(obj, bytes):
             try:
-                return obj.decode()
+                return obj.decode("utf-8", errors="surrogateescape")
             except UnicodeDecodeError:
                 return str(obj)
 
diff --git a/unblob/report.py b/unblob/report.py
index 1b5bed1..acdabaf 100644
--- a/unblob/report.py
+++ b/unblob/report.py
@@ -4,7 +4,7 @@ import stat
 import traceback
 from enum import Enum
 from pathlib import Path
-from typing import List, Optional, Union, final
+from typing import Dict, List, Optional, Union, final
 
 import attr
 
@@ -116,6 +116,12 @@ class MaliciousSymlinkRemoved(ErrorReport):
 class StatReport(Report):
     path: Path
     size: int
+    ctime: int
+    mtime: int
+    atime: int
+    uid: int
+    gid: int
+    mode: int
     is_dir: bool
     is_file: bool
     is_link: bool
@@ -133,6 +139,12 @@ class StatReport(Report):
         return cls(
             path=path,
             size=st.st_size,
+            ctime=st.st_ctime_ns,
+            mtime=st.st_mtime_ns,
+            atime=st.st_atime_ns,
+            uid=st.st_uid,
+            gid=st.st_gid,
+            mode=st.st_mode,
             is_dir=stat.S_ISDIR(mode),
             is_file=stat.S_ISREG(mode),
             is_link=stat.S_ISLNK(mode),
@@ -181,6 +193,7 @@ class ChunkReport(Report):
     end_offset: int
     size: int
     is_encrypted: bool
+    metadata: Dict
     extraction_reports: List[Report]

Please let me know what you think about this approach.

martonilles · 2023-04-02T20:04:15Z

An issue could be that in _extract_chunk after the extraction is done we call fix_extracted_directory which calls fix_permission:

def fix_permission(path: Path):
    if path.is_file():
        path.chmod(0o644)
    elif path.is_dir():
        path.chmod(0o775)

So, by the time we run StatReport.from_path to check the permission is already changed.

Also if the extraction is not running as root, the uid/gid will be inaccurate as well.

Could be also problematic in case the ownership in the format is stored using names and those names are not present on the system.

martonilles · 2023-04-02T20:05:50Z

The meta-data part looks ok, though I am not sure we want to store the whole header, but rather try to standardize the stored meta information. We can also store the raw header, though in some cases there are multiple headers etc.

vlaci · 2024-03-11T15:33:10Z

Had some discussions with @orosam around unblob better preserving/logging/reporting file metadata. Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.

qkaiser · 2024-03-11T17:52:46Z

Had some discussions with @orosam around unblob better preserving/logging/reporting file metadata. Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.

I like the approach, but can you be a bit more specific ? Do you have examples or specific ideas in mind ?

kissgyorgy · 2024-03-11T18:10:19Z

Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.

I don't understand why this would help? If the format can reproduce these metadata, it contains in the format itself which can be parsed and extracted without looking at the extracted files. What am I missing?

vlaci · 2024-03-11T18:35:48Z

I don't understand why this would help? If the format can reproduce these metadata, it contains in the format itself which can be parsed and extracted without looking at the extracted files. What am I missing?

Probing question: do we want to eventually replace all extractors by our hand rolled ones? If so, then this totally makes sense. If we are to outsource extraction to external implementations, I don't want to that intimately familiarize ourselves with each format, that we'd be able to parse out these details. Some extractors have listing commands, but these need to be parsed as well, and may not contain all details we want to gather.

I like the approach, but can you be a bit more specific ? Do you have examples or specific ideas in mind ?

My idea is to have a very thin fuse driver executed either outside of unblob or inside as a thread, that would forward¹ all operations to the underlying filesystem, and record metadata from interesting ones, like mknod, chown, etc. See list of available operations here: https://libfuse.github.io/doxygen/structfuse__operations.html. According to my almost non-existent Mac knowledge, fuse API is supported there as well.

The complexity of this approach that we are not using the details stored in the archive/fs image, but the intent of extractor tools, e.g. if they are incomplete or just doing their own things diverging from the data format, we miss those. OTOH it would be trivial to wire up any format which has a well-behaving extractor.

actually sanitization could take place at this level, e.g. device node creation can be skipped, symlinks validated, and so on. ↩

qkaiser · 2024-03-12T08:29:53Z

So if I understand correctly, the fuse layer would allow any kind of operation like a fakeroot would. It would save the intent of the operation as metadata (uid, gid, timestamps, mode), and then proceed by doing what unblob is currently doing (setting ownership and permissions so that extraction can continue).

Correct ?

qkaiser · 2024-03-12T08:31:28Z

Would a FUSE layer interpose itself between the extraction directory and external tools launched as subprocess like 7z ? Is it possible from an unprivileged perspective ?

vlaci · 2024-03-12T10:30:08Z

Would a FUSE layer interpose itself between the extraction directory and external tools launched as subprocess like 7z ? Is it possible from an unprivileged perspective ?

That would be the idea. Unfortunately, it is a pain¹ to make it work inside docker, because it requires access to a kernel facility on the host. Otherwise, it would work for normal users.

An alternative approach we have discussed in the past is to LD_PRELOAD/DYLIB_FORCE_LIBRARIES a shim or use some other introspection method to trace IO calls. Unfortunately, it has its own can of worms, as it may not work for e.g. commands which are statically linked to libc or call syscalls directly (e.g. go on Linux).

It requires to pass --cap-add SYS_ADMIN --device /dev/fuse ↩

kissgyorgy added this to the v2.0 - more in depth extraction milestone Nov 24, 2021

This was referenced Mar 22, 2022

Have a clear report of the extraction process #246

Closed

Be able to only emit information about Chunks #15

Closed

martonilles mentioned this issue Apr 5, 2022

Basic metadata on chunks & extracted files #327

Closed

martonilles modified the milestones: v2.0 - metadata extraction, v2.5 - Detailed metadata Apr 5, 2022

qkaiser mentioned this issue Apr 14, 2023

feat(reporting): report meta-data information about chunks. #557

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata file #16

Metadata file #16

kissgyorgy commented Nov 22, 2021 •

edited

martonilles commented Mar 18, 2022 •

edited

qkaiser commented Apr 2, 2023

martonilles commented Apr 2, 2023

martonilles commented Apr 2, 2023

vlaci commented Mar 11, 2024

qkaiser commented Mar 11, 2024

kissgyorgy commented Mar 11, 2024

vlaci commented Mar 11, 2024

qkaiser commented Mar 12, 2024

qkaiser commented Mar 12, 2024

vlaci commented Mar 12, 2024 •

edited

Metadata file #16

Metadata file #16

Comments

kissgyorgy commented Nov 22, 2021 • edited

martonilles commented Mar 18, 2022 • edited

qkaiser commented Apr 2, 2023

martonilles commented Apr 2, 2023

martonilles commented Apr 2, 2023

vlaci commented Mar 11, 2024

qkaiser commented Mar 11, 2024

kissgyorgy commented Mar 11, 2024

vlaci commented Mar 11, 2024

Footnotes

qkaiser commented Mar 12, 2024

qkaiser commented Mar 12, 2024

vlaci commented Mar 12, 2024 • edited

Footnotes

kissgyorgy commented Nov 22, 2021 •

edited

martonilles commented Mar 18, 2022 •

edited

vlaci commented Mar 12, 2024 •

edited