Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More output options #56

Open
Chaphasilor opened this issue Dec 10, 2020 · 23 comments
Open

More output options #56

Chaphasilor opened this issue Dec 10, 2020 · 23 comments

Comments

@Chaphasilor
Copy link
Contributor

Is your feature request related to a problem? Please describe.

We are currently running into the problem that we have very large (3GB+) JSON files generated by ODD, but can't process them because we don't have enough RAM to parse the JSON.
I personally love JSON, but it seems like the format is not well-suited for the task (it's not streamable).

Now, you might ask, why don't you guys just use the .txt file?; the problem is that this is only created after the scan is finished, including file size estimations. After scanning a large OD for ~6h yesterday, I had a couple million links, with over 10M links left in queue for file size estimation. The actual urls were already there, but the only way to save them was through hitting J for saving as JSON.

Describe the solution you'd like

There are multiple features that would be useful for very large ODs:

  • add a key command to prematurely save the .txt-file
    this should be no problem at all and is simply a missing option/command at this point
  • adopt a new file format that supports streaming parsers
    think jsonlines, csv, whatever
    it might also be a good idea to restructure the meta info of the scan and files in order to remove duplicate info and make the output files smaller and easier to work with
  • while we're at it, an option for saving the reddit output as well as error logs to a separate file would also be appreciated! :D

@MCOfficer and I would be glad to discuss the new file structure further, if you're so inclined :)

@KoalaBear84
Copy link
Owner

KoalaBear84 commented Dec 10, 2020

Hey @Chaphasilor and @MCOfficer

Yes. I of course ran into issues like this. Biggest JSON is 6.23GB which still works on my local machine with 48GB RAM, but yes, it's bad 😂

JSON has some positive things, but the size part and the RAM part isn't any of those. One things still is good and that's the parent / child structure which you can't really nicely do in any line based thing like txt/csv etc.

Thought about this issue before and wanted to completely rewrite it all to use SQLite, than RAM also doesn't matter while still scanning. When I see the queue turns to 100.000 I mostly stop scanning 🤣 The SQLite part is too much effort for NOW, but will try hopefully just around new year, no promises.

The Reddit part is already logged in History.log. Error is not in a seperate log, but maybe could already be done by adding your own nlog.config, but not sure about that, because I changed that lately for the single file releases.

The TXT part I added (ugly) and will be released in some minutes :)

Already using CSV for other tools so I could also easily add that. Will do when I have some time.

@KoalaBear84
Copy link
Owner

See https://github.com/KoalaBear84/OpenDirectoryDownloader/releases/tag/v1.9.2.6 for intermediate TXT saving.

@Chaphasilor
Copy link
Contributor Author

One things still is good and that's the parent / child structure which you can't really nicely do in any line based thing like txt/csv etc.

Yeah, we are aware of that. I thought about either explicitly listing the 'path' for each file (maybe using ids) or adding special lines that indicate the 'path' of each following file (url). Could be pseudo-nested, or explicit.

Thought about this issue before and wanted to completely rewrite it all to use SQLite, than RAM also doesn't matter while still scanning. The SQLite part is too much effort for NOW, but will try hopefully just around new year, no promises.

I'm not familiar with SQLite, isn't it a database? If it's just a file format that is easy to parse, that would be nice, but I believe a database would make ODD more complicated and harder to use/wrap.

The Reddit part is already logged in History.log.

I'll check it out tomorrow. Does it only contain the reddit markdown stuff? The reason I'm asking is because parsing stdout to find and extract that table is more of a workaround than a solution ^^

The TXT part I added (ugly) and will be released in some minutes :)

You're awesome! <3

Already using CSV for other tools so I could also easily add that. Will do when I have some time.

Don't rush this. CSV was simply one format that came to mind for having easily-parsable files with items that contain meta-info. There might be better formats than this.
I'm not opposed to something completely custom, if that's more efficient :)

@MCOfficer
Copy link
Contributor

Here's an idea making use of JSONLines. It's not pretty, but I don't believe one can actually represented nested structures in a streaming matter:

{ "type": "root", "url": "https://OD/", "subdirectories": ["public"] }
{ "type": "file", "url": "https://OD/test.txt", "name": "test.txt", "size": 351 }
{ "type": "directory", "url": "https://OD/public", "name": "public", "subdirectories": [] }

No matter how you put it, it will be pretty hard to rebuild a nested structure from this dataformat, but that's what json is for.

@Chaphasilor
Copy link
Contributor Author

Chaphasilor commented Dec 11, 2020

No matter how you put it, it will be pretty hard to rebuild a nested structure from this dataformat, but that's what json is for.

I believe IDs might help us out:

Just rebuild only the structure of the OD with JSON, as compact as possible. Include just the dir names along with a random ID. Without all the files and meta-info. Put it as the first line.
It might be long, but even with thousands of subdirectories it should still be parsable.

And then below that, for each ID, add the necessary meta info.
If the type is dir, the ID refers to the actual directory.
If the type is file, the ID refers to the parent directory.

If I'm not missing something obvious, this should make it possible to rebuild the nested structure?

@Chaphasilor
Copy link
Contributor Author

Chaphasilor commented Dec 11, 2020

I just put a 'proof of concept' together here:

Edit naughty-sinoussi-m4kjt

I'm using a jsonlines-based format, where the first line contains the general info about the scan, the second line contains the directory structure and the following lines contain meta-info about the directories and files.
What's important is that a respective parent directory has to come before any of its child directories and/or files.

The tool can take the regular JSON output and parse it into the new file format, for testing purposes. Only works with small files (<50kB), due to the limitations discussed above.

It can also take the new file format and parse it into the old format, proving that the new format perfectly preserves all info and the OD structure, without many of the previous drawbacks.
If applied correctly, the parser could be optimized in a way that allows to process indefinitely-large files (depending on use-case).

Would love to hear your thoughts on this @KoalaBear84 @MCOfficer :D

Edit: The file format is just an example. We could use that one, but if there are even better ways to do it, I'm all for it!

@MCOfficer
Copy link
Contributor

The file format would work - i lack the experience to design something better, tbh.

One thing i found counterintuitive is that each file has an ID, which is actually the ID of its parent directory. Should either be named accordingly, or be moved to a per-file basis.

@Chaphasilor
Copy link
Contributor Author

Chaphasilor commented Jan 13, 2021

One thing i found counterintuitive is that each file has an ID, which is actually the ID of its parent directory. Should either be named accordingly, or be moved to a per-file basis.

Yeah, those keys could (should) be renamed. I'll think about a better naming scheme tonight!

@Chaphasilor
Copy link
Contributor Author

Okay, I've renamed the parameter to directoryId for type==file and kept it as id for type==dir. This should be clear enough now.

I also fixed a bug that caused the directory names to get lost in translation, now the only difference between the original file and the reconstructed file are the IDs and some slight reordering.

From where I'm standing, the only thing left to do is implementing this in C#/.NET.
I'm just not sure how the JSON output is constructed right now and if it can be easily changed to the new format...

If @KoalaBear84 could point me in the general direction, I'd be willing to contribute a PR to offer the new format alongside the currently available ones.


On a different note:
Once the new format is implemented and tested, we could still keep the normal JSON available, because that's just easier to work with in most cases.
And even though putting out huge JSON files obviously isn't a problem for ODD, but we could add a warning/confirmation "dialog" if the user tried to save a very large scan in JSON-format and offer them to save it in the new format instead.
And then for applications like our automated scanners, we could specify a flag to that it's always put out in the new format and we don't need to worry about two formats :)

@KoalaBear84
Copy link
Owner

True. I'll take a look at it another time. Too much going on right now with the homeschooling part as an extra added bonus 😂

Also want to rewrite to a SQLite version. Then it doesn't matter at all how big a directory is. Now all directory structure is build up in memory exactly like it is in the JSON. But it's not particularly great for processing, as we all have experienced. Who would have imagined OD's with 10K directories and 10M files 🤣

And because SQLite has an implementation is nearly every language, it is portable.

@Chaphasilor
Copy link
Contributor Author

And because SQLite has an implementation is nearly every language, it is portable.

Does that mean SQLite can dump it's internal DB so we can import it somewhere else? Or how does SQLite help with exporting and importing scan results?

@KoalaBear84
Copy link
Owner

SQLite is just a database, you can use a client for every programming language and read it, and import it wherever you want.

@Chaphasilor
Copy link
Contributor Author

Ah okay, I took a quick look at it but didn't think it through xD

Makes sense 👌🏻

@MCOfficer
Copy link
Contributor

It feels wrong to use a database as data exchange format, but i can't seem to find any arguments against it. weird.

@KoalaBear84
Copy link
Owner

No, this isn't any promise, as I see it's even more work than I thought. I have to rewrite a lot of code, Handle all the parent/subdirectory things with database keys/ids.

But I've made a start. What I expected was right, it does get slower, this is because for every URL it wants to process it checks if it's already in it.

Besides that, it needs to insert every directory and url on disk, which also takes time.

For now it looks like an open directory with 950 directories and 15.228 files goes from 9 seconds to 16 seconds scanning/processing time. But... That is with still all queueing in memory, but all of that has to be rewritten to use the SQLite too.

So.. Started it as a test, But 95% yet to be done, and this already took 4 hours to check 😮

@MCOfficer
Copy link
Contributor

MCOfficer commented Jan 17, 2021

But I've made a start. What I expected was right, it does get slower, this is because for every URL it wants to process it checks if it's already in it.

I assume you are using one sqlite db for every scanned OD. in that case, you could maintain a HashSet (or whatever C#'s equivalent is) of all visited URLs, which is significantly faster to check against.

Besides that, it needs to insert every directory and url on disk, which also takes time.

This idea may be more complicated and i lack the experience to judge it, but:
SQLite DBs can be in-memory. You can just write to your in-memory database, and then persist it to the disk once the scan is complete.

This may also make the HashSet unnecessary, as in-memory DBs are typically blazingly fast. I'm not sure how much faster they are when reading though, because reading can be cached quite efficiently.

@KoalaBear84
Copy link
Owner

Yes, HashSet is also the same in C#, have used it before, funny is that I didn't use this in the current implementation. I also want to make a 'resume' function, that you can pause the current scan, because you need to restart the server/computer, and continue the previous scan. HashSet is probably a good choice for this problem.

Indeed, I have used "Data Source=:memory:" as well, was my first test, that inserted 10.000 urls in 450ms. Then changed to using disk, which takes 120 SECONDS for 10.000 urls, but, that was before performance optimizations 😇

I think that it will be fast enough. Especially when the OD will become very big, and we have no more memory issues. Also writing the URLs file will not depend on memory anymore and will be a lot faster when the OD has a lot of URLs. We can just query the database, and all will stream from database to file.

Refactored some more now, rewrote all to native SQLite thing. Hopefully more news 'soon'.

Ahh, looks like the 5 SQLite library dlls needed are only 260 kB, I expected MB's 😃

@Chaphasilor
Copy link
Contributor Author

SQLite DBs can be in-memory. You can just write to your in-memory database, and then persist it to the disk once the scan is complete.

I believe our goal was to reduce memory usage, yes? 😉

Maybe a combination of HashSet and SQLite really is the way to go, combining speed with efficiency...

But I guess @KoalaBear84 knows best 😇

@KoalaBear84
Copy link
Owner

Hmm. For the performance optimization I use a "write-ahead log", this works great, but 'pauses' every 500 or 1000 records/inserts.

I was thinking, I might want to have some sort of queue for inserting the files, and process directories on the fly and do the files in a separate thread, this way we maybe can have both.

Also a note for myself 😃

@KoalaBear84
Copy link
Owner

Also linking this issue to #20 😜

@Chaphasilor
Copy link
Contributor Author

Coming back to this issues, did I understand it correctly that when using disk-based SQLite, the memory usage would be "near"-zero, no matter how large the OD?
This would indeed be a very compelling argument, but should probably be optional unless it's really needed. Using up a few MBs or even GBs for scanning twice as fast might be useful in some cases...

@maaaaz
Copy link

maaaaz commented Sep 5, 2022

I also want to make a 'resume' function, that you can pause the current scan, because you need to restart the server/computer, and continue the previous scan.

Yes, a resume feature would be awesome !

@KoalaBear84
Copy link
Owner

Well.. I sort of gave up on resume. Currently the whole processing of urls is depending on all data being in memory because it looks 'up' 4 levels of directories to see if we are in a recursive / endless loop..

It's very hard to rewrite everything, which costs a lot of time which I don't have / want to spend. 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants