Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizontal scaling across multiple nodes #1916

Open
lausycampari opened this issue Dec 20, 2018 · 9 comments
Open

Horizontal scaling across multiple nodes #1916

lausycampari opened this issue Dec 20, 2018 · 9 comments

Comments

@lausycampari
Copy link

Is it possible to scale the crawler module and/or search module across multiple computers, all concurrently operating on the same data set? (similar to Elasticsearch, for example). If not, a work-around would be to mount a networked file-system, and set that as the data-path, but would this cause any problems with the software that you're aware of (besides the obvious increase in read/write latency)?

@ROBERT-MCDOWELL
Copy link

i'm also interested by this question...

@jelutz77
Copy link

I'm pretty sure that this would be compatible with the idea of a Federated search, such as Elasticsearch. The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again. There are a couple of protocols out there that fail to do this effectively, or fail to assign weights to different aspects of a page, losing much of the information in HTML.

@jelutz77
Copy link

Another approach to this issue would be to separate servers based on their functionality. The part of the system that is absolutely critical to keep all together is the web site metadata, so keeping a separate database server would be the first part to this solution. Another server or multiple servers could do crawling and feed the database via network access. And another server could perform web functions, such as supply a web interface for users (possibly a shared server), or access the database via API.

@ROBERT-MCDOWELL
Copy link

  • The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again
    why not sharedObjects?

@ROBERT-MCDOWELL
Copy link

  • The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again
    Why not SharedObjects?
  • And another server could perform web functions, such as supply a web interface for users (possibly a shared server), or access the database via API.
    Maybe the concept of cluster would be more effecient, use a UDP protocol (like a DNS server), to share instantly everything new or modified, the sharedObjects will analyze the part to change so will pass to the stream only the new bytes or modified bytes

@jelutz77
Copy link

Sharedobjects? This is the first I’ve heard of that. It’s generally better to use simpler, or more efficient, or more mainstream software rather than the more novel idea unless there is some new feature of the newer idea that adds measurable value. I’m not familiar with this structure so I don’t have a reason to use Sharedobjects.

@jelutz77
Copy link

jelutz77 commented Jul 21, 2019

A cluster? Databases can be clustered, and copy data between nodes synchronously or asynchronously, sort of how I understand Sharedobjects work. However, the amount of data involved would make keeping a copy on each search or web server impractical. Besides, a single dedicated database server would easily be able to handle the transaction load by itself for a sizable cluster of web servers. Existing clustering configurations for a dedicated database cluster can further expand scalability to dozens or hundreds of web servers.

@ROBERT-MCDOWELL
Copy link

well, nothing is impossible in a digital world. How do you think FB or else can manage their DB amon hundreds of thousands of servers?
Shared Objects is a 10 years old feature, more recent in JS, but exists in Java, Actionscript, etc..

@ROBERT-MCDOWELL
Copy link

I also forgot the torrent protocol, can also be interesting to explore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants