Thursday, June 5, 2014

In operation, the spider draws 1 GB / hour, so it is recommended that a server with unlimited traff


Christopher Laux 'motivation to work on its own search engine for blogs comes from the dissatisfaction with the quality of blog search engines: The results of search within blogs could not be reached through services such as Technorati or Blog Search from Google or Yahoo. At Barcamp Berlin, he discussed his experience with the first steps to build your own blog search engine.
The machine works is that as a "Spider" Blogs place has a "Visitor" the task to find the latest entries. These two elements are already implemented. To build yet are search algorithms and a web interface, which allows users the actual search.
The Spider concept trees are rated according to, for example as "spam-branch" (cuts to the so are many other sources of spam, "cut") or "search branch oxygenics in Italian". If the blog search engine monolingual (ie, for example, English) is spidering, even foreign-language blogs need to be hidden. Blogs in the desired language can be determined quite simply a high proportion of words from an appropriate lexicon.
Do you also need that the Spider "in the circle running," because to detect such circles in the index among millions of blogs is problematic from the computational capacity ago. It is generally also tries to determine which contain parts of the blog what info (archive link, internal links, oxygenics blogroll), and in principle also semantic conclusions can be drawn from these structures. Blogs themselves are recognized by their typical URL structure: detected so far are the basis of 30 rules 80% of the blogs. oxygenics RSS to take as an indicator has not been proven.
In operation, the spider draws 1 GB / hour, so it is recommended that a server with unlimited traffic. A Bloom filter is not necessary to keep all previously detected URLs in memory, and the data files are structured linearly.
The Visitor is structured similar to the Spider. Here, however, fall to 30 GB / hr - probably because the designated machine is faster. Instead of 30 Connections like the Spider here 50 parallel connections are down.
One problem was that after half a million blogs to the system after some time went out the blogs: The database covers only half a million blogs and ... assumes that the half a million 'best' blogs are. From the 'Who links to whom "statistics can be concluded that the long-tail ratio, in fact, is inversely exponentially to each other. There are fewer and fewer blogs with more and more links that point to them for the statistician among the readers: In a double logarithmic representation results in an approximate straight line.
Timo Derstappen has a similar system as an alert system for PR departments built based on Technorati data, whose ultimate goal was the release of a PR alarm when certain terms 'peaken'. Terms were created by hand in Topic Maps semantically linked and can be used to evaluate text content. The hardware requirements are relatively high and difficult to implement in the context of a hobby project. Invalides HTML or RSS, and spam blogs were also for this system - which is currently installed anywhere - a problem. Book review: "Mining the Web - Discovering Knowledge from Hypertext Data" by S. Chakrabarti
Powered by Movable Type

No comments:

Post a Comment