Web crawler is one of system components required to operate Web search service. It collects huge amount of Web pages on the internet by accessing each Web server. 'goo' has Web crawler 'ichiro' that is operated by NTT Resonant Inc. Web pages collected by ichiro are made retrievable on thefollowing services.
(a) Multi-Media Search: http://bsearch.goo.ne.jp
(b) Multi-Media Search for Mobile: http://mobile.mmm.nttr.co.jp/
(c) Mobile Search: http://mobile.goo.ne.jp/
- (a) and (b) have multi media index, so ichiro collects image, movie, and
audio files as well as HTML pages. Besides, some of collected pages are used
for R&D purpose.
ichiro's crawling policy
To prevent an excess load on Web servers, ichiro follows the rules below.
1) Obey 'robots.txt' on Web servers
'robots.txt' is a text file that can be placed at 'http://.../robot.txt' to limit crawlers' access to the server. ichiro read 'robot.txt' and follow "User-agent:", "Disallow:" and "Allow:" in the file.
How to make 'robot.txt'? -> http://www.robotstxt.org/wc/exclusion.html#robotstxt
2) Obey META TAG in Web pages
Some META TAGs are written on Web pages for the purpose of collection limitation on a per page basis. Ichiro reads and follow tags - "NOFOLLOW", "NOARCHIVE", "NOIMAGEINDEX", "NONE", "FOLLOW", "INDEX", "ALL", "NOSERVE", "SERVE", "ARCHIVE", and "NOIMAGECLICK"
How to write META TAG? -> http://www.robotstxt.org/wc/exclusion.html#meta
3) Fetch one page at a time
From each Web server, ichiro collects only one page at a time. This is also true in the case of virtual domain - a single Web server has multiple domain names (A, B). ichiro doesn't access A and B at the same time.
4) Proper interval time between accesses
Ichiro accesses to any single server at long enough intervals. Especially after collecting a large file such as AVI, much longer interval are taken, but the actual time can be arranged occasionally.
If you have other inquiry about ichiro, please contact our
help desk.
We'd appreciate it if you could put '[ichiro]' on e-mail Subject so
that we can reply quickly.