crawler Archives - Tsung's Blog

Robots.txt 寫 Crawl-delay 的作用

Robots.txt 有個 Crawl-delay 的設定參數，是要做什麼用的呢？

閱讀全文〈Robots.txt 寫 Crawl-delay 的作用〉

Google 與 Googlebot 是如何看待 HTTP status code

Google / Googlebot 並不是所有 HTTP 的狀態都會處理的，這個表有列出他處理哪些狀態，分別是怎麼處理方式：

How HTTP status codes, and network and DNS errors affect Google Search
- We cover the top 20 status codes that Googlebot encountered on the web, and the most prominent network and DNS errors.
- Googlebot 在網路上最常遇到的 20 種狀態碼，以及最為常見的網路錯誤和 DNS 錯誤。較為罕見的狀態碼 (例如 418 (I'm a teapot))
  - 裡面特別提到 418 是不支援的，418 是什麼？可以參考此篇：HTTP Status Code 418：teapot 茶壺

如何降低 Googlebot 爬取速度(頻率)

Googlebot 來爬得太兇，要如何請他降速？

閱讀全文〈如何降低 Googlebot 爬取速度(頻率)〉

Google Podcast 的專用標籤、RSS

搜尋引擎爬取文字、圖片的內容，Podcast 這些可以怎麼做呢？

Podcast 要在 Google 上線的話，會需要製作 RSS，而 RSS 有哪些必要的 Tag 呢？

閱讀全文〈Google Podcast 的專用標籤、RSS〉

PHP 於 Header 送 noindex 給 Crawler bot

某些頁面不想要讓 Search engine (Google、Bing) 的 crawler bot 爬，有幾種方法可以使用：

HTML Meta Tag
使用 robots.txt
於 HTTP Header 送 X-Robots-Tag

此篇主要紀錄 HTTP Header 的作法

閱讀全文〈PHP 於 Header 送 noindex 給 Crawler bot〉

Google 開源 robots.txt 解析器並推 REP 為正式標準

robots.txt 的文字檔裡面，可以設定哪些可以爬、哪些不要爬，大多數的搜尋引擎爬蟲都會遵守這個規範。

robots.txt 起源：Martijn Koster 在 1994年建立 REP 的初期標準，再加上其它網站管理員的補充後，REP 已經成為產業標準，但是還沒成為官方的網路標準。

robots.txt RFC：A Method for Web Robots Control

閱讀全文〈Google 開源 robots.txt 解析器並推 REP 為正式標準〉