유튜브 크롤링 차단 - yutyubeu keulolling chadan

유튜브를 시청하다 보면 갑자기 튀어나오는 짜증나는 광고.

목차 Show

noindex 구현
<meta> 태그
Google에 메타 태그 알리기

아무리 한예슬이 나온다고 해도 어쩔 수 없다. 광고는 짜증난다.

'광고 건너뛰기' 버튼이 나올때 까지 기다렸다 클릭 해야만 다시 시청하던 영상으로 돌아 갈 수 있다. 파이썬 으로 '광고 건너뛰기' 버튼을 자동으로 클릭하는 프로그램을 만들어 보자.

import pyautogui

import datetime

import time

size = pyautogui.size()

print('Screen Size: {0}'.format(size))

while True:

try :

nowTime = datetime.datetime.now()

location = pyautogui.locateCenterOnScreen('adskip.png', region = (1200, 750, 300, 100), confidence = 0.7)

# region = (left, top, width, height)

# You need to have OpenCV installed for the confidence keyword to work.

if location == None:

print("[{0}] Ad not found. (Press 'Ctrl + C' to quit)".format(nowTime.strftime('%H:%M:%S')))

time.sleep(2.0)

continue

print('[{0}] Ad found at {1}'.format(nowTime.strftime('%H:%M:%S'), location))

pyautogui.moveTo(location[0], location[1], 1)

pyautogui.click(button = 'left')

time.sleep(5.0)

except KeyboardInterrupt :

print("Thank You.")

break

Python Source Code

'광고 건너뛰기' 버튼을 찾기 위해 이 그림을 adskip.png로 저장 한다.

실행 로그

영상으로 확인 하자.

컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

HTTP 응답에 noindex 메타 태그 또는 헤더를 포함하여 Google 검색에 특정 페이지나 다른 리소스가 표시되지 않게 할 수 있습니다. Googlebot이 다음에 페이지를 크롤링할 때 noindex 태그나 헤더를 발견하면 다른 사이트가 페이지에 연결되어 있는지와 관계없이 페이지 전체를 Google 검색결과에서 제외합니다.

noindex를 사용하면 페이지별로 사이트 액세스 권한을 제어할 수 있으므로 서버에 대한 루트 액세스 권한이 없는 경우 유용합니다.

noindex 구현

noindex는 메타 태그 및 HTTP 응답 헤더의 두 가지 방법으로 구현할 수 있습니다. 두 방법의 효과는 동일하며 사이트에 더 편리하고 콘텐츠 유형에 적절한 방법을 선택하면 됩니다.

<meta> 태그

대부분의 검색엔진 웹 크롤러에서 사이트 페이지의 색인을 생성하지 못하도록 하려면 다음 메타 태그를 페이지의 <head> 섹션에 삽입합니다.

<meta name="robots" content="noindex">

Google 웹 크롤러만 페이지의 색인을 생성하지 못하게 하려면 다음을 추가합니다.

<meta name="googlebot" content="noindex">

일부 검색엔진 웹 크롤러는 noindex 명령어를 다르게 해석할 수도 있습니다. 따라서 내 페이지가 다른 검색엔진의 검색결과에는 여전히 표시될 수 있습니다.

noindex 메타 태그 자세히 알아보기

메타 태그 대신 응답에 noindex 또는 none 값이 있는 X-Robots-Tag 헤더를 반환할 수도 있습니다. 응답 헤더는 PDF, 동영상 파일, 이미지 파일 등 HTML이 아닌 리소스에 사용할 수 있습니다. 다음은 페이지의 색인을 생성하지 않도록 크롤러에 지시하는 X-Robots-Tag가 있는 HTTP 응답의 예입니다.

HTTP/1.1 200 OK
(…)
X-Robots-Tag: noindex
(…)

noindex 응답 헤더 자세히 알아보기

Google에 메타 태그 알리기

메타 태그와 HTTP 헤더를 확인하려면 페이지를 크롤링해야 합니다. 페이지가 계속 검색결과에 표시된다면 태그를 추가한 이후 Google에서 페이지를 크롤링하지 않았기 때문일 수 있습니다. URL 검사 도구를 사용해 Google에 페이지를 다시 크롤링하도록 요청하세요. robots.txt 파일에서 Google 웹 크롤러가 이 URL을 크롤링하지 못하도록 차단하여 Google에서 태그를 인식하지 못하는 경우에도 페이지가 검색결과에 계속 표시됩니다. Google로부터의 페이지 차단을 해제하려면 robots.txt 파일을 수정해야 합니다. robots.txt 테스터 도구를 사용하여 robots.txt를 수정 및 테스트할 수 있습니다.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-04-21 UTC.

[{ "type": "thumb-down", "id": "missingTheInformationINeed", "label":"필요한 정보가 없음" },{ "type": "thumb-down", "id": "tooComplicatedTooManySteps", "label":"너무 복잡함/단계 수가 너무 많음" },{ "type": "thumb-down", "id": "outOfDate", "label":"오래됨" },{ "type": "thumb-down", "id": "translationIssue", "label":"번역 문제" },{ "type": "thumb-down", "id": "samplesCodeIssue", "label":"샘플/코드 문제" },{ "type": "thumb-down", "id": "otherDown", "label":"기타" }] [{ "type": "thumb-up", "id": "easyToUnderstand", "label":"이해하기 쉬움" },{ "type": "thumb-up", "id": "solvedMyProblem", "label":"문제가 해결됨" },{ "type": "thumb-up", "id": "otherUp", "label":"기타" }]