๊ด€๋ฆฌ ๋ฉ”๋‰ด

bom's happy life

Python(ํŒŒ์ด์ฌ) - ์›น์Šคํฌ๋ž˜ํ•‘(ํฌ๋กค๋ง) ์‚ฌ์šฉ๋ฒ•, bs4 ๋ณธ๋ฌธ

Deveolpment Study๐Ÿ—‚๏ธ/Python

Python(ํŒŒ์ด์ฌ) - ์›น์Šคํฌ๋ž˜ํ•‘(ํฌ๋กค๋ง) ์‚ฌ์šฉ๋ฒ•, bs4

bompeach 2022. 6. 3. 11:26

ํฌ๋กค๋ง(crawling)์ด๋ž€ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

์ฃผ๋กœ ์ธํ„ฐ๋„ท์ƒ์˜ ์›นํŽ˜์ด์ง€๋ฅผ ์ˆ˜์ง‘ํ•ด์„œ ๋ถ„๋ฅ˜ํ•˜๊ณ  ์ €์žฅํ•˜๋Š” ๊ฒƒ์„ ๋œปํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ๊ฐ€

์–ด๋””์— ์ €์žฅ๋˜์–ด ์žˆ๋Š”์ง€ ์œ„์น˜์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜ ์ž‘์—…์ด ํฌ๋กค๋ง์˜ ์ฃผ์š” ๋ชฉ์ ์ด๋‹ค. [Google 'ํฌ๋กค๋ง ๋œป']

 

 

"ํฌ๋กค๋ง"์„ ํ•˜๋ ค๋ฉด ๋‘๊ฐ€์ง€ ์ผ์„ ํ•ด์•ผํ•œ๋‹ค.

 

1. ์ฃผ์†Œ์ฐฝ์— ์š”์ฒญํ•ด์„œ html์„ ๊ฐ€์ ธ์™€์•ผ ํ•œ๋‹ค.

( ์•ž์ „์— requests๋ฅผ ๊ฐ€์ง€๊ณ  ์ด๋ฏธ ํ•ด๋ณธ ๊ฒƒ.)

2. beautifulsoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.

 

beautifulsoup ์„ค์น˜ํ•˜๊ธฐ(requests ์„ค์น˜ ๋•Œ๋ž‘ ๊ฐ™์Œ)

'ํŒŒ์ผ'  →   '์„ค์ •'  →  'ํ”„๋กœ์ ํŠธ: prac_python'  →  Python ์ธํ„ฐํ”„๋ฆฌํ„ฐ ํ™”๋ฉด์—์„œ + ๋ฒ„ํŠผ ๋ˆ„๋ฆ„.
→   'bs4' ๊ฒ€์ƒ‰ํ•ด์„œ ์„ค์น˜!

 


 

# ์ง€๊ธˆ๋ถ€ํ„ฐ beautifulsoup ์‚ฌ์šฉ๋ฐฉ๋ฒ•!!

 

[์ฝ”๋“œ์Šค๋‹ˆํŽซ] ํฌ๋กค๋ง ๊ธฐ๋ณธ ์„ธํŒ…

import requests
from bs4 import BeautifulSoup

# ํƒ€๊ฒŸ URL์„ ์ฝ์–ด์„œ HTML๋ฅผ ๋ฐ›์•„์˜ค๊ณ ,
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=pnt&date=20210829',headers=headers)

# HTML์„ BeautifulSoup์ด๋ผ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด ๊ฒ€์ƒ‰ํ•˜๊ธฐ ์šฉ์ดํ•œ ์ƒํƒœ๋กœ ๋งŒ๋“ฆ
# soup์ด๋ผ๋Š” ๋ณ€์ˆ˜์— "ํŒŒ์‹ฑ ์šฉ์ดํ•ด์ง„ html"์ด ๋‹ด๊ธด ์ƒํƒœ๊ฐ€ ๋จ
# ์ด์ œ ์ฝ”๋”ฉ์„ ํ†ตํ•ด ํ•„์š”ํ•œ ๋ถ€๋ถ„์„ ์ถ”์ถœํ•˜๋ฉด ๋œ๋‹ค.
soup = BeautifulSoup(data.text, 'html.parser')

#############################
# (์ž…๋ง›์— ๋งž๊ฒŒ ์ฝ”๋”ฉ)
#############################

 


 

print(soup)์„ ํ•˜๋ฉด ๋ฐ‘์— ์‹คํ–‰์ฐฝ์— html์ด ๋“ค์–ด์˜ค๋Š” ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. (์ž˜ ๋“ค์–ด์˜ค๋Š”์ง€ url ๋ถ™์—ฌ๋„ฃ๊ณ  ๊ผญ ํ™•์ธํ•ด๋ณด๊ธฐ)

 


 

 

select / select_one์˜ ์‚ฌ์šฉ๋ฒ•์„ ์ตํ˜€๋ณด์ž.

 

์˜ˆ์ œ)

์˜ํ™” ์ œ๋ชฉ์„ ๊ฐ€์ ธ์™€๋ณด๊ธฐ!

ํƒœ๊ทธ ์•ˆ์˜ ํ…์ŠคํŠธ๋ฅผ ์ฐ๊ณ  ์‹ถ์„ ๋• → ํƒœ๊ทธ.text
ํƒœ๊ทธ ์•ˆ์˜ ์†์„ฑ์„ ์ฐ๊ณ  ์‹ถ์„ ๋• → ํƒœ๊ทธ['์†์„ฑ']
import requests
from bs4 import BeautifulSoup

# URL์„ ์ฝ์–ด์„œ HTML๋ฅผ ๋ฐ›์•„์˜ค๊ณ ,
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=pnt&date=20210829',headers=headers)

# HTML์„ BeautifulSoup์ด๋ผ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด ๊ฒ€์ƒ‰ํ•˜๊ธฐ ์šฉ์ดํ•œ ์ƒํƒœ๋กœ ๋งŒ๋“ฆ
soup = BeautifulSoup(data.text, 'html.parser')

# select๋ฅผ ์ด์šฉํ•ด์„œ, tr๋“ค์„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
movies = soup.select('#old_content > table > tbody > tr')

# movies (tr๋“ค) ์˜ ๋ฐ˜๋ณต๋ฌธ์„ ๋Œ๋ฆฌ๊ธฐ
for movie in movies:
    # movie ์•ˆ์— a ๊ฐ€ ์žˆ์œผ๋ฉด,
    a_tag = movie.select_one('td.title > div > a')
    if a_tag is not None:
        # a์˜ text๋ฅผ ์ฐ์–ด๋ณธ๋‹ค.
        print (a_tag.text)

if a_tag is not None : ์˜ ์˜๋ฏธ๋Š”?  (if a_tag !=   ← is not None์ด๋ž‘ ๊ฐ™์€ ์˜๋ฏธ.)

None์€ ํ™”๋ฉด์— ๋ณด์ด๋Š” ๊ฐ€๋กœ์„ ! ๊ทธ๋ž˜์„œ ๊ฐ€๋กœ์„ ์„ ์ œ์™ธํ•˜๊ณ  ๊ฐ’์„ ๋‚˜์˜ค๊ฒŒ ํ•˜๊ธฐ์œ„ํ•จ์ž„.

 

 

 

 

 

beautifulsoup ๋‚ด select์— ๋ฏธ๋ฆฌ ์ •์˜๋œ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์ž.

# ์„ ํƒ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• (copy selector)
soup.select('ํƒœ๊ทธ๋ช…')
soup.select('.ํด๋ž˜์Šค๋ช…')
soup.select('#์•„์ด๋””๋ช…')

soup.select('์ƒ์œ„ํƒœ๊ทธ๋ช… > ํ•˜์œ„ํƒœ๊ทธ๋ช… > ํ•˜์œ„ํƒœ๊ทธ๋ช…')
soup.select('์ƒ์œ„ํƒœ๊ทธ๋ช….ํด๋ž˜์Šค๋ช… > ํ•˜์œ„ํƒœ๊ทธ๋ช….ํด๋ž˜์Šค๋ช…')

# ํƒœ๊ทธ์™€ ์†์„ฑ๊ฐ’์œผ๋กœ ์ฐพ๋Š” ๋ฐฉ๋ฒ•
soup.select('ํƒœ๊ทธ๋ช…[์†์„ฑ="๊ฐ’"]')

# ํ•œ ๊ฐœ๋งŒ ๊ฐ€์ ธ์˜ค๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ
soup.select_one('์œ„์™€ ๋™์ผ')

 

 

 

ํฌ๋กฌ ๊ฐœ๋ฐœ์ž๋„๊ตฌ๋ฅผ ์ฐธ๊ณ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

 

1. ์›ํ•˜๋Š” ๋ถ€๋ถ„์—์„œ ๋งˆ์šฐ์Šค ์šฐํด๋ฆญ → ๊ฒ€์‚ฌ

2. ์›ํ•˜๋Š” ํƒœ๊ทธ์—์„œ ๋งˆ์šฐ์Šค ์šฐํด๋ฆญ

3. Copy → Copy selector๋กœ ์„ ํƒ์ž๋ฅผ ๋ณต์‚ฌํ•  ์ˆ˜ ์žˆ์Œ

 

 

 


 

# "ํฌ๋กค๋ง" ์—ฐ์Šต 

 

url: https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=pnt&date=20210829

๋„ค์ด๋ฒ„ ์˜ํ™”๋žญํ‚น ํŽ˜์ด์ง€์—์„œ ํฌ๋กค๋ง(์›น์Šคํฌ๋ž˜ํ•‘) ํ•ด๋ณด๊ธฐ!

 

 

์ˆœ์œ„, ์ œ๋ชฉ, ๋ณ„์  ๋‚˜์˜ค๊ฒŒ ํ•ด๋ณด์ž!

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=pnt&date=20210829',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

movies = soup.select('#old_content > table > tbody > tr')

#old_content > table > tbody > tr:nth-child(3) > td.title > div > a
#old_content > table > tbody > tr:nth-child(4) > td.title > div > a


for movie in movies :
    a = movie.select_one('td.title > div > a')
    if a is not None:
        title = a.text
        rank = movie.select_one('td:nth-child(1) > img')['alt']
        star = movie.select_one('td.point').text

        print(rank,title,star)

 

for ๋ฌธ์€ ๋Œ๋ฆฌ๊ฒ ๋‹ค๋Š” ์˜๋ฏธ. (์ •ํ™•ํ•˜๊ฒŒ ๋Œ๋ฆฐ๋‹ค๋Š”๊ฒŒ ๋จผ์ง€๋Š” ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ๋А๋‚Œ์€ ์•Œ๊ฒ ์Œ ;)

 

if  a  is  not  None : (ํ™”๋ฉด์— ๋ณด์ด๋Š” ๊ฐ€๋กœ์„  ์ œ์™ธํ•œ๋‹ค๋Š” ์˜๋ฏธ.)

 

title์— ๋Œ€ํ•œ ์„ค๋ช… :

a๋กœ ์ง€์ •ํ•ด์ค€ ๋ถ€๋ถ„์—์„œ text(ํ…์ŠคํŠธ)๋งŒ ๊ฐ€์ ธ์˜ค๊ฒ ๋‹ค๋Š” ์˜๋ฏธ.

 

 

rank์— ๋Œ€ํ•œ ์„ค๋ช… : 

์ˆœ์œ„(rank)๋ฅผ ๋‚ด๊ณ  ์‹ถ์€๋ฐ! ๋จผ์ € movie.select_one ๊ฐ€๋กœ์—ด๊ณ ,

์šฐํด๋ฆญ  ํฌ๋กฌ๊ฒ€์‚ฌ  →  copy  →  copy selector ์—์„œ ๋ณต๋ถ™ํ•˜๋ฉด

#old_content > table > tbody > tr:nth-child(2) > td:nth-child(1) > img

์ด๋ ‡๊ฒŒ ๊ธธ๊ฒŒ ๋‚˜์˜ค๋Š”๋ฐ,

 

์œ„์—์„œ

movies = soup.select('#old_content > table > tbody > tr')

์ด๋ ‡๊ฒŒ ์ง€์ •์„ ํ•ด์ฃผ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์œ„(rank) ๋ณต์‚ฌํ•ด์˜จ ๊ฒƒ์—์„œ ์•ž์— tr๋ถ€๋ถ„๊นŒ์ง€ ์ง€์šฐ๊ณ 

td:nth-child(1) > img ์ด๊ฒƒ๋งŒ ๋„ฃ์–ด์ค€๋‹ค! ๊ทธ๋ฆฌ๊ณ  [ ' alt ' ] ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด ์ˆœ์œ„๋งŒ ๊น”๋”ํ•˜๊ฒŒ ๋‚˜์˜จ๋‹ค. 

('alt' ์ถ”๊ฐ€ ์•ˆํ•ด์ฃผ๋ฉฐ ๋ณต์žกํ•˜๊ฒŒ ๊ธธ๊ฒŒ ๋‚˜์˜ด.)

 

 

star ์— ๋Œ€ํ•œ ์„ค๋ช… : 

rank์™€ ๋งˆ์ฐฌ๊ฐ€์ง€!

 

#old_content > table > tbody > tr:nth-child(2) > td.point

๋ณ„์  ๋ณต์‚ฌํ•ด์„œ ๊ฐ€์ ธ์˜จ ๋ถ€๋ถ„ ์ค‘ tr๊นŒ์ง€ ์ง€์šฐ๊ณ 

๋’ค์— td.point ๋งŒ ๋ถ™์—ฌ์ค€๋‹ค. ์ด๋ ‡๊ฒŒ๋งŒ ํ•˜๋ฉด ๋ณต์žกํ•˜๊ฒŒ ๊ธธ๊ฒŒ ๋‚˜์˜ค๋Š”๋ฐ

.text๋ฅผ ํ•ด์ฃผ๋ฉด ๋ณ„์  ์ˆซ์ž๋งŒ ๊น”๋”ํ•˜๊ฒŒ ๋‚˜์˜จ๋‹ค!

 

 

๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๊ธฐ ์›ํ•˜๋Š” ์ˆœ์„œ๋Œ€๋กœ ์ ์–ด์„œ printํ•˜๋ฉด

print ( rank, title, star )

 

์ด๋Ÿฐ์‹์œผ๋กœ ๋‚˜์˜จ๋‹ค~!

 

01 ๋ฐฅ์ • 9.64
02 ๊ทธ๋ฆฐ ๋ถ 9.59
03 ๊ฐ€๋ฒ„๋‚˜์›€ 9.59