먼저 필요한경우 하둡을 포맷하고 초기화 해준다.

stop-dfs.sh  # 분산 파일 시스템(HDFS) 중지
stop-yarn.sh # 리소스 관리자(YARN)중지
hdfs namenode -format
start-dfs.sh
# YARN은 Hadoop의 주요 컴포넌트 중 하나로, 클러스터 리소스 관리 및 job scheduling을 담당하고 있습니다.
# 따라서, 만약 MapReduce 작업을 수행할 예정이라면, 이 명령어를 통해 YARN을 시작해야 합니다.
start-yarn.sh

# 나같은 경우 아래를 해야하는 경우가 있었다.
rm -rf /hadoop/data/*

# 헬스체크
hadoop fsck /

 

내 웹서버 로그 샘플은 다음과 같음

evity@sevityubuntu:~/workspace/hadoop_sevity.com/session_analysis$ cat ../iis_logs/* | head
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2015-09-06 08:32:43
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
2015-09-06 08:32:43 127.0.0.1 GET / - 80 - 127.0.0.1 Mozilla/5.0+(Windows+NT+10.0;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/45.0.2454.85+Safari/537.36 - 200 0 0 211
2015-09-06 08:32:43 127.0.0.1 GET /iisstart.png - 80 - 127.0.0.1 Mozilla/5.0+(Windows+NT+10.0;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/45.0.2454.85+Safari/537.36 http://127.0.0.1/ 200 0 0 2
2015-09-06 08:32:43 127.0.0.1 GET /favicon.ico - 80 - 127.0.0.1 Mozilla/5.0+(Windows+NT+10.0;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/45.0.2454.85+Safari/537.36 http://127.0.0.1/ 404 0 2 2

이중에서 다음 컬럼을 뽑아서 map-reduce를 해보도록 하자.

날짜와 시간 (2015-09-06 08:32:43)
클라이언트 IP (c-ip)
요청된 URL (cs-uri-stem)

먼저 매핑과정을 통해 다음처럼 클라이언트 IP별로 30분 단위로 세션으로 묶어서 URL들을 뽑는다.

119.207.72.161  {"start_time": "2020-10-03 00:14:47", "end_time": "2020-10-03 00:14:48", "duration": 1.0, "url_list": ["/favicon.ico", "/board/dhtml_logo.js", "/menu_wonilnet.asp", "/", "/css/common.css"]}
66.249.65.54    {"start_time": "2020-10-03 00:27:31", "end_time": "2020-10-03 00:27:31", "duration": 0.0, "url_list": ["/wiki/doku.php"]}
80.82.70.187    {"start_time": "2020-10-03 01:59:09", "end_time": "2020-10-03 01:59:09", "duration": 0.0, "url_list": ["/cache/global/img/gs.gif"]}
139.59.95.139   {"start_time": "2020-10-03 02:11:31", "end_time": "2020-10-03 02:11:31", "duration": 0.0, "url_list": ["/"]}

그다음 리듀스 과정을 통해 IP별로 총 체류시간과 url_list를 묶어준다.

66.249.64.128   {"total_duration": 1694.0, "url_list": ["/menu_wonilnet.asp", "/menu_friend.asp", "/robots.txt", "/", "/menu_wonilnet_newImage.asp", "/favicon.ico"], "start_time": "2017-09-03 04:19:21", "end_time": "2018-07-29 20:43:11"}
66.249.70.30    {"total_duration": 1694.0, "url_list": ["/menu_wonilnet.asp", "/menu_friend.asp", "/robots.txt", "/wiki/doku.php", "/wiki/lib/tpl/dokuwiki/images/favicon.ico", "/", "/wiki/feed.php", "/wiki/lib/exe/js.php", "/wiki/lib/exe/css.php", "/css/common.css", "/wiki/lib/exe/detail.php", "/favicon.ico", "/wiki/lib/exe/fetch.php"], "start_time": "2017-07-26 01:31:14", "end_time": "2022-02-01 09:49:49"}
2.37.132.159    {"total_duration": 1689.0, "url_list": ["/"], "start_time": "2018-03-23 22:13:24", "end_time": "2018-03-23 22:43:34"}

폴더구조는 다음과 같으며, 이번 프로젝트의 현재 디렉토리는 session_analysis이다.

sevity@sevityubuntu:~/workspace/hadoop_sevity.com$ tree -L 1
.
├── iis_logs
├── ip_cnt
└── session_analysis

mapper.py를 다음과 같이 작성한다.

sevity@sevityubuntu:~/workspace/hadoop_sevity.com/session_analysis$ cat mapper.py
#!/usr/bin/env python3

import sys
import json
import codecs
from datetime import datetime, timedelta
import logging

logging.basicConfig(stream=sys.stderr, level=logging.ERROR)

previous_ip = None
url_list = []
start_time = None
end_time = None
session_timeout = timedelta(seconds=1800)  # 1800 seconds = 30 minutes

def reset_variables():
    global url_list, start_time, end_time
    url_list = []
    start_time = None
    end_time = None

# Using codecs to get stdin with a fallback in case of a UTF-8 decoding error
stdin = codecs.getreader('utf-8')(sys.stdin.buffer, errors='ignore')

for line in stdin:
    try:
        parts = line.strip().split()
        if len(parts) != 15:
            continue

        date_time = parts[0] + " " + parts[1]
        date_time = datetime.strptime(date_time, '%Y-%m-%d %H:%M:%S')

        ip = parts[8]  # changed from parts[2] to parts[8] to use the client IP
        url = parts[4]

        # If there is a previous ip and the current ip is different or a session timeout has occurred
        if previous_ip and (previous_ip != ip or (date_time - end_time) > session_timeout):
            print('%s\t%s' % (previous_ip, json.dumps({"start_time": str(start_time), "end_time": str(end_time), "duration": (end_time - start_time).total_seconds(), "url_list": list(set(url_list))})))
            reset_variables()

        if not start_time or date_time < start_time:
            start_time = date_time
        if not end_time or date_time > end_time:
            end_time = date_time

        url_list.append(url)
        previous_ip = ip

    except Exception as e:
        logging.error(f"Error processing line: {line.strip()}, Error: {e}")

# Print the last session
if previous_ip:
    print('%s\t%s' % (previous_ip, json.dumps({"start_time": str(start_time), "end_time": str(end_time), "duration": (end_time - start_time).total_seconds(), "url_list": list(set(url_list))})))

reducer.py를 다음과 같이 작성한다.

sevity@sevityubuntu:~/workspace/hadoop_sevity.com/session_analysis$ cat reducer.py
#!/usr/bin/env python3

import sys
import json

previous_ip = None
total_duration = 0.0
url_list = []
start_time = None
end_time = None

for line in sys.stdin:
    ip, session_info = line.strip().split('\t')
    session_info = json.loads(session_info)

    if previous_ip and previous_ip != ip:
        print('%s\t%s' % (previous_ip, json.dumps({"total_duration": total_duration, "url_list": list(set(url_list)), "start_time": start_time, "end_time": end_time})))

        total_duration = 0.0
        url_list = []
        start_time = None
        end_time = None

    if not start_time or session_info["start_time"] < start_time:
        start_time = session_info["start_time"]
    if not end_time or session_info["end_time"] > end_time:
        end_time = session_info["end_time"]

    total_duration += float(session_info["duration"])
    url_list.extend(session_info["url_list"])

    previous_ip = ip

if previous_ip:
    print('%s\t%s' % (previous_ip, json.dumps({"total_duration": total_duration, "url_list": list(set(url_list)), "start_time": start_time, "end_time": end_time})))

다음명령을 통해 map-reduce과정을 병렬로 진행한다.

hdfs dfs -rm -r /output && \
	hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar    \
    -file ~/workspace/hadoop_sevity.com/session_analysis/mapper.py     \
    -mapper 'python3 mapper.py'     \
    -file ~/workspace/hadoop_sevity.com/session_analysis/reducer.py     \
    -reducer 'python3 reducer.py'     \
    -input /logs/* -output /output \
    2>&1 | tee r.txt

결과를 HDFS에서 로컬파일시스템으로 가져온다.

rm -rf output && hadoop fs -get /output .

다음 python파일을 통해 duration역순으로 정렬한다.

import json
import sys

def get_duration(json_str):
    data = json.loads(json_str)
    return data.get('total_duration', 0)

lines = sys.stdin.readlines()

# 각 줄을 total_duration 값에 따라 정렬
lines.sort(key=lambda line: get_duration(line.split('\t', 1)[1]), reverse=True)

for line in lines:
    print(line, end='')
python3 sort.py < output/part-00000 > result.txt

result.txt를 보면 다음과 같다.

45.61.187.81    {"total_duration": 186353.0, "url_list": ["/"], "start_time": "2022-07-23 09:13:06", "end_time": "2022-08-22 20:57:09"}
127.0.0.1       {"total_duration": 161933.0, "url_list": ["/w/lib/tpl/dokuwiki/images/favicon.ico", "/wonilnet_pub/2010-06-09_192759_resize.png", "/w/lib/exe/css.php", "/image/top_menu_icon.gif", "/upgsvr/service.exe", "/favicon.ico", "/index.asp", "/w/lib/tpl/dokuwiki/images/pagetools-sprite.png", "/scrape", "/w/lib/tpl/dokuwiki/images/search.png", "/wiki/doku.php", "/w/lib/tpl/dokuwiki/images/usertools.png", "/", "/images/next5.gif", "/wonilnet_pub/2010-06-09_192656_resize.png", "/w/lib/tpl/dokuwiki/images/button-php.gif", "/image/rightconer_shadow.gif", "/menu_wonilnet_dbwork.asp", "/iisstart.png", "/w/lib/images/external-link.png", "/tc.js", "/w", "/css/common.css", "/a.php", "/images/num1_on_a.gif", "/w/lib/tpl/dokuwiki/images/button-donate.gif", "/w/lib/tpl/dokuwiki/images/button-html5.png", "/menu_friend.asp", "/w/lib/tpl/dokuwiki/images/button-dw.png", "/announce.php", "/w/doku.php", "/w/lib/tpl/dokuwiki/images/logo.png", "/announce", "/images/nonext.gif", "/images/num5_off.gif", "/image/garo_shadow.gif", "/scrape.php", "/image/sero_shadow.gif", "/w/lib/tpl/dokuwiki/images/button-css.png", "/images/num2_off.gif", "/menu_wonilnet.asp", "/favicon.png", "/image/leftconer_shadow.gif", "/images/num4_off.gif", "/images/num3_off.gif", "/image/line02.gif", "/board/dhtml_logo.js", "/w/lib/tpl/dokuwiki/images/page-gradient.png", "/w/lib/images/license/button/cc-by-sa.png", "/w/lib/exe/js.php", "/pub/02111401.jpg", "/w/lib/exe/indexer.php"], "start_time": "2015-09-06 08:32:43", "end_time": "2021-10-13 12:56:47"}
45.61.188.237   {"total_duration": 121316.0, "url_list": ["/"], "start_time": "2022-10-03 20:31:14", "end_time": "2022-10-26 20:10:54"}
64.62.252.174   {"total_duration": 102775.0, "url_list": ["/stock/test1/main.cpp", "/menu_wonilnet.asp", "/menu_friend.asp", "/stock/test1/id.cpp", "/robots.txt", "/wiki/lib/exe/css.php", "/wiki/lib/exe/opensearch.php", "/wiki/feed.php", "/wiki/doku.php", "/wiki/lib/exe/", "/wiki/lib/exe/indexer.php", "/wiki/lib/exe/js.php", "/wiki/lib/exe/mediamanager.php", "/wiki/", "/wiki/lib/exe/detail.php", "/wiki/lib/exe/fetch.php"], "start_time": "2022-08-11 04:54:40", "end_time": "2023-07-06 16:23:29"}

 

이 과정을 통해 다음과 같이 해킹을 시도하는 서버접근을 발견할 수 있었다.

124.173.69.66   {"total_duration": 1685.0, "url_list": ["/xw1.php", "/9510.php", "/link.php", "/sbkc.php", "/index.php", "/phpmyadmin/scripts/setup.php", "/xmlrpc.php", "/jsc.php.php", "/phpmyadmin3333/index.php", "/phpNyAdmin/index.php", "/hm.php", "/411.php", "/hhhhhh.php", "/key.php", "/92.php", "/pma/scripts/db___.init.php", "/aojiao.php", "/1/index.php", "/3.php", "/slider.php", "/plus/tou.php", "/g.php", "/ssaa.php", "/sss.php", "/lala-dpr.php", "/605.php", "/admin/index.php", "/plus/yunjitan.php", "/d.php", "/boots.php", "/win1.php", "/hell.php", "/uploader.php", "/images/vuln.php", "/error.php", "/que.php", "/099.php", "/aa.php", "/12345.php", "/666.php", "/api.php", "/plus/90sec.php", "/xiao.php", "/fusheng.php", "/xiaoxi.php", "/plus/ma.php", "/paylog.php", "/321/index.php", "/qwqw.php", "/qiangkezhi.php", "/liangchen.php", "/mysql.php", "/ganshiqiang.php", "/images/stories/cmd.php", "/xiong.php", "/datas.php", "/xixi.php", "/aaa.php", "/awstatstotals/awstatstotals.php", "/conflg.php", "/php2MyAdmin/index.php", "/think.php",

 

반응형

'Programming > Linux' 카테고리의 다른 글

스팍 - 실습 - 웹서버 세션분석  (0) 2023.07.30
스팍(Spark) 설치  (0) 2023.07.30
하둡 - 실습 - 웹서버로그분석  (0) 2023.07.29
하둡(Hadoop)  (0) 2023.07.29
Docker 설치, 초기설정, 명령어가이드  (0) 2023.07.17

+ Recent posts