일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
Tags
- BufferedReader
- BFS
- 백트래킹
- 문자열
- 아스키코드
- 백준
- Stack
- 다이나믹프로그래밍
- DP
- LIS
- 배열
- 알고리즘
- SQL
- 새벽코딩
- 완전탐색
- 구현
- 다리 만들기
- Queue
- 스택
- Python
- oracle
- 브루트포스
- HashMap
- 프로그래머스
- 탐색
- 시뮬레이션
- 그리디
- 빅데이터
- Java
- dfs
Archives
- Today
- Total
새벽코딩
[빅데이터분석기사] 실기 기출문제 작업형 2 본문
반응형
https://www.datamanim.com/dataset/practice/ex4.html
문제
예측 변수 Segmentation, test.csv에 대해 ID별로 Segmentation의 클래스를 예측해서 저장후 제출, 제출 데이터 컬럼은 ID와 Segmentation 두개만 존재해야함. 평가지표는 macro f1 score
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_test.csv')
display(train.head(2))
test.head(2)
풀이
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_test.csv')
y = train['Segmentation']
train_drop = train.drop(columns=['ID', 'Segmentation'], axis=1)
test_drop = test.drop(columns=['ID'], axis=1)
ID = test['ID']
train_drop['Ever_Married'] = train_drop['Ever_Married'].fillna('Yes')
train_drop['Graduated'] = train_drop['Graduated'].fillna('Yes')
train_drop['Profession'] = train_drop['Profession'].fillna('Artist')
train_drop['Work_Experience'] = train_drop['Work_Experience'].fillna(0.0)
train_drop['Family_Size'] = train_drop['Family_Size'].fillna(2.0)
train_drop['Var_1'] = train_drop['Var_1'].fillna('Cat_6')
test_drop['Ever_Married'] = test_drop['Ever_Married'].fillna('Yes')
test_drop['Graduated'] = test_drop['Graduated'].fillna('Yes')
test_drop['Profession'] = test_drop['Profession'].fillna('Artist')
test_drop['Work_Experience'] = test_drop['Work_Experience'].fillna(0.0)
test_drop['Family_Size'] = test_drop['Family_Size'].fillna(2.0)
test_drop['Var_1'] = test_drop['Var_1'].fillna('Cat_6')
#라벨 인코딩
from sklearn.preprocessing import LabelEncoder
en = LabelEncoder()
labels = ['Profession', 'Var_1']
train_drop[labels] = train_drop[labels].apply(en.fit_transform)
test_drop[labels] = test_drop[labels].apply(en.fit_transform)
#더미화
category = ['Gender', 'Ever_Married', 'Graduated', 'Spending_Score']
for i in category:
train_drop[i] = train_drop[i].astype('category')
test_drop[i] = test_drop[i].astype('category')
train_drop = pd.get_dummies(train_drop)
test_drop = pd.get_dummies(test_drop)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
target = ['Age', 'Work_Experience', 'Family_Size']
scaler.fit(train_drop[target])
train_drop[target] = scaler.transform(train_drop[target])
test_drop[target] = scaler.transform(test_drop[target])
#데이터분리
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(train_drop, y, test_size=0.2, stratify=y, random_state=1)
#모델학습
from sklearn.ensemble import RandomForestClassifier
model1 = RandomForestClassifier(random_state=23).fit(X_train, y_train)
prd1 = model1.predict(X_valid)
model2 = RandomForestClassifier(n_estimators=200, max_depth=6, random_state=23).fit(X_train, y_train)
prd2 = model2.predict(X_valid)
#검증
from sklearn.metrics import f1_score
#help('sklearn.metrics.f1_score')
print("f1_score_분류 : ", f1_score(y_valid, prd1, average='micro'))
print("f1_score_확률 : ", f1_score(y_valid, prd2, average='micro'))
result = model2.predict(test_drop)
#결과
pd.DataFrame({'ID':ID, 'Segmentation':result}).to_csv('answer.csv', index=False)

감사합니다.
반응형
'데이터분석' 카테고리의 다른 글
[빅데이터 분석기사] 8회 합격 후기 (0) | 2024.11.27 |
---|---|
[빅데이터분석기사] 실기 기출문제 5회 작업형 3 (0) | 2024.06.22 |
[빅데이터분석기사] 실기 기출문제 4회 작업형 1 (0) | 2024.06.21 |
[빅데이터 분석기사] 실기 (체험형) 작업형 3 풀이 (0) | 2024.06.21 |
[ADsP] 36회 데이터분석 준전문가 (합격 후기) (3) | 2023.03.28 |