새벽코딩

[빅데이터분석기사] 실기 기출문제 작업형 2 본문

데이터분석

[빅데이터분석기사] 실기 기출문제 작업형 2

J 코딩 2024. 6. 21. 19:43
반응형

 

https://www.datamanim.com/dataset/practice/ex4.html

문제

예측 변수 Segmentation, test.csv에 대해 ID별로 Segmentation의 클래스를 예측해서 저장후 제출, 제출 데이터 컬럼은 ID와 Segmentation 두개만 존재해야함. 평가지표는 macro f1 score

import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_test.csv')

display(train.head(2))
test.head(2)

 

풀이

import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_test.csv')
y = train['Segmentation']

train_drop = train.drop(columns=['ID', 'Segmentation'], axis=1)
test_drop = test.drop(columns=['ID'], axis=1)
ID = test['ID']
train_drop['Ever_Married'] = train_drop['Ever_Married'].fillna('Yes')
train_drop['Graduated'] = train_drop['Graduated'].fillna('Yes')
train_drop['Profession'] = train_drop['Profession'].fillna('Artist')
train_drop['Work_Experience'] = train_drop['Work_Experience'].fillna(0.0)
train_drop['Family_Size'] = train_drop['Family_Size'].fillna(2.0)
train_drop['Var_1'] = train_drop['Var_1'].fillna('Cat_6')

test_drop['Ever_Married'] = test_drop['Ever_Married'].fillna('Yes')
test_drop['Graduated'] = test_drop['Graduated'].fillna('Yes')
test_drop['Profession'] = test_drop['Profession'].fillna('Artist')
test_drop['Work_Experience'] = test_drop['Work_Experience'].fillna(0.0)
test_drop['Family_Size'] = test_drop['Family_Size'].fillna(2.0)
test_drop['Var_1'] = test_drop['Var_1'].fillna('Cat_6')

#라벨 인코딩
from sklearn.preprocessing import LabelEncoder
en = LabelEncoder()
labels = ['Profession', 'Var_1']
train_drop[labels] = train_drop[labels].apply(en.fit_transform)
test_drop[labels] = test_drop[labels].apply(en.fit_transform)

#더미화
category = ['Gender', 'Ever_Married', 'Graduated', 'Spending_Score']
for i in category:
  train_drop[i] = train_drop[i].astype('category')
  test_drop[i] = test_drop[i].astype('category')

train_drop = pd.get_dummies(train_drop)
test_drop = pd.get_dummies(test_drop)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
target = ['Age', 'Work_Experience', 'Family_Size']
scaler.fit(train_drop[target])
train_drop[target] = scaler.transform(train_drop[target])
test_drop[target] = scaler.transform(test_drop[target])

#데이터분리
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(train_drop, y, test_size=0.2, stratify=y, random_state=1)

#모델학습
from sklearn.ensemble import RandomForestClassifier
model1 = RandomForestClassifier(random_state=23).fit(X_train, y_train)
prd1 = model1.predict(X_valid)
model2 = RandomForestClassifier(n_estimators=200, max_depth=6, random_state=23).fit(X_train, y_train)
prd2 = model2.predict(X_valid)

#검증
from sklearn.metrics import f1_score
#help('sklearn.metrics.f1_score')
print("f1_score_분류 : ", f1_score(y_valid, prd1, average='micro'))
print("f1_score_확률 : ", f1_score(y_valid, prd2, average='micro'))
result = model2.predict(test_drop)

#결과
pd.DataFrame({'ID':ID, 'Segmentation':result}).to_csv('answer.csv', index=False)

 

 

감사합니다.

반응형
Comments