[pandas] duplicated()의 함정과 모든 중복 데이터 모으기

1. duplicated()의 함정
2. 모든 중복 데이터 모아서 보기

pandas를 쓰다가 중복된 데이터를 처리할 때 모든 중복 데이터를 확인하고 싶을 때가 있다. 이를 위한 방법을 정리하고자 한다.

1. duplicated()의 함정

아래와 같은 데이터가 있다고 하자.

print(data.to_markdown())
>>>> 
|    | class   |   age | good   |
|---:|:--------|------:|:-------|
|  0 | a       |    11 | True   |
|  1 | a       |    10 | True   |
|  2 | b       |     9 | True   |
|  3 | c       |     9 | False  |
|  4 | c       |     7 | False  |
|  5 | c       |     7 | False  |

duplicated()를 사용하면 중복데이터 중 남겨둘 데이터 하나를 제외한 중복데이터만 보여준다.

print(data[data.duplicated(subset="class", keep="first")].to_markdown())
>>>>
|    | class   |   age | good   |
|---:|:--------|------:|:-------|
|  1 | a       |    10 | True   |
|  4 | c       |     7 | False  |
|  5 | c       |     7 | False  |


print(data[data.duplicated(subset="class", keep="last")].to_markdown())
>>>>
|    | class   |   age | good   |
|---:|:--------|------:|:-------|
|  0 | a       |    11 | True   |
|  3 | c       |     9 | False  |
|  4 | c       |     7 | False  |

2. 모든 중복 데이터 모아서 보기

stackoverflow글을 참고하면 아래와 같은 방법으로 모든 중복데이터를 모아서 볼 수 있다.

print(pd.concat(g for _, g in data.groupby("class") if len(g) > 1).to_markdown())
>>>>
|    | class   |   age | good   |
|---:|:--------|------:|:-------|
|  0 | a       |    11 | True   |
|  1 | a       |    10 | True   |
|  3 | c       |     9 | False  |
|  4 | c       |     7 | False  |
|  5 | c       |     7 | False  |

728x90

저작자표시 비영리 변경금지 (새창열림)

'python 메모' 카테고리의 다른 글

[huggingface] transformers 모델 onnx로 변환하기 (0)	2022.09.25
[python] Thread-Local Data (2)	2022.08.03
[jupyter] notebook 파일 cli로 중단없이 실행시키기 (1)	2022.07.30
[numpy] np.take, np.take_along_axis (0)	2022.07.29
[pandas] 셀의 모든 내용 출력하기 (0)	2022.07.21

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[pandas] duplicated()의 함정과 모든 중복 데이터 모으기

1. duplicated()의 함정

2. 모든 중복 데이터 모아서 보기

'python 메모' 카테고리의 다른 글

1. duplicated()의 함정

2. 모든 중복 데이터 모아서 보기

'python 메모' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역