엘라스틱서치 elasticsearch-py, elasticsearch-dsl 에서 검색결과를 모두 가져오기. (덤 pandasticsearch)

python 을 이용하여, 엘라스틱의 검색결과를 가져오기 위해서, elasticsearch-py 와 elasticsearch-dsl 패키지를 이용해 보았다. 검색이 잘 되었으나, 검색결과를 10개만 가져왔다.

아래 elasticsearch-py 코드샘플과 elasticsearch-dsl 코드샘플을 보자.

>>> client = Elasticsearch(['http://nightly.apinf.io:14002'])
>>> search = Search(using=client)
>>> results = search.execute()
>>> results.hits.total
9611
>>> len(results.hits.hits)
10

## src : https://github.com/elastic/elasticsearch-dsl-py/issues/737#issue-258049352

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import pandas as pd

s = Search(using=Elasticsearch("https://11.11.11.11:9200"), index="data_index")
s = s.query("match", userId="theo")

resp = s.execute()

pd.DataFrame([ hit.to_dict() for hit in resp ])
#>> dataframe
#>> 10 rows × 18 columns

두 경우 공히 10개의 document row 만 가져왔다.

구글링해 보았으며, 해법을 찾은 것은 첫번째 샘플코드의 출처인 github.com/elastic/elasticsearch-dsl-py/issues/737 이다.

해결방법은

total = search.count() 로 검색조건에 맞는 갯수를 구하고,
s = s[0:total] 으로 슬라이싱 인덱싱문법으로 가져올 결과 갯수를 결정하고,
s.execute() 를 실행

하는 방법이다. 이를 두번째 코드에 적용해 보면,

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import pandas as pd

s = Search(using=Elasticsearch("https://11.11.11.11:9200"), index="data_index")
s = s.query("match", userId="theo")

# 추가부분
total = s.count()
s = s[0:total]


resp = s.execute()

pd.DataFrame([ hit.to_dict() for hit in resp ])
#--------------------------------
#>> dataframe
#>> 1027 rows × 18 columns

기본적으로 이렇게 하면 되는데, 유의할 점은, 구해진 total 이 너무 큰 수일 때에는 에러가 발생한다. 적절히 검색조건을 추가하거나, 적절한 갯수만큼을 슬라이싱하여 가져오도록 수정하여야 할 것이다.

pandasticsearch 라는 (pandas+elasticsearch) 패키지도 있는데, 이 패키지에 RestClient 로 쿼리를 날릴 수 있고, 이것도 동일한 문제를 가지고 있다. 이 경우에는 post 함수 params 에 size 인자를 추가하여, 받을 결과의 최대값을 지정하여 가져올 수 있더라.

from pandasticsearch import DataFrame, RestClient, Select

client = RestClient(url)
r = client.post('data_index/_search', data={"query":{"userId":"theo"}}, params={"size":10000})

pandas_df = Select.from_dict(r).to_pandas()
#--------------------------------
#>> pandas dataframe
#>> 1027 rows x 18 columns

728x90

저작자표시 동일조건

'프로그래밍 > Python' 카테고리의 다른 글

[Python] 리스트를 딕셔너리의 키로 사용하려 하는데 에러가 발생한다. TypeError: unhashable type (1)	2021.01.07
[파이썬] 2020년 탑10 파이썬 라이브러리 (0)	2020.12.26
[Anaconda] conda install 과 pip install 은 똑같은 걸까? (0)	2020.12.23
[통계학\|scipy] 정규분포 모집단의 표본분산의 분포는 정말 카이제곱분포를 따를까 (0)	2020.12.22
[통계학\|Scipy] scipy 로 정규분포 그래프 + 구간확률 구하기. (0)	2020.12.10

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

알락블록

엘라스틱서치 elasticsearch-py, elasticsearch-dsl 에서 검색결과를 모두 가져오기. (덤 pandasticsearch)

'프로그래밍 > Python' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

엘라스틱서치 elasticsearch-py, elasticsearch-dsl 에서 검색결과를 모두 가져오기. (덤 pandasticsearch)

'프로그래밍 > Python' 카테고리의 다른 글

'프로그래밍/Python' Related Articles

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역