PythonのaiohttpモジュールとGeneratorを使って、ページネーションを処理するHTTPクライアントを実装する

2023-07-08

はじめに

Python で GET リクエストを実行する際、ページネーションの処理が必要な場合があります。
AWS SDK のようなライブラリは Paginator クラスなどの実装が用意されていますが、REST API を使用するために HTTP クライアントを使う場合は、自分でページネーションの実装を行う必要があります。

今回は aiohttp を使ってページネーションの処理を書いてみます。

バージョン

	バージョン
Python	3.11.4

環境構築

$ python3 -m venv .venv
$ source ./.venv/bin/activate
$ pip install aiohttp

GitHub REST API

今回の例では、リクエスト先としてGitHub REST API を使用します。
使用するリポジトリは grafana/grafana であり、使用するAPIエンドポイントは「List repository workflows」の URL です。

このエンドポイントのクエリパラメータには per_pageがあります。デフォルトでは30ですが、最大値は100です。クエリパラメータを渡さない場合、レスポンスに含まれる workflows の配列は最大30件しかデータが返されないことになります。

{
  "total_count": 1,
  "workflows": [
    {
      ....
    }
  ]
}

ページネーションについて知る

GitHub REST API でページネーションを使用するには、レスポンスヘッダーの Link を見る必要があります。取得したいデータが 1 回のレスポンスにすべて収まっている場合は、Link がありません。

When a response is paginated, the response headers will include a link header. The link header will be omitted if the endpoint does not support pagination or if all results fit on a single page. The link header contains URLs that you can used to fetch additional pages of results.

レスポンスがページ分割される場合、レスポンスヘッダはリンクヘッダを含みます。エンドポイントがページ分割をサポートしていない場合や、すべての結果が1ページに収まる場合は、リンクヘッダは省略されます。リンクヘッダには、結果の追加ページを取得するために使用できる URL が含まれます。

引用：Using pagination in the REST API

リクエストを送って確かめてみる

まずは GET リクエストができることを確認

import aiohttp
import asyncio


async def main():
    headers = {
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
    }
    async with aiohttp.ClientSession() as session:
        async with session.get(
            "https://api.github.com/repos/grafana/grafana/actions/workflows",
            headers=headers,
        ) as resp:
            print(await resp.text())


if __name__ == "__main__":
    asyncio.run(main())

$ python src/list_repository_workflows.py | jq
{
  "total_count": 53,
  "workflows": [
    {
      "id": 3035099,
      "node_id": "MDg6V29ya2Zsb3czMDM1MDk5",
      "name": "Backport PR Creator",
      "path": ".github/workflows/backport.yml",
      "state": "active",

次に Link を確認します。

import aiohttp
import asyncio


async def main():
    headers = {
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
    }
    async with aiohttp.ClientSession() as session:
        async with session.get(
            "https://api.github.com/repos/grafana/grafana/actions/workflows",
            headers=headers,
        ) as resp:
            print(await resp.text())
            link = resp.headers.get("Link")
            print(link)


if __name__ == "__main__":
    asyncio.run(main())

$ python src/list_repository_workflows.py
<https://api.github.com/repositories/15111821/actions/workflows?page=2>; rel="next", <https://api.github.com/repositories/15111821/actions/workflows?page=2>; rel="last"

ドキュメントに書いてあるとおり、以下のような形式になります。

link: <https://xxx>; rel="prev", <https://xxx>; rel="next", <https://xxx>; rel="last", <https://xxx>; rel="first"

rel=“next” の URL を取り出す

re モジュールを使用して rel="next" の URLを取り出してみます。

import re

link = '<https://api.github.com/repositories/15111821/actions/workflows?page=2>; rel="next", <https://api.github.com/repositories/15111821/actions/workflows?page=2>; rel="last"'


def extract_next_url(link: str) -> str | None:
    m = re.findall(r'<(https?://[\w/:%#\$&\?\(\)~\.=\+\-]+)>; rel="next"', link)
    return m[0] if len(m) >= 1 else None


print(extract_next_url(link))

<https://api.github.com/repositories/15111821/actions/workflows?page=2>; rel="next" から次ページの URL を取り出すことができました。

$ python src/extract_next_url.py
https://api.github.com/repositories/15111821/actions/workflows?page=2

参考：https://www.megasoft.co.jp/mifes/seiki/s310.html

ページネーションのリクエスト

アルゴリズム

処理の順番は以下の通りです。次ページのURLを取得するためには、最初に必ず GET リクエストを実行する必要があります。

flowchart TD
    Start([Start]) --> GetRequest[GET リクエスト]
    GetRequest --> GetNextURL[next_url を取得]
    GetNextURL --> ReturnYield[yield で結果を返す]
    ReturnYield --> LoopStart[/ページネーション処理\nnext_url が存在している\]
    LoopStart --> GetRequest2[GET リクエスト]
    GetRequest2 --> GetNextURL2[net_url を取得]
    GetNextURL2 --> ReturnYield2[yield で結果を返す]
    ReturnYield2 --> LoopEnd[\ページネーション処理/]
    LoopEnd --> End([End])

Python で実装

from typing import Any, AsyncGenerator
import aiohttp
import asyncio
import re


def extract_next_url(link: str) -> str | None:
    m = re.findall(r'<(https?://[\w/:%#\$&\?\(\)~\.=\+\-]+)>; rel="next"', link)
    return m[0] if len(m) >= 1 else None


async def get(url: str) -> AsyncGenerator[str, None]:
    headers = {
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
    }
    async with aiohttp.ClientSession() as session:
        next_url: str | None = None
        async with session.get(url, headers=headers) as resp:
            next_url = extract_next_url(resp.headers.get("Link"))
            yield await resp.text()

        while next_url is not None:
            async with session.get(next_url, headers=headers) as resp:
                next_url = extract_next_url(resp.headers.get("Link"))
                yield await resp.text()


async def main():
    url = "https://api.github.com/repos/grafana/grafana/actions/workflows"

    async for resp in get(url):
        print(resp)


if __name__ == "__main__":
    asyncio.run(main())

実行はできたので、問題なさそうです。

$ python src/main.py | jq
{
  "total_count": 53,
  "workflows": [
    {
      "id": 3035099,
      "node_id": "MDg6V29ya2Zsb3czMDM1MDk5",
      "name": "Backport PR Creator",
      "path": ".github/workflows/backpo

先ほどのレスポンスに total_count がありました。値は 53 です。つまり、workflows のリストには全部で 53 個あるはずです。

workflow の id から重複がないことを確認してみます。

$ python src/main.py | jq '.workflows[] | .id' | sort | uniq | wc -l
      53

問題なさそうですね。

unitテストで動作確認

私が書いたコードが正しいか確認してみましょう。

$ pip install pytest pytest-aiohttp

Testing - docs.aiohttp.org

tests ディレクトリとファイルを作成します。

 $ tree -L 2 -I __pycache__
  .
  ├── README.md
  ├── requirements.txt
  ├── src
  │   ├── extract_next_url.py
  │   ├── list_repository_workflows.py
  │   └── main.py
  └── tests
      └── test_get.py

ページネーションされているように振る舞うテストサーバを作成します。
処理は handler 関数です。

import json

import pytest
from typing import Any
from aiohttp import web
from aiohttp.test_utils import TestServer

from src.main import get


async def handler(request: web.Request) -> web.Response:
    pageNum: str | None = request.query.get("page")
    host = request.url.host
    port = request.url.port

    if pageNum == "2":
        return web.json_response(
            {
                "total_count": 50,
                "workflows": [
                    {
                        "id": 2,
                    }
                ],
            },
            headers={"Link": f'<http://{host}:{port}/?page=3>; rel="next"'},
        )
    if pageNum == "3":
        return web.json_response(
            {
                "total_count": 50,
                "workflows": [
                    {
                        "id": 3,
                    }
                ],
            },
            headers={"Link": f'<http://{host}:{port}/?page=2>; rel="last"'},
        )

    return web.json_response(
        {
            "total_count": 50,
            "workflows": [
                {
                    "id": 1,
                }
            ],
        },
        headers={"Link": f'<http://{host}:{port}/?page=2>; rel="next"'},
    )


@pytest.mark.asyncio
async def test_get(aiohttp_server: Any) -> None:
    app = web.Application()
    app.add_routes([web.get("/", handler)])
    server: TestServer = await aiohttp_server(app)

    gen = get(server.make_url("/"))

    resp = await gen.__anext__()
    assert json.loads(resp)["workflows"][0]["id"] == 1

    resp = await gen.__anext__()
    assert json.loads(resp)["workflows"][0]["id"] == 2

    resp = await gen.__anext__()
    assert json.loads(resp)["workflows"][0]["id"] == 3

参考： Testing client with fake server

テストコードの引数にあるaiohttp_server は pytest-aiohttp のプラグインとして実装されており、pytest.fixtureがついているので引数に aiohttp_server を指定できます。

async def test_get(aiohttp_server: Any) -> None:

引用：https://github.com/aio-libs/aiohttp/blob/v3.8.4/aiohttp/pytest_plugin.py#L267-L287

テストが通りました。実装は問題なさそうです。

$ python -m pytest tests -s
================================== test session starts ===================================
platform darwin -- Python 3.11.4, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/hogehoge/blog-code/2023/07/github-actions-workflow-log
plugins: asyncio-0.21.0, aiohttp-1.0.4
asyncio: mode=Mode.STRICT
collected 1 item

tests/test_get.py .

=================================== 1 passed in 0.01s ====================================

まとめ

aiohttp + Generatorでページネーションに対応したHTTPクライアントを作成することができました。
StopIteration、StopAyncIterationだった場合のエラーハンドリングやバックオフについては考慮していないため実装としては物足りませんが、最低限 HTTPクライアントとして使えるのではないでしょうか。

今回使用したコードは以下に置きました。 https://github.com/kntks/blog-code/tree/main/2023/07/python-aiohttp-pagination-client

実装の参考になれば幸いです。