> For the complete documentation index, see [llms.txt](https://docs.koinju.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.koinju.io/sql-cookbook/ohlcv-streaming.md). # OHLCV streaming download Most pages in this section show a SQL query that *replaces* a chunk of pandas. This one is the opposite: a small Python wrapper around a trivially simple SQL query, designed to **download large volumes of data** straight to a Parquet file. The pandas idiom this replaces is `client.query_df(sql)` followed by `df.to_parquet(path)` — fine for thousands of rows, increasingly painful past a few hundred thousand, an outright OOM at multi-million scale. ## Query The SQL itself is a one-liner — a multi-exchange BTC-USDT 1-minute OHLCV pull, no aggregation, no analytics: ```sql SELECT start, exchange, market, toFloat64(open) AS open, toFloat64(high) AS high, toFloat64(low) AS low, toFloat64(close) AS close, toFloat64(volume) AS volume FROM api.ohlcv(candle_duration_in_minutes = 1) WHERE market = 'BTC-USDT' AND exchange IN ('binance', 'okx', 'kucoin', 'gateio') AND start >= toDateTime('2024-01-01') AND start < toDateTime('2024-12-31') + INTERVAL 1 DAY ORDER BY exchange, start ``` Functions used: [`toFloat64`](https://clickhouse.com/docs/sql-reference/functions/type-conversion-functions#tofloat64), [`toDateTime`](https://clickhouse.com/docs/sql-reference/functions/type-conversion-functions#todatetime) For the year of 2024 across the four exchanges, that's about 525,600 minutes × 4 exchanges ≈ **2.1 million rows**. `client.query_df(sql)` would peak around 500 MB – 1 GB of resident memory before writing anything. ## Python streaming wrapper `clickhouse-connect` exposes `Client.raw_stream(query, fmt='...')` — the server formats the result row-by-row in a streaming pipeline and ships it via HTTP chunked transfer. Wrap it with a byte-loop into a file and you get bounded RAM regardless of result size: ```python from pathlib import Path from clickhouse_connect.driver import Client def download_to_parquet(client: Client, query: str, out_path: Path) -> int: """Stream `query` directly to `out_path` as Parquet, return bytes written.""" out_path.parent.mkdir(parents=True, exist_ok=True) bytes_written = 0 with ( client.raw_stream(query, fmt="Parquet") as stream, out_path.open("wb") as f, ): for chunk in stream: f.write(chunk) bytes_written += len(chunk) return bytes_written client = clickhouse_connect.get_client( host='', port=8443, secure=True, username='', password='', database='api') sql = """ SELECT * FROM api.ohlcv(candle_duration_in_minutes = 1) WHERE market = 'BTC-USDT' AND exchange IN ('binance', 'okx', 'kucoin', 'gateio') AND start >= toDateTime('2024-01-01') AND start < toDateTime('2024-12-31') + INTERVAL 1 DAY ORDER BY exchange, start """ download_to_parquet(client, sql, Path('btc_ohlcv_2024.parquet')) ``` Three things to notice: * **`fmt='Parquet'`** picks the [ClickHouse format](https://clickhouse.com/docs/en/interfaces/formats#data-format-parquet) — typed columns, columnar layout, very compact on disk. Other useful alternatives: `CSVWithNames`, `JSONEachRow`, `Arrow`, `TSVWithNames`. * **No pandas, no NumPy, no per-row Python.** The Parquet bytes leave ClickHouse and land in the file with one Python-level allocation per HTTP chunk (\~64 KB by default). * **Order of magnitude on the koinju cluster.** The December-2024 slice of the query above (4 exchanges × 31 days × 1-minute candles ≈ 178 K rows) downloads as **5.4 MB Parquet in \~5.3 seconds**. Extrapolated linearly the full 2024 pull is \~65 MB / \~65 seconds — bounded by ClickHouse's Parquet writer and your network throughput, not by client RAM. (For comparison, the same data as `CSVWithNames` is \~17 MB / \~7 seconds — Parquet is \~3× smaller and \~30 % faster end-to-end.) ## Reading the Parquet back ```python import pyarrow.parquet as pq table = pq.read_table('btc_ohlcv_2024.parquet') # table.num_rows, table.column_names, table.to_pandas(), … ``` A note on schema: ClickHouse's Parquet writer maps `DateTime` columns to Parquet `uint32` (Unix-epoch seconds) and `LowCardinality(String)` to Parquet `binary`. Both are conventional and most readers (pyarrow, DuckDB, Spark) decode them transparently when converted to typed results — but if you're parsing the Parquet at the byte level, expect those physical types rather than `timestamp[us]` / `utf8`. ## Companion code This page's repo companion at [`api/python_vs_sql/ohlcv_streaming/`](https://gitlab.com/koinju/connector/exporter/-/tree/master/api/python_vs_sql/ohlcv_streaming) packages the wrapper as a runnable `download.py`, plus a smoke test that exercises the pipeline on a tiny window (1 hour × 1 exchange) — verifying that the script actually writes a non-empty Parquet file with the right schema and row count, without burning quota on a multi-million-row pull every CI run. ## Extending it * **Different query** — the wrapper is data-agnostic. Pass any SQL string (a multi-asset trade pull, a candle fan-out, a custom aggregate over a long window) and it streams the result. The only cost of a complex query is server-side compute time, not client memory. * **Different format** — change `fmt='Parquet'` to `'CSVWithNames'`, `'JSONEachRow'`, `'Arrow'`, etc. CSV is friendlier to text-only tooling; Arrow IPC is the fastest in-process round-trip to pyarrow. * **Resumable downloads** — the simple version doesn't checkpoint. For genuinely huge pulls, partition the SQL by month or by exchange and run one wrapper call per partition; failures retry one partition, not the whole pull. * **Direct upload to S3 / object storage** — replace `out_path.open('wb')` with a streaming upload buffer (`boto3.upload_fileobj` and friends accept iterators). The bytes never touch local disk.