Extract a zip file, decompress it, and convert CSV to parquet format in Python

Source

yaml

id: zip-to-parquet
namespace: company.team

variables:
  file_id: "{{ execution.startDate | dateAdd(-3, 'MONTHS') | date('yyyyMM') }}"

tasks:
  - id: get_zipfile
    type: io.kestra.plugin.core.http.Download
    uri: https://divvy-tripdata.s3.amazonaws.com/{{ render(vars.file_id)
      }}-divvy-tripdata.zip

  - id: unzip
    type: io.kestra.plugin.compress.ArchiveDecompress
    algorithm: ZIP
    from: "{{ outputs.get_zipfile.uri }}"

  - id: parquet_output
    type: io.kestra.plugin.scripts.python.Script
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    containerImage: ghcr.io/kestra-io/pydata:latest
    env:
      FILE_ID: "{{ render(vars.file_id) }}"
    inputFiles: "{{ outputs.unzip.files }}"
    script: |
      import os
      import pandas as pd

      file_id = os.environ["FILE_ID"]
      file = f"{file_id}-divvy-tripdata.csv"

      df = pd.read_csv(file)
      df.to_parquet(f"{file_id}.parquet")
    outputFiles:
      - "*.parquet"

About this blueprint

Python Kestra

This flow downloads a Divvy Trip Data dataset in a zip file format from an HTTP endpoint, unzips it, and converts it to a parquet format in Python using Pandas.

Download

Archive Decompress

Script

Docker

More Related Blueprints

PythonSQLKestra

Extract multiple tables from Postgres using SQL queries and process those as Pandas dataframes on schedule

PythonKestra

Process partitions in parallel in Python, report outputs and metrics about number of rows and processing time for all partitioned files

PythonKestra

Run specific tasks only on business days for a specific country

New to Kestra?

Use blueprints to kickstart your first workflows.

Get started with Kestra