Skip to content

readParquet for partioned datasets #124

@collinarnett

Description

@collinarnett

Is your feature request related to a problem? Please describe.
I would like to load a partitioned parquet dataset. The current readParquet function does not support directories.

import qualified DataFrame as D
import qualified DataFrame.Functions as F

main :: IO ()
main = do
  df <- D.readParquet "./dataset/"
  print . D.take 10 $ df

Error encountered

Main: ./dataset : withBinaryFile: does not exist (No such file or directory)
HasCallStack backtrace:
  collectBacktraces, called at libraries/ghc-internal/src/GHC/Internal/Exception.hs:169:13 in ghc-internal:GHC.Internal.Exception
  toExceptionWithBacktrace, called at libraries/ghc-internal/src/GHC/Internal/IO.hs:260:11 in ghc-internal:GHC.Internal.IO
  throwIO, called at libraries/ghc-internal/src/GHC/Internal/IO/Exception.hs:315:19 in ghc-internal:GHC.Internal.IO.Exception
  ioException, called at libraries/ghc-internal/src/GHC/Internal/IO/Exception.hs:319:20 in ghc-internal:GHC.Internal.IO.Exception

Describe the solution you'd like
readParquet supports reading from directories or a new function like readParquetPartioned for reading directories specifically.

Describe alternatives you've considered
NA

Additional context
https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions