Read all files in a nested folder in Spark

If directory structure is regular, lets say something like this:

folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│       └── ab.txt
└── b
    ├── a
    │   └── ba.txt
    └── b
        └── bb.txt

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()

[u'file:/folder/a/a/aa.txt',
 u'file:/folder/a/b/ab.txt',
 u'file:/folder/b/a/ba.txt',
 u'file:/folder/b/b/bb.txt']

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)