Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don’t have to deal with Scala features which cannot be easily expressed using plain Java and as a result don’t play well with Py4J gateway.
Assuming your class is int the package com.example and have Python DataFrame called df
df = ... # Python DataFrame
you’ll have to:
-
Build a jar using your favorite build tool.
-
Include it in the driver classpath for example using
--driver-class-pathargument for PySpark shell /spark-submit. Depending on the exact code you may have to pass it using--jarsas well -
Extract JVM instance from a Python
SparkContextinstance:jvm = sc._jvm -
Extract Scala
SQLContextfrom aSQLContextinstance:ssqlContext = sqlContext._ssql_ctx -
Extract Java
DataFramefrom thedf:jdf = df._jdf -
Create new instance of
SimpleClass:simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v") -
Call
exemethod and wrap the result using PythonDataFrame:from pyspark.sql import DataFrame DataFrame(simpleObject.exe(), ssqlContext)
The result should be a valid PySpark DataFrame. You can of course combine all the steps into a single call.
Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.