To solve this ImportError: No module named py4j.java_gateway error.
To begin, define the py4j module. Spark was originally created in Scala, however due to industry adoption, its API PySpark was provided for Python using Py4J.
Py4J is a required module for running the PySpark application and may be found in the $SPARK HOME/python/lib/py4j-*-src.zip directory.
To execute the PySpark application after installing Spark, add the Py4j module to the PYTHONPATH environment variable. ImportError: No module called py4j.java gateway occurs if this module is not set to env.
So try the below code and make it run:
export SPARK_HOME=/Users/your_name/apps/spark-3.0.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Put these in .bashrc
file and reload it with source /.bashrc
.
The py4j module version changes depending on the PySpark version you are using; in order to set this version correctly, follow the code below. In order to know the path of the pyspark use pip show pyspark
.
export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}
If you are using windows then try:
set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
the set path using:
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%
Hope this solves your issue.
Look at similar issue:
ModuleNotFoundError: No module named ‘py4j’