I am part of a few data science communities on LinkedIn and from other places and one thing that I see from time to time is people questioning about PySpark.
Let’s face it: Data Science is too vast of a field for anyone to be able to know about everything. So, when I join a course/community about statistics, for example, sometimes people ask what is PySpark, how to calculate some stats in PySpark, and many other kinds of questions.
Usually, those who already work with Pandas are especially interested in Spark. And I believe that happens for a couple of reasons:
- Pandas is for sure very famous and used by data scientists, but also for sure not the fastest package. As the data increases in size, the speed decreases proportionally.
- It is a natural path for those who already dominate Pandas to want to learn a new option to wrangle data. As data is more available and with higher volume, knowing Spark is a great option to deal with big data.
- Databricks is very famous, and PySpark is possibly the most used language in the Platform, along with SQL.