In a previous article, we explored creating many-to-one relationships between columns in a synthetic PySpark DataFrame. This DataFrame only consisted of Foreign Key information and we didn’t produce any textual information that might be useful in a demo DataSet.
For anyone looking to populate an artificial dataset, it is likely you will want to produce descriptive data such as product information, location details, customer demographics, etc.
In this post, we’ll dig into a few sources that can be used to create synthetic text data at little effort and cost, and use the techniques to pull together a DataFrame containing customer details.
Synthetic datasets are a great way to anonymously demonstrate your data product, such as a website or analytics platform. Allowing users and stakeholders to interact with example data, exposing meaningful analysis without breaching any privacy concerns with sensitive data.
It can also be great for exploring Machine Learning algorithms, allowing Data Scientists to train models in the case of limited real data.
Performance testing Data Engineering pipeline activities is another great use case for synthetic data, giving teams the ability to ramp up the scale of data pushed through an infrastructure and identify weaknesses in the design, as well as benchmarking runtimes.
In my case, I’m currently creating an example dataset to performance-test some Power BI capabilities at high volumes, which I’ll be writing about in due course.
The dataset will contain sales data, including transaction amounts and other descriptive features such as store location, employee name and customer email address.
Starting off simple, we can use some built-in functionality to generate random text data. Importing the random and string Python modules, we can use the following simple function to create a random string of the desired length.