Create fake data with Faker in Python
Fake data is useful in a number of scenarios such as in testing applications or libraries or in demonstrating functionality without any privacy or security concerns. While it is possible to create fake data manually, this can be time consuming and there is a better way. Faker is a library that makes it easy to generate fake data, such as names, phone numbers, emails, and other types of data.
There are many uses of fake data as listed below. In this article, I'll detail the basic use of Faker to create fake data, as well its use to populate a Pandas DataFrame and generate a chart.
- Unit testing
- Populating databases
- Creating sample Reports
- Simulating financial transactions
- Load testing for APIs
- Prototyping and design
- Testing search functionality
- Educational environments
- Anonymizing data
Faker installation and use
Faker can be installed with pip install
1pip install faker
Simply import faker and initialize a faker generator, which can be used to create
fake data by calling methods on the fake instance. The following generates a fake
name by calling the name
method on the instance of faker.
1from faker import Faker
2
3fake = Faker()
4
5# fake name
6fake.name()
7
8# 'Stefanie Brown'
Show all the Faker methods
The following code lists all the methods on the instance of faker. This shows that there are 297 methods for creating all kinds of data.
1methods = [x for x in dir(fake) if not x.startswith('_')]
2print(f"number of methods = {len(methods)}")
3# number of methods = 297
4
5
6# print(methods)
7for index in range(0, len(methods), 4):
8 print(methods[index].ljust(35, " "), end="")
9 try:
10 print(methods[index+1].ljust(35, " "), end="")
11 print(methods[index+2].ljust(35, " "), end="")
12 print(methods[index+3].ljust(35, " "))
13 except IndexError:
14 break
Methods on Faker:
Methods | methods (contd.) | methods (contd.) | methods (contd.) |
---|---|---|---|
aba | add_provider | address | administrative_unit |
am_pm | android_platform_token | ascii_company_email | ascii_email |
ascii_free_email | ascii_safe_email | bank_country | basic_phone_number |
bban | binary | boolean | bothify |
bs | building_number | cache_pattern | catch_phrase |
century | chrome | city | city_prefix |
city_suffix | color | color_hsl | color_hsv |
color_name | color_rgb | color_rgb_float | company |
company_email | company_suffix | coordinate | country |
country_calling_code | country_code | credit_card_expire | credit_card_full |
credit_card_number | credit_card_provider | credit_card_security_code | cryptocurrency |
cryptocurrency_code | cryptocurrency_name | csv | currency |
currency_code | currency_name | currency_symbol | current_country |
current_country_code | date | date_between | date_between_dates |
date_object | date_of_birth | date_this_century | date_this_decade |
date_this_month | date_this_year | date_time | date_time_ad |
date_time_between | date_time_between_dates | date_time_this_century | date_time_this_decade |
date_time_this_month | date_time_this_year | day_of_month | day_of_week |
del_arguments | dga | domain_name | domain_word |
dsv | ean | ean13 | ean8 |
ein | emoji | enum | |
factories | file_extension | file_name | file_path |
firefox | first_name | first_name_female | first_name_male |
first_name_nonbinary | fixed_width | format | free_email |
free_email_domain | future_date | future_datetime | generator_attrs |
get_arguments | get_formatter | get_providers | hex_color |
hexify | hostname | http_method | iana_id |
iban | image | image_url | internet_explorer |
invalid_ssn | ios_platform_token | ipv4 | ipv4_network_class |
ipv4_private | ipv4_public | ipv6 | isbn10 |
isbn13 | iso8601 | items | itin |
job | json | json_bytes | language_code |
language_name | last_name | last_name_female | last_name_male |
last_name_nonbinary | latitude | latlng | lexify |
license_plate | linux_platform_token | linux_processor | local_latlng |
locale | locales | localized_ean | localized_ean13 |
localized_ean8 | location_on_land | longitude | mac_address |
mac_platform_token | mac_processor | md5 | military_apo |
military_dpo | military_ship | military_state | mime_type |
month | month_name | msisdn | name |
name_female | name_male | name_nonbinary | nic_handle |
nic_handles | null_boolean | numerify | opera |
optional | paragraph | paragraphs | parse |
passport_dates | passport_dob | passport_full | passport_gender |
passport_number | passport_owner | password | past_date |
past_datetime | phone_number | port_number | postalcode |
postalcode_in_state | postalcode_plus4 | postcode | postcode_in_state |
prefix | prefix_female | prefix_male | prefix_nonbinary |
pricetag | profile | provider | providers |
psv | pybool | pydecimal | pydict |
pyfloat | pyint | pyiterable | pylist |
pyobject | pyset | pystr | pystr_format |
pystruct | pytimezone | pytuple | random |
random_choices | random_digit | random_digit_above_two | random_digit_not_null |
random_digit_not_null_or_empty | random_digit_or_empty | random_element | random_elements |
random_int | random_letter | random_letters | random_lowercase_letter |
random_number | random_sample | random_uppercase_letter | randomize_nb_elements |
rgb_color | rgb_css_color | ripe_id | safari |
safe_color_name | safe_domain_name | safe_email | safe_hex_color |
sbn9 | secondary_address | seed | seed_instance |
seed_locale | sentence | sentences | set_arguments |
set_formatter | sha1 | sha256 | simple_profile |
slug | ssn | state | state_abbr |
street_address | street_name | street_suffix | suffix |
suffix_female | suffix_male | suffix_nonbinary | swift |
swift11 | swift8 | tar | text |
texts | time | time_delta | time_object |
time_series | timezone | tld | tsv |
unique | unix_device | unix_partition | unix_time |
upc_a | upc_e | uri | uri_extension |
uri_page | uri_path | url | user_agent |
user_name | uuid4 | vin | weights |
windows_platform_token | word | words | xml |
year | zip | zipcode | zipcode_in_state |
zipcode_plus4 |
Fake methods for Name
It can be seen from the list of all the methods on Faker that there are a number of
methods for generating names. We can use dir
to list all the methods for generating
fake names. This lists 24 methods that have name in the method name. Not all of these
are names for people
1[x for x in dir(fake) if not x.startswith('_') and x.find('name') > -1]
Methods in Faker with name int the method
Fake Names | Fake Names (contd.) | Fake Names (contd.) |
---|---|---|
color_name | cryptocurrency_name | currency_name |
domain_name | file_name | first_name |
first_name_female | first_name_male | first_name_nonbinary |
hostname | language_name | last_name |
last_name_female | last_name_male | last_name_nonbinary |
month_name | name | name_female |
name_male | name_nonbinary | safe_color_name |
safe_domain_name | street_name | user_name |
Methods for people names in Faker
Fake Names | Fake Names (contd.) | Fake Names (contd.) |
---|---|---|
first_name_female | name | last_name |
last_name_female | name_nonbinary | last_name_nonbinary |
name_male | name_female | user_name |
first_name_male | first_name | |
last_name_male | first_name_nonbinary |
Generate sample fake names
It can be helpful to see a sample of data from these methods to help identify the best method for the desired fake data. This code calls the methods with name in them to generate a sample for each of the methods. This is very useful to compare the output from the different names.
1for name_method in [x for x in dir(fake) if not x.startswith('_') and x.find('name') > -1]:
2 print(f"{name_method.ljust(23, ' ')} : { eval(f'fake.{name_method}()') }")
Fake data generated for methods with name
Method Name | Fake value generated |
---|---|
color_name | Navy |
cryptocurrency_name | Zcash |
currency_name | Surinamese dollar |
domain_name | miller-barber.com |
file_name | ok.webm |
first_name | Aaron |
first_name_female | Patricia |
first_name_male | Jackson |
first_name_nonbinary | Ricky |
hostname | web-01.hampton-jenkins.com |
language_name | Bihari languages |
last_name | Ray |
last_name_female | Barker |
last_name_male | Vega |
last_name_nonbinary | Morales |
month_name | August |
name | Charles Hoover |
name_female | Kathryn Levy |
name_male | Timothy Bruce |
name_nonbinary | Samantha Carr |
safe_color_name | gray |
safe_domain_name | example.com |
street_name | Kevin Harbor |
user_name | haneydaniel |
Regenerating the same fake data - Seed
It is sometimes necessary to generate the same fake data rather than generating
random data every time. seed
produces the same result when the same methods with the
same version of faker are called. This can be required if fake data is being used in
unit tests. This is easily achieved by setting a seed similar to random number
generation.
1Faker.seed(42)
2for i in range(5):
3 print(fake.name())
4# Allison Hill
5# Noah Rhodes
6# Angie Henderson
7# Daniel Wagner
8# Cristian Santos
9
10
11Faker.seed(42)
12for i in range(5):
13 print(fake.name())
14# Allison Hill
15# Noah Rhodes
16# Angie Henderson
17# Daniel Wagner
18# Cristian Santos
Localized fake data
A locale value can be passed into the constructor for Faker
to create fake data in
that locale.
1italian_fake = Faker('it_IT')
2for name_method in [x for x in dir(italian_fake) if not x.startswith('_') and x.find('name') == 0]:
3 print(f"{name_method.ljust(23, ' ')} = { eval(f'italian_fake.{name_method}()') }")
4
5# name = Lando Opizzi
6# name_female = Elvira Bonanno-Beccheria
7# name_male = Dott. Jacopo Bottaro
8# name_nonbinary = Ciro Parmitano
1japanese_fake = Faker('ja_JP')
2for name_method in [x for x in dir(japanese_fake) if not x.startswith('_') and x.find('name') == 0]:
3 print(f"{name_method.ljust(30, ' ')} = { eval(f'japanese_fake.{name_method}()') }")
4
5# name = 山田 春香
6# name_female = 小林 知実
7# name_male = 橋本 稔
8# name_nonbinary = 斎藤 健一
1chinese_fake = Faker('zh_CN')
2for name_method in [x for x in dir(chinese_fake) if not x.startswith('_') and x.find('name') == 0]:
3 print(f"{name_method.ljust(30, ' ')} = { eval(f'chinese_fake.{name_method}()') }")
4
5# name = 张春梅
6# name_female = 权建平
7# name_male = 贺华
8# name_nonbinary = 阎玉梅
Create a chart with fake data
Any of the methods on the instance of Faker
can be used in a list comprehension to
generate lists of fake values. These lists can be combined into a list of
dictionaries and these dictionaries used to load data into a Pandas DataFrame.
This can be a great way to quickly generate test data without having to risk exposing
customer sensitive data. Once the data is loaded into a DataFrame it can be
manipulated or charts created to develop data pipelines with the data.
1Faker.seed(42)
2DATA_SIZE = 500
3names = [fake.first_name() for i in range(DATA_SIZE)]
4ages = [fake.random_int(min=21, max=80) for x in range(DATA_SIZE)]
5states = [fake.state() for x in range(DATA_SIZE)]
6jobs = [fake.job() for x in range(DATA_SIZE)]
7
8data = {'name': names, 'age': ages, 'state': states, 'job': jobs}
9df = pd.DataFrame.from_dict(data)
Top 10 records from a DataFrame containing fake data
Name | Age | State | Job | |
---|---|---|---|---|
0 | Danielle | 60 | Delaware | Surveyor, planning and development |
1 | Angel | 55 | Alaska | Surveyor, planning and development |
2 | Joshua | 22 | South Dakota | Landscape architect |
3 | Jeffrey | 79 | Rhode Island | Community education officer |
4 | Jill | 46 | Kansas | Paediatric nurse |
5 | Erica | 58 | Virginia | Manufacturing systems engineer |
6 | Patricia | 57 | Hawaii | Pharmacologist |
7 | Christopher | 63 | Maryland | Arts development officer |
8 | Robert | 22 | Oklahoma | Holiday representative |
9 | Anthony | 26 | Alabama | Tourism officer |
1mean_age_df = df[['state', 'age']].groupby('state').mean()
2mean_age_df.rename_axis(None, axis = 0, inplace = True)
3top_df = mean_age_df.sort_values(by=['age'], ascending = False).head(10)
4
5fig, ax = plt.subplots(figsize = (10,7), facecolor=plt.cm.Blues(.2))
6ax.set_facecolor(plt.cm.Blues(.2))
7
8fig.suptitle('Top 10 States with highest average age',
9 fontsize = 18,
10 fontweight = 'bold')
11
12states = list(top_df.index)
13y_pos = np.arange(len(states))
14ages = list(top_df['age'])
15
16ax.barh(y_pos, ages, align='center')
17ax.set_yticks(y_pos)
18ax.invert_yaxis()
19ax.set_xlabel('Average Age', fontsize = 14)
20
21# Display highest on top
22ax.set_yticklabels(states, fontsize = 14)
23
24# Hide the right and top spines
25ax.spines['right'].set_visible(False)
26ax.spines['top'].set_visible(False)
27
28plt.show
Conclusion
Fake data can easily be created with Faker in Python. I can't help asking - do we really need more fake data? I think we do, as it can be useful in a number of situations, such as unit testing or developing data pipelines without the need for real data. There is wide list of data types that can be generated with Faker as well as providing a custom data provider.