Advanced Usage#

This tutorial showcases advanced functionalities and applications of GETTSIM’s interface. For an introductory tutorial see here. The introductory tutorial showcases GETTSIM’s two main functions using a minimal working example:

  1. set_up_policy_environment which loads a policy environment for a specified date.

  2. compute_taxes_and_transfers which allows you to compute taxes and transfers given a specified policy environment for household or individual observations.

This tutorial dives deeper into the GETTSIM interface to acquaintance you with further useful functionalities. Specifically, this tutorial shows how to navigate the numerous input and target variables that the package supports as well as how GETTSIM processes them internally using the example of child benefits in the German taxes and transfers system.

[1]:
import warnings

import numpy as np
import pandas as pd
import plotly.express as px
from gettsim import (
    FunctionsAndColumnsOverlapWarning,
    compute_taxes_and_transfers,
    create_synthetic_data,
    plot_dag,
    set_up_policy_environment,
)

warnings.filterwarnings("ignore", category=FunctionsAndColumnsOverlapWarning)

Example: Kindergeld (Child Benefits)#

For this tutorial, we will focus on Kindergeld, which is a child benefit that can be claimed by parents in Germany. Kindergeld can be claimed in different ways and eligibility for families to receive it depends on various variables. For instance, Kindergeld can be claimed as a monthly payment but also as a tax credit (Kinderfreibetrag) which is more advantageous for higher income groups. Additionally, eligibility depends on factors like the age and work status of children. These factors make it a more complex feature of the German taxes and transfers system than one might initially believe.

In the following, we will inspect in detail how the German Kindergeld is implemented in GETTSIM to showcase further functionalities of the package. To start off, we load a policy environment to work with.

[2]:
policy_params, policy_functions = set_up_policy_environment("2020")

The according policy parameters are saved under the key kindergeld.

[3]:
policy_params["kindergeld"]
[3]:
{'altersgrenze': {'mit_bedingungen': 25, 'ohne_bedingungen': 18},
 'kindergeld': {1: 204, 2: 204, 3: 210, 4: 235},
 'einkommensgrenze': 8004,
 'stundengrenze': 20,
 'kinderbonus': 300,
 'datum': numpy.datetime64('2020-01-01')}

DAG Plots for Visualization of the Taxes and Transfers System#

To get a better picture of how Kindergeld is implemented in GETTSIM and, meanwhile, of the structure of the German taxes and transfers system, we can utilize GETTSIM’s visualization capabilities which are concentrated in the function plot_dag. This function creates a directed acyclic graph (DAG) for the taxes and transfers system. It offers many different visualization possibilities. The guide on visualizing the taxes and transfers system gives an in depth explanation of the function.

To figure out which variables are relevant for the child benefit, we plot an according slice of the entire taxes and transfers system implemented in GETTSIM using plot_dag. The function was already imported with all other relevant packages at the beginning of this tutorial. To select the relevant plot, we have to define selectors that we can pass as arguments to the function. We can check the possible output variables here to find the relevant variable name for our application.

[4]:
selectors = {"type": "ancestors", "node": "kindergeld_m"}

Since we are interested in the child benefits, we select the node kindergeld_m and plot its ancestors, which are all the nodes kindergeld_m directly or indirectly depends on. As the plot below shows, the variable depends on many other nodes and generates a very large DAG. Clicking on a node links to the according function or variable.

[5]:
plot_dag(functions=policy_functions, selectors=selectors).show()

An alternative way to inspect the variable is by looking at its neighbors in the DAG. This depiction shows the related variables and functions up to two nodes away from kindergeld_m. It reveals descendants of kindergeld_m: kindergeld_m_bg and kindergeld_m_eg. These variables contain the child benefits on Bedarfsgemeinschaften level and Einstandsgemeinschaften level.

[6]:
selectors = {"type": "neighbors", "node": "kindergeld_m", "order": 2}
plot_dag(functions=policy_functions, selectors=selectors).show()

Computing Variables of Interest#

Once we have inspected the DAG, we now have an impression of the various input variables and functions that influence our variable of interest. As a next step, we will load a set of simulated household data and inspect how we can compute the Kindergeld using compute_taxes_and_transfers and use the function’s features and error messages to aid us in this process.

Simulated Data#

We simulate a dataset using create_synthetic_data. We can easily specify a few variables while all other necessary input variabels will be filled with defaults.

The specification chosen here creates a set of households with two adults and two children. The households vary in the variable bruttolohn_m and are otherwise identical.

[7]:
data = create_synthetic_data(
    n_adults=2,
    n_children=2,
    specs_heterogeneous={
        "bruttolohn_m": [[i, 0, 0, 0] for i in np.linspace(1000, 8000, 701)]
    },
)
[8]:
data[["hh_id", "hh_typ", "alter", "kind", "bruttolohn_m"]]
[8]:
hh_id hh_typ alter kind bruttolohn_m
0 0 couple_2_children 35 False 1000.0
1 0 couple_2_children 35 False 0.0
2 0 couple_2_children 8 True 0.0
3 0 couple_2_children 5 True 0.0
4 1 couple_2_children 35 False 1010.0
... ... ... ... ... ...
2799 699 couple_2_children 5 True 0.0
2800 700 couple_2_children 35 False 8000.0
2801 700 couple_2_children 35 False 0.0
2802 700 couple_2_children 8 True 0.0
2803 700 couple_2_children 5 True 0.0

2804 rows × 5 columns

Adults’ monthly gross earnings range between €1,000 and €8,000. It is captured in the variable bruttolohn_m. We can use the pandas function pandas.DataFrame.describe to assess the variable in detail.

[9]:
data["bruttolohn_m"].describe()
[9]:
count    2804.000000
mean     1125.000000
std      2195.983791
min         0.000000
25%         0.000000
50%         0.000000
75%       250.000000
max      8000.000000
Name: bruttolohn_m, dtype: float64

The columns contain all the input variables needed to compute kindergeld_m.

[10]:
data.columns
[10]:
Index(['p_id', 'hh_id', 'hh_typ', 'hat_kinder', 'alleinerz',
       'anz_eig_kind_bis_24', 'weiblich', 'alter', 'kind', 'in_ausbildung',
       'bruttolohn_m', 'p_id_elternteil_1', 'p_id_elternteil_2',
       'p_id_kindergeld_empf', 'p_id_erziehgeld_empf', 'p_id_einstandspartner',
       'p_id_ehepartner', 'bürgerg_bezug_vorj', 'vermögen_bedürft',
       'eigenbedarf_gedeckt', 'gemeinsam_veranlagt', 'selbstständig',
       'wohnort_ost', 'eink_selbst_m', 'in_priv_krankenv',
       'priv_rentenv_beitr_m', 'bruttolohn_vorj_m', 'arbeitsstunden_w',
       'geburtsjahr', 'geburtstag', 'geburtsmonat', 'mietstufe',
       'entgeltp_ost', 'entgeltp_west', 'rentner', 'betreuungskost_m',
       'p_id_betreuungsk_träger', 'kapitaleink_brutto_m', 'eink_vermietung_m',
       'bruttokaltmiete_m_hh', 'heizkosten_m_hh', 'jahr_renteneintr',
       'monat_renteneintr', 'behinderungsgrad', 'wohnfläche_hh',
       'm_elterngeld', 'm_elterngeld_vat_hh', 'm_elterngeld_mut_hh',
       'bewohnt_eigentum_hh', 'immobilie_baujahr_hh', 'sonstig_eink_m',
       'grundr_entgeltp', 'grundr_zeiten', 'grundr_bew_zeiten', 'priv_rente_m',
       'schwerbeh_g', 'm_pflichtbeitrag', 'm_freiw_beitrag', 'm_mutterschutz',
       'm_arbeitsunfähig', 'm_krank_ab_16_bis_24', 'm_arbeitsl',
       'm_ausbild_suche', 'm_schul_ausbild', 'm_geringf_beschäft',
       'm_alg1_übergang', 'm_ersatzzeit', 'm_kind_berücks_zeit',
       'm_pfleg_berücks_zeit', 'y_pflichtbeitr_ab_40', 'pflichtbeitr_8_in_10',
       'arbeitsl_1y_past_585', 'vertra_arbeitsl_1997', 'vertra_arbeitsl_2006',
       'anwartschaftszeit', 'arbeitssuchend', 'm_durchg_alg1_bezug',
       'sozialv_pflicht_5j', 'kind_unterh_anspr_m', 'kind_unterh_erhalt_m',
       'steuerklasse', 'budgetsatz_erzieh', 'voll_erwerbsgemind',
       'teilw_erwerbsgemind'],
      dtype='object')

Using Errors and Warnings#

As the DAG and column list above show, a large number of inputs is required to compute child benefits for a family. While the DAG is very useful to understand the structure within GETTSIM behind a variable or function, it might be difficult to infer which inputs exactly are needed in the data to compute a desired output. The function compute_taxes_and_transfers thus directly provides multiple mechanisms that help you identify the required input variables to compute certain taxes and transfers.

As shown in the basic usage tutorial, the function requires data, one or multiple targets, and policy_params as well as policy_functions to compute taxes and transfers for a given policy environment.

Since our data set includes all required input columns already, the function does so without problems.

[11]:
result = compute_taxes_and_transfers(
    data=data, params=policy_params, targets="kindergeld_m", functions=policy_functions
)
result.head(3)
[11]:
kindergeld_m
0 408
1 0
2 0

Error Messages: Missing Inputs#

However, if we have failed to add a required column, the function throws an error with a message that specifies which columns are missing. For example, the variable arbeitsstunden_w holds information on weekly working hours and is required to compute child benefits. Dropping it from the data triggers the error.

[12]:
incomplete_data = data.drop("arbeitsstunden_w", axis=1)
result = compute_taxes_and_transfers(
    data=incomplete_data,
    params=policy_params,
    targets="kindergeld_m",
    functions=policy_functions,
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 2
      1 incomplete_data = data.drop("arbeitsstunden_w", axis=1)
----> 2 result = compute_taxes_and_transfers(
      3     data=incomplete_data,
      4     params=policy_params,
      5     targets="kindergeld_m",
      6     functions=policy_functions,
      7 )

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:134, in compute_taxes_and_transfers(data, params, functions, aggregate_by_group_specs, aggregate_by_p_id_specs, targets, check_minimal_specification, rounding, debug)
    129 processed_functions = _round_and_partial_parameters_to_functions(
    130     necessary_functions, params, rounding
    131 )
    133 # Create input data.
--> 134 input_data = _create_input_data(
    135     data=data,
    136     processed_functions=processed_functions,
    137     targets=targets,
    138     columns_overriding_functions=columns_overriding_functions,
    139     check_minimal_specification=check_minimal_specification,
    140 )
    142 # Calculate results.
    143 tax_transfer_function = dags.concatenate_functions(
    144     processed_functions,
    145     targets,
   (...)
    148     enforce_signature=True,
    149 )

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:371, in _create_input_data(data, processed_functions, targets, columns_overriding_functions, check_minimal_specification)
    364 dag = set_up_dag(
    365     all_functions=processed_functions,
    366     targets=targets,
    367     columns_overriding_functions=columns_overriding_functions,
    368     check_minimal_specification=check_minimal_specification,
    369 )
    370 root_nodes = {n for n in dag.nodes if list(dag.predecessors(n)) == []}
--> 371 _fail_if_root_nodes_are_missing(root_nodes, data, processed_functions)
    372 data = _reduce_to_necessary_data(root_nodes, data, check_minimal_specification)
    374 # Convert series to numpy arrays

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:539, in _fail_if_root_nodes_are_missing(root_nodes, data, functions)
    537 if missing_nodes:
    538     formatted = format_list_linewise(missing_nodes)
--> 539     raise ValueError(f"The following data columns are missing.\n{formatted}")

ValueError: The following data columns are missing.

[
    "arbeitsstunden_w",
]

Similarly, we can pass an empty pandas.DataFrame to the function to get a list of all the necessary input columns to compute the desired target(s).

[13]:
result = compute_taxes_and_transfers(
    data=pd.DataFrame({"p_id": []}),
    params=policy_params,
    targets="kindergeld_m",
    functions=policy_functions,
)
/home/docs/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:108: UserWarning:

The data types of the following input variables have been converted:

 - p_id from float64 to int

Note that the automatic conversion of data types is unsafe and that its correctness cannot be guaranteed. The best solution is to convert all columns to the expected data types yourself.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 result = compute_taxes_and_transfers(
      2     data=pd.DataFrame({"p_id": []}),
      3     params=policy_params,
      4     targets="kindergeld_m",
      5     functions=policy_functions,
      6 )

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:134, in compute_taxes_and_transfers(data, params, functions, aggregate_by_group_specs, aggregate_by_p_id_specs, targets, check_minimal_specification, rounding, debug)
    129 processed_functions = _round_and_partial_parameters_to_functions(
    130     necessary_functions, params, rounding
    131 )
    133 # Create input data.
--> 134 input_data = _create_input_data(
    135     data=data,
    136     processed_functions=processed_functions,
    137     targets=targets,
    138     columns_overriding_functions=columns_overriding_functions,
    139     check_minimal_specification=check_minimal_specification,
    140 )
    142 # Calculate results.
    143 tax_transfer_function = dags.concatenate_functions(
    144     processed_functions,
    145     targets,
   (...)
    148     enforce_signature=True,
    149 )

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:371, in _create_input_data(data, processed_functions, targets, columns_overriding_functions, check_minimal_specification)
    364 dag = set_up_dag(
    365     all_functions=processed_functions,
    366     targets=targets,
    367     columns_overriding_functions=columns_overriding_functions,
    368     check_minimal_specification=check_minimal_specification,
    369 )
    370 root_nodes = {n for n in dag.nodes if list(dag.predecessors(n)) == []}
--> 371 _fail_if_root_nodes_are_missing(root_nodes, data, processed_functions)
    372 data = _reduce_to_necessary_data(root_nodes, data, check_minimal_specification)
    374 # Convert series to numpy arrays

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:539, in _fail_if_root_nodes_are_missing(root_nodes, data, functions)
    537 if missing_nodes:
    538     formatted = format_list_linewise(missing_nodes)
--> 539     raise ValueError(f"The following data columns are missing.\n{formatted}")

ValueError: The following data columns are missing.

[
    "alter",
    "p_id_kindergeld_empf",
    "in_ausbildung",
    "arbeitsstunden_w",
]

Error Messages and Warnings: Unused Inputs#

The function compute_taxes_and_transfers also has an option that allows you to check for unused inputs in your data. This functionality is controlled through the argument check_minimal_specification. By default, it is set to ignore, meaning no check is conduced. However, it can also be set to warn to trigger a warning or raise an error that includes a message stating the unused inputs.

[14]:
result = compute_taxes_and_transfers(
    data=data,
    params=policy_params,
    targets="kindergeld_m",
    functions=policy_functions,
    check_minimal_specification="raise",
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 1
----> 1 result = compute_taxes_and_transfers(
      2     data=data,
      3     params=policy_params,
      4     targets="kindergeld_m",
      5     functions=policy_functions,
      6     check_minimal_specification="raise",
      7 )

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:134, in compute_taxes_and_transfers(data, params, functions, aggregate_by_group_specs, aggregate_by_p_id_specs, targets, check_minimal_specification, rounding, debug)
    129 processed_functions = _round_and_partial_parameters_to_functions(
    130     necessary_functions, params, rounding
    131 )
    133 # Create input data.
--> 134 input_data = _create_input_data(
    135     data=data,
    136     processed_functions=processed_functions,
    137     targets=targets,
    138     columns_overriding_functions=columns_overriding_functions,
    139     check_minimal_specification=check_minimal_specification,
    140 )
    142 # Calculate results.
    143 tax_transfer_function = dags.concatenate_functions(
    144     processed_functions,
    145     targets,
   (...)
    148     enforce_signature=True,
    149 )

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:372, in _create_input_data(data, processed_functions, targets, columns_overriding_functions, check_minimal_specification)
    370 root_nodes = {n for n in dag.nodes if list(dag.predecessors(n)) == []}
    371 _fail_if_root_nodes_are_missing(root_nodes, data, processed_functions)
--> 372 data = _reduce_to_necessary_data(root_nodes, data, check_minimal_specification)
    374 # Convert series to numpy arrays
    375 data = {key: series.values for key, series in data.items()}

File ~/checkouts/readthedocs.org/user_builds/gettsim/checkouts/latest/src/_gettsim/interface.py:550, in _reduce_to_necessary_data(root_nodes, data, check_minimal_specification)
    548     warnings.warn(message, stacklevel=2)
    549 elif unnecessary_data and check_minimal_specification == "raise":
--> 550     raise ValueError(message)
    552 return {k: v for k, v in data.items() if k not in unnecessary_data}

ValueError: The following columns in 'data' are unused.


[
    "p_id_erziehgeld_empf",
    "hh_id",
    "bürgerg_bezug_vorj",
    "arbeitssuchend",
    "hat_kinder",
    "bruttokaltmiete_m_hh",
    "hh_typ",
    "geburtsjahr",
    "m_ersatzzeit",
    "m_elterngeld",
    "voll_erwerbsgemind",
    "m_arbeitsunfähig",
    "kind_unterh_anspr_m",
    "gemeinsam_veranlagt",
    "immobilie_baujahr_hh",
    "m_arbeitsl",
    "budgetsatz_erzieh",
    "p_id_einstandspartner",
    "bewohnt_eigentum_hh",
    "m_mutterschutz",
    "entgeltp_west",
    "sonstig_eink_m",
    "selbstständig",
    "p_id_elternteil_1",
    "m_schul_ausbild",
    "teilw_erwerbsgemind",
    "bruttolohn_vorj_m",
    "priv_rente_m",
    "kind_unterh_erhalt_m",
    "grundr_zeiten",
    "m_freiw_beitrag",
    "mietstufe",
    "m_pfleg_berücks_zeit",
    "arbeitsl_1y_past_585",
    "grundr_entgeltp",
    "kapitaleink_brutto_m",
    "kind",
    "m_geringf_beschäft",
    "vermögen_bedürft",
    "priv_rentenv_beitr_m",
    "monat_renteneintr",
    "eigenbedarf_gedeckt",
    "entgeltp_ost",
    "grundr_bew_zeiten",
    "vertra_arbeitsl_1997",
    "anwartschaftszeit",
    "steuerklasse",
    "y_pflichtbeitr_ab_40",
    "geburtstag",
    "m_elterngeld_vat_hh",
    "m_elterngeld_mut_hh",
    "wohnort_ost",
    "m_krank_ab_16_bis_24",
    "anz_eig_kind_bis_24",
    "eink_selbst_m",
    "behinderungsgrad",
    "in_priv_krankenv",
    "p_id_betreuungsk_träger",
    "schwerbeh_g",
    "wohnfläche_hh",
    "m_kind_berücks_zeit",
    "heizkosten_m_hh",
    "vertra_arbeitsl_2006",
    "m_durchg_alg1_bezug",
    "m_alg1_übergang",
    "betreuungskost_m",
    "rentner",
    "geburtsmonat",
    "alleinerz",
    "eink_vermietung_m",
    "p_id_elternteil_2",
    "jahr_renteneintr",
    "pflichtbeitr_8_in_10",
    "weiblich",
    "sozialv_pflicht_5j",
    "bruttolohn_m",
    "m_pflichtbeitrag",
    "m_ausbild_suche",
    "p_id_ehepartner",
]

Debug Mode#

In addition to errors and warnings compute_taxes_and_transfers can also be used in debug mode by setting the argument debug=True. In this mode, the function returns all inputs and outputs that can be computed while issuing error messages for the parts where the code fails. It is thus a very useful tool to help you set up your code correctly and detect the sources of problems that might arise in the process. Check out the troubleshooting tutorial for more information.

Computing Child Benefits and Taxes#

In this section we will compute lump-sum child benefits (Kindergeld) for example households. Since households can also claim a tax credit (Kinderfreibetrag) instead of the child benefit, we will also compute the income taxes for each household. By default, GETTSIM chooses the financially more favorable option for each case. The results will thus let us inspect how the policy affects different income levels in our data.

Income Taxes#

The income tax of a household depends on the child benefit since the tax credit is only claimed if it more beneficial than the child benefit. To compare, we can additionally compute the income taxes for our data set eink_st_y_hh. We also compute the variable bruttolohn_y_hh, which gives the monthly gross income per household (in our case, this is the combined income of the two adults in the household).

[15]:
df = compute_taxes_and_transfers(
    data=data,
    params=policy_params,
    targets=["eink_st_y_sn", "bruttolohn_y_hh", "kindergeld_y_hh"],
    functions=policy_functions,
)

Next, we aggregate eink_st_y_sn to the household level and drop unused variables as well as duplicates from our DataFrame. The final DataFrame contains the yearly gross income, income tax, child benefit, and number of children in the household.

[16]:
# Aggregate eink_st_y_hh on the household level.
df = df.join(data["hh_id"])
df["eink_st_y_hh"] = df.groupby("hh_id")["eink_st_y_sn"].transform("sum")
# Select variables of interest for further steps.
df = df[["bruttolohn_y_hh", "eink_st_y_hh", "kindergeld_y_hh"]].drop_duplicates()
df.head().round(2)
[16]:
bruttolohn_y_hh eink_st_y_hh kindergeld_y_hh
0 12000.0 0.0 4896
4 12120.0 0.0 4896
8 12240.0 0.0 4896
12 12360.0 0.0 4896
16 12480.0 0.0 4896

At a certain income level (around €80,000-€90,000) the tax credit becomes more favorable and GETTSIM assigns the tax break. The next cells plot the resulting income tax and child benefits.

[17]:
def plot_kindergeld(df):
    """Plot the child benefit and income taxes by household type."""

    return px.line(
        data_frame=df,
        x="bruttolohn_y_hh",
        y=["eink_st_y_hh", "kindergeld_y_hh"],
    )
[18]:
plot_kindergeld(df).show()

Columns Overriding Functions#

Lastly, it is also possible to substitute internally computed variables using input columns in the data.

For instance, for this application we could override the internal function kindergeld_m and set the child benefit to 0.

[19]:
new_data = data.copy()
new_data["kindergeld_m"] = 0.0

Again, we compute the child benefit and income tax by household.

[20]:
outputs = compute_taxes_and_transfers(
    data=new_data,
    params=policy_params,
    targets=["kindergeld_y_hh", "eink_st_y_sn", "bruttolohn_y_hh"],
    functions=policy_functions,
)
[21]:
# Aggregate eink_st_y_hh on the household level.
outputs = outputs.join(new_data["hh_id"])
outputs["eink_st_y_hh"] = outputs.groupby("hh_id")["eink_st_y_sn"].transform("sum")

df_new = outputs.set_index(new_data.hh_id)
df_new = df_new[
    ["bruttolohn_y_hh", "eink_st_y_hh", "kindergeld_y_hh"]
].drop_duplicates()

Since the child benefits are set to zero, GETTSIM computes the tax credit for all households instead.

[22]:
plot_kindergeld(df_new).show()

Aside from overriding internal function outputs using data columns, it is also possible to substitute the functions entirely. Please refer to the policy functions tutorial for more information.

Use Case for Columns Overriding Functions: Retirement Earnings#

Retirement earnings (ges_rente_m) can be calculated by GETTSIM which requires several input variables including entgeltp or grundr_zeiten.

However, in most data sets (e.g. the SOEP) retirement earnings are observed and those input variables are not. For some applications, it is, hence, more straight-forward to specify ges_rente_m directly as an input variable. Then the pension-specific input variables like entgeltp or grundr_zeiten are not needed as input variables.