A list of must pre-project questions


Rumbling through Twitter I found a Jupyter Notebook of Paige Bailey written at the rOpensci unconf about Ethical Machine Learning which you can read here. It was very interesting to look at her workflow but even more interesting was the set of questions she asked herself before and during the analysis. I paste them here just to keep them as a reference.

As you design the goal and the purpose of your machine learning product, you must first ask: Who is your audience?

  • Is your product or analysis meant to include all people?
  • And, if not: is it targeted to an exclusive audience?
  • Is there a person on your team tasked specifically with identifying and resolving bias and discrimination issues?

Once the concept and scope have been defined, it is time to focus on the acquisition, evaluation, and cleaning of data. We have received a single .csv file filled with information on customers from the bank’s manager. Some questions to consider:

  • Did the data come from a system prone to human error?
  • Is the data current?
  • What technology facilitated the collection of the data?
  • Was participation of the data subjects voluntary?
  • Does the context of the collection match the context of your use?
  • Was your data collected by people or a system that was operating with quotas or a particular incentive structure?

Now that your data has been collected, it would be a great idea to evaluate and describe it:

  • Who is represented in the data?
  • Who is under-represented or absent from your data?
  • Can you find additional data, or use statistical methods, to make your data more inclusive?
  • Was the data collected in an environment where data subjects had meaningful choices?
  • How does the data reflect the perspective of the institution that collected it?
  • Were fields within the data inferred or appended beyond what was clear to the data subject? Would this use of the data surprise the data subjects?

The next step would be cleaning the data.

  • Are there any fields that should be eliminated from your data?
  • Can you use anonymization or pseudonymization techniques to avoid needless evaluation or processing of individual data?

Establishing logic for variables

  • Can you describe the logic that connects the variables to the output of your equation?
  • Do your variables have a causal relationship to the results they predict?
  • How did you determine what weight to give each variable?

Identifying assumptions

  • Will your variables apply equally across race, gender, age, disability, ethnicity, socioeconomic status, education, etc.?
  • What are you assuming about the kinds of people in your data set?
  • Would you be comfortable explaining your assumptions to the public?
  • What assumptions are you relying on to determine the relevant variables and their weights?

Defining success - What amount and type of error do you expect? - How will you ensure your system is behaving the way you intend? How reliable is it?

How will you choose your analytical method? For example, predictive analytics, machine learning (supervised, unsupervised), neural networks or deep learning, etc.

  • How much transparency does this method allow your end users and yourself?
  • Are non-deterministic outcomes acceptable given your legal or ethical obligations around transparency and explainability?
  • Does your choice of analytical method allow you to sufficiently explain your results?
  • What particular tasks are associated with the type of analytical method you are using?


  • How could results that look successful still contain bias?
  • Is there a trustworthy or audited source for the tools you need?
  • Have the tools you are using been associated with biased products?
  • Or, if you build from scratch: can you or a third-party test your tools for any features that can result in biased or unfair outcomes?
comments powered by Disqus