Reproducible and FAIR data science

Week 6 - ANSCI 4940 - Spring 2026

Miel Hostens

Reproducible & FAIR data science

What are the FAIR principles

FAIR data is data which meets the FAIR principles of findability, accessibility, interoperability, and reusability (FAIR).^[1][2] The acronym and principles were defined in a March 2016 paper in the journal Scientific Data by a consortium of scientists and organizations.^[1]

Get to know the principles (1)

Get familiar with FAIR by looking at this video.
Read the intial publication on The FAIR Guiding Principles for scientific data management and stewardschip.
Explore the vision of Cornell University on FAIR principles and their cheatsheets.

Get to know the principles (2)

Check for reproducible advise in the tutorials at the bovi-analytics website.
Discuss with the entire team and Dr. Miel Hostens (contact him on his desk) on how this will reflect on your team project.

Findable

The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.

Findable

F1. (Meta)data are assigned a globally unique and persistent identifier
F2. Data are described with rich metadata (defined by R1 below)
F3. Metadata clearly and explicitly include the identifier of the data they describe
F4. (Meta)data are registered or indexed in a searchable resource

Accessible

Once the user finds the required data, they need to know how they can be accessed, possibly including authentication and authorisation.

Accessible

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
A1.1 The protocol is open, free, and universally implementable
A1.2 The protocol allows for an authentication and authorisation procedure, where necessary
A2. Metadata are accessible, even when the data are no longer available

Interoperable

The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.

Interoperable

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (Meta)data use vocabularies that follow FAIR principles
I3. (Meta)data include qualified references to other (meta)data

Reusable

The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.

Reusable

R1. (Meta)data are richly described with a plurality of accurate and relevant attributes
R1.1. (Meta)data are released with a clear and accessible data usage license
R1.2. (Meta)data are associated with detailed provenance
R1.3. (Meta)data meet domain-relevant community standards

Naming convention

Why do we need this

Dairy data science is multimodal, longitudinal, and collaborative
Naming is not cosmetic — it is infrastructure
Most downstream issues are caused by:
- Ambiguity
- Inconsistency
- Silent semantic drift

40 Minutes, 4 Takeaways

Naming conventions are scientific decisions
Bad names silently break analyis and models
Conventions must scale across projects & people
Governance matters more than documentation

The Dairy Data Reality

Multiple farms
Multiple vendors
Multiple countries
Multiple disciplines:
- Nutrition
- Health
- Behavior
- Production

→ Names are the only shared language

What Is a Naming Convention?

A set of agreed‑upon rules for creating unique, informative, and stable names for data objects.

Applies to:

Variables
Tables
Files
Visuals
Ontology terms

Naming ≠ Style Preference

Bad naming causes:

Wrong joins
Broken pipelines
Invalid within/between farm comparisons

These errors do not throw exceptions.

Common Dairy Data Failures

DIM, days_in_milk, DaysInMilk
milk_yield, MilkYieldKg, MY
other, Other, OTHER
lactation_stage vs lactation_stages

Each creates a new semantic entity

Names Encode Meaning

A good name answers:

What?
For whom?
In what unit?
At what level?
Over what time window?

Core Design Principles

Consistency over cleverness
Explicit over implicit
Stable over convenient
Readable by humans and machines

Scope of a Naming Convention

You must define rules for:

✅ Variables
✅ Tables
✅ Files
✅ Features
✅ Categories
✅ Ontology terms

Partial conventions always fail.

Case Sensitivity Matters

Choose one and enforce it.

Recommended: - CamelCase for data schemas - lower_snake_case for code variables

Never mix within the same layer.

Variable Naming Pattern

It’s now the perfect time to decide on a convention used for the project