Rationale 1: "Preferred form of modification" is not satisfied.
---------------------------------------------------------------

Without the original training data or training software, the kinds of possible
modifications are very limited. Take LLMs, typically fine-tuning a pre-trained
LLM through LoRA does not require the original training data or training
software, but fine-tuning is not the only way to modify a model. For example,
when one needs to change the tokenizer (e.g., for adding support for a new
language), the context window size, the position encoding, or improve the model
architecture, merely AI model itself is not enough.

Taking "fine-tuning" (or other types of secondary development) as the only
"preferred form of modification", is relentlessly excluding the minority,
namely power users who are really able to understand, modify, maintain, and
improve, and iterate the AI model at a deeper or even fundamental level.

Thus, the "preferred form of modification" is not satisfied with just the AI
model file itself (without the original training data or training software).

This part also connects the "freedom to change and improve" of the AI model.
Without the original training data or training software, the ways to change
and improve the AI model are very limited.


Rationale 2: Training data and program are the "Source code" (DFSG #2).
-----------------------------------------------------------------------

If we treat the emacs.c as the input, gcc as the processing software, and the
emacs ELF binary executable as the output. Then the emacs.c is the source code.
The emacs.c is the "preferred form of modification" of the emacs ELF binary.

If we treat the training data as the input, the training software as the
processing software, and the trained AI model as the output. Then the training
data is the "source code" of the AI model. The training data plus training
software is the "preferred form of modification" of the AI model.

Plus, if a user would like to study and edit the "source code" of an AI model
like the original author does, the "source code" is the training data and
training software, instead of the AI model (a pile of matrices and vectors).


Rationale 3: Reproducibility is not satisfied.
----------------------------------------------

It is impossible to reproduce the original author's work (the pre-trained AI
model), without the original training data or training software. Here
"reproduce" means to produce an AI model that has very similar or identical
performance/behavior as the original author's released AI model.

The definition of "reproducibility" may be ambiguous sometimes.  Collecting
alternative training data and writing new training software based on the
information provided by the author of the pre-trained AI model is sometimes
called "reproducing a work" in some contexts, but it is in fact a mimic of the
original work that creates new work, instead of "reproducing the original
work".  


Rationale 4: Safety, Security, Bias, and Ethics Issues.
-------------------------------------------------------

Without the original training data or training software, the security patching
mechanism will be limited to binary diff on the AI model file, or simply
replacing the old AI model with a brand new AI model. Nobody excepts the
original author can understand the security update.

If we encounter a safety/bias/ethics issue where the AI model is producing
contents that is harmful to the society, such as discrimination against
a certain group of people, or a certain type of endeavor, etc., patching
will be needed -- but doing that at the fundamental level can only be done
by the original author, let alone downstream distributors.

For security issues (e.g., when AI takes role in making decisions that can lead
to real-world impact and hence security risks), there is not yet a CVE (Common
Vulnerabilities and Exposures) system for AI models. When we face security
issues, security patching to the mentioned type of AI models at the fundamental
level can only be done by the original author, let alone downstream
distributors.


Rationale 5: The freedom to study is broken.
--------------------------------------------

Take LLMs, without the original training data, it is impossible to study
whether the AI model leverages GPL-licensed data, or even verify whether the
model is trained on legal data or not. It is impossible to study how the AI
model's outputs are affected by the GPL-licensed data, such as whether the AI
model will explicitly copy the GPL-licensed data in its outputs, without citing
the source or providing the license information.

If such kind of "study" particularly involving GPL-licensed data is too harsh,
we may need to revisit the definition of "study". That said, as the MSFT/NYT
case is not yet settled, we should put the "fair use" issue aside for now.  At
least, "the freedom to verify the license of training data" does not rely on
the "fair use" issue.

When there is unfortunately copyrighted data in the training data by accident,
directly removing those portion of training data will be effective for avoiding
legal risks, but it is challenging to remove the influence of those data from
the AI model directly and cleanly (a pile of vectors and matrices). This again
goes back to the "preferred form of modification" issue.