Fix nightly test runs (#6069)

* Specify python 3.10.12 for nightly runs Both ml-agents and ml-agents-envs only allow Python versions <=3.10.12: make sure the nightly uses a valid version. (We might want to consider allowing any 3.10 version so that we can be using the latest security bugfixes, such as 3.10.13: https://www.python.org/downloads/release/python-31013/ ) Sample failure of nightly full-pytest before: https://github.com/alex-mccarthy-unity/ml-agents/actions/runs/8152427823/job/22281884176 Sample passing run afterwards: https://github.com/alex-mccarthy-unity/ml-agents/actions/runs/8153333182/job/22284499278 * Fix dead links in documentation Together with #6065, fix the `markdown-link-check-full` component of nightly runs. Sample failing run before: https://github.com/alex-mccarthy-unity/ml-agents/actions/runs/8152427823/job/22281884377 Sample passing run after: https://github.com/alex-mccarthy-unity/ml-agents/actions/runs/8154489456/job/22288022888
2024-03-06 14:57:04 +01:00 · 2024-03-06 14:57:04 +01:00 · e322b6160e
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@ -37,9 +37,9 @@ jobs:
      # If one test in the matrix fails we still want to run the others.
      fail-fast: false
      matrix:
-        python-version: [3.10.x]
+        python-version: [3.10.12]
        include:
-          - python-version: 3.10.x
+          - python-version: 3.10.12
            pip_constraints: test_constraints_version.txt
    steps:
    - uses: actions/checkout@v2
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
@ -385,7 +385,7 @@ your agent's behavior:

 ML-Agents provide an implementation of two reinforcement learning algorithms:

- [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/)
+- [Proximal Policy Optimization (PPO)](https://openai.com/research/openai-baselines-ppo)
 - [Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/)

 The default algorithm is PPO. This is a method that has been shown to be more
@ -563,7 +563,7 @@ in training behaviors for specific types of environments.

 ML-Agents provides the functionality to train both symmetric and asymmetric
 adversarial games with
-[Self-Play](https://openai.com/blog/competitive-self-play/). A symmetric game is
+[Self-Play](https://openai.com/research/competitive-self-play). A symmetric game is
 one in which opposing agents are equal in form, function and objective. Examples
 of symmetric games are our Tennis and Soccer example environments. In
 reinforcement learning, this means both agents have the same observation and
--- a/docs/Training-Configuration-File.md
+++ b/docs/Training-Configuration-File.md
@ -195,7 +195,7 @@ each Behavior:
 | `save_steps`                      | (default = `20000`) Number of _trainer steps_ between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. <br><br>A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent. <br><br> Typical range: `10000` - `100000`                                                                                                                                                                                                                                                                                            |
 | `team_change`                     | (default = `5 * save_steps`) Number of _trainer_steps_ between switching the learning team. This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents per team switch. <br><br>A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies and so the agent may fail against the next batch of opponents. <br><br> The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we recommend setting this value as a function of the `save_steps` parameter discussed previously. <br><br> Typical range: 4x-10x where x=`save_steps` |
 | `swap_steps`                      | (default = `10000`) Number of _ghost steps_ (not trainer steps) between swapping the opponents policy with a different snapshot. A 'ghost step' refers to a step taken by an agent _that is following a fixed policy and not learning_. The reason for this distinction is that in asymmetric games, we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents` agents during `team-change` total steps is: `(num_agents / num_opponent_agents) * (team_change / x)` <br><br> Typical range: `10000` - `100000`                                                                                                                                                                                                 |
-| `play_against_latest_model_ratio` | (default = `0.5`) Probability an agent will play against the latest opponent policy. With probability 1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its opponent from a past iteration. <br><br> A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy. <br><br> Typical range: `0.0` - `1.0`                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| `play_against_latest_model_ratio` | (default = `0.5`) Probability an agent will play against the latest opponent policy. With probability 1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its opponent from a past iteration. <br><br> A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/research/emergent-tool-use) of more increasingly challenging situations which may lead to a stronger final policy. <br><br> Typical range: `0.0` - `1.0`                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | `window`                          | (default = `10`) Size of the sliding window of past snapshots from which the agent's opponents are sampled. For example, a `window` size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded. A larger value of `window` means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the `save_steps` hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. <br><br> Typical range: `5` - `30`                                                                                                                                                                                                                                                                                                                                                                                                                                  |

 ### Note on Reward Signals
--- a/docs/Training-on-Microsoft-Azure.md
+++ b/docs/Training-on-Microsoft-Azure.md
@ -5,7 +5,7 @@ correctly. We've decided to keep it up just in case it is helpful to you.

 This page contains instructions for setting up training on Microsoft Azure
 through either
-[Azure Container Instances](https://azure.microsoft.com/services/container-instances/)
+[Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances/)
 or Virtual Machines. Non "headless" training has not yet been tested to verify
 support.

@ -13,7 +13,7 @@ support.

 A pre-configured virtual machine image is available in the Azure Marketplace and
 is nearly completely ready for training. You can start by deploying the
-[Data Science Virtual Machine for Linux (Ubuntu)](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-dsvm.ubuntu-1804)
+[Data Science Virtual Machine for Linux (Ubuntu)](https://learn.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro?view=azureml-api-2)
 into your Azure subscription.

 Note that, if you choose to deploy the image to an
@ -112,7 +112,7 @@ Once you have started training, you can

 ## Running on Azure Container Instances

-[Azure Container Instances](https://azure.microsoft.com/services/container-instances/)
+[Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances/)
 allow you to spin up a container, on demand, that will run your training and
 then be shut down. This ensures you aren't leaving a billable VM running when it
 isn't needed. Using ACI enables you to offload training of your models without
--- a/localized_docs/KR/docs/Training-PPO.md
+++ b/localized_docs/KR/docs/Training-PPO.md
@ -1,6 +1,6 @@
 # Proximal Policy Optimization를 이용한 학습

-ML-Agents는 [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/) 라는 강화학습 기법을 사용합니다.
+ML-Agents는 [Proximal Policy Optimization (PPO)](https://openai.com/research/openai-baselines-ppo) 라는 강화학습 기법을 사용합니다.
 PPO는 에이전트의 관측 (Observation)을 통해 에이전트가 주어진 상태에서 최선의 행동을 선택할 수 있도록 하는 이상적인 함수를 인공신경망을 이용하여 근사하는 기법입니다.  ML-agents의 PPO 알고리즘은 텐서플로우로 구현되었으며 별도의 파이썬 프로세스 (소켓 통신을 통해 실행중인 유니티 프로그램과 통신)에서 실행됩니다.

 에이전트를 학습하기 위해서 사용자는 에이전트가 최대화하도록 시도하는 보상 시그널을 하나 혹은 그 이상 설정해야합니다.  사용 가능한 보상 시그널들과 관련된 하이퍼파라미터에 대해서는 [보상 시그널](Reward-Signals.md) 문서를 참고해주십시오.
--- a/localized_docs/zh-CN/docs/Getting-Started-with-Balance-Ball.md
+++ b/localized_docs/zh-CN/docs/Getting-Started-with-Balance-Ball.md
@ -246,7 +246,7 @@ Reinforcement Learning（强化学习）算法。
 与其他许多 RL 算法相比，这种算法经证明是一种安全、
 有效且更通用的方法，因此我们选择它作为与 ML-Agents
 一起使用的示例算法。有关 PPO 的更多信息，
-请参阅 OpenAI 近期发布的[博客文章](https://blog.openai.com/openai-baselines-ppo/)，
+请参阅 OpenAI 近期发布的[博客文章](https://openai.com/research/openai-baselines-ppo)，
 其中对 PPO 进行了说明。


--- a/localized_docs/zh-CN/docs/Learning-Environment-Design.md
+++ b/localized_docs/zh-CN/docs/Learning-Environment-Design.md
@ -2,7 +2,7 @@

 Reinforcement learning（强化学习）是一种人工智能技术，通过奖励期望的行为来训练 _agent_ 执行任务。在 reinforcement learning（强化学习）过程中，agent 会探索自己所处的环境，观测事物的状态，并根据这些观测结果采取相应动作。如果该动作带来了更好的状态，agent 会得到正奖励。如果该动作带来的状态不太理想，则 agent 不会得到奖励或会得到负奖励（惩罚）。随着 agent 在训练期间不断学习，它会优化自己的决策能力，以便随着时间的推移获得最高奖励。

-ML-Agents 使用一种称为 [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/) 的 reinforcement learning（强化学习）技术。PPO 使用神经网络来逼近理想函数；这种理想函数将 agent 的观测结果映射为 agent 在给定状态下可以采取的最佳动作。ML-Agents PPO 算法在 TensorFlow 中实现，并在单独的 Python 过程中运行（通过一个socket与正在运行的 Unity 应用程序进行通信）。
+ML-Agents 使用一种称为 [Proximal Policy Optimization (PPO)](https://openai.com/research/openai-baselines-ppo) 的 reinforcement learning（强化学习）技术。PPO 使用神经网络来逼近理想函数；这种理想函数将 agent 的观测结果映射为 agent 在给定状态下可以采取的最佳动作。ML-Agents PPO 算法在 TensorFlow 中实现，并在单独的 Python 过程中运行（通过一个socket与正在运行的 Unity 应用程序进行通信）。

 **注意：**如果您并非要专门研究机器学习和 reinforcement learning（强化学习）主题，只想训练 agent 完成任务，则可以将 PPO 训练视为一个_黑盒_。在 Unity 内部以及在 Python 训练方面有一些与训练相关的参数可进行调整，但您不需要深入了解算法本身就可以成功创建和训练 agent。[训练 ML-Agents](/docs/Training-ML-Agents.md)提供了执行训练过程的逐步操作程序。