Этот файл содержит неоднозначные символы Юникода, которые могут быть перепутаны с другими в текущей локали. Если это намеренно, можете спокойно проигнорировать это предупреждение. Используйте кнопку Экранировать, чтобы подсветить эти символы.
本コーディングガイドラインは MLOps のプロジェクトを進める上でベース
ML Lifecycle の中でのコーディングガイドラインを適用すべきところを示す。
Coding Guidelines
Here are some coding guidelines that we have adopted in this repo.
- Test Driven Development
- Do not Repeat Yourself
- Single Responsibility
- Python and Docstrings Style
- The Zen of Python
- Evidence-Based Software Design
- You are not going to need it
- Minimum Viable Product
- Publish Often Publish Early
- If our code is going to fail, let it fails fast
Test Driven Development
TODO
- コード例を示す
- examples の link を修復
- 具体的な流れを示す (commit したことをトリガーにして pipeline が動く etc)
We use Test Driven Development (TDD) in our development. All contributions to the repository should have unit tests, we use pytest for Python files and papermill for notebooks.
Apart from unit tests, we also have nightly builds with smoke and integration tests. For more information about the differences, see a quick introduction to unit, smoke and integration tests.
You can find a guide on how to manually execute all the tests in the tests folder.
Click here to see some examples
- Basic asserts with fixtures comparing structures like list, dictionaries, numpy arrays and pandas dataframes.
- Basic use of common fixtures defined in a conftest file.
- Python unit tests for our evaluation metrics.
- Notebook unit tests for our PySpark notebooks.
Do not Repeat Yourself
Do not Repeat Yourself (DNRY) by refactoring common code.
Click here to see some examples
- See how we are using DNRY when testing our notebooks.
Single Responsibility
Single responsibility is one of the SOLID principles, it states that each module or function should have responsibility over a single part of the functionality.
Click here to see some examples
Without single responsibility:
def train_and_test(train_set, test_set):
# code for training on train set
# code for testing on test_set
With single responsibility:
def train(train_set):
# code for training on train set
def test(test_set):
# code for testing on test_set
Python and Docstrings Style
Python と Docstings のコードスタイルについて
TODO
- 具体例をドキュメントに直接書く
We use the automatic style formatter Black. See the installation guide for VSCode and PyCharm. Black supersets the well-known style guide PEP 8, defined by Guido van Rossum and collaborators. PEP 8 defines everything from naming conventions, indentation guidelines, block vs inline comments, how to use trailing commas and so on.
We use Google style for formatting the docstrings. See also the reStructuredText documentation for the syntax of docstrings.
Click here to see some examples
The Zen of Python
We follow the Zen of Python when developing general Python code, for PySpark code, please see the note (1) at the end of this section.
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Click here to see some examples
Implementation of explicit is better than implicit with a read function:
#Implicit
def read(filename):
# code for reading a csv or json
# depending on the file extension
#Explicit
def read_csv(filename):
# code for reading a csv
def read_json(filename):
# code for reading a json
(1) Note regarding PySpark development: PySpark software design is highly influenced by Java. Therefore, in order to follow the industry standards and adapt our code to our users preferences, when developing in PySpark, we don't strictly follow the Zen of Python.
Evidence-Based Software Design
When using Evidence-Based Design (EBD), software is developed based on customer inputs, standard libraries in the industry or credible research. For a detailed explanation, see this post about EBD.
Click here to see some examples
When designing the interfaces of the evaluation metrics in Python, we took the decision of using functions instead of classes, following standards in the industry like scikit-learn and tensorflow. See our implementation of Python metrics.
You are not going to need it
You aren’t going to need it (YAGNI) principle states that we should only implement functionalities when we need them and not when we foresee we might need them.
Click here to see some examples
- Question: should we start developing now computer vision capabilities for the Recommenders project?
- Answer: No, we will wait until we see a demand of these capabilities.
Minimum Viable Product
We work through Minimum Viable Products (MVP), which are our milestones. An MVP is that version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort. More information about MVPs can be found in the Lean Startup methodology.
Click here to see some examples
- Initial MVP of our repo with basic functionality.
- Second MVP to give early access to selected users and customers.
Publish Often Publish Early
Even before we have an MVP, get the code base working and doing something, even if it is something trivial that everyone can "run" easily.
Click here to see some examples
We make sure that in between MVPs all the code that goes to the branches staging or master passes the tests.
If it is going to fail, let it fail fast
Make sure that our code has sanity checks for all input parameters checking the validity of data to match the functional bounds to ensure that if the code is going to fail, then it fails as soon as possible.
Click here to see some examples
Function with no checkers:
def division(a, b, c):
d = some_function(a, b)
e = some_other_function(a, d)
return e/c # this will fail and raise a ZeroDivisionError if c=0
Function with checkers:
def division(a, b, c):
if c == 0: # this will raise an exception if c=0 early so we don't need to compute the subsequent functions
raise ValueError("c can't be 0")
d = some_function(a, b)
e = some_other_function(a, d)
return e/c