Added CONTRIBUTING.md to explain how to contribute code to SparkCLR

2015-11-11 17:18:04 -08:00 · 2015-11-11 17:18:04 -08:00 · dfd677359c
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,40 @@
+Contributing to SparkCLR
+========================
+This page outlines contribution to SparkCLR.
+  
+Ways to contribute
+------------------
+If you would like to become involved in the development of SparkCLR, there are many different ways in which you can contribute. We strongly value your feedback, questions, bug reports, and feature requests. 
+
+Consider these options: 
+* Use SparkCLR bits
+* Submit a GitHub issue (see [Issue Guide](docs/project-docs/issue-guide.md)). 
+* Verify fixes for bugs.
+* Submit a code fix for a bug.
+* Submit a new feature request (as a GitHub issue in [Issue Guide](docs/project-docs/issue-guide.md)).
+* Help answer questions on SparkCLR mailing lists (*[sparkclr-user](https://groups.google.com/d/forum/sparkclr-user)* or *[sparkclr-dev](https://groups.google.com/d/forum/sparkclr-dev)*).
+* Submit a unit test.
+* Code review pending pull requests and bug fixes.
+* Tell others about SparkCLR.
+
+Contributing Code Changes
+-------------------------
+
+Before opening a *pull request*, review the [Contributing Code Changes](docs/project-docs/CONTRIBUTING.md). 
+It lists steps that are required before creating a PR. In particular, consider:
+
+- Is the change important and ready enough to ask the community to spend time reviewing?
+- Have you searched for existing, related issues and pull requests?
+- Is the change being proposed clearly explained and motivated?
+
+Contribution License Agreement
+------------------------------
+We appreciate community contributions to code repositories open sourced by Microsoft. By signing a 
+[contribution license agreement](https://cla.microsoft.com/cladoc/microsoft-contribution-license-agreement.pdf), we ensure that the community is free to use your contributions. 
+
+When a contributor makes a pull request, the Microsoft Pull Request BOT (MSBOT) checks whether the change requires a CLA; for example, trivial typo fixes usually don’t require a CLA. If no CLA is required, the pull request is labeled as *cla-not-required* and the contributor is done. If the change requires a CLA, MSBOT checks whether the contributor has already signed one; if you have, the pull request is labeled as *cla-signed* and the contributor is done. If the contributor needs to sign a CLA, MSBOT will label the request as *cla-required* and post a comment pointing you to sign in on the appropriate website to sign the CLA (fully electronic, no faxing involved); once the contributor has signed a CLA, the pull request is labeled as *cla-signed*. You are done. 
+
+We accept only pull requests that are labeled as either *cla-not-required*, *cla-signed* or *cla-already-signed*.
+
+=========================
+**Credit** to [Microsoft Azure](http://azure.github.io/guidelines.html) and [Apache Spark](https://github.com/apache/spark/blob/master/CONTRIBUTING.md). We are borrowing liberally from their processes.
--- a/docs/project-docs/CONTRIBUTING.md
+++ b/docs/project-docs/CONTRIBUTING.md
@ -0,0 +1,61 @@
+Contributing Code Changes
+=========================
+This page documents the various steps required in order to contribute SparkCLR code changes. 
+
+### Overview
+Generally, SparkCLR uses:
+* [Github issues](https://github.com/Microsoft/SparkCLR/issues) to track logical issues, including bugs and improvements
+* [Github pull requests](https://github.com/Microsoft/SparkCLR/pulls) to manage the *code review* and merge of *code changes*.
+
+### Github Issues
+[Issue Guide](issue-guide.md) explains Github labels used for managing SparkCLR *issues*. Even though Github allows only committers to apply labels, **Prefix titles with labels** section explains how to lable an *issue*. The steps below help you assess whether and how to file a *Github issue*, before a *pull request*:
+  
+1. Find the existing *Github issue* that the *code change* pertains to.
+  1. Do not create a new *Github issue* for creating a *code change* to address an existing *issue*; reuse the existing *issue* instead.
+  2. To avoid conflicts, assign the *Github issue* to yourself if you plan to work on it, add the *Ownership* label of `grabbed by the assignee`, and corresponding *Project Management* label, either `Backlog`, `Up Next`, or `In Progress` (see [Issue Guide](issue-guide.md) for more details).
+  3. Look for existing *pull requests* that are linked from the *issue*, to understand if someone is already working on it.
+2. If the change is new, then it usually needs a new *Github issue*. However, *trivial changes*, where *"what should change"* is virtually the same as *"how it should change"*, do not require a *Github issue*. Example: *"Fix typos in Fooo scaladoc"*
+3. If required, create a new *Github issue*:
+  1. Provide a descriptive Title. *"Update Checkpoint API"* or *"Problem in DataFrame API"* is not sufficient. *"Checkpoint API fails to save data to HDFS"* is good.
+  2. Write a detailed Description. For bug reports, this should ideally include a short reproduction of the problem. For new features, it may include a *design document*, especially if it's a *major change*.
+  3. Set proper issue *Type* and *Area* labels, per [Issue Guide](issue-guide.md). 
+  4. To avoid conflicts, assign the *Github issue* to yourself if you plan to work on it, add the *Ownership* label of `grabbed by the assignee`, and corresponding *Project Management* label, either `Backlog`, `Up Next`, or `In Progress`. Leave it unassigned otherwise.
+  5. Do not include a patch file; pull requests are used to propose the actual change.
+  6. If the change is a major one, consider inviting discussion on the issue at *[sparkclr-dev](https://groups.google.com/d/forum/sparkclr-dev)* mailing list first before proceeding to implement the change. Note that a design doc helps the discussion and the review of *major* changes.
+
+### Pull Request
+1. Fork the Github repository at http://github.com/Microsoft/SparkCLR if you haven't already.
+2. Clone your fork, create a new dev branch, push commits to the dev branch.
+3. Consider whether documentation or tests need to be added or updated as part of the change, and add them as needed (doc changes should be submitted along with code change in the same PR).
+4. Run all tests and samples as described in the project's [README](../../README.md).
+5. Open a *pull request* against the master branch of Microsoft/SparkCLR. (Only in special cases would the PR be opened against other branches.)
+  1. Always associate the PR with corresponding *Github issues* execpt for trial changes when no *Github issue* is created.
+  2. For trivial cases where an *Github issue* is not required, **MINOR:** or **HOTFIX:** can be used as the PR title prefix.
+  3. If the *pull request* is still a work in progress, not ready to be merged, but needs to be pushed to Github to facilitate review, then prefix the PR title with **[WIP]**.
+  4. Consider identifying committers or other contributors who have worked on the code being changed. Find the file(s) in Github and click *"Blame"* to see a line-by-line annotation of who changed the code last. You can add `@username` in the PR description to ping them immediately.
+6. Investigate and fix failures caused by the pull request:
+  1. Fixes can simply be pushed to the same branch from which you opened your pull request.
+  2. If the failure is unrelated to your pull request and you have been able to run the tests locally successfully, please mention it in the pull request.
+
+### The Review Process
+1. Other reviewers, including committers, may comment on the changes and suggest modifications. Changes can be added by simply pushing more commits to the same branch.
+2. Lively, polite, rapid technical debate is encouraged from everyone in the community. The outcome may be a rejection of the entire change.
+3. Reviewers can indicate that a change looks suitable for merging with a comment such as: *"I think this patch looks good"*. SparkCLR uses the **LGTM** convention for indicating the strongest level of technical sign-off on a patch: simply comment with the word **LGTM**. It specifically means: *"I've looked at this thoroughly and take as much ownership as if I wrote the patch myself"*. If you comment **LGTM** you will be expected to help with bugs or follow-up issues on the patch. Consistent, judicious use of **LGTM** is a great way to gain credibility as a reviewer with the broader community.
+4. The *Github issue* should be labelled as `In Progress` if the pull request needs more work.
+5. Sometimes, other changes will be merged which conflict with your pull request's changes. The PR can't be merged until the conflict is resolved. This can be resolved with `git fetch origin` followed by `git rebase origin/master` and resolving the conflicts by hand, then pushing the result to your branch.
+6. Try to be responsive to the discussion rather than let days pass between replies.
+
+### Closing Your Pull Request / Github Issue
+1. If a change is accepted, it will be merged and the *pull request* will automatically be closed, along with the associated *Github issues* if any.
+  1. Note that in the rare case you are asked to open a pull request against a branch besides *master*, that you will actually have to close the pull request manually
+  2. The *Github issue* will be Assigned to the primary contributor to the change as a way of giving credit. If the *issue* isn't closed and/or Assigned promptly, comment on the *issue*.
+2. It is the **PR submitter's responsibility** to resolve any **merge conflicts**. This applies to the situation where two PRs change the same code fragment – second merge will fail and the submitter should be *politely* asked to fix the conflicts – it is just an accepted fact of live occasionally with distributed development process on GitHub, so no malice involved – *"First merge wins"*.
+3. If your pull request is ultimately rejected, please close it promptly
+  * ... because committers can't close PRs directly
+3. If a *pull request* has gotten little or no attention, consider improving the description or the change itself and ping likely reviewers again after a few days. Consider proposing a change that's easier to include, like a smaller and/or less invasive change.
+4. If a pull request is closed because it is deemed not the right approach to resolve a *Github issue*, then leave the *issue* open. However if the review makes it clear that the problem identified in the *Github issue* is not going to be resolved by any pull request (not a problem, *won't fix*) then also resolve the Github *issue*.
+
+==============================
+**Credit** to the [Apache Kafka](https://cwiki.apache.org/confluence/display/KAFKA/Contributing+Code+Changes). We are borrowing liberally from their process.
+
+There may be bugs or possible improvements to this page, so help us improve it.
--- a/docs/project-docs/issue-guide.md
+++ b/docs/project-docs/issue-guide.md
@ -0,0 +1,67 @@
+Issue Guide
+===========
+
+This page outlines how SparkCLR issues are handled. 
+  
+*Issues on GitHub* represent actionable work that should be done at some future point. It may be as simple as a small product or test bug or as large as the work tracking the design of a new feature. However, it should be work that falls under the charter of SparkCLR, which is to enable C# binding for Apache Spark. We will keep issues open even if we have no plans to address them in an upcoming milestone, as long as we consider the issue to fall under the charter.
+
+### When we close issues
+As noted above, we don't close issues just because we don't plan to address them in an upcoming milestone. So why do we close issues? There are few major reasons:
+
+1. Issues unrelated to SparkCLR.
+2. Nebulous and Large open issues.  Large open issues are sometimes better suited for discussions at mailing lists *[sparkclr-user](https://groups.google.com/d/forum/sparkclr-user)* or *[sparkclr-dev](https://groups.google.com/d/forum/sparkclr-dev)*.
+
+Sometimes after debate, we'll decide an issue isn't a good fit for SparkCLR.  In that case, we'll also close it.  Because of this, we ask that you don't start working on an issue until it's tagged with *"up for grabs"* or 
+*"feature approved"*.  Both you and the team will be unhappy if you spend time and effort working on a change we'll ultimately be unable to take. We try to avoid that.
+
+### Labels
+We use GitHub labels to manage workflow on *issues*.  We have the following categories per issue:
+* **Area**: These labels call out the feature areas where the issue applies to. 
+ * [RDD](https://github.com/Microsoft/SparkCLR/labels/RDD): Issues relating to RDD
+ * [DataFrame/SQL](https://github.com/Microsoft/SparkCLR/labels/DataFrame%2FSQL): Issues relating to DataFrame/SQL.
+ * [DataFrame UDF](https://github.com/Microsoft/SparkCLR/labels/DataFrame%20UDF): Issues relating to DataFrame UDF.
+ * [Streaming](https://github.com/Microsoft/SparkCLR/labels/Streaming): Issues relating to Streaming.
+ * [Job Submission](https://github.com/Microsoft/SparkCLR/labels/Job%20Submission): Issues relating to Job Submission.
+ * [Packaging](https://github.com/Microsoft/SparkCLR/labels/Packaging): Issues relating to packaging.
+ * [Deployment](https://github.com/Microsoft/SparkCLR/labels/Deployment): Issues relating to deployment.
+ * [Spark Compatibility](https://github.com/Microsoft/SparkCLR/labels/Spark%20Compatibility): Issues relating to supporting different/newer Apache Spark releases.
+* **Type**: These labels classify the type of issue.  We use the following types:
+ * [documentation](https://github.com/Microsoft/SparkCLR/labels/documentation): Issues relating to documentation (e.g. incorrect documentation, enhancement requests)
+ * [debuggability/supportability](https://github.com/Microsoft/SparkCLR/labels/debuggability%2Fsupportability): Issues relating to making debugging and support easy. For instance, throwing meaningful errors when things fail.
+ * [user experience](https://github.com/Microsoft/SparkCLR/labels/user%20experience): Issues relating to making SparkCLR more user-friendly. For instance, improving the first time user experience, helping run SparkCLR on a single node or cluster mode etc.
+ * [bug](https://github.com/Microsoft/SparkCLR/labels/bug).
+ * [enhancement](https://github.com/Microsoft/SparkCLR/labels/enhancement): Issues related to improving existing implementations.
+ * [test bug](https://github.com/Microsoft/SparkCLR/labels/test%20bug): Issues related to invalid or missing tests/unit tests.
+ * [design change request](https://github.com/Microsoft/SparkCLR/labels/design%20change%20request): Alternative design change suggestions.
+ * [suggestion](https://github.com/Microsoft/SparkCLR/labels/suggestion): Feature or API suggestions.
+* **Ownership**: These labels are used to specify who owns specific issue. Issues without an ownership tag are still considered "up for discussion" and haven't been approved yet. We have the following different types of ownership:
+ * [up for grabs](https://github.com/Microsoft/SparkCLR/labels/up%20for%20grabs): Small sections of work which we believe are well scoped. These sorts of issues are a good place to start if you are new.  Anyone is free to work on these issues.
+ * [feature approved](https://github.com/Microsoft/SparkCLR/labels/feature%20approved): Larger scale issues having priority and the design approved, anyone is free to work on these issues, but they may be trickier or require more work.
+ * [grabbed by assignee](https://github.com/Microsoft/SparkCLR/labels/grabbed%20by%20assignee): the person the issue is assigned to is making a fix.
+* **Project Management**: These labels are used to communicate task status. Issues/tasks without a Project Management tag are still considered as "pendig/under triage".
+ * [0 - Backlog](https://github.com/Microsoft/SparkCLR/labels/0%20-%20Backlog): Tasks that are not yet ready for development or are not yet prioritized for the current milestone.
+ * [1 - Up Next](https://github.com/Microsoft/SparkCLR/labels/1%20-%20Up%20Next): Tasks that are ready for development and prioritized above the rest of the backlog.
+ * [2 - In Progress](https://github.com/Microsoft/SparkCLR/labels/2%20-%20In%20Progress): Tasks that are under active development.
+ * [3 - Done](https://github.com/Microsoft/SparkCLR/labels/3%20-%20Done): Tasks that are finished.  There should be no open issues in the Done stage.
+* **Review Status**: These labels are used to indicate that the issue/bug cannot be worked on after review. Issues without Review Status, Project Management or Ownership tags are ones pending reviews.
+ * [duplicate](https://github.com/Microsoft/SparkCLR/labels/duplicate): Issues/bugs are duplicates of ones submitted already.
+ * [invalid](https://github.com/Microsoft/SparkCLR/labels/invalid): Issues/bugs are unrelated to SparkCLR.
+ * [wontfix](https://github.com/Microsoft/SparkCLR/labels/wontfix): Issues/bugs are considered as limitations that will not be fixed.
+ * [needs more info](https://github.com/Microsoft/SparkCLR/labels/needs%20more%20info): Issues/bugs need more information. Usually this indicates we can't reproduce a reported bug.  We'll close these issues after a little while if we haven't gotten actionable information, but we welcome folks who have acquired more information to reopen the issue.
+
+In addition to the above, we may introduce new labels to help classify our issues.  Some of these tag may cross cutting concerns (e.g. *performance*, *serialization impact*), where as others are used to help us track additional work needed before closing an issue (e.g. *needs design review*). 
+
+### Prefix titles with labels
+Github allows only committers to label issues and pull requests (unfortunately). When creating or updating a Github issue, you can help by prefixing the issue title with proper *Area*, *Type* and other labels in suqare brackets (**[ ]**). After reviewing the issue, we will apply the labels and remove the prefixes from the title. 
+
+Two examples below:
+* **[**DataFrame/SQL**]****[**bug**]** DF.showString() throws exception in Sparck 1.5.1 cluster
+* **[**Deployment**]****[**bug**]** "sparkclr-submit.cmd --package" throws exception
+
+### Assignee
+We will assign each issue to a project member.  In most cases, the assignee will not be the one who ultimately fixes the issue (that only happens in the case where the issue is tagged *"grabbed by assignee"*). The purpose of the assignee is to act as a point of contact between the SparkCLR project and the community for the issue and make sure it's driven to resolution.  If you're working on an issue and get stuck, please reach out to the assignee (just at mention them)  and they will work to help you out.
+
+======================
+**Credit** to the [.Net CoreFx project](https://github.com/dotnet/corefx). We are borrowing liberally from their process.
+  
+There may be bugs or possible improvements to this page, so help us improve it.