azure-event-hubs-spark/docs/README.md

20 строки
1.3 KiB
Markdown
Исходник Постоянная ссылка Обычный вид История

Library re-write (For Spark 2.3) (#229) * added EventHubsConf but haven't integrated it yet. build is stable! * putting a pin in these EventHubsConf changes to focus on Spark 2.2 * WIP: implementation and tests complete. Need to fix issue related to Spark 2.2 * updating connector to work with Spark 2.2 * minor update to comments * setting timeouts in EventHubClientWrapper * change EventHubsConf.copy to EventHubsConf.clone * temporarily disabling tests. progress tracker tests are being problematic and they are going to be removed in the next phase of cleanup * driver-side translation added. dstream re-written. rdd re-written. configuration documentation added. * EventHubsSource partial rewrite complete. Committing progress b/c need hit pause and fix a bug in an older version * EventHubsSource re-write complete. Moving on to testing. Re-write was substantial, so I expect further changes will be needed as we fine tune the connector * Fixed client, starting tests * moving all client functionality into the client. added simulated eventhubs. gonna starting really reworking the tests now * cleaned out old tests. updated code. everything is building, no tests yet. * updated EventHubsConfSuite, all tests passing * test utils set up, first RDD is passing * adding RDD tests * finalized sequence number support in eventhubsconf, dstream, and source * basic stream tests done, moving to checkpointing tests * finished DStream tests. moving to Source tests * tests for EventHubsSourceOffset and JsonUtils * removing excessive stack trace printing * first few source tests. running into a cast exception due to EH Java Client, gonna take care of that now * fixing how source handles EnqueueTime from EventData * added maxSeqNoPerTrigger and corresponding source tests * additional Source tests * decoupled simulated client from simulated eventhubs. extended simulated eventhubs to allow sending events * rdd, dstream, and source tests adapted to new simulated eventhubs * adding AddEventHubsData integration tests. switching machines. * modifying eventhubsclient to avoid false positive data loss reports * additional structured streaming integration tests * adding support for national and private clouds via setDomainName in EventHubsConf * added final integration tests for struct streaming * EnqueuedTime is converted to java.sql.Timestamp * removing unused imports * moving to eventhubs java client 1.0.0 * Remove isValid from EventHubsConf * maxRatePerPartition refactoring * Client refactoring - signature changes and removing unused methods * EventHubsConf refactoring * Common package is removed * dropping default max rate * Support for JavaRDD and JavaInputDStream * Rename Position to EventPosition * misc cleanup * Support multiple simulated eventhubs at once * remove sql containsProps and userDefinedKeys options * parallelized all loops in EventHubsClient.translate * removing unecessary comments * adding javadoc comments * conn str builder tests * EventHubsConf tests added * Minor bug fixes and EventPosition serialization issue is fixed * Simulated client is enabled in tests * Moving non-util files out of utils package * ClientWrapper fix * Minor bug fixes in tests * Moved to Spark 2.3, all tests passing * EventPosition bug fix * Receive until we do get null, only make API call for partition count once * moving defaults into package.scala * removing out of date docs, adding structured streaming integration guide * spark streaming integration guide * Removing old information from docs * Updated PySpark docs * updating doc name * Updating minor issues in docs. Added experimental tag to four apis in eventhubsconf * Adding support for batch styled queries in structured streaming * Update struct streaming docs to reflect new batch query support * docs/README formatting * doc fomratting * add batch style query code sample in docs * EventData: remove inclusive flag from public api. Starts are always inclusive, ends are always exclusive * Updating public apis to take NameAndPartition instead of PartitionId * Fixing javadoc issues in EventPosition * updating readme * updating templates for pull requests, issues, and contriubting * moving test resource to test directory * renaming EventHubsClientWrapper to EventHubsClient * fixing access issues in NameandPartition * reorganizing test resources * Accomodating breaking changes in java client * Additional tracing in translate method * Client connection pooling and thread pooling first draft * Minor bug fix to connection pool * remove failOnDataLoss option * Adding EventHubsSink * Adding send functionality to TestUtils * First batched writes passing * More unit tests for EventHubsRelation and EventHubsSink * Additional Sink tests * Final Sink test updates * Adding Sink documentation to integration guide * Adding databricks docs * remvoing concurrent jobs limit in spark streaming * Check for EventData expiration each batch * Rebase * Adding preferred location in Spark Streaming and Struct Streaming * concurrency bug fix in EVentHubsClient * Minor logging fix * retry client create until successful * Update structured-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Update spark-streaming-eventhubs-integration.md * Update structured-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Update README.md * add toString for simulated eventhubs * Update structured-streaming-eventhubs-integration.md * Update spark-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Updating docs - typo fixes and reorganizing * fixing NPE in RDD * Moving to proper Spark 2.3.0 release and Java client 1.0.0 release * Enabling unit and integration tests in Travis * Updating CONTRIBUTING.md * additional traces in client pool
2018-03-02 21:33:18 +03:00
# EventHub Connector Documentation
Hello! This connector supports Structured Streaming and Spark Streaming. For documentation on using this connector, please read our integration guides.
Library re-write (For Spark 2.3) (#229) * added EventHubsConf but haven't integrated it yet. build is stable! * putting a pin in these EventHubsConf changes to focus on Spark 2.2 * WIP: implementation and tests complete. Need to fix issue related to Spark 2.2 * updating connector to work with Spark 2.2 * minor update to comments * setting timeouts in EventHubClientWrapper * change EventHubsConf.copy to EventHubsConf.clone * temporarily disabling tests. progress tracker tests are being problematic and they are going to be removed in the next phase of cleanup * driver-side translation added. dstream re-written. rdd re-written. configuration documentation added. * EventHubsSource partial rewrite complete. Committing progress b/c need hit pause and fix a bug in an older version * EventHubsSource re-write complete. Moving on to testing. Re-write was substantial, so I expect further changes will be needed as we fine tune the connector * Fixed client, starting tests * moving all client functionality into the client. added simulated eventhubs. gonna starting really reworking the tests now * cleaned out old tests. updated code. everything is building, no tests yet. * updated EventHubsConfSuite, all tests passing * test utils set up, first RDD is passing * adding RDD tests * finalized sequence number support in eventhubsconf, dstream, and source * basic stream tests done, moving to checkpointing tests * finished DStream tests. moving to Source tests * tests for EventHubsSourceOffset and JsonUtils * removing excessive stack trace printing * first few source tests. running into a cast exception due to EH Java Client, gonna take care of that now * fixing how source handles EnqueueTime from EventData * added maxSeqNoPerTrigger and corresponding source tests * additional Source tests * decoupled simulated client from simulated eventhubs. extended simulated eventhubs to allow sending events * rdd, dstream, and source tests adapted to new simulated eventhubs * adding AddEventHubsData integration tests. switching machines. * modifying eventhubsclient to avoid false positive data loss reports * additional structured streaming integration tests * adding support for national and private clouds via setDomainName in EventHubsConf * added final integration tests for struct streaming * EnqueuedTime is converted to java.sql.Timestamp * removing unused imports * moving to eventhubs java client 1.0.0 * Remove isValid from EventHubsConf * maxRatePerPartition refactoring * Client refactoring - signature changes and removing unused methods * EventHubsConf refactoring * Common package is removed * dropping default max rate * Support for JavaRDD and JavaInputDStream * Rename Position to EventPosition * misc cleanup * Support multiple simulated eventhubs at once * remove sql containsProps and userDefinedKeys options * parallelized all loops in EventHubsClient.translate * removing unecessary comments * adding javadoc comments * conn str builder tests * EventHubsConf tests added * Minor bug fixes and EventPosition serialization issue is fixed * Simulated client is enabled in tests * Moving non-util files out of utils package * ClientWrapper fix * Minor bug fixes in tests * Moved to Spark 2.3, all tests passing * EventPosition bug fix * Receive until we do get null, only make API call for partition count once * moving defaults into package.scala * removing out of date docs, adding structured streaming integration guide * spark streaming integration guide * Removing old information from docs * Updated PySpark docs * updating doc name * Updating minor issues in docs. Added experimental tag to four apis in eventhubsconf * Adding support for batch styled queries in structured streaming * Update struct streaming docs to reflect new batch query support * docs/README formatting * doc fomratting * add batch style query code sample in docs * EventData: remove inclusive flag from public api. Starts are always inclusive, ends are always exclusive * Updating public apis to take NameAndPartition instead of PartitionId * Fixing javadoc issues in EventPosition * updating readme * updating templates for pull requests, issues, and contriubting * moving test resource to test directory * renaming EventHubsClientWrapper to EventHubsClient * fixing access issues in NameandPartition * reorganizing test resources * Accomodating breaking changes in java client * Additional tracing in translate method * Client connection pooling and thread pooling first draft * Minor bug fix to connection pool * remove failOnDataLoss option * Adding EventHubsSink * Adding send functionality to TestUtils * First batched writes passing * More unit tests for EventHubsRelation and EventHubsSink * Additional Sink tests * Final Sink test updates * Adding Sink documentation to integration guide * Adding databricks docs * remvoing concurrent jobs limit in spark streaming * Check for EventData expiration each batch * Rebase * Adding preferred location in Spark Streaming and Struct Streaming * concurrency bug fix in EVentHubsClient * Minor logging fix * retry client create until successful * Update structured-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Update spark-streaming-eventhubs-integration.md * Update structured-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Update README.md * add toString for simulated eventhubs * Update structured-streaming-eventhubs-integration.md * Update spark-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Updating docs - typo fixes and reorganizing * fixing NPE in RDD * Moving to proper Spark 2.3.0 release and Java client 1.0.0 release * Enabling unit and integration tests in Travis * Updating CONTRIBUTING.md * additional traces in client pool
2018-03-02 21:33:18 +03:00
- [Structured Streaming + Event Hubs Integration Guide](structured-streaming-eventhubs-integration.md)
- [Spark Streaming + Event Hubs Integration Guide](spark-streaming-eventhubs-integration.md)
2020-10-01 05:16:56 +03:00
Also, you can read more about other features of this connector in below documents:
- [Receive Key-Value Pairs from Events Sent using Event Hubs Kafka Endpoint](receive-events-sent-using-kafka-protocol.md)
- [Spark Structured Streaming Adjustment for Slow Partitions](slow-partition-adjustment-feature.md)
2022-05-04 04:02:33 +03:00
- [Use AAD Authentication to Connect Event Hubs](use-aad-authentication-to-connect-eventhubs.md)
2020-10-01 05:16:56 +03:00
Library re-write (For Spark 2.3) (#229) * added EventHubsConf but haven't integrated it yet. build is stable! * putting a pin in these EventHubsConf changes to focus on Spark 2.2 * WIP: implementation and tests complete. Need to fix issue related to Spark 2.2 * updating connector to work with Spark 2.2 * minor update to comments * setting timeouts in EventHubClientWrapper * change EventHubsConf.copy to EventHubsConf.clone * temporarily disabling tests. progress tracker tests are being problematic and they are going to be removed in the next phase of cleanup * driver-side translation added. dstream re-written. rdd re-written. configuration documentation added. * EventHubsSource partial rewrite complete. Committing progress b/c need hit pause and fix a bug in an older version * EventHubsSource re-write complete. Moving on to testing. Re-write was substantial, so I expect further changes will be needed as we fine tune the connector * Fixed client, starting tests * moving all client functionality into the client. added simulated eventhubs. gonna starting really reworking the tests now * cleaned out old tests. updated code. everything is building, no tests yet. * updated EventHubsConfSuite, all tests passing * test utils set up, first RDD is passing * adding RDD tests * finalized sequence number support in eventhubsconf, dstream, and source * basic stream tests done, moving to checkpointing tests * finished DStream tests. moving to Source tests * tests for EventHubsSourceOffset and JsonUtils * removing excessive stack trace printing * first few source tests. running into a cast exception due to EH Java Client, gonna take care of that now * fixing how source handles EnqueueTime from EventData * added maxSeqNoPerTrigger and corresponding source tests * additional Source tests * decoupled simulated client from simulated eventhubs. extended simulated eventhubs to allow sending events * rdd, dstream, and source tests adapted to new simulated eventhubs * adding AddEventHubsData integration tests. switching machines. * modifying eventhubsclient to avoid false positive data loss reports * additional structured streaming integration tests * adding support for national and private clouds via setDomainName in EventHubsConf * added final integration tests for struct streaming * EnqueuedTime is converted to java.sql.Timestamp * removing unused imports * moving to eventhubs java client 1.0.0 * Remove isValid from EventHubsConf * maxRatePerPartition refactoring * Client refactoring - signature changes and removing unused methods * EventHubsConf refactoring * Common package is removed * dropping default max rate * Support for JavaRDD and JavaInputDStream * Rename Position to EventPosition * misc cleanup * Support multiple simulated eventhubs at once * remove sql containsProps and userDefinedKeys options * parallelized all loops in EventHubsClient.translate * removing unecessary comments * adding javadoc comments * conn str builder tests * EventHubsConf tests added * Minor bug fixes and EventPosition serialization issue is fixed * Simulated client is enabled in tests * Moving non-util files out of utils package * ClientWrapper fix * Minor bug fixes in tests * Moved to Spark 2.3, all tests passing * EventPosition bug fix * Receive until we do get null, only make API call for partition count once * moving defaults into package.scala * removing out of date docs, adding structured streaming integration guide * spark streaming integration guide * Removing old information from docs * Updated PySpark docs * updating doc name * Updating minor issues in docs. Added experimental tag to four apis in eventhubsconf * Adding support for batch styled queries in structured streaming * Update struct streaming docs to reflect new batch query support * docs/README formatting * doc fomratting * add batch style query code sample in docs * EventData: remove inclusive flag from public api. Starts are always inclusive, ends are always exclusive * Updating public apis to take NameAndPartition instead of PartitionId * Fixing javadoc issues in EventPosition * updating readme * updating templates for pull requests, issues, and contriubting * moving test resource to test directory * renaming EventHubsClientWrapper to EventHubsClient * fixing access issues in NameandPartition * reorganizing test resources * Accomodating breaking changes in java client * Additional tracing in translate method * Client connection pooling and thread pooling first draft * Minor bug fix to connection pool * remove failOnDataLoss option * Adding EventHubsSink * Adding send functionality to TestUtils * First batched writes passing * More unit tests for EventHubsRelation and EventHubsSink * Additional Sink tests * Final Sink test updates * Adding Sink documentation to integration guide * Adding databricks docs * remvoing concurrent jobs limit in spark streaming * Check for EventData expiration each batch * Rebase * Adding preferred location in Spark Streaming and Struct Streaming * concurrency bug fix in EVentHubsClient * Minor logging fix * retry client create until successful * Update structured-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Update spark-streaming-eventhubs-integration.md * Update structured-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Update README.md * add toString for simulated eventhubs * Update structured-streaming-eventhubs-integration.md * Update spark-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Updating docs - typo fixes and reorganizing * fixing NPE in RDD * Moving to proper Spark 2.3.0 release and Java client 1.0.0 release * Enabling unit and integration tests in Travis * Updating CONTRIBUTING.md * additional traces in client pool
2018-03-02 21:33:18 +03:00
Additionally, here are some links to documentation on Event Hubs, Spark, and Databricks:
- [Azure Event Hubs on Databricks](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/eventhubs-connector.html)
Library re-write (For Spark 2.3) (#229) * added EventHubsConf but haven't integrated it yet. build is stable! * putting a pin in these EventHubsConf changes to focus on Spark 2.2 * WIP: implementation and tests complete. Need to fix issue related to Spark 2.2 * updating connector to work with Spark 2.2 * minor update to comments * setting timeouts in EventHubClientWrapper * change EventHubsConf.copy to EventHubsConf.clone * temporarily disabling tests. progress tracker tests are being problematic and they are going to be removed in the next phase of cleanup * driver-side translation added. dstream re-written. rdd re-written. configuration documentation added. * EventHubsSource partial rewrite complete. Committing progress b/c need hit pause and fix a bug in an older version * EventHubsSource re-write complete. Moving on to testing. Re-write was substantial, so I expect further changes will be needed as we fine tune the connector * Fixed client, starting tests * moving all client functionality into the client. added simulated eventhubs. gonna starting really reworking the tests now * cleaned out old tests. updated code. everything is building, no tests yet. * updated EventHubsConfSuite, all tests passing * test utils set up, first RDD is passing * adding RDD tests * finalized sequence number support in eventhubsconf, dstream, and source * basic stream tests done, moving to checkpointing tests * finished DStream tests. moving to Source tests * tests for EventHubsSourceOffset and JsonUtils * removing excessive stack trace printing * first few source tests. running into a cast exception due to EH Java Client, gonna take care of that now * fixing how source handles EnqueueTime from EventData * added maxSeqNoPerTrigger and corresponding source tests * additional Source tests * decoupled simulated client from simulated eventhubs. extended simulated eventhubs to allow sending events * rdd, dstream, and source tests adapted to new simulated eventhubs * adding AddEventHubsData integration tests. switching machines. * modifying eventhubsclient to avoid false positive data loss reports * additional structured streaming integration tests * adding support for national and private clouds via setDomainName in EventHubsConf * added final integration tests for struct streaming * EnqueuedTime is converted to java.sql.Timestamp * removing unused imports * moving to eventhubs java client 1.0.0 * Remove isValid from EventHubsConf * maxRatePerPartition refactoring * Client refactoring - signature changes and removing unused methods * EventHubsConf refactoring * Common package is removed * dropping default max rate * Support for JavaRDD and JavaInputDStream * Rename Position to EventPosition * misc cleanup * Support multiple simulated eventhubs at once * remove sql containsProps and userDefinedKeys options * parallelized all loops in EventHubsClient.translate * removing unecessary comments * adding javadoc comments * conn str builder tests * EventHubsConf tests added * Minor bug fixes and EventPosition serialization issue is fixed * Simulated client is enabled in tests * Moving non-util files out of utils package * ClientWrapper fix * Minor bug fixes in tests * Moved to Spark 2.3, all tests passing * EventPosition bug fix * Receive until we do get null, only make API call for partition count once * moving defaults into package.scala * removing out of date docs, adding structured streaming integration guide * spark streaming integration guide * Removing old information from docs * Updated PySpark docs * updating doc name * Updating minor issues in docs. Added experimental tag to four apis in eventhubsconf * Adding support for batch styled queries in structured streaming * Update struct streaming docs to reflect new batch query support * docs/README formatting * doc fomratting * add batch style query code sample in docs * EventData: remove inclusive flag from public api. Starts are always inclusive, ends are always exclusive * Updating public apis to take NameAndPartition instead of PartitionId * Fixing javadoc issues in EventPosition * updating readme * updating templates for pull requests, issues, and contriubting * moving test resource to test directory * renaming EventHubsClientWrapper to EventHubsClient * fixing access issues in NameandPartition * reorganizing test resources * Accomodating breaking changes in java client * Additional tracing in translate method * Client connection pooling and thread pooling first draft * Minor bug fix to connection pool * remove failOnDataLoss option * Adding EventHubsSink * Adding send functionality to TestUtils * First batched writes passing * More unit tests for EventHubsRelation and EventHubsSink * Additional Sink tests * Final Sink test updates * Adding Sink documentation to integration guide * Adding databricks docs * remvoing concurrent jobs limit in spark streaming * Check for EventData expiration each batch * Rebase * Adding preferred location in Spark Streaming and Struct Streaming * concurrency bug fix in EVentHubsClient * Minor logging fix * retry client create until successful * Update structured-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Update spark-streaming-eventhubs-integration.md * Update structured-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Update README.md * add toString for simulated eventhubs * Update structured-streaming-eventhubs-integration.md * Update spark-streaming-eventhubs-integration.md * Update azure_eventhubs_support.md * Updating docs - typo fixes and reorganizing * fixing NPE in RDD * Moving to proper Spark 2.3.0 release and Java client 1.0.0 release * Enabling unit and integration tests in Travis * Updating CONTRIBUTING.md * additional traces in client pool
2018-03-02 21:33:18 +03:00
- [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
- [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html)
- [Event Hubs Documentation](https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs)