Cassandra 19480 Additional task execution specific instrumentation of job stats #51

arjunashok · 2024-04-05T22:25:04Z

… job stats

Changes

Utilized spark listeners to publish job/task stats on job completion for reader and writer.
Unified implementation of job end stats published for reader and writer
Publishes task runtime stats
Adds spark jobGroup based UUID to reader, similar to existing implementation in writer. This is used as the jobId to uniquely identify and potentially merge multiple stats published from the same job.

Testing

Validated stats logged from unit tests and in-jvm-dtest integration tests for reader/writer

yifan-c

it looks good in general. A rebase is required though.

… job stats

frankgh

Some comments

frankgh · 2024-04-25T20:48:37Z

cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/JobEventDetail.java

+
+import java.util.Map;
+
+public class JobEventDetail


can we add a javadoc here?

frankgh · 2024-04-25T20:49:38Z

...cs-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraBulkSourceRelation.java

@@ -61,6 +61,14 @@ public CassandraBulkSourceRelation(BulkWriterContext writerContext, SQLContext s
        this.sqlContext = sqlContext;
        this.sparkContext = JavaSparkContext.fromSparkContext(sqlContext.sparkContext());
        this.broadcastContext = sparkContext.<BulkWriterContext>broadcast(writerContext);
+        this.jobStatsListener = new JobStatsListener((jobEventDetail) -> {
+            if (writerContext.job().getId().toString().equals(jobEventDetail.internalJobID()))


can we add a comment in code mentioning why we need this condition here?

frankgh · 2024-04-26T16:02:34Z

...cs-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraBulkSourceRelation.java

                        rowCount,
                        totalBytesWritten,
                        hasClusterTopologyChanged);
            publishSuccessfulJobStats(rowCount, totalBytesWritten, hasClusterTopologyChanged);
        }
        catch (Throwable throwable)
        {
-            publishFailureJobStats(throwable.getMessage());


not sure what happens here, do we no longer publish failure stats?

Yes, we are not explicitly publishing failure stats at the point of failure. Instead, we rely on the job failure event, and the listener now publishes these stats.

frankgh · 2024-04-26T16:03:38Z

...ics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraBulkWriterContext.java

            put("sparkVersion", sparkVersion);
-            put("keyspace", jobInfo.getId().toString());
-            put("table", jobInfo.getId().toString());
+            put("keyspace", jobInfo.getId());


is the jobInfo.getId() the keyspace? shouldn't we use qualifiedTableName().keyspace() here instead?

Yes, probably got mixed-up during rebase. Corrected.

frankgh · 2024-04-26T16:03:55Z

...ics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraBulkWriterContext.java

-            put("keyspace", jobInfo.getId().toString());
-            put("table", jobInfo.getId().toString());
+            put("keyspace", jobInfo.getId());
+            put("table", jobInfo.qualifiedTableName().toString());


should this be qualifiedTableName().table() instead?

Yes, probably got mixed-up during rebase. Corrected.

frankgh · 2024-04-26T16:06:12Z

...-analytics-core/src/main/java/org/apache/cassandra/spark/common/stats/JobStatsPublisher.java

     */
    void publish(Map<String, String> stats);
+
+    Map<String, String> stats();


This is no longer being used. Removed.

frankgh · 2024-04-26T16:07:07Z

...-analytics-core/src/main/java/org/apache/cassandra/spark/data/CassandraDataSourceHelper.java

@@ -129,6 +129,7 @@ protected static CassandraDataLayer createAndInitCassandraDataLayer(

        dataLayer.startupValidate();

+


NIT: unnecessary extra line?

bbotella

There is an opportunity here to move all the hardcoded stat names to a Consts file. Maybe worth a separate ticket?

bbotella · 2024-05-23T13:57:09Z

...cs-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraBulkSourceRelation.java

                put("clusterResizeDetected", String.valueOf(hasClusterTopologyChanged));
-                put("jobElapsedTimeMillis", Long.toString(elapsedTimeMillis()));


Why are we removing the jobElapsedTimeMillis stat?

bbotella · 2024-05-23T13:58:08Z

...cs-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraBulkSourceRelation.java

@@ -258,28 +269,17 @@ private void persist(@NotNull JavaPairRDD<DecoratedKey, Object[]> sortedRDD, Str

    private void publishSuccessfulJobStats(long rowCount, long totalBytesWritten, boolean hasClusterTopologyChanged)


Does it make sense to keep the Successful name on the method if we are ignoring failure stats?

bbotella · 2024-05-23T14:00:15Z

...cs-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraBulkSourceRelation.java

@@ -79,6 +82,15 @@ public CassandraBulkSourceRelation(BulkWriterContext writerContext, SQLContext s
        ReplicaAwareFailureHandler<RingInstance> failureHandler = new ReplicaAwareFailureHandler<>(writerContext.cluster().getPartitioner());
        this.writeValidator = new BulkWriteValidator(writerContext, failureHandler);
        onCloudStorageTransport(ignored -> this.heartbeatReporter = new HeartbeatReporter());
+        this.jobStatsListener = new JobStatsListener((jobEventDetail) -> {
+            // Note: Consumers are called for all jobs and tasks. We only publish for the existing job
+            if (writerContext.job().getId().equals(jobEventDetail.internalJobID()))


Should we also check for !internalJobId.isEmpty()?

bbotella · 2024-05-23T14:03:17Z

...a-analytics-core/src/main/java/org/apache/cassandra/spark/common/stats/JobStatsListener.java

+            jobStats.put("failureReason", reason);
+            jobStats.put("jobElapsedTimeMillis", String.valueOf(elapsedTimeMillis));
+
+            LOGGER.debug("Job END for jobId:{} status:{} Reason:{} ElapsedTime: {}",


There is an extra space after ElapsedTime

bbotella · 2024-05-23T14:05:08Z

...nalytics-core/src/test/java/org/apache/cassandra/spark/bulkwriter/MockBulkWriterContext.java

@@ -81,12 +82,14 @@ public class MockBulkWriterContext implements BulkWriterContext, ClusterInfo, Jo
    new CqlField.CqlType[]{mockCqlType(INT), mockCqlType(DATE), mockCqlType(VARCHAR), mockCqlType(INT)});
    private ConsistencyLevel.CL consistencyLevel;
    private int sstableDataSizeInMB = 128;
+    private int sstableWriteBatchSize = 2;


What is this new variable doing? Where is it used?

yifan-c reviewed Apr 25, 2024

View reviewed changes

arjunashok added 4 commits April 25, 2024 14:56

CASSANDRA-19480 Additional task execution specific instrumentation of…

24ef9cc

… job stats

Adds minor changes and tests to validate the stats listener

37c78eb

Prevent errors from stats listeners from propagating

8f7bf34

Fix for java8

dd97edb

arjunashok closed this Apr 25, 2024

Cleanup after rebase

7bb12eb

arjunashok reopened this Apr 26, 2024

arjunashok force-pushed the CASSANDRA-19480 branch from 668a44e to 7bb12eb Compare April 26, 2024 01:29

arjunashok mentioned this pull request Apr 26, 2024

Cassandra 19480 Additional task execution specific instrumentation of job stats #55

Closed

frankgh reviewed Apr 26, 2024

View reviewed changes

Addressed PR comments

e4a8ded

bbotella reviewed May 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cassandra 19480 Additional task execution specific instrumentation of job stats #51

Cassandra 19480 Additional task execution specific instrumentation of job stats #51

arjunashok commented Apr 5, 2024

yifan-c left a comment

frankgh left a comment

frankgh Apr 25, 2024

frankgh Apr 25, 2024

frankgh Apr 26, 2024

arjunashok Apr 26, 2024

frankgh Apr 26, 2024

arjunashok Apr 26, 2024

frankgh Apr 26, 2024

arjunashok Apr 26, 2024

frankgh Apr 26, 2024

arjunashok Apr 26, 2024

frankgh Apr 26, 2024

bbotella left a comment

bbotella May 23, 2024

bbotella May 23, 2024

bbotella May 23, 2024

bbotella May 23, 2024

bbotella May 23, 2024

		@@ -129,6 +129,7 @@ protected static CassandraDataLayer createAndInitCassandraDataLayer(

		dataLayer.startupValidate();

		put("clusterResizeDetected", String.valueOf(hasClusterTopologyChanged));
		put("jobElapsedTimeMillis", Long.toString(elapsedTimeMillis()));

		@@ -258,28 +269,17 @@ private void persist(@NotNull JavaPairRDD<DecoratedKey, Object[]> sortedRDD, Str

		private void publishSuccessfulJobStats(long rowCount, long totalBytesWritten, boolean hasClusterTopologyChanged)

Cassandra 19480 Additional task execution specific instrumentation of job stats #51

Are you sure you want to change the base?

Cassandra 19480 Additional task execution specific instrumentation of job stats #51

Conversation

arjunashok commented Apr 5, 2024

Changes

Testing

yifan-c left a comment

Choose a reason for hiding this comment

frankgh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbotella left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment