Writing custom Java objects to parquet
I have some custom Java objects (internally composed of other custom objects) I want to write this to HDFS in parquet format
Even after a lot of searching, most suggestions seem to use Avro format and internal avroconverter from parquet floor to store objects
Seeing this here and here, I seem to have to write a custom writersupport to complete this task
Is there a better way? Which is better, writing custom objects directly or using intermediate schema definitions like Avro?
Solution
You can use Avro reflection to get the schema Its code is like reflectdata AllowNull. get(). getSchema(CustomClass.class). I have a sample parquet demo code snippet
In essence, the custom Java object writer is like this:
Path dataFile = new Path("/tmp/demo.snappy.parquet"); // Write as Parquet file. try (ParquetWriter<Team> writer = AvroParquetWriter.<Team>builder(dataFile) .withSchema(ReflectData.AllowNull.get().getSchema(Team.class)) .withDataModel(ReflectData.get()) .withConf(new Configuration()) .withCompressionCodec(SNAPPY) .withWriteMode(OVERWRITE) .build()) { for (Team team : teams) { writer.write(team); } }
You can replace team. Java with a custom Java class You can see that the team class contains a list of person objects, which is similar to your requirements And Avro can get the architecture without any problem
If you want to write to HDFS, you may need to replace the path with HDFS format But I didn't try it myself
By the way, my code was inspired by this parquet - example code