Apache Avro for all!

I am on a tech steroid to keep going on technology topics. I will assure you my next topic will be non technology.

If I fail in my promise, just let it go since I am spoiled beyond repair!.

If you have noticed all my blogs, all my technology topics stay in the middle ground. Not at 1,000 foot level or 10 foot level. but somewhere in the middle. Why?. Many of my friends and readers asked me to write blogs that they can understand in and out but not at the coding level. They also do not want to be at high level and getting exposed when they have to explain to their clients. Fair requests to me!.

What happens when you stay at 10,000 foot level and if you are a technologist. You know peeling the onion story. I have seen so many people who have claimed as technologists, but it was just one or max two layers of onions and there after they were not only exposed but also resulted in taking project to the drain.

Back to the topic now. What is this Apache Avro and why is it a topic for the discussion?. Let me not stir a war with other big data evangelists. Neither them nor me having time to debate but rather focus our energy to make use of everyone’s time wisely. Hey, did you notice that I have avoided the critics from those people?. !.

Let me make it very simple. I am biased and if you don’t want to join with me, wait till you read the blog completely.  I am sure I will have army behind me by the time you finish the reading. Am I overconfident?.

At the end, if you still feel that I am biased, read my second paragraph again.!. This may be an article than a blog. I call it as lengthy blog by violating the rules of blogging. Someone told me “rules are meant to be broken”.. Just making fun. Don’t break the rules.. My intention is to not to violate the blog rules but not to loose your and my thought process by breaking into two blogs.

In big data platform there are many file storage formats available and each has its own strength and weakness. Like your cameras!. You have so many choices to choose from and we can’t say one camera  is better than the others or there is one camera that meets and beats every other camera in EVERY aspects. Because there is no such “ONE” exists. But we can pickup the one or two cameras as top of the line that provides most benefits and functionalities. Ofcourse yes. I am sure you are there with me!.  Why did I pick-up camera as an example?. I love photography, I have all equipment such as professional grade camera, lenses, flash and what not. My camera expenses YOY is growing steadily and exponentially but the quality of the pictures has not increased.!. Don’t tell my wife since she will either put a limit on my spending or if she is in bad mood, she may completely curb my spending. I am writing a blog for you and hope you will not kill my hobby. We have a gentlemen agreement!

Like I have a choice of Camera, I chose one or two file storage formats which I believe offers greater benefits than others in common/most implementations. So which camera you choose between top brands like Canon, Nikon and Sony?. If you are not into cameras, which car you choose from Mercedes, BMW and Porsche! ?.  Don’t get offended if those brands are not the one you like.

My pick is Apache Avro storage format over the others (Compare to Text, Parquet, ORC, RC etc) when I am WRITING (ingesting).  Make a note for “WRITE” and not for “READ“.  You have to make a note the upper case letters.

Am I saying, for faster reading you need to use another format?. Absolutely but discussion for another topic and not now. If you are going to challenge why would I use one format for writing and another for reading, it is yet another lengthier topic. I assume you all know the reasons to make this blog shorter and not diving into format for reading now. I will write another blog for file format for reading and my choice!.

Avro

Let us talk about the benefits.

1. With all other benefits it offers, this format provides best if not super write performance in writing (ingesting)

2. Works very well with snappy compression. I love snappy compression. Remember you need a compression technology which consumes less CPU cycles

3. It stores the data in binary format but also “embed the schema in it!”. No need to have an external schema to understand the data file. Ofcourse few other format also provides this capability but Avro also has it!

4. But the biggest value for the bucks  comes from its capability to accommodate “schema evolution“. Some people by mistake call it as “Evaluation” but it is called as “Evolution”. As of today no other format supports schema evolution to an extend Avro supports. This let alone made me to take a side line with Avro. Unless you are put in a situation like me to get the big data implementation done at an enterprise level with a size of PetaBytes, ingesting 25K to 30K entities per day, you may not understand the rationale and my vote for Avro. To know how difficult to play as a  quarterback, you have to play and then only you will know the reality. Unfortunately or fortunately I was put up in that situation. If you wanted to try with other formats, by all means, you are encouraged to go and play with all those players such as Tackle (300+ pounds), Line backers(245+ pounds), Corner backs-200+pounds (I meant your clients) hungrily  waiting to crush you. Or use others experience and stay as coach(like Bill Belichick)  on the sideline but make right call for the team!.

Believe it or not, many enterprise system will warrant for the support of schema evolution for data ingestion and also consumption. Trust me, your source system where you consume data can decide to change the structure of any file/table (aka entity) at any time based on change in requirements. In a nutshell schema changes from source system can happen multiple times in its life cycle.

What is the challenge?

You have an application written with an interface agreement between your source system. As per the interface agreement, the source system will deliver files with “X” number attributes/fields separated by comma or by delimited fields (in case of files) and in case of tables, you have called out select statement with the agreed upon fields.. (Don’t ask me why can’t we use select * for tables. If you are asking this question, you have to go back to the fundamentals of data extraction principles. For now, I assume you understand why we don’t use select *).

What happens source system decided to remove (delete) a field which they don’t want to carry it forward.?. (It happens). For file based source, your application fails since you are expecting “n” number of fields but you suddenly getting “n-1” or “n-m” fields. For tables also it fails generally but some variations it can be accommodated and let us not solve the exceptions now.

What happens when a field is added?. For file based source, your application fails since you are expecting “n” number of fields but you suddenly getting “n+1” or “n+m” fields. For tables your application may not fail but will not bring the new field! (since your select statement has only known fields and not the newly added field)

What happens when a field is renamed?. For file bases source, your application may or may not fail but your application will still use OLD name. For tables also it fails generally but some variations it can be accommodated and let us not solve the exceptions now.

What is a schema evolution?.

Schema evolution refers to the problem of evolving a database schema to adapt it to a change in the modeled reality. The problem is not limited to the modification of the schema. It, in fact, affects the data stored under the given schema and the queries (and thus the applications) posed on that schema.  https://en.wikipedia.org/wiki/Schema_evolution

In other words in simplified form, Schema evolution is the term used for how the data store and/or application behaves when schema is changed after data has been written to the data store using an older version of that schema.

  • It is a file storage format that helps you to protect all your consuming application from any of these source system changes even a new file is generated with all these additions/deletions and renaming of attributes or any of these combination of the events.
  • Also it allows newly developed applications to consume OLD and NEW data without any impact.
  • NO impact to your existing applications which is running in production even if source system changes its structure.
  • Enterprise will love you since you have designed a system that will allow to make changes in the file structure based on business needsbut NONE of the consuming applications will have to change!.

Another question is what if I am getting same data from two different sources (one Equifax – If you have not, check your data is compromised now, and another from Experian). You used to have common format but now Equifax made a change (due to the identify theft they had!!), added some fields, removed some and renamed some but Experian continue to deliver old format. You have ONE consuming application (since you wrote that application 5 years back with an assumption that the format is common between Equifax and Experian and you did not have the intelligence like what we have to predict that there will be data breach and the format will change!). Now your application will break and you don’t have time to fix your applications but somehow your application should be able to handle both OLD (Experian) and NEW (Equifax) files. You came to the right place. Apache Avro can help you and save you. Go back and read “What is the challenge?” section again. Then send me a blank check with sufficient money in your bank account.!. No you don’t but I need you to return a favor. Start contributing back to our community.

Can you build schema evolution without using the Avro’s inbuilt capabilities?. Ofcourse and why would you do that?. Do you want to prove your technical mastery?. We need people like you who are very smart but we would like to use your time to do something big by effectively using the tools that are already provided by the industry and backed by so many great people like you. There were so many Apache committers engineered and supporting the Apache Avro and we don’t want you to  develop another work around. Your potential can be used for something big and I encourage you to use the one that comes free from Apache.  Did I say “free”?. Yes it is free and let me know who would not love free goodies?. So far I put it in a nice way. The not a nice way is, “Who will be supporting your engineering marvel when you are sick or on vacation or to enhance or when you leave or make it compatible when new formats comes in or new versions of source is released?”.

If you are going to engineer your own engine, I am not going stop you and good luck. How many people will buy the “car you have designed in your garage” over the cars engineered by the manufacturers and well tested?. There are so many reasons why you buy a car from the manufacture. I don’t need to tell you why?.

If you are still adamant, I  am not taking about car enthusiasts who will design for specif goal and purpose. Exceptions are not counted or considered for broader use. Hope you are all with me! After taking pain to write this blog, I assume you are all on the side which talks about reality. Without asking you, I am assuming you are with me so far!

I see the next question is, can I have samples to explain the schema evolution (the life cycle). I can, but you will be able to find tons of details on the internet since so many selfless contributors authored with examples.  You have better choice than me!

The purpose of this blog is a). for you to talk to anyone about Avro’s benefits b). make recommendations based on your client needs and if it falls in to the cases I have explained (majority will and minority will not) c).  to generate interests so that you can do further research to see the examples (pushing you to learn)  d). to differentiate yourself from others (you are a doer and NOT a talker).

Am I saying ALWAYS use Avro format?. Hey, I don’t get any kick back!. Certainly not. It is like I am recommending you to use your Mercedes Benz as a garbage truck?. I did not mean garbage is bad.. What I meant is Garbage truck is designed to collect garbages effectively and efficiently than your car. So you car is NOT a right choice of technology for garbage collection.  Likewise your situation may warrant for other type of file formats and never hung-up to Avro if it does not. ONE SIZE DOES NOT FIT ALL.

We are at the end of the blog and did anyone think about what camera I support?. If you have not, you are very focused on the “subject” and ignored my interests. I am disappointed guys!.. No, I am not. I am extremely happy that you have focused on what is needed and filtered out the noises. I win..

If you thought about it and also remembered not just camera and cars but also the technology details, You are the best!.  Coming back to camera, I support “Canon”. Cars?. I am neutral!. Why do I like Canon?. The major factors are the picture quality, handling, support and brand but also after investing quite a bit on this I can’t move to some other brands. I know I will have to live in my garage or live somewhere else for doing another investment with another brand.  I don’t want to ruin my quite life. You know how it works if you are married?. You can buy more gold and diamonds of same type but nothing of your interest. You know who I am talking about!. If you like other brands, have fun and I am not asking you to change. I understand you will have same problem as mine!

Before I conclude, are you with me?. Do you see the value for Avro?. If not, either re-read again or do some research. Even after that if you are not the fan of Avro, we need competition.  There is no fun without having a competition.

Finally, just for fun. One of the photo from my visit to Boston. You can see the photographer!. Who else can take a picture of Beautiful Boston  so ugly other than me?. Now you know why my photography skills has not improved even with high end cameras!.

Boston - Winter.jpg

https://www.linkedin.com/in/manikandasamy/

BTW,

  1. While every caution has been taken to provide my readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author will not compensate you in any way whatsoever if you ever happen to suffer a loss/inconvenience/damage because of/while making use of information in this blog.
  2. If you like it or dislike, post your comments. Former motivates me to share more to our community and later helps me to learn from you
  3. Pardon me for the grammar mistakes

Leave a comment