Road to Faster Insights and Decisions

Databricks Autoloader

Dear readers, I will slow down on Databricks blog after this one since I strongly believe in Technology DEI (Diversity, Equity, and Inclusion) is very important for all of us.

I also had a LinkedIn poll to get your thoughts on the kinds of blogs you all would like to see..

  • Tech, Uses Cases & Humor (94% Votes) – This blog is tailored for you
  • Tech & Uses Cases (3% Votes) – Ignore “Real life examples” sections
  • Tech only (3% Votes) – Ignore Use Cases and “Real life example”. Check databricks.com for deep dives

Thanks for responding to the poll. Let us keep having some fun in sharing our thoughts.

Thank you very much Asjad Husain for your great editing job on this blog.

Why did I choose Databricks’ Autoloader as next topic for our discussion? After reading the Delta Li(o)ve tables blog, some of you have asked me about;

  • How is the data received by the Delta Live Tables (DLT)?
  • What do I mean by Autoloader?
  • Why can’t we build the Autoloader using Hyperscalers (AWS, Azure, GCP etc.) provided technologies (Do It Yourself – DIY)?

Before you start reading this blog, a disclaimer;

  • This is not a comparison of Databricks Autoloader with other technologies like Snowflake Snowpipe, GCP Big Query streaming inserts, AWS Kinesis set of tools, Azure Event Hub / Event Grid with Azure Stream Analytics, Confluent Kafka Streaming with KSQL, Apache Flink, Apache Storm, or any other competing technology available on the cloud including the hyperscalers.
  • I practice and advise our clients on the above-mentioned technologies and many others based on their enterprise strategy without taking a bias on one technology over another.
  • Every vendor brings new capabilities. I admire, appreciate, and challenge them to make the world better with their innovative creations.
  • I generally write blogs for my friends, who are non-technologists, but are very interested in learning and talking about new technologies.
  • If you are looking for hands-on or looking for sample codes, my recommendation is to use google to find so many great ready-to-run examples. I highly recommend checking databricks.com.
  • Pardon me for the longer blog. The goal was to keep the thought process alive as opposed to having multiple parts.

What is an Autoloader?

Auto Loader is another innovative thought process from Databricks leveraging existing hyperscalers (AWS, GCP and Azure) technologies to incrementally and efficiently processes new data files as they arrive in cloud storage (S3, GCS, ADLS Gen 2 etc.) without any additional setup. Easy right? Indeed, it is easy when you leverage Autoloader. If not, then the process may be a bit lengthier.

Courtesy : depsoitphotos.com (Educational purpose only) / Databricks.com

Real life Example: You can go and build your own car. Most of us wanted a car from auto manufacturers that have an engine fine-tuned for performance, fuel efficiency, comfort, price and so on. Some of us may not like the factory installed sound systems, interior lighting that dances to our music or we just wanted to add after market performance boosters. The bottom line is that the car from the manufacturer, should be flexible enough to accommodate for after-market enhancements.

Databricks’ Autoloader gives you many out-of-the-box (OOB) capabilities while allowing you to customize/enhance to meet your needs.

Courtesy: Unknown Internet source; Educational Use Only

If you wanted to build your own car and strongly believe you can build one better than the manufacturer, however, you may not get much benefit from this blog.

Everyone must ask two questions to your business and its true value. Do you need near real-time processing? The answer is Yes. Do you need real-time processing all the time? The answer is No.

Some examples for near-real time and / or batch process are receiving raw data from your electric meters aka advanced metering infrastructure (AMI), fitness watches, electric vehicles, advanced motor vehicles / mining / farming machineries equipped with thousands of sensors, weather data, health monitors, aircrafts, credit cards and so on, into cloud storage and processing them as soon as it arrives without manually establishing and managing complex technologies.

If any of you believe that it is too late to process the data after landing, but wanted to process the data while in-flight (aka data in transit). There are several options available such as Spark streaming (Check Databricks Websites and Google), Kafka Streams (Powered by KSQL, if you are interested, follow Kai Waehner who has number of articles for the confluent Kakfa streaming use cases), Azure Stream Analytics (Microsoft Site), Kinesis Analytics (AWS Site), Snowflake Streams etc.

Smart readers like you can make use of both options (data in transit processing and store & process in near real-time) to solve your special use cases.

This blog is about what happens after the files land into the object storage (The Data Sources/Topics box in the above diagram) and how quickly it can be processed. Readers are encouraged to visualize the “ingestion” with this concept in mind.

Value Proposition of Delta Live Tables

Incremental Ingestions

No need to ingest files in a batch or scheduled mode. Prior to Autoloader, generally there are three ways the incremental data was processed.

Batch/Mini/Micro Batch: Applications are designed to read the ingested files from the object storage (S3, GCS, ADLS etc.) on a pre-scheduled interval, say every 5/15/30 minutes, or a few times a day. This has two issues; Increased latency and scheduling overhead. Some technologists innovatively developed a file watcher code that polls the directory for new file arrivals and triggered the next level processing, this is an intuitive and amazing thought process. But they also had to own the code, support, and tracking which files are processed, order of the files being processed, recoverability when fails, restarting from the point of failure, scalability and so on. The great news is that the great architects have designed a complex system to solve these issues. The Sky is not even the limit.

Notification (Trigger based): With cloud vendors’ event driven architecture, queues, notification technologies, lambda functions, etc. the near real-time reading of files became relatively better. But there is still a good amount of work involved to setup the items I have mentioned. They must track which files are processed, order of the files being processed, recoverability when fails, restarting from the point of failure, high performance reading, and so on.

Some technologists (ex. Databricks) thought that they can make the near real-time processing much simpler so the data engineers can focus more on higher order value services than spending time optimizing near real-time issues. They gathered every drop of water and that added up to a cup, if not a whole bucket full of water. The good news is that in enterprise, saving a cup of water on a large scale can be like saving a tank / container full of water when they must process large numbers of jobs that involve several thousand entities and several hundred thousand files.

Use Case(s)

  • Process as soon as the files are ingested
  • Never miss any files unintentionally
  • Read only newly arrived files and never process old files by mistake
  • Identify new files efficiently (Never use DIY comparison and check points)
  • Handle Schema Inference and Schema Evolution
  • Run as streaming or Scheduled / Batch
  • Make it as repeatable (Parameterize)

Process as soon as the files are ingested

It is important to know that this requirement is not only processing the files as soon as they arrive but also without setting up any additional or complex services.

How would we do this normally? We would have metadata/ABC tables with last file being processed and continuously poll (aka heartbeat check) every 60 seconds or so for newly arrived files. Is there a way we can simplify this process without us polling, setting-up, and managing meta data tables? Yes, by leveraging Databricks’ Autoloader capabilities.

Autoloader triggers automatically while maintaining the order in which the files are being arrived.

Real life Example: You can go for a manual car wash, pre-wash your car, clean your tires, wash with soap, rinse, towel dry, and then apply wax. OR use  a drive-thru wash and skip the hassle of manually making all the effort to wash your car. You have saved both time and money (manual wash cost higher when you take longer to clean).

Image Courtesy: Unknown Internet Source. Educational Use Only. Not for Commercial use

Never miss any files unintentionally

When we manually setup the event driven architecture, we must make sure we did not miss any files as they arrive. We have the responsibility to make sure the files processed are matching with the files received. Traditionally, we store the metadata of all the files being processed and then we compare that against the directory listing to certify none of the files have been missed. This becomes trickier when applications must restart after any system related failures.

What if someone(something) takes care of all of these steps for us? (reading the directory faster, skipping the need to explain or prove to the audit team that all the files are processed and guarantee that they’re ready for business use with perfect data governance.)

Autoloader took care of this auditability without the need for us to maintain explicit ABC/Metadata management while improving the speed of the directory scanning.

What is this “unintentional”? If for some reasons if a particular file is bad, we need to stop the process since processing subsequent files may have negative consequences since order matters. What if we want to skip the bad file? This is where it gets tricky. You may need to move the bad file to a bad files location before you restart the process.

Real life Example: Let’s say you forgot to take the wax with you for the manual car wash, you will not be able to wax your car after the wash. You did not take the wheel cleaner, ditto! If you go to an auto car wash instead, they help you to avoid these minor inconveniences by having all these items in stock and ready to go. All you need to do is sit in your car under the cool lights, enjoy your music, and drive away!

Efficiently identify new files

Let us think about how we generally design the system to identify new files.

We will design a metadata framework to store last processed file marker (called a checkpoint). Then we list all the files in the directory to identify the files that are newer than our checkpoint and then start the process. One of the slowest processes is directory listing. When a directory gets bigger and bigger, the time it takes to scan the files will increase and thereby increasing data processing latency. The next slowest process is our own maintenance and management of meta data/ABC table.

What if someone else takes care of this without you working as hard to setup this functionality?

Autoloader processes only the new files without slow directory listing and never reprocesses the old files.

Autoloader process only the new files without slow directory listing and never reprocess the old files.

Real life Example: This is simple. The car wash gate will open if you are an existing pre-paid customer with a tag. If you are a new customer, you have pay it manually or to get a tag as a new customer in order for the gates to open! Do you think the gate opens if you keeping coming in cycles with a pre-paid tag? Interesting question for a healthy discussion. The answer is yes, for the most part, since every one of your entries is new entry. That begs the question. Is there a way to prevent people circling back into car wash? Is there a way we can check same file is coming back again and again?

Unfortunately, every time the file comes with a new entry into an object store (technologies know that you cannot have same file stored more than once when you overwrite a file, timestamp changes. It may be a valid overwrite so Autoloader picks it up).

How many car wash companies will instantaneously stop per-paid unlimited wash customers coming back in circles in real-time? None as far as I know. Is it worth it for them to add that intelligence in their system? Most likely not since it is very rare. Do they have analytics after the fact to find how many times we have washed our car in a month? If we exceed more than “x” visits, they will increase our monthly premium.

Of course, you can have after the fact validation to catch these errors as part DLT (Now you know how DLT is integrated with Autoloader). It gets complex and trickier if you want to prevent while ingesting!

Read ONLY newly arrived files

We can’t afford to have double counting even by mistake. It will have serious consequences such as audit issues, market impact and even penalties.

So, the architects designed the system by adding bells and whistles to the application like making an entry in the Audit, Balance and Control (ABC) tables as “Being Processed” so resubmitting the same job does not pick-up the file being processed and after completion or failure of the file, update the ABC table with status and so on.

Is there a better way to handle this use case scenario?

Autoloader process only the new files and never reprocess the old files.

Real life Example: You dropped your partner in a grocery store next to the car wash. You washed the car but get a phone call so you step away to take the call under some nearby shade. When your partner came in, partner saw the car was parked and starts washing it since your partner could not find you and assumed car has yet to be washed. Now you came back. We all know you did not say a word but you spent twice the money! That extra wash of the car is just more money out of your pocket!

Schema Inference and Evolution

Courtesy: iStock

Schema Inference: One of the bigger pain points of DIY metadata handling is storing metadata to read the files in a specific format. It works but is there a way we can eliminate the process if the application has the intelligence to infer the schema from the data files? This is not a new concept since we all know when we try to import CSV files, excel automatically determines the data types or if we have a header, it automatically adds the header. Years ago, we had to write a python code to do this work. The difference is that this became so easy in a way that we need to specify a parameter to infer the schema from a file or point to a location where the schema is stored. If you want to go even more details in addition to schema inference, you can also pass a hint if you know a specific column type (Example: double instead of an integer)

Courtesy: depositphotos.com

Schema Evolution: We all know that the source systems can add / remove the columns but may not notify us.

Use Case(s):

  • Fail when new columns are added: Your Data governance team does not want the new column to automatically appear for end user consumption before they approve it. What if the new column contains Personally Identifiable Information (PII) data such as an SSN, DOB, etc.
  • Add new columns: There are many objects that do not contain any PII data. Time to market will be impacted if IT takes weeks/months to add new columns. We need to have the columns/attributes appear as soon as source system sends a new file. This will update to the schema metastore.
  • Data type changes: Sometimes, source systems make data type changes such as changing an integer to a double. We should be able to ingest without failing our flows.
  • Ignore new columns: Do not bring in new columns. Just ignore them but do not fail the process.
  • Rescue new columns: This is interesting. I do not want this to be added as regular columns (No update to schema metastore) but added a JSON string as a “_rescue” column. Why JSON? It is easy to store as many columns as we want as a single string that can easily be parseable.

Is there a better way to handle these use cases without developing a custom framework of code?

Autoloader has the capability to handle schema inference and schema evolution.

Real life Example:  Let’s take a break from the carwash example for a second. You go to restaurant and customize your food to add or remove ingredients. As order comes by, the chef can add or remove ingredients as per the customer’s taste. I do this all the time going to an Italian restaurant asking them to change the sauce, add jalapeños (highly recommend you all do the same), remove mushrooms, replace angel hair pasta with Penne, add extra chicken, etc. My boss told me I ordered something from the Menu but nothing from the menu! But the chef can handle this, it’s just a simple Schema Evolution!

Most times, I walk into the restaurant with my friends, I get water without ice when others get water with Ice. They inferred from their experiences that I take water without ice. This is a prime example of Schema Inference!

However, when they gave same water without ice to our son; they failed to interpret it correctly since he was raised in a different part of the world than me. So when we go to the restaurant our son will add a hint that he needs water with ice. Schema Hints!

Run as Streaming and as a Batch

Use Case(s):

I need to load and process certain objects in

  • near-real-time (always on)
  • batch or scheduled mode (Cost control – As needed)

I want all of Autoloader’s capabilities including the ability in batch processing to add more business rules such as merging new data with old data leveraging the Delta foundation!

You can have Autoloader continuously watching and processing or scheduled at pre-scheduled intervals without missing any of these functionalities. New files uploaded after an Autoloader query starts will be processed in the next schedule. Autoloader can also process a maximum of 1000 files per every micro-batch. But there are other ways you can increase this rate limiting too.

Real life Example: Let us take the fast-food restaurant example. You can go and get it through drive thru or you can pre-order and pickup later at a scheduled time. Sure, you could say that we still have to wait at drive throughs, but this is the closest example I can think for fastest processing. In Autoloader, there is no wait time if you want to load immediately!

Make it repeatable

One of the greatest strengths of an architect is to design all the above functionalities as a parameterized repeatable process. We call them frameworks, assets, and accelerators. Why is this important?

  • Eliminates redundant processes
  • Consistent implementation across thousands of objects
  • Developer bias and quality issues are eliminated
  • Huge cost reductions since manual work is eliminated
  • Speed to market so the objects can be delivered in minutes that days and weeks
  • High quality outcome. Once tested; always provides guaranteed quality as opposed

Can we achieve all of the above benefits without architects investing their time and developer spend their time to develop all these capabilities? Indeed Yes.

Courtesy: Unknown Source (Partially)

Autoloader processes are configured to be repeatable.

Real life Example: Automated car washes are either touch or touch free. Cars are washed in the order in which they came in, the same amount of solution is being applied, all selected options are applied (tire wash, under carriage wash, wax and so on), and this process repeats for each car, more cars washed in short period of time and so on. Also, for the record, I don’t own a car wash!!

Jump on the Wagon?

Like any other technologies, one size does not fit all. You also need to understand when it is not going to be efficient. Same way you can’t take a sports car off-roading, you can’t expect every technology to work in every instance.

  • Evaluate whether Autoloader fits your use cases.
  • Gather all requirements. Sometimes, minor requirements may prevent leveraging Autoloader or vice versa.
  • Event-based orchestration is useful but not in all use cases.
  • Complexity increases when you have a dependency among data sources (In any streaming).
  • Evaluate file handling failures. While Autoloader with the help of notebook development provides certain level of failure handling, it might increase the complexity.
  • Understand limitations of hyperscalers’ infrastructure quotas and limits. For example, in Azure, Autoloader uses event grid topics. You need to understand resource limitations such as event subscriptions per topic, event grid publish rate (events/sec or size/sec) and custom topics limitations etc.
  • Streaming optimization techniques to optimize the parallelism and size of streaming micro batch etc.
  • How to manage how long the message to live in the queue (Time To Live – TTL) and what to do with the messages that can’t be delivered (Dead Letter Queue).

Recap

So what exactly Autoloader does?

  • It stores the metadata of what has been read already.
  • Leverages structure streaming for new real-time processing.
  • It builds an infrastructure by leveraging a hyperscalers’ specific components without us needing to worry about anything

The power of Autoloader is not only meeting the above use cases, enabling the capabilities discussed but also providing the managed services while simplifying the multi-step and complex notification process for easy adoption. The goal is to make data engineers focus on leveraging the platform for higher order business benefits than spending their time on backend of the infrastructure configurations and management.

Please read the blog to understand how Autoloader and Delta Live tables are natively integrated to provide the “faster time to market” for the business.

Like any other products in the market, feedback from the field (Rubber meets the road) to Databricks is encouraged to enhance the Autoloader functionalities in its future releases. This will help us reap the benefits from the Databricks engineering investments.

Disclaimer

  1. While every caution has been taken to provide readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author will not compensate you in any way whatsoever if you ever happen to suffer a loss, inconvenience, or damage because of information in this blog.
  2. The views, thoughts, and opinions expressed in the blog belong solely to the author, and not necessarily to the author’s employer, organization, committee, or other group/individual.
  3. While the Information contained in this blog has been presented with all due care, author does not warrant or represent that the Information is free from errors or omission.
  4. Reference to any particular technology does not imply any endorsement, non-endorsement, support or commercial gain by the author. Author is not compensated by any vendor in any shape or form.
  5. Author has no liability for the accuracy of the information and cannot be held liable for any third-party claims or losses of any damage.
  6. Please like, dislike, or post your comments. This motivates us to share more to our community and later helps me to learn from you!
  7. Pardon for any grammar errors and spelling mistakes!

https://www.linkedin.com/in/manikandasamy/

Delta Li(O)ve tables

I am quite thrilled to connect with you all after a very long time. For new readers, I generally write blogs for my friends, who are non-technologists, but are very interested in learning and talking about new technologies. I used to write blogs during my travel for work, but like everyone else, the pandemic impacted my travel and blogposts, but we’re slowly but surely, returning back to normal!

Thanks to Asjad Husain for the proofreading as the work looks a lot better now.

Before start reading this blog, a few disclaimers.

  • This is not a comparison of Databricks with other technologies like Snowflake, GCP Big Query, AWS Redshift, Azure Synapse, or any other competing technology.
  • I practice and advise our clients on the above-mentioned technologies and many others based on their enterprise strategy without taking a bias on one technology over another.
  • Every vendor brings new capabilities. I admire, appreciate, and challenge them to make the world better with their innovative creations.
  • Pardon me for the longer blog. The goal was to keep the thought process alive as opposed to having multiple parts.

What is a Delta Live Table (DLT)?

Delta Live Tables (DLT) is an innovative ETL framework invented by Databricks that uses a simple declarative approach to building reliable data pipelines and automatically manages the infrastructure at scale so data analysts and engineers can spend less time on tooling and focus more on getting value from data.

Diagram Courtesy: Databricks

Where does DLT fit into above *Medallion Architecture?. Everything from Bronze to Gold zone. Hard to believe isn’t?. Wait till you reach the end of the blog to agree with me.

*Medallion Architecture https://www.databricks.com/glossary/medallion-architecture

Still complicated? Let us watch a small YouTube video to explain above DLT concepts.

Courtesy: YouTube for educational purpose. Not for commercial use

Assembly lines are optimized for speed and efficiency, tasks are limited, and most lines can turn out products much faster than traditional methods of manufacturing.

Think of similar situation of moving raw data from multiple sources (Audio, Video, Sensor, RFID, Financial, Medical, Insurance and so on) within an enterprise into a data platform (lakehouse medallion architecture aka assembly line) leveraging “Delta Live Table Pipeline” that is optimized for speed and efficiency, tasks are limited, and most data pipelines can turn out insights much faster than traditional methods.

Value Proposition of Delta Live Tables

Delta Live Tables (DLT) offers many of the following benefits while continuously introducing new features in their roadmap. These benefits not only provide higher ROI but can also accelerate the development as well as enabling easy maintenance and support.

Automatic Lineage & Visualization DAGs (Directed Acyclic Graph)

Developers can develop the code in any order within a single notebook or across notebooks. DLT automatically determines the dependency based on table dependencies across the data pipeline(s) and creates a visualization DAG while automatically checking table dependencies and syntax errors.

Use Case(s)

  • Identify syntax errors in the pipeline
  • Identify table dependencies across notebooks
  • Provide visual representation of lineage
DLT DAG

DLT automatically creates the above DAG visualization, provision infrastructure, execute the pipelines and makes all operational statistics available for the user

Real life Example: Filmmakers take shots in different locations in no particular order. However, once it goes through the complete edit processing, you will not only get to see the sequence, but a visual piece that flows well and makes sense. Just as the editor of the film takes the shots and organizes them into a comprehensive piece, in Data platform development, developers can develop in any order. When it is all put together, the DLT pipeline intelligently sequences and makes it a simple visual flow to understand the handshake between multiple developer components!

Data Quality and Error Handling

DLT provides built-in quality controls, testing, monitoring and enforcement to ensure accurate data quality checks to tag, allow or fail based on the severity of the data quality rules while the data is streaming

Use Case(s):

  • Flag bad records without dropping but continue the process
  • Drop invalid records if certain conditions are not met
  • Fail the job if there are data quality issues
  • Store the rules and reuse them across multiple objects
  • Capture tun-time statistics of rows processed, failed and data quality expectations metrics including history

DLT simplifies to implement multiple data quality checks of your choice without needing to develop a complex code on either stream or batch data. It helps prevent bad data from flowing into tables, lets your measure data quality, and provide tools to troubleshoot bad data. You can also store the DQ rules and apply them against multiple objects (Reusable / Portable).

Note: You can also define multiple expectations aka quality checks within a single constraint

Real life Example: Like Editors cut, copy, move and align movie clips for best quality of the movie while tracking the wastage, time it took to assemble, resources being used, $ spend etc., DLT provides the capability to produce good quality data available for the business user consumption while also tracking many of the metrics explained above.

Slowly Changing Dimension

There is no need to write complex SCD 1 & 2 logic. Simplified functions take care of the complexity with simple code block

Use Case(s)

  • Implement Type-1 SCD (Update records if exists; Insert otherwise) and SCD-2 (Expire old record and create new record for updates to maintain auditability / version; insert new records)
  • Handle SCD by applying the order correctly even when data arrives out of order
  • Fill data from previous data for missing column values when source send partial update (few columns)

SCD implementation used to be a multi-step data flow either with ETL/ELT tools or SQL technology, with databricks all complex implementations hidden behind simple specifications including handling of out of order events using user specified sequence column.

Real life Example: Sometime retaining a flashback in the movie is important when historical events matter to the movie with a time frame of when it happened. Slowly Changing Dimension Type-2 is maintaining history of events that happened to a data.

Another example:

  • After you place an order, the invoice will only display final quantities.. If you make a change to the order (before it gets shipped out) and add a new item, your final invoice will show only the latest change(s) as old information will be lost as designed with a purpose. This is called Type-1 SCD where old is updated and new is inserted!
  • For example, you go to Amazon, add items to your cart, updated quantity with more, remove some items that you don’t think you need, move some to the wish list, move some items to save for later and then finally submitt an order. Amazon keeps track of every action (data) you do over a period of time(historically) and then later display personalized ads related to your actions. (Very high business value). This is Type-2 SCD.
  • Disclaimer: for those who are technical, please do not say “Order” is a fact and not a dimension. This is just an example for my non-technology friends to understand.

Operational Metrics / Observability

Provides details of past run statistics at each table level, historical runs of pipelines over its life cycle, job success / failure status, individual and total run times, and an easy link to Spark UI / Ganglia charts, logs and operational metrics for performance tuning.

Use Case(s)

  • Monitor the state of the data pipelines, run time statistics such as records processed, rejected, table dependencies, runtime etc.
  • Capture tun-time statistics of rows processed, failed and data quality expectations metrics including history

Databricks stores all the event logs for you to track, understand, and monitor the state of your data pipelines and makes it available within its user interface or even better through API calls.

DLT tracks table dependencies and provides a lineage diagram while storing all the data quality metrics in a log to understand data quality issue across your pipelines. DLT logs are exposed as a Delta table with all data expectation metrics to generate reports to monitor the data quality using Databricks SQL or BI tools of our choice.

DLT provides monitoring UIs that provide run-time statistics like the number of rows process, failed, as well as metrics on a per-expectations basis. Users can look at historical performance (previous runs), allowing you to track pipeline performance and data quality over time.

Real life Example: It is like capturing the details of how many times the shot was taken before it is perfected, how much of the film was wasted or used effectively, how long it took to get the shot, how good the set background was, how good the lighting was and so on. These learnings are important for the movie actors / actress to improve or filmmakers to control costs and identify where the issues are. If you are one of the producers, you would agree with me. If you are the director of the movie, guess what?

Automatic Infrastructure Management

Remove overhead by automating complex and time-consuming activities like task orchestration, error handling and recovery, auto-scaling, and performance optimization.

Use Case(s)

  • Scale up/down, out/in the infrastructure automatically based on the data volume without over or under provisioning while achieving an optimal performance
  • Intelligent scaling of cluster of deciding when to scale up and based on the amount of data being pulled out of event-based systems such as Kafka, EventHub, Kinesis streams etc.
  • Retry the pipeline automatically to better handle unexpected infrastructure failures

DLT automates operations and eliminates the challenging server sizing, failure handling with retries of pipeline during infrastructure operational issues, pass the data volume and event queues details to autoscaler to determine optimal cluster scaling (Work in progress as per Databricks).

Real life Example: This is your production manager. All you need to ask him/her is how many actors/actresses are needed, how many cameras are needed, how many sets are needed, how many meal orders are needed and so on. He/She will make it happen without you having to worried about these tasks. He/She will have the capability to scale-up or scale-down all of the necessary resources through his/her contracting agencies behind the scenes.

An Alternative example is when you sign-up with system integrators like us (please read disclaimers at the bottom), we not only take care of everything else discussed in this blog, but we can also ramp-up / ramp-down the resources very quickly!

Workflow Integration with DLT

Complex logic can be written in a separate notebook but can be integrated with DLT through Job flow (Workflow)

Use Case(s)

  • Integrate existing notebooks of high complexity jobs with DLT pipeline
  • What you will need is
    • data flow visibility, table dependencies and aggregated data quality metrics across all data pipelines
    • row level logging for operational, quality and status of the data pipeline

With Job flow, it is easy to define workflow with dependency to integrate traditional notebooks logic with DLT without needing to use separate orchestration / scheduling tool.

Real life Example: Let’s divert from the filmmaking example for a minute. Think of  workflow integration with DLT as having a manual assembly line which was built over the years and is now running smoothly. Recently, you have built a new expansion for a portion of your assembly line with a high level of automation. Your final product depends on both the manual and automated processes working seamlessly in order to yield product in a time and cost effective manner.

Streaming and Batch Simplified

Architecture that allows to combine streaming and batch capability without needing to develop separate pipelines

Use Case(s)

  • Process the data as soon as it become available for improved speed to market and decision making; invoke machine learning models to predict the outcome while streaming
  • Avoid setting up complex streaming technologies, managing and supporting them
  • Enable business users to develop streaming applications! (Streaming at the business users fingertips)

DLT makes it easier to integrate with Autoloader, efficiently streaming the data files as they arrive in real-time without setting up complex cloud services. DLT will not reprocess duplicate files out of order, implement all the data quality, business logic, calling predictive models and storing the data across all layers making the data available as soon as the data  arrives. No more batch process, scheduling or waiting. Autoloader itself qualifies for a separate blogpost so I am not going to take a deep dive into it here.

Real life Example: In movies, this is called single shot cinema for streaming equivalent. A one-shot cinemaone-take filmsingle-take filmcontinuous shot feature film, or a “oner”, is a full-length movie filmed in one long take by a single camera, or manufactured to give the impression that it was shot by a single camera. (Courtesy Wikipedia).

Alternative example is, think of a newer machine capable of sending all the data in near real-time (streaming) and the older machines that store all data, but a separate process is done to acquire the data once a day from its internal storage (batch). It is important that the technology provides the capability to process / integrate both streaming and batch data without developing two different pipelines.

Multiple language support

Capability to choose commonly available language for the ease of development

Use Case(s)

  • Analyst: I would like to use common language such as SQL
  • Data Science and Data Engineers: I want the flexibility to use Python to pull in more complex libraries and take advantage of the complexity that Python offers and the SQL

Users / Developers can leverage SQL or Python to build declarative pipelines – easily defining ‘what’ to do, not ‘how’ to do it. 

Real life Example: This is a fairly easy example. A large majority of movies in the present day provide multiple language support (not just subtitles or dubbed movies). Nowadays, you can select what language you want to watch your movie in and 9 times out of 10 it will be available.

Automatic handling of Data Structure Changes

When source produce change records that contain only a subset of the fields in the target table, and unmodified columns are represented as NULLs. Delta Live Tables can combine (coalesce) these partial updates into a single complete / up-to-date row.

Use Case(s)

  • Accommodate system table structure changes, without impacting existing application (protected from those changes) but also accommodate the changes.

DLT handles these change aka Schema Evolution. There is no need to develop complicated DDL to evolve the schema of the target table. Instead, columns are automatically added to the target table, including complex / nested schemas.

Real life Example: Movies are shot with different types of cameras without compromising the added or reduced functionalities and quality of the camera that was available on the date of filming.

Integration with Orchestration tools

DLT pipelines with your enterprise tool of choice.

Use Case(s)

  • Run DLT pipelines with your enterprise tool of choice orchestration tools
  • Integrate of custom developed Python UDF within DLT
  • Schedule data pipelines either to run on a scheduled interval or continuously

DLT integrates very well with Apache Airflow or ADF (Azure Data Factory) while providing capabilities to run at a scheduled interval or continuously, so the data gets processed from bronze->Silver->Gold layer as soon as the data arrives.

Real life Example: A Team of people on a set working together seamlessly (the director, actors, lighting crew, sound engineers, photographers, choreographers, etc.) to create a film.

DLT Exposed for Reporting and Dashboards

DLT pipelines integrated business intelligence tools for reporting and dashboards.

Use Case(s)

  • Create near real-time / batch dashboards and reports against DLT tables with continuous data refresh
  • Spot anomalous behavior and display it visually as soon as it happens to avoid opportunity loss, equipment failures, security breaches and fraudulent transaction etc.

DLT exposed through SQL endpoints (Databricks Runtime / Servers) can be configured either as ad-hoc or scheduled running queries and exposed the results as Reports/Dashboards.

Courtesy Databricks – Sample Visualization within Databricks SQL!

The above visualization is a sample one. Users can create visualization leveraging Databricks SQL without having to use any special visualization tools. Please evaluate your need for other visualization tools when a complex visualization capabilities are needed which cannot be fulfilled by Databricks SQL visualizations at the moment.

Real life Example: I would like to see visual representation of daily / weekly / monthly movie revenues based on location, comparison of my movie revenue against other movies and so on.

Recap

The power of DLT is not just meeting one or a few use cases but all the above use cases under one umbrella. Databricks’ investment in making a user-friendly Delta Live Table notebook development and new features that is expected to become GA will continue to make the development and business community excited to reap more benefits from the modern lakehouse platform.

Disclaimer

  1. While every caution has been taken to provide readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author will not compensate you in any way whatsoever if you ever happen to suffer a loss, inconvenience, or damage because of information in this blog.
  2. The views, thoughts, and opinions expressed in the blog belong solely to the author, and not necessarily to the author’s employer, organization, committee, or other group/individual.
  3. While the Information contained in this blog has been presented with all due care, author does not warrant or represent that the Information is free from errors or omission.
  4. Reference to any particular technology does not imply any endorsement, non-endorsement, support or commercial gain by the author. Author is not compensated by any vendor in any shape or form.
  5. Author has no liability for the accuracy of the information and cannot be held liable for any third-party claims or losses of any damage.
  6. Please like, dislike, or post your comments. This motivates us to share more to our community and later helps us to learn from you!
  7. Pardon for any grammar errors and spelling mistakes!

https://www.linkedin.com/in/manikandasamy/

Cloud Databases – I love it! Part IV

I have really enjoyed writing these 4 blogs. It is an incredible journey. First time ever I had to split the blogs to provide a comprehensive view about each technologies and its evolution. Thanks to my friends who have pre-reviewed the individual blogs.

Last but not least, thank YOU all for traveling with me so far, your encouragements thru messages, posts, in-person, over the phone and all your likes in LinkedIn and WordPress. My blogs usually gets about 750 to 1,000 views but this topic doubled it and keep growing. I did not think this topic will generate so much interest. It was possible only because of all your motivation and encouragement. THANK YOU.

blog-IV

In this final cloud databases blog, we will try to address the challenges and concerns that are raised in the previous blogs regarding legacy and big data processing architectures.

Recap from Part III

  1. You want to scale the storage independently of compute and virtually unlimited(scalability) as well as share your storage to any number of computers aka CPU’s without physically attaching CPU to storage?
  2. You want to scale your compute independently of storage virtually unlimited (scalability), decide and change size of compute power to each person or group dynamically. If you add 10 CPU’s you should get 10 times faster response without any overhead.
  3. You want your compute not only scale independently but scale up and down based on your usage. And do not want to pay for entire compute power when you are not using it. You want the elasticity!
  4. You want the services that manages the platform to scale independently of your compute and storage.
  5. You don’t want anyone of unwanted the services running on your compute nodes which should be only dedicated to processing application needs.
  6. You want to store data once but share with many groups/environments. You don’t want to duplicate the data or in the business of sending data and managing all over heads. You also want the consumer to pay for their compute power which they can dynamically scale, and you don’t want to be in the business of sharing your compute power or managing service level agreements.
  7. You make mistakes but instantaneously you should be able to go back “n” number of days and get the data back as of that day.
  8. You want your backups are done automatically except in special cases.
  9. You want all your data to be encrypted, complete access management like any other databases you have without any security issues. Highly secured!
  10. Simply you want someone else to do all these for you and you want to pay per use. I think you want a Swiss bank!
  11. You want highly available and fault tolerant system.

Before you read further down, I just wanted to remind you that this is a blog. So it is almost impossible to write everything down here about cloud databases. We can go on and on and it will never stop since there are so much to talk about this technology and it is amazing!. When I chose the topic and added “I love it” in the title, I truly love it. If you want more conversation, catch me at drink time so we don’t need to worry about time, who is around us and what not?. But I will accept the invite only on the day when it “snows“. At the end, you make a call of who should pick up the tab?

I ran out of example with farming to expand further about cloud databases. I hope you all can travel with me technically in this blog after going thru the journey in the last 3 blogs.

There is also another surprise for you all. One of my long-term colleague expressed his interest to co-author this part of the blog with me. I am taking him with me in this journey. He is happy, but don’t tell him that I am taking him just to put a blame on him if anyone provides negative comments!. Just kidding. Happiness multiplies when we share!. Pleasure is mine!.

So what does cloud databases offer?. Everything in your wish list plus more(read the recap)

Yes and Yes.

In general, we love cloud databases including Google Big Query and what it offers. But we love snowflake little more (similar to your liking of two brands of beer but one has “little” more love from you than the other!. Sarang, the co-author do disagree on beer, he thinks it has to do with Santa Claus). Let us take some deep dive now.

Enterprise data does not live only on-premise any more with trends in big data, sales-force and many applications being now hosted on the cloud . Do we all agree on this one?. Also, enterprises have little or no control on 3 Vs (Volume, Variety and Velocity) of the data that is being generated. In later blogs we will address the other 2 Vs (Value and Veracity). Let us share some interesting fact for you to visualize the Vs.

“The average transatlantic flight generates about 1,000 gigabytes of data”.

One way / per flight. Do you see all 3 Vs here?. Volume of 1000 GB, Velocity at which it comes and the Variety of data from different part of the sensors.
To all Big Data experts (I know you have lot to say here), wait till you finish reading the blog in its entirety. Also read Part III of the blog where I have explained how big of a difference the Big Data technology made/making to us?. I am also one of you, so there is no intention of underestimating other technologies here.

To meet the need of Vs, data platform MUST scale.

It can’t be constrained by disk, CPU or memory.

This is where cloud’s elastic compute and its scalability makes its mark in the enterprise landscape. While we have few enterprise class cloud databases that leverages the cloud to its maximum, as we have mentioned, for simplicity and understanding we will pivot our discussion around Snowflake – The data warehouse built for the cloud.

We wanted to reiterate again that “our goal is to talk about Cloud databases” and not necessarily to support one cloud database over the other. Snowflake experience and in-depth knowledge helped us to develop the contents in a way that the readers can understand.

We will take you through our journey and also explain why we all fell for it. We also understand that we can’t do enough justifications for all of Snowflake’s functionality within this blog alone. We took a best effort only.
The features that tops our list are below. We absolutely love its architecture. We hope you will feel the same way when you finish reading this blog.

VIRTUAL WAREHOUSE

A “T-Size” cluster of compute resources in the cloud. A warehouse provides the required resources, such as CPU, memory, and temporary storage, to perform various operations.

How does this virtual warehouse helps the enterprises?. Let us take a use case. We all know, enterprises own one or more of SMP or MPP or Big Data platform to meet some imaginary / anticipated / projected workload. Let us know if you guys disagree or agree to disagree. You also know that we all have many different types of work load in our enterprises such as ETL aka data integration jobs, Business Intelligence (BI) tools accessing data for dashboarding and reporting, data mining team running the processes to discover patterns in large data sets, SQL analysts running large queries and Data Scientists running their predictive models against large volume of data. Oh. I forgot to mention the Data Wrangling tools used by the Business Analysts pulling huge volume of data and running the queries with unknown patterns!. Then the auditors don’t keep quite as well. They make sure we all will have special / on-demand audit or restatements data processing requirements to meet their standards. The list and types of workloads goes on!

All the above processes are sharing same resources constrained by a box you have purchased few years back and continuously upgraded to its maximum limit!. Experience has taught us; one single bad query can bring down the entire box or slows down EVERYONE ELSE. Let us not argue that you have a work load management in place and we all know what it means and how complex it is?. We all are used to hearing Oh!, that one rogue query brought the entire platform down and we need to reboot or restart or we missed the Service Level Agreements (SLA). Missing SLA’s not only causes bitterness from the consumers but also leads to heavy penalization for the delayed delivery as well.

How often we dreamed if we can scale up or down the resources dynamically based on the compute, space and memory as needed without impacting anything else going on in the system?. I am sure you all must have imagined for the “day” to come.

Let us take this even more. Would it be possible to have finer control and virtually unlimited expansion and also support the user needs separately and manage their own clusters based on their individual needs?. Yes! This is exactly what snowflake’s virtual warehouse does for you. Snowflake found the “magic key”.

The below diagram explains all we have talked about so far.

Wihout Snowflake

Figure – Day in the life of Enterprise Processing

How does this magic happen? Let us look at the architecture of Snowflake. Snowflake’s architecture is a hybrid of traditional shared-disk database architectures and shared-nothing database architectures.

snowflake arch.png

Figure – Snowflake Architecture – © Snowflake Courtesy

Make a note. There are 3 layers. Storage, Compute (Virtual Warehouse) and Services. All these layers SCALES INDEPENDENTLY.
Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the data warehouse. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture©.

You can have a separate cluster (compute) for separate work loads but all sharing same data!. Have you noticed that “the storage is independent of compute”?. Any compute resource can access the storage since it is shared. For an example, each department can choose the size of their own compute resources based on their need but accessing same data.
Look at “Figure – Day in the life of Enterprise Processing” again. All these workloads can get its OWN dynamically scalable resources of compute and/or shared storage without impacting each other. Biggest differentiation and a huge leap.!

There are NO more issue of one rogue query brings entire enterprise operations down or missing any SLA’s!.
This is a great Segway for introducing the next top feature which is auto scaling.

AUTO SCALING

When the compute resources are not in use you can even shut down the cluster or scale it down. You pay nothing when you are not using the system. You pay only for what you use.!. There is no sunken cost. All are TRUE to believe; but it is TRUE.

With the traditional on premises solutions you must size and purchase hardware for maximum load based on anticipated / projected future load. This is where Snowflake’s ability to auto scale is impressive. Let us say you have “M” size compute resources for your regular work load. One fine day due to some government regulations, you are being asked to process 7 years of historical data to meet special auditing requirements. Just simply select the next big computing cluster size you needed just like selecting your “T-Shirt”. That’s all needed. Since you are storing all data in Snowflake / S3, there is no need to bring historical data from the tapes. If you have not kept all the data within snowflake / S3, then you have to have better plan to get the data first.

Snowflake makes it simple, next T shirt size simply doubles the processing power. Instantaneous scale-up!. Wow!. Once your historical processing is done, you can scale down immediately too. Scale up and down happens automatically as soon you make a request. We are sure many of you are already jumping up and down with a Joy.! You can change the warehouse size while the queries are running. When the warehouse size is changed, the change does not impact any statements, including queries, that are currently executing. Once the statements complete, the new size is used for all subsequent statement.

Advantages are limitless; Departments can have their own warehouses and can be billed separately. Remember the storage is shared so departments won’t create data swamps. Let me take it back, they can create more data and swamps if you don’t have a good data governance.! What we meant was that they have a choice to request more compute power without huge investment and pay for only what they use. Practitioners can now focus on data analysis, mining rather than worrying about size and CPU requirements.

Let us recap snowflake’s cloud database architecture.

arch recap

Snowflake distinguished itself from other technology providers in providing some of the capabilities that are never existed before or as all inclusive.
Let us take another real life scenario?. How many weeks and/or months it takes the enterprises to setup separate environments for DEV, QA and PROD?. You all know.! Even after the initial setup is done, anytime when the lower environment need to be refreshed with production data, you will have to wait for another week or so in-spite of so many people working behind the scene to make the environment ready. One such example from our life was “some” enterprises took months to setup a performance test environment. Many places you can’t even copy production data to development environment without going through a bridge process (due to security reasons). We know you have a long list from your experiences too!.

You also needed separate infrastructure for these environments and separate people to manage. Have you heard about prod and non-prod admins, dba’s etc?. This is the biggest pain in the software / application development cycle. You don’t need to worry about this now. Snowflake allows you to stand-up the environments in minutes (if you want to add additional security and other items, add few more days for implementing data management processes) and WITHOUT DUPLICATING OR COPYING THE DATA.

It is called “Zero Copy Cloning”. You have to see it to believe it!.

ZERO COPY CLONING

You can create “n” different environments without even copying single bit!. Once you have your clone, you can make changes or add or remove anything in your environment. It is ONLY visible to you and all others will have their own data but there is ONLY ONE COPY of original data that was cloned. Same concepts applicable for the data security. You can apply common security on cloned objects and/or add more on top of this in your environment for your own needs. Is it not amazing?

So what is a clone and how is zero copy?. Clone does NOT duplicate the data. It is like a pointer to the original data of the time at which the clone is created. All the consumer can access their copy from the clone, add or modify or update more data or even combine with all other data they already have.

With the cloned copy, sky is your limit. You can select your own “T-size” clusters instantaneously to perform the operations you wanted to do (Analytics, data mining, Data Wrangling etc) based on your individual needs to get all your work accomplished.

Snowflake did not even stop there. They have added more gifts for the enterprise community.

How many times enterprises will have to share the data within the organization but do not want them to have access to the underlying data?. How many hundreds or thousand times we have to FTP/SFTP the files EVERY DAY for downstream or other consumers?. We all know we have dedicated systems and resources allocated just to perform these thousands of duplicate file transfers every day in every enterprises.

Not anymore. We know that you don’t believe it. Time to change!. “Zero Copy Cloning” will solve all your FTP / SFTP / File Sharing problems. *Remember your end user must have access to access the clone. If they are not the users of your snowflake eco system, you still need to have other file sharing mechanisms you have in your current enterprise system.
So what is called?. It is called “Data Sharing”.

DATA SHARING

How do we do that?. Simple. You just create a clone and give access to the consumers.!

This is an important concept because it means that shared data does not take up any storage in a consumer account and, therefore, does not contribute to the consumer’s monthly data storage charges. The only charges to consumers are for the compute resources (i.e. virtual warehouses) used to query the shared data.

We have been saying “Data has gravity and process does not”. This is why the revolutionary concept of data sharing is important; This new data sharing paradigm fundamentally disrupts the concept of data gravity.

We’re [Snowflake] now revolutionizing this [Data Sharing] and allowing people to share data between organizations without having to move the data. Nobody’s done that before. The future of data warehousing is data sharing.- Bob Muglia, chief executive, Snowflake.

PART III RECAP vs SNOWFLAKE

All your requirements and wish lists are addressed!

recap

Time Travel

Pay little more attention to “Time travel”. It offers the benefits that was not easy to achieve or not possible in other technologies. We are very much excited and hope you will feel the same way once you understand this. Let us take another real life scenario we all will have to deal with.
What happens when your production processes updated (inserts, deletes and updates) set of tables in your database with the data that was identified as incorrect data?. The only way you can recover is to get the table data from the back-up as of yesterday (prior to updates), reload it and then reprocess with good data. You know how long it takes and how many people involved in it. What happens if you have to do this for several tables?. The problem multiplies. You became the “VIP” of the day from your senior leadership for all wrong reasons.

With “Time Travel”, you can access the point-in-time data instantaneously without going through the painful data backup and restore. The data retention period of each object specifies the number of days for which this historical data is preserved and, therefore, Time Travel operations (SELECT, CREATE…CLONE, UNDROP) can be performed on the data© . You can define the retention from 0 to 90 days, go back to ANY point-in-time data from 0 to 90 days instantaneously. A Big one!.

MORE

We have added more items than what you have asked for since we knew you will be asking us or challenging us soon.

SEMI STRUCTURED DATA

What about handling semi-structured data formats, such as JSON, Avro, ORC, Parquet, and XML?
Snowflake provides native support for semi-structured data, including:©

  • Flexible-schema data types for loading semi-structured data without transformation
  • Automatic conversion of data to optimized internal storage format
  • Database optimization for fast and efficient SQL querying

CONTINUOUS LOADING / STREAMING

Snowpipe is Snowflake’s continuous data ingestion service. Snowpipe tackles both continuous loading for streaming data and serverless computing for data loading into Snowflake.

You pay only for the per-second compute utilized to load data instead of running a warehouse continuously or by the hour.

How about the number of clusters needed?. Does snowflake uses bigger cluster than what is needed or do I get charged higher for this special service?.. Not at all. Snowpipe automatically provisions the correct capacity for the data being loaded. No servers or management to worry about.

How about streaming (sub-second processing and storing)?. You can still achieve to an extend using Snowpipe but depends on what you are looking for. Some edge cases or special cases may need a different technology. You should think about your use case, whether snowpipe can help or not before you make a call on another technology because you get all other added benefits that comes with snowflake. This is a big topic for another time.

So what does cloud databases offer?. Everything in your wish list plus more(read the recap)

Yes and Yes.

FINAL THOUGHTS

Don’t assume one technology will solve all your issues. May be; maybe not. For an example, unstructured data. You have to use different technology to address this. You can’t have all kinds of competing technologies too since it is too much pain to maintain and we will be spending more time in debating which one to use. Just like you can’t have all kind of cars in your garage.

ALWAYS look at the big picture ,your E2E benefits (trust us and don’t ever do any comparison in isolation) and where others are heading into.

We have witnessed organizations and market leaders with all kinds of resources available at their fingertips lost the big time opportunity in the new technology area but others from nowhere captured the markets. We all have witnessed and witnessing downfall of big players just because they did not think proactively.

These new technologies will be very helpful , it is worth to take the risk and you are taking a calculative risk but a giant leap. Don’t wait for others to jump in or wait to learn from their mistakes. By that time you learn from others, either you are chasing them or the opportunity is lost. Get in there first and sure you will have some challenges at times, but it is worth. Of course life is full of challenges (including collaborating with the co-author!) only. But as technologist these are challenges that define us, groom us and humble us. Remember “Fail fast”.

“If there is no risk; there is no reward”

So don’t end up in a situation of chasing your competitors.

Summing this up; having previously worked with cutting edge technologies, proven experiences in building very large-scale data warehouses, big data platforms, cloud implementations and analytics systems from ground up, we would like to conclude with our own view and recommendations. Cloud databases like Snowflake are providing modern edge transformation technology that was not available before. Adoption of the cloud database technology will result in significant business benefits, positive impact to data driven enterprise and overall data management landscape.

So what are you waiting for?.

Embrace the new technology; you will loose nothing.!

Thank You Sarang for joining and your contribution as a co-author, specially for accommodating all my push and the revisions!
My next blog?. I am still debating between Technology or Leadership!.
Until then… Cheers.

  1. While every caution has been taken to provide readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author(s) will not compensate you in any way whatsoever if you ever happen to suffer a loss / inconvenience / damage because of / while making use of information in this blog
  2. The views, thoughts, and opinions expressed in the blog belong solely to the author(s), and not necessarily to the author’s employer, organization, committee or other group or individual
  3. While the Information contained in this blog has been presented with all due care, author(s) warrant or represent that the Information is free from errors or omission
  4. Reference to any particular technology does not imply any endorsement, non-endorsement, support or commercial gain by the author(s). Author(s) are not compensated by any vendor in any shape or form
  5. Author(s) have no liability for the accuracy of the information and cannot be held liable for any third-party claims or losses of any damage
  6. If you like it or dislike, post your comments. Former motivates us to share more to our community and later helps us to learn from you but neither is going to stop us!
  7. Pardon us for any grammar errors and spelling mistakes which we know there are many!

https://www.linkedin.com/in/manikandasamy/
https://www.linkedin.com/in/sarang-kiran-patil-pmp-pmi-acp-183b535/

Cloud DataBases – I love it! – Part III

BGigandCloud DBs-2

If you have not read previous 2 blogs “Cloud Databases – I love it! – Part 1 &2”, may be time to read so that you will have better understanding when you read this one but not a mandatory if you can keep up with this.

Farming at Scale became a big issue for my Dad. But my Dad’s friend had another idea and we know why he was rich and not we!. Let us call him Mr.Big
Mr.Big came up with a model of having many small farmers like my dad (the one invented dedicated resource model) under his umbrella. So whenever Mr. Big had more demand, he assigned the jobs to a group of farmers or all farmers under his system based on his need to meet the demand. Since Mr.Big had many small farmers, he was able to decide which demand gets fulfilled by whom, who is nearer, how many farms needed to fulfill the request etc. Since the demand started growing he has hired a dedicated resource manager to effectively manage his distribution system. He was able to meet the needs of the market and also effectively used the farmers only when needed. Very simple, he was able to scale up and down very easily. Since he was using small farmers in his network, it was easy for him to give low cost incentives to maintain the farmers. While the farmers used tractors, Mr. Big brought high speed large trucks so the traffic between the farms can be addressed effectively.

Whenever Mr.Big had more demand than what is anticipated he has added more small farmers into his group. Whenever new farmer comes in, he has also made sure they are more or less of same size in terms of delivering goods so that he does not need to worry about managing them differently . Also whenever his system was expanded with new additional farmer(s), he made sure every farmer is updated with newer set of operating instructions. Net net is, he was able to scale up or down based on needs while making sure each one has their own operating instructions . Mr.Big was really Big. We were all impressed. With small farmers under his umbrella and with considerably small investments, Mr.Big was able to open up the gate to experiment more opportunities that were not possible before with considerably low investments. He was encouraging farmers to try inter-crop farming such as having banana plantation or cocoa or pepper (multiple open source software for research and analytics) inside coconut plants, he was able to have different kinds of farmers such as sugar cane farmers, peanut farmers, paddy farmers, corn farmer (Multi-tenancy) etc under one umbrella. He has even tried some new style of farming and new types of crops too. Prior to this model he was never been able to experiment unknown potentials. He was able to meet the different needs of different demands (workloads) at scale.

Since he had so many small farmers, when the demands comes in real time, he was able to fulfill the request by sending it to the farm directly where the needed item was available.

Don’t you think Mr.Big has invented multi-tenancy, real-time processing and open source?.

With the success he had in coconut farming, he has ventured into other kind of farming and not limiting to anything but with same model of “distributed and dedicated” farmers. He did not care about what he was getting (No schema on Write?). But he made sure there are many group of farmers but each group has the items he needed (Each has portion of data, distributed). Some shops needed combination of coconuts and sugar canes, some needed combination of coconuts, vegetables and fruits (Schema on read?). Since he knew which group of farmers has what (Sounds like a Namenode?), he was able to fulfill the request faster. For some reason, if there was road repair in one farm (node failure?) or some equipment down (one cluster down?)  in one farm or the resources are slow (node slow?) in responding, he knew which other farms has similar items and diverted the trucks to those farms. (Sounds like node selection with data replication?)

I think Mr. Big changed the paradigm in this area?. A Big one. Do you all agree?

Let us go little more technical now.

Business Model Paradigm Shift

I assume we all know 3 Vs of data. (3 became 4 and then 5 v’s). I will take 3 Vs for the ease of discussion. Tremendous (V)olume of data growth, (V)elocity, the speed at which these are coming, (V)ariety of data. I don’t think I need to explain more on these since everybody talks about these 3 Vs better than me and also about how, where and what forms the data is coming from?. Trust me, I can trade some of my friends who are actually non-technology people to talk about this topic since they are so much impressive and knowledgeable that I can bet on them. Remember I chose them only to talk.

With the V’s , the mode in which the business operated has started changing. Big time change. The business users wanted to access all kinds of data (video, audio, text, pdf, rfid’s and you name anything) based on their business, analyze based on new patterns that they have not thought about or never been able to do with previously available technology and also not necessarily analyze based on how the data is distributed or stored or know pattern (boring you know!). Business used to access and analyze the data based known pattern for decades and that was the only choice. BUT NOT ANYMORE. They want to have the capability to ask questions that they were never been able to ask but also get the results back before the opportunity is lost. Time to market became more important than ever before.

Who says the technology is making life better?. Ofcourse it makes life better for the people who are using it. (like my wife, kids, my parents etc). But to make them happy we (people like you and me) have to sweat a lot!. Trust me I lost all my hairs and many of the fellow technologists also due to the continuous never ending needs!.

The technologists faced a difficult situation again with all the above business needs and demands. As we all know, when one door closes other door opens. There are so many smarter people in this world who cares about solving most challenging problems more than we care about or we could not solve!. They don’t take no as an answer.

These extremely smart people came up with an idea of something like shared nothing architecture but virtually unlimited number of LOW COST minicomputers (horizontal scalability) and without having to spend a lot like SMP or MPP systems.

So whenever you hit with vertical scalability, just add more commodity type minicomputers and as many as you like and your problem is solved. We are not talking about 1 or 10 or 100. Several thousands of commodity type computers can be added and all work together.

Rocket science isn’t it?. Yes it is a huge leap and a paradigm change.

You must be asking me how much more money now?. What would you do if I say cost is about 10 times less than shared nothing computer?. Am I going crazy?. Not at all. Do you want to know more?. The software is not a coca cola formula anymore. It is not a proprietary. It is called open source. What about the software to develop business applications?. All became open source and everyone in the world can learn freely using their laptop.

Yes. I am talking about what you are thinking. Big Data & Hadoop.

Big Data Architecture

Enterprises started embracing much needed big data platforms and it’s related technologies to make impossible analytics into possible ones while spending less money and rapid increase in time to market. They were able to find insights that they were never been able to find. These low cost highly scalable technology is able to solve what super computers used to do. Well almost!. Demands of high volume real time analysis and IoT (Internet of Things) became reality. Most importantly these technologies helped to store and analyze unstructured data such as audio, video etc which was never possible before. This technology made high availability and fault tolerance as no brainer without added complexity or high cost. You know very well how complex your solutions were with other technologies prior to Big Data. Do you all agree this is a game changer of this decade?. Say yes since I spent so much time to write this blog.!.  Since I wrote a very detailed blog related to Big Data in the past, I am not going anymore deeper here.

We all give a big shout now. Go Big Data Go!. I can’t say enough how much appreciation I have for this technology.

So all issues are resolved and no more worries?. Of course not. As humans we don’t settle down and live happily with what we have. We need more and more and more but conveniently we will say I am not after ….. Politically we say “I am driven and I love what I am doing”. I hear you and I believe you!.

While these Big Data technologies helped a lot and resolved many issues, still each computer owned a portion of the data. So if I want faster response , I can’t just add another set of computers but also I have to re-gather all data (recollect) and redistribute again across all computers including to the new computers. Also the software used to develop applications needed specialized skills which are not easy to find. Quite a bit of work load management is needed to make sure low priority requests are not taking more resources. Also to maintain this infrastructure needed expert skills. The response times were not quick in general unless special software and tools are used. Hey, I love these technologies and understand I exist with their courtesy and without any doubt the Big Data technologies made impossible to possible. Still it has gaps to be filled.!. I would ask you to read the link shared here to understand more. It was shared by one of my friend and thanks to him. Also I want you NOT to take it to the extreme since these Big Data technologies changed the era and lot of good in it.
https://venturebeat.com/2018/10/06/cloudera-and-hortonworks-merger-means-hadoops-influence-is-declining/amp/
I am just waiting to see how the mergers of two big rivals are going to pan out for the benefit of the customers and technical community. I truly believe the best comes out when there is a competition. Now they became one. Time will tell us. Until then, we will sit and watch!

One size does not fit all

I want you to understand that each of these technologies have its own strengths and usage. The goal of this blog is not to say anything is bad, but to bring out the challenges we have had and how the new technologies are helping us to solve those roadblocks.

Just like you have different kinds of players in basketball such as center, power forward, small forward, guard etc, to make a a team, these technologies plays a role and it is your need that determines what need to be used when?.

What next?

You have heard all and I also heard all your pains, time to close this part-3. Now take a break and let me know what do you really want?. I am waiting here..

waiting

You guys came back very quickly but the list is too long guys!. We used to give Christmas gift to our Son every year without asking his wish list. For one Christmas we told him to write his wish and hang it in the Christmas Tree so that Santa can get him what he really wants. We were stupid. That day lights were on for long time in his room and we thought he was struggling to write the one he wanted. We went to sleep and late night went to look at the Christmas tree so that we can go to the store to buy it. We could not see the tree at all. It was full of his wish letters!.

I did the same mistake again in asking you guys!. What do you call them when people don’t learn from mistakes?.

Let me summarize what you all needed and thanks for all your questions and the greedy wish list.

Recap

If I believe correct, all of you wanted a technology that can solve not only all the items mentioned above but more!. Let me know if have not captured all your long list of wishes.

  1. You want to scale the storage independently of compute and virtually unlimited(scalability) as well as share your storage to any number of computers aka CPU’s without physically attaching CPU to storage?
  2. You want to scale your compute independently of storage virtually unlimited (scalability), decide and change size of compute power to each person or group dynamicaly. Something like selecting like a “T Shirt” of S, M, L and XL. If you add 10 CPU’s you should get 10 times faster response without any overhead.
  3. You want your compute not only scale independently but scale up and down based on your usage and do not want to pay for entire compute power when you are not using it. You want the elasticity!
  4. You want the services that manages the platform to scale independently of your compute and storage.
  5. You don’t want anyone of unwanted services running on your compute nodes which should be only dedicated to processing application needs.
  6. You want to store data once but share with many groups/environments . You don’t want to duplicate the data or in the business of sending data and managing all over heads. You also want the consumer to pay for their compute power which they can dynamically scale and you don’t want to be in the business of sharing your compute power or managing service level agreements.
  7. You make mistakes but instantaneously you should be able to go back “n” number of days and get the data back as of that day.
  8. You want your backups are done automatically except in special cases.
  9. You want all your data to be encrypted, complete access and authorization management like any other databases you have without any security issues. Highly secured!
  10. Simply you want someone else to do all these for you and you want to pay per use.
  11. You want highly available and fault tolerant system.

I will get back to you within a week. If you are in project management, it is a tentative date and meeting this date depends on many other factors!. Beware, the dates can slip!

Since we are approaching Winter, expect lot of snow and the joy of snowflakes in my next and final blog part-4. If you are living in warmer places, time to roll-up your sleeves, pack up and get ready to board the train to get immersed in the beauty of snow and snowflakes in my next blog.

Ofcourse there are no snow without the cloud!

snowfalkes

Disclaimer

  1. While every caution has been taken to provide readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author(s) will not compensate you in any way whatsoever if you ever happen to suffer a loss / inconvenience / damage because of / while making use of information in this blog
  2. The views, thoughts, and opinions expressed in the blog belong solely to the author(s), and not necessarily to the author’s employer, organization, committee or other group or individual
  3. While the Information contained in this blog has been presented with all due care, author(s) warrant or represent that the Information is free from errors or omission
  4. Reference to any particular technology does not imply any endorsement, non-endorsement, support or commercial gain by the author(s). Author(s) are not compensated by any vendor in any shape or form
  5. Author(s) have no liability for the accuracy of the information and cannot be held liable for any third-party claims or losses of any damage
  6. If you like it or dislike, post your comments. Former motivates us to share more to our community and later helps us to learn from you but neither is going to stop us!
  7. Pardon us for any grammar errors and spelling mistakes which we know there are many!

https://www.linkedin.com/in/manikandasamy/

Cloud DataBases – I love it! – Part-II

Cloud DBs

Have you read Cloud Databases – I love it! – Part-I?. This blog is written in such a way that you can understand even if you don’t read part-1. However, reading part-1 first will help you even better. Your call!.

I am going to continue with  farming as an example for continuity and also for easy understanding.

My dad is my hero!.

I know what you think next, What about Mom?. She is the heroine!. What else?.. Few people who wanted take a shot at me is thinking of asking what about My wife?. She is above all!..(You can’t catch me.!).

Didn’t I tell you (I did not but checking you) we had farms in three different places?. 1. my dad inherited, 2. dad’s self made and 3. courtesy from my mom’s parents!. (my Dad got two jackpots, My mom and also some land. I got none. When I asked, my wife told me  I  am lucky to get her and I already won mega millions and powerball together!.).

In a nutshell you know my Dad had similar roadblocks like SMP systems had. (Part-1)

I am going to tell you a story of how did he overcome those obstacles and then explaining MPP is much easier!. You will also get to see the  glimpse of why my Dad is my hero!.

He was a born leader who has never hesitated to take risk anytime in his life for good and he always thought differently than many others. So he  looked at different kinds of farming we do and decided to pick the one that is more critical and most beneficial for us instead of trying to chew all at the same time.  We had more coconut trees than others which had more potential for income than other types of farming. He chose to concentrate on coconut farming..So he has hired X-number of small but faster people who can climb quickly and dedicated them to each farm (dedicated CPU), hired Y-number of strong people who can carry large number of coconuts in bigger baskets (dedicated DISKS) and large tractors (COMMON NETWORK) to move coconuts between farms based on the demand, mix and match sizes (resource optimization) of coconuts etc). Wow!. We were able to witness the jobs done faster, generated more income and able to sell the coconuts sooner even during the peak harvest season since we had dedicated workers in each location (did NOT share them at all).

But one issue still persisted and couple more issues started adding up.

  1. When there are not enough rain or slow season of the harvest, we still had to maintain all resources (humans, animals and equipments) even though they were barely used. This did not go away even with new approach
  2. Dedicated resources means more expensive too but with more coutcome, my dad was fine.
  3. We started getting more demand for coconuts, but we were limited with amount of land (the size of the box) we had even after we have expanded our farm with 100% coconut. I wished we could grab our neighbors land!. Even when we had more demands we could not sell more since we have maxed out on land. (maxed out on the MPP box)
  4. Since land was limited (the MPP Box), my dad thought about buying another land (another box), but dropped the idea  due to steep curve between cost vs benefits. He made a wise decision of not doing it so.

Go back and read ONLY highlighted texts with the context of computers. You will see MPP.

Now let us get little more technical but still I would like to aim for semi or non-technologist but not limited to. So what is MPP?

Massively Parallel Processing (MPP)

Some smart one’s came up with an idea of something called as Massively Parallel Processing (MPP) aka shared nothing architecture to overcome the limitations of SMP. Think of MPP as many mini computers inside a big computer and each minicomputer will have its own CPU, its own disk and memory. They don’t share anything at all. Like our kids!.

How did they achieve this shared nothing architecture?. Let us take a simple example. Not exactly the real technical greatness but for people to understand easily.

When you store a file, it is splited into multiple small files and a portion is stored in each computer. Now when you want to read that file it will be read “n” times faster because all mini computers works in parallel and uses its OWN memory and its portion of the file. For non-techno people, instead of one person sorting all your goods, you have say 10 people each sorting portion of your goods and achieving the productivity of “about” 10 times faster. Wow!. Yes Wow!

MPP

So what is MPP?.

Massively Parallel Processing (MPP) is the coordinated processing of a single task by multiple processors, each processor using its own OS and memory and communicating with each other using some form of messaging interface.

Now all your queries ran much faster and met all your business demand. Bingo and no more sleepless nights?. Correct?. Yes to an extent but not fully. It also came with a price such as you have to know the best way of splitting the file based on most used access pattern. (A long story about data distribution and zone maps). Also we have to have additional workload management where low priority requests don’t take the resources when high priority requests comes in. The price of these computers became very expensive too because they found a way to solve your pain!. They also hit the wall of how many mini computers they can fit inside a box?. (Horizontal Scalability)

Lastly, think about a scenario where you bought these expensive boxes with associated overheads, but your business does not need all the processing power for 80% of the time. Meaning you have a sunken cost and maintenance cost even though you want the power as little as needed and not the entire infrastructure during non-peak hours of your business. For non-technologist, you built 10 bedroom with huge event hall for once a year get-together and not able to use them for rest of the year. In addition to huge onetime sunken cost, still you have to manage and maintain the house and the county is not going to make your property tax low just because you don’t use the house all the times!. Merciless counties!. You get it.

The technology owners were very happy till the explosion and massive growth of data with all digitization happened in this decade. They could not add unlimited number of mini computers inside a physical box, faced difficulties in connecting all together efficiently thru high bandwidth network interfaces and suffered to provide faster response times when more data is added. So they eventually became slow and unable scale either horizontally (unable to add unlimited mini computers) or vertically (adding more processing power and memory for each mini computers). Since these computers need to know in advance how the data need to be distributed based on most accessing pattern, it was not able to give better throughout when people asked a question that was not based on the access pattern used to divide the data.

SMP vs MPP for technologists

SMPvsMPP

Before I move into next section, I wanted to let you know that I have nothing but great regards on these technologies and I can’t thank enough.

So how do we solve the issues in-front of us?. We needed a technology better than MPP.  Do you all agree?. Yes we all did and the help came in. A “BIG” help.

Time to turn the page to part 3 of the blog!.

Disclaimer

  1. While every caution has been taken to provide readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author(s) will not compensate you in any way whatsoever if you ever happen to suffer a loss / inconvenience / damage because of / while making use of information in this blog
  2. The views, thoughts, and opinions expressed in the blog belong solely to the author(s), and not necessarily to the author’s employer, organization, committee or other group or individual
  3. While the Information contained in this blog has been presented with all due care, author(s) warrant or represent that the Information is free from errors or omission
  4. Reference to any particular technology does not imply any endorsement, non-endorsement, support or commercial gain by the author(s). Author(s) are not compensated by any vendor in any shape or form
  5. Author(s) have no liability for the accuracy of the information and cannot be held liable for any third-party claims or losses of any damage
  6. If you like it or dislike, post your comments. Former motivates us to share more to our community and later helps us to learn from you but neither is going to stop us!
  7. Pardon us for any grammar errors and spelling mistakes which we know there are many!

https://www.linkedin.com/in/manikandasamy/

Cloud Databases – I love it! – Part – I

Thanks to the weather and Delta airlines for delaying my flight 5 times to help me to start writing this blog. For the new readers, my intention of writing technical blog is not only for the technologists but also for the semi-techno people to understand and appreciate the advancements happening in this field. I hope you understand the reason for lengthy blog too.

Many of my readers expressed their desire to read funny technical blogs to laugh it out loud when the details are unfolding. Since laughing is the best medicine in the world, I have continued to punish them with these kinds of blogs. I can’t guarantee whether you are going to laugh or cry at the end. !

I am not in the competition with great technical authors and bloggers unless I am in front of my clients who wanted to get to the point without beating around the bush. If you are looking for a deep in the water view of what I am talking about here, google can help you to find much better articles written by excellent technical contributors.

On the other hand if you are ready for a story, a real story that helped me to bring food to the table, you can read further on. I warned you and don’t blame me at the end!.

This topic has 4 parts. This blog is part 1. If you really wanted to get into cloud databases directly, then you can jump into part-4 when I am done publishing it. By the time you finish reading all 4 parts, I am sure you will be able to explain multiple technologies to anyone asif you have invented it. If not, reach out to me. I will charge nominal fee only!.

Change is the only constant

Whether we like it or not, Elan Musk is changing the world. Whether it is in Space program, Auto or Solar, he has already rocked the boat. It is alright if you don’t like him and you know SEC either. I know you are dreaming about owning a Tesla!. But you don’t like him. I don’t like him too since he is yet to offer high end of Tesla for the price of Chevy cavalier.!. I am not greedy, but wishing for a small gift he can afford!. OK. If you don’t like Musk, substitute with someone you truly believe as true game changer. May be it is you!.

So many companies are following “gamechangers” footsteps, jumping on the wagon at-least to be competitive enough and to grow their business. If they don’t, they will become a history. Have you read the book “Good to great” by Jim Collins?. Even in that “great”, think about how many became history as you read this book now?.

Just like Musk and other game changers, there are new technologies in the computer world making huge disruptions while some became history. The first two and a half parts of the blogs are to pave a way to talk about those disruptive technologies of the past, present and future to some extend. Without giving the background and the history, it would be hard to explain the benefits of new technology advancements. Also it is difficult for me to capture all in one blog too.

New technologies in the computing world are gaining big momentum silently, not only in terms of technology advancements but also the opportunities it provides for tremendous business growth and speed to market . I strongly believe that there is no use of technology if it does not help the business and the consumers. I am happy to be wrong but  remember this and you will thank my boss one day since he hammered it in my head and I thank him everyday.

In the part 3 & 4 of the blog, I would like to share my views about some of the ground breaking technology which I fell in love after investing quite a bit of time to debate, compare and contradict with many other similar technologies I have been using till date.

I love to learn new technologies just like you, even though I may not be as smart as you. Hey, I am catching up with you just like I am trying to catch up with my wife to beat her intelligence for the last 20+ years.! Next piece of wisdom. Read and learn and never stop!.

About the title of the topic. I know it is too generic. I have debated between cloud databases vs cloud data warehouses and chose cloud database to make it easy to start with. My wife overlooked at my blog and the word “cloud” caught her eye. Her comment was “You guys spoiled the earth and now you are not leaving the cloud too!”. She came from an agricultural family (I am from Farming family but she has better background and title) and also a strong environmentalist. She has reason to worry about!.. Atleast for me, she is always right.. (I have to get my food guys).

clouddb

Need for speed

“Dream is not that you see in sleep, dream is something that does not let you to sleep” – Dr. APJ Abdul Kalam.

I always dreamed about an advancement in technology where the storage, compute and services that manages the platform can scale independently while it preserves all great benefits I have today. Read the keyword independently. You may be wondering why?. You may be thinking that I am fooling you around by saying I am “that” visionary. Not at all. It was not because of I had the “visionary” power but as a result of frustration. Do I call my wife as visionary just because she told me 10 years back one day the cars will drive by itself?. I think she was frustrated with my driving and being continuously tossed out inside the car. Poor girl did not realize that I give her experience for the lifetime everytime she gets into the car. Since she cared more about “life”, she was nicely communicating.  Jokes aside, years back she said that one day the we will have self driven cars soon. I was adamantly said it was not possible. But I became a loser again!.

I admit, I am not the visionary and if anyone thought you caught me, sorry better luck next time.

I was dreaming because I got frustrated when I ran out of storage or cpu, calling admins to increase the storage  or cpu allocations and my struggles with them to explain the need of more resources and their escalation about cost of storage/allocating more cpu power or the limitation they had (one guy told me they can’t just go to “Bestbuy” to buy a hard disk and add it!. He was venting out his frustration too! ). In addition to the resource issues, production jobs were running slow and entire team had to stay all night to run the jobs when others were partying out as if there is no tomorrow!. I know once of you reading this blog partying out when we were staring at the green screen.

Ironically when the job fails at the tail end and just before we all decided to leave at 5.00 am in the morning(which happens always), it made us to go for yet another sleepless night and never stopped…. The “frustration” list goes on.. Are these familiar to you as well?.

Am I saying that all the past technologies made my life worse?. No. After all I am human and one of the bad one!. Generally I have less appreciation for what I am enjoying and always look out for the one I don’t have. !.. (Just kidding guys!).

Setting up the stage

Let us talk about couple of ground breaking technologies of past decades and why is it not as interesting as it used to be even though they are all good. I agree these technologies created a paradigm shift and continue to exist if not dominant.

  1. Symmetric Multi-Processing (SMP), I call it as shared “something” architecture for the benefit of the readers and
  2. Massively Parallel Processing (MPP), aka shared nothing architecture

For sure, these technologies paved a way for lot of technical advancements we enjoy now. No complaints at all. It is just like the model of the car we loved 10 year back does not meet our needs anymore. You loved it but also you have to move on too!

Symmetric Multi-Processing (SMP)

For semi technologists, what is SMP or shared something?. Big topic but let me try to explain in few sentences. If I don’t do enough justice to make it easy, just ignore and let us not debate it since you already know it better. Deal?!

I came from a farming family and I am very proud to say it louder than what I do now even though my parents made me to stay away from farming. It is a big sacrifice they made to discourage me in getting there because they knew that farming is not going to make our life better.

I will take an example from the one that gave me the life and a person of who I am. Our dad had PERMANENT employees in our farm and they are shared between all activities.  Some are strong who can do all job but also transport large quantities of items, some or thin but faster and good in some special area like climbing coconut trees and some ladies are super good in seeding paddy plants. Ofcourse we had bullocks, cows and goats contributing to many of the farming activities. They were all shared.

  1. The issue was once they started working on some activities, we can’t stop most of the times even some other high priority work comes in (too many details of why) and even if we do, the work may need to restarted again from scratch. Example is, if someone is half way in climbing coconut tree of 100 feet to plug the ripe coconuts, bringing them down to address something else before it is done means loss of productivity and a redo it again.

climbing-coconut2

2. During harvest season, we needed more resources (human/animal/farming equipments) but we were limited with human/animal/mechanical resources. So we could not get the harvest done faster to sell it in the market sooner to generate higher cash value. Speed to market was limited and the loss was more than we could bear. We could do nothing about it. We were limited.

3. When there are not enough rain or slow season of the harvest, still we had to maintain all resources (humans, animals and equipments) even though they were barely used.

I hear you saying enough farming and  asking what is the relation between farming and computers?. I care about farming because we all live with their courtesy. (Different story for different time). But to answer you, Go back and read ONLY highlighted texts with the context of computers. You will see SMP.

Now little more technical but still aimed for semi or non-technologist but not limited to. So what is SMP?

  • All your processors (compute) will have access to all of the expensive shared disk (for technologists it is something like SAN or NAS. For non-technologists assume that the files you create needed a place to store but the storage place is shared with all the processors/CPU’s you have in the huge computer) and
  • Expensive memory (RAM) is also shared. Meaning, memory is shared with all the processors/CPU’s you have in the huge computer.

This architecture offered benefits of scaling to reasonably larger sized computers with lots of processing power, allowed the processors to access all the storage and memory. But it was hit with a limitation of amount and size of the shared resources that can fit in a physical box and the inner communication bandwidth(common bus) limitation across these resources . For non-technologists, think about a scenario that you can’t make your door wider just because of 500 people came to your home to attend some functions. Your door is limiting factor that prevented faster entry of the people.

Costs of these boxes were high and could not scale when more and faster response is needed with the explosion of the data growth. Just like building a bigger house costs more and you don’t have unlimited real estate too.

Another issue was one set of CPU may be processing a huge data file while other CPU’s were sitting idle. Remember the storage is shared but data is not distributed across multiple processors for all the processors to operate on a portion of the file in parallel. Simply you have paid big bucks but under-utilizing it. . Debaters, I would like to stay away from the argument of block level access which will complicate the understanding for semi technologists.

SMP

Since memory is also shared, if you have large number of requests, other requests will have to wait if the available memory was taken over by fewer processes. Another problem is that if the CPU’s were taken by less critical jobs, high priority jobs will get its CPU slot in a round robin manner thereby taking longer time to finish. If you want to make some high priority process to run faster you have to manage the workload (different ball game), optimize the system by tweaking tons of knobs and also lineup many different people to manage your infrastructure such as computer admins (Unix admin, DBA’s, Network engineers) etc. In a nutshell, you are limited by scalability, unable to meet business demand, complex infrastructure / data center maintenance even after investing in very expensive hardware.

So what is SMP?.

Symmetric Multi-Processing (SMP) is a tightly coupled multiprocessor system where processors share resources – single instances of the Operating System (OS), memory, I/O devices and connected using a common bus.

These limitations caused considerable number of sleepless nights for me and my team. (for sure I am better than previous generation which spend sleepless weeks with other older technologies). You must have heard the phrase “Pizza on the hall way!”. I have had more pizza’s at my office than with friends and family.  I let our female employees go home and work from there so that it is convenient for them but also gives them the satisfaction and a feel of part of the team. I will tell you the real reason now; we get to eat more pieces of pizza with less people to share!. Don’t tell them.

Some of my team member’s spouses sent thank you note to me for keeping their husbands at work longer and making their life better at home. Some others were demanded for baby sitters at my cost since I kept their husbands at work. Some people demanded cleaning services for keeping their husbands at work. Until that time I didn’t know why some of my people kept their desk so unorganized. Poor guys doing so much cleaning at home and taking it easy at the office.

My wife was in former category and wished for the same technology to continue for years so that she don’t need to cook at home and or listening to me. She gets peace when I loose peace!. Their happiness did not last long.

While we have had great benefits with SMP, we still needed a better technology than SMP to make our life better. We needed a technology that will fill the gaps of SMP. God listened to our prayers. So Massively Parallel Processing (MPP) was born.!Time to turn the page to part 2 of the blog.

Disclaimer

  1. While every caution has been taken to provide readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author(s) will not compensate you in any way whatsoever if you ever happen to suffer a loss / inconvenience / damage because of / while making use of information in this blog
  2. The views, thoughts, and opinions expressed in the blog belong solely to the author(s), and not necessarily to the author’s employer, organization, committee or other group or individual
  3. While the Information contained in this blog has been presented with all due care, author(s) warrant or represent that the Information is free from errors or omission
  4. Reference to any particular technology does not imply any endorsement, non-endorsement, support or commercial gain by the author(s). Author(s) are not compensated by any vendor in any shape or form
  5. Author(s) have no liability for the accuracy of the information and cannot be held liable for any third-party claims or losses of any damage
  6. If you like it or dislike, post your comments. Former motivates us to share more to our community and later helps us to learn from you but neither is going to stop us!
  7. Pardon us for any grammar errors and spelling mistakes which we know there are many!

https://www.linkedin.com/in/manikandasamy/

Apache Avro for all!

I am on a tech steroid to keep going on technology topics. I will assure you my next topic will be non technology.

If I fail in my promise, just let it go since I am spoiled beyond repair!.

If you have noticed all my blogs, all my technology topics stay in the middle ground. Not at 1,000 foot level or 10 foot level. but somewhere in the middle. Why?. Many of my friends and readers asked me to write blogs that they can understand in and out but not at the coding level. They also do not want to be at high level and getting exposed when they have to explain to their clients. Fair requests to me!.

What happens when you stay at 10,000 foot level and if you are a technologist. You know peeling the onion story. I have seen so many people who have claimed as technologists, but it was just one or max two layers of onions and there after they were not only exposed but also resulted in taking project to the drain.

Back to the topic now. What is this Apache Avro and why is it a topic for the discussion?. Let me not stir a war with other big data evangelists. Neither them nor me having time to debate but rather focus our energy to make use of everyone’s time wisely. Hey, did you notice that I have avoided the critics from those people?. !.

Let me make it very simple. I am biased and if you don’t want to join with me, wait till you read the blog completely.  I am sure I will have army behind me by the time you finish the reading. Am I overconfident?.

At the end, if you still feel that I am biased, read my second paragraph again.!. This may be an article than a blog. I call it as lengthy blog by violating the rules of blogging. Someone told me “rules are meant to be broken”.. Just making fun. Don’t break the rules.. My intention is to not to violate the blog rules but not to loose your and my thought process by breaking into two blogs.

In big data platform there are many file storage formats available and each has its own strength and weakness. Like your cameras!. You have so many choices to choose from and we can’t say one camera  is better than the others or there is one camera that meets and beats every other camera in EVERY aspects. Because there is no such “ONE” exists. But we can pickup the one or two cameras as top of the line that provides most benefits and functionalities. Ofcourse yes. I am sure you are there with me!.  Why did I pick-up camera as an example?. I love photography, I have all equipment such as professional grade camera, lenses, flash and what not. My camera expenses YOY is growing steadily and exponentially but the quality of the pictures has not increased.!. Don’t tell my wife since she will either put a limit on my spending or if she is in bad mood, she may completely curb my spending. I am writing a blog for you and hope you will not kill my hobby. We have a gentlemen agreement!

Like I have a choice of Camera, I chose one or two file storage formats which I believe offers greater benefits than others in common/most implementations. So which camera you choose between top brands like Canon, Nikon and Sony?. If you are not into cameras, which car you choose from Mercedes, BMW and Porsche! ?.  Don’t get offended if those brands are not the one you like.

My pick is Apache Avro storage format over the others (Compare to Text, Parquet, ORC, RC etc) when I am WRITING (ingesting).  Make a note for “WRITE” and not for “READ“.  You have to make a note the upper case letters.

Am I saying, for faster reading you need to use another format?. Absolutely but discussion for another topic and not now. If you are going to challenge why would I use one format for writing and another for reading, it is yet another lengthier topic. I assume you all know the reasons to make this blog shorter and not diving into format for reading now. I will write another blog for file format for reading and my choice!.

Avro

Let us talk about the benefits.

1. With all other benefits it offers, this format provides best if not super write performance in writing (ingesting)

2. Works very well with snappy compression. I love snappy compression. Remember you need a compression technology which consumes less CPU cycles

3. It stores the data in binary format but also “embed the schema in it!”. No need to have an external schema to understand the data file. Ofcourse few other format also provides this capability but Avro also has it!

4. But the biggest value for the bucks  comes from its capability to accommodate “schema evolution“. Some people by mistake call it as “Evaluation” but it is called as “Evolution”. As of today no other format supports schema evolution to an extend Avro supports. This let alone made me to take a side line with Avro. Unless you are put in a situation like me to get the big data implementation done at an enterprise level with a size of PetaBytes, ingesting 25K to 30K entities per day, you may not understand the rationale and my vote for Avro. To know how difficult to play as a  quarterback, you have to play and then only you will know the reality. Unfortunately or fortunately I was put up in that situation. If you wanted to try with other formats, by all means, you are encouraged to go and play with all those players such as Tackle (300+ pounds), Line backers(245+ pounds), Corner backs-200+pounds (I meant your clients) hungrily  waiting to crush you. Or use others experience and stay as coach(like Bill Belichick)  on the sideline but make right call for the team!.

Believe it or not, many enterprise system will warrant for the support of schema evolution for data ingestion and also consumption. Trust me, your source system where you consume data can decide to change the structure of any file/table (aka entity) at any time based on change in requirements. In a nutshell schema changes from source system can happen multiple times in its life cycle.

What is the challenge?

You have an application written with an interface agreement between your source system. As per the interface agreement, the source system will deliver files with “X” number attributes/fields separated by comma or by delimited fields (in case of files) and in case of tables, you have called out select statement with the agreed upon fields.. (Don’t ask me why can’t we use select * for tables. If you are asking this question, you have to go back to the fundamentals of data extraction principles. For now, I assume you understand why we don’t use select *).

What happens source system decided to remove (delete) a field which they don’t want to carry it forward.?. (It happens). For file based source, your application fails since you are expecting “n” number of fields but you suddenly getting “n-1” or “n-m” fields. For tables also it fails generally but some variations it can be accommodated and let us not solve the exceptions now.

What happens when a field is added?. For file based source, your application fails since you are expecting “n” number of fields but you suddenly getting “n+1” or “n+m” fields. For tables your application may not fail but will not bring the new field! (since your select statement has only known fields and not the newly added field)

What happens when a field is renamed?. For file bases source, your application may or may not fail but your application will still use OLD name. For tables also it fails generally but some variations it can be accommodated and let us not solve the exceptions now.

What is a schema evolution?.

Schema evolution refers to the problem of evolving a database schema to adapt it to a change in the modeled reality. The problem is not limited to the modification of the schema. It, in fact, affects the data stored under the given schema and the queries (and thus the applications) posed on that schema.  https://en.wikipedia.org/wiki/Schema_evolution

In other words in simplified form, Schema evolution is the term used for how the data store and/or application behaves when schema is changed after data has been written to the data store using an older version of that schema.

  • It is a file storage format that helps you to protect all your consuming application from any of these source system changes even a new file is generated with all these additions/deletions and renaming of attributes or any of these combination of the events.
  • Also it allows newly developed applications to consume OLD and NEW data without any impact.
  • NO impact to your existing applications which is running in production even if source system changes its structure.
  • Enterprise will love you since you have designed a system that will allow to make changes in the file structure based on business needsbut NONE of the consuming applications will have to change!.

Another question is what if I am getting same data from two different sources (one Equifax – If you have not, check your data is compromised now, and another from Experian). You used to have common format but now Equifax made a change (due to the identify theft they had!!), added some fields, removed some and renamed some but Experian continue to deliver old format. You have ONE consuming application (since you wrote that application 5 years back with an assumption that the format is common between Equifax and Experian and you did not have the intelligence like what we have to predict that there will be data breach and the format will change!). Now your application will break and you don’t have time to fix your applications but somehow your application should be able to handle both OLD (Experian) and NEW (Equifax) files. You came to the right place. Apache Avro can help you and save you. Go back and read “What is the challenge?” section again. Then send me a blank check with sufficient money in your bank account.!. No you don’t but I need you to return a favor. Start contributing back to our community.

Can you build schema evolution without using the Avro’s inbuilt capabilities?. Ofcourse and why would you do that?. Do you want to prove your technical mastery?. We need people like you who are very smart but we would like to use your time to do something big by effectively using the tools that are already provided by the industry and backed by so many great people like you. There were so many Apache committers engineered and supporting the Apache Avro and we don’t want you to  develop another work around. Your potential can be used for something big and I encourage you to use the one that comes free from Apache.  Did I say “free”?. Yes it is free and let me know who would not love free goodies?. So far I put it in a nice way. The not a nice way is, “Who will be supporting your engineering marvel when you are sick or on vacation or to enhance or when you leave or make it compatible when new formats comes in or new versions of source is released?”.

If you are going to engineer your own engine, I am not going stop you and good luck. How many people will buy the “car you have designed in your garage” over the cars engineered by the manufacturers and well tested?. There are so many reasons why you buy a car from the manufacture. I don’t need to tell you why?.

If you are still adamant, I  am not taking about car enthusiasts who will design for specif goal and purpose. Exceptions are not counted or considered for broader use. Hope you are all with me! After taking pain to write this blog, I assume you are all on the side which talks about reality. Without asking you, I am assuming you are with me so far!

I see the next question is, can I have samples to explain the schema evolution (the life cycle). I can, but you will be able to find tons of details on the internet since so many selfless contributors authored with examples.  You have better choice than me!

The purpose of this blog is a). for you to talk to anyone about Avro’s benefits b). make recommendations based on your client needs and if it falls in to the cases I have explained (majority will and minority will not) c).  to generate interests so that you can do further research to see the examples (pushing you to learn)  d). to differentiate yourself from others (you are a doer and NOT a talker).

Am I saying ALWAYS use Avro format?. Hey, I don’t get any kick back!. Certainly not. It is like I am recommending you to use your Mercedes Benz as a garbage truck?. I did not mean garbage is bad.. What I meant is Garbage truck is designed to collect garbages effectively and efficiently than your car. So you car is NOT a right choice of technology for garbage collection.  Likewise your situation may warrant for other type of file formats and never hung-up to Avro if it does not. ONE SIZE DOES NOT FIT ALL.

We are at the end of the blog and did anyone think about what camera I support?. If you have not, you are very focused on the “subject” and ignored my interests. I am disappointed guys!.. No, I am not. I am extremely happy that you have focused on what is needed and filtered out the noises. I win..

If you thought about it and also remembered not just camera and cars but also the technology details, You are the best!.  Coming back to camera, I support “Canon”. Cars?. I am neutral!. Why do I like Canon?. The major factors are the picture quality, handling, support and brand but also after investing quite a bit on this I can’t move to some other brands. I know I will have to live in my garage or live somewhere else for doing another investment with another brand.  I don’t want to ruin my quite life. You know how it works if you are married?. You can buy more gold and diamonds of same type but nothing of your interest. You know who I am talking about!. If you like other brands, have fun and I am not asking you to change. I understand you will have same problem as mine!

Before I conclude, are you with me?. Do you see the value for Avro?. If not, either re-read again or do some research. Even after that if you are not the fan of Avro, we need competition.  There is no fun without having a competition.

Finally, just for fun. One of the photo from my visit to Boston. You can see the photographer!. Who else can take a picture of Beautiful Boston  so ugly other than me?. Now you know why my photography skills has not improved even with high end cameras!.

Boston - Winter.jpg

https://www.linkedin.com/in/manikandasamy/

BTW,

  1. While every caution has been taken to provide my readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author will not compensate you in any way whatsoever if you ever happen to suffer a loss/inconvenience/damage because of/while making use of information in this blog.
  2. If you like it or dislike, post your comments. Former motivates me to share more to our community and later helps me to learn from you
  3. Pardon me for the grammar mistakes

Apache Hadoop Yarn for fun!

Technology terms are easy to understand generally (I lied) but not all the times. I have noticed the facial expression (you see that they are either trying hard or making funny expressions) of the receivers when I explained some technologies to a group of technologists and non-technologists. The moment I notice those expressions, I have related them with our real life scenarios to help them to understand the topics. The good part is that they understood and able to relate with the technology concepts, but the bad is after they left they forgot the technology and remembered only the story.. Hey, Don’t blame me.

Now a days, I have started capturing all in writing so people can go back and relate them again and I hope they take time to remember the technology too. Will you?.

Lot of people asked me about Yarn and even after explaining multiple times, it became hard for them to remember after few weeks/months. So how about having some fun and learn?. Let us try..

You are a client and you need consultants (CPU core / memory) . You called the Vendor’s resource manager (yarn resource manager) who is the MASTER . The Vendor resource manager knows all the details about how many managers he has, how many people report to each manager, what type of skills they have, where are they located etc.  You understand without him, nothing can be done. What happens if he goes on vacation?. The company is dead unless they have a back-up. So remember you have to have yarn resource manager configured to run in High Availability (HA) mode.

Remember you don’t want your resource manager to do all work for you since you will be overwhelming him (the previous version of Hadoop did that and realized soon; then created new architecture YARN with a resource manager role which does specific jobs and not everything).

Let us walk thru the life cycle of the process.

YARN and COMMON LIFE

  1. You reached out to Vendor resource manager with your requirement  consultants and the request (Job)
  2. The resource manager accepts the request and send the Job request to his reporting manager (Yarn Application Master – Single Container) to deal with the details and nature of the job etc.
  3. The reporting manager (Yarn Application Master) acknowledge the (registers with resource manager), looks at the Job requirement of the clients, qualifies it and understand that you need  5 consultants (CPU cores) and bill rate of $150 each(Memory)
  4. Reporting manager (Yarn Application Master), contacts (negotiates with) the resource manager (Yarn resource manager) asking for 5 consultants (CPU cores) and bill rate of $150 each(Memory). (remember centralized resource manager is the one who knows whether the resources are available or not since we do not want different staffing managers committing the same resource to multiple clients)
  5. Resource manager (Yarn resource manager), checks the availability of the resources (cores and memory) and if available, grant Reporting manager (Yarn Application Master) permission and provides the departments (containers) the resources are available and also mark as that the department is allocated for the reporting manager ((Yarn Application Master) to access.
  6. Now(Yarn Application Master) to reach out  (Container request) to the supervisors (Node Manager) who actually manage the resources of each department to provide the resources (Cores, CPU) to get the work (Job) completed. Supervisor (node manager) grant the resources. As soon as the resources are allocated, supervisors (Node Manager) notifies the resource manager so that the resource manager (Yarn resource manager) does not double book the resources.
  7. Reporting manager (Yarn Application Master) got the resources and makes sure it owns the responsibility to monitor the resource allocation till the contract expires. Now onwards, Application Master (reporting manager) DIRECTLY connects with the clients since they have all they need. If any of the employee is not well, it is their responsibility to report back to resource manager (Yarn resource manager) for a replacement resource BUT WITHOUT let the client knows you are scrambling around and failing. Your organization goal is to provide services to the client and manage dynamics without affecting client’s requests. Hadoop platform is too good in doing that with high level of fault tolerance.
  8. After all the works are completed and client contract is done; application master (reporting manager) notifies resource manager of resource release (container releases) and also self release (unregister) so that they can be available.
  9. Now resource are back to the pool and happy go.

 

YARN ALONE

Let us have the same sequence technically now.

  1. A client program submits the application
  2. ResourceManager allocates a specified container to start the ApplicationMaster
  3. ApplicationMaster, on boot-up, registers with ResourceManager
  4. ApplicationMaster negotiates with ResourceManager for appropriate resource containers
  5. On successful container allocations, ApplicationMaster contacts NodeManager to launch the container
  6. Application code is executed within the container, and then ApplicationMaster is responded with the execution status
  7. During execution, the client communicates directly with ApplicationMaster or ResourceManager to get status, progress updates etc.
  8. Once the application is complete, ApplicationMaster unregisters with ResourceManager and shuts down, allowing its own container process

Screen Shot 2017-10-05 at 11.53.40 AM

Picture Courtesy:- Apache.org

COMMON LIFE ALONE

Let us have the same sequence without technology now.

  • A client submits the request
  • Central Resource Manager allocates a specified Reporting Manager to take care of the request
  • Reporting Manager acknowledge with Central ResourceManager
  • Reporting Manager negotiates with Central ResourceManage for appropriate resources
  • On successful resource allocations,Reporting Manager contacts department supervisors to provide the resources
  • Resources allocated to the client, and thenReporting Manager is responded with the status
  • During client engagement, the client communicates directly with the Reporting Manager or ResourceManager for progress updates etc.
  • Once the client engagement is complete,Reporting Manager unregisters with ResourceManager and make himself/herself available in addition to other resources.

I don’t know if whether my blog simplified or complicated, I had a fun to learn and remember atleast. I also wanted to thanks ALL of the apache committers, authors, bloggers and every one who has helped me to learn these concepts to service my clients, my company and the community.

https://www.linkedin.com/in/manikandasamy/

BTW,

  1. While every caution has been taken to provide my readers with most accurate information and honest analysis, please use your discretion before taking any decisions based on the information in this blog. Author will not compensate you in any way whatsoever if you ever happen to suffer a loss/inconvenience/damage because of/while making use of information in this blog.
  2. If you like it or dislike, post your comments. Former motivates me to share more to our community and later helps me to learn from you
  3. Pardon me for the grammar mistakes

Don’t justify or defend!

I would like to step away from technology and leadership blogs for a moment to talk about this topic we encounter normally. I may be totally off and I am not expecting anyone to follow my suggestions if you feel otherwise.

defense

This was a discussion topic between my friends who are not fortunate to be an entrepreneur!.

One of my friend asked me how did I manage the situation when I was told by my boss that I am defensive or trying to justify?. I told him immediately, I have no idea and what are you talking about?. Nope. I did not say that. I am not going to lie or going to make myself look good in-front of you all. Absolutely I have had my share of experience and I admit to you all.

In those situations I had to bite the bullet with a “grain of salt” even though it was a bitter experience. But it also gave me an opportunity to put myself in their shoes and be mindful of my team, my organization and my clients. It helped me to become a better person.

If things are tough, remember that every flower had to go through a whole lot of dirt to get there. So do not grieve about a bitter experience. The present is slipping by while you are regretting the past and worrying about the future. Regret will not prevent tomorrow’s sorrows; it will only rob today of its strength. – Barbara Johnson

But some cases I was not happy at all since I knew I was not justifying or defensive but I was victimized just for explaining the reality he/she does not want to hear. In those cases I got frustrated and vented out with my friends since I don’t like hitting the punching bags as the media to vent out. (The truth is I don’t have the punching bags!. Physically, hitting a punching bag produces a response in your body that helps to relieve tension). As far as I remember I have not used the words such as “don’t be defensive” or “don’t justify” with any of my colleagues or team members for the same reasons since I know how much it will hurt them even if they are at fault. I believe a verbal wound is as bad as a physical one, often worse.

The cases I was not in agreement, I found that the guys who told me was someone superior to me obviously and they just wanted to remind me that he/she is my boss who has the right and freedom to advise even though they knew they are not right. Don’t underestimate the power of ego. They also did not have an open mind to accept or appreciate others input..

Did it hurt me?. What was my reaction?. Yes it did hurt me when I was suppressed for being right. In those cases I have done one of the three things.

  1. Vented out with my friends (It was their destiny and not mine!).
  2. Let it go or pretend like nothing happened!
  3. Stayed away from their radar since I know they will chase me out if I don’t stay away from them.

I also did one more too. Hey I lied that I followed 3 but actually followed 4. My lucky number is 3!.

I sought for help from my better half (my wife). Who else is going to give better suggestion than your spouse?. Hey if you have not purchased the diamond necklace she asked last week, better go to the store now and surprise her before you seek the advice. I can’t guarantee the results. She may ask you to not to go against your boss since she is worried you may loose your job and no more diamond necklace OR if you don’t buy the necklace she may give you a suggestion that will aggravate the issue more since she wanted you to get into more trouble with your boss (tit for tat?). Either way you don’t win. !!! Jokes apart, my wife is my best advocate and she gave me the right suggestions whenever I sought help. All I had to do was use my ears and mouth proportionately. Actually I have used only my ears!

Do you think staying away or let it go is not a good strategy?. I am not asking you to not be courageous. I speak of my mind all the times. I want you to understand it is not about courage but about being thoughtful. Take your own time to think without emotions. Don’t react immediately. “Bravery without wisdom is foolishness – Lindsay Tia”.

Cut your losses if you know it will affect your and your team’s progress. Why put more fuel to the burning fire for the one which is worth for nothing?. To me, it is not a worst strategy at all and if not the best.  There are couple of reasons. The reaction from your boss a). may be due to his bad mood (he got his share from his wife in the morning since he went home yesterday night fully drunk and tried to “justify” to his wife why he drank too much) OR b). due to the blood bath he took it from his boss/customers (he has to pass it on to others) OR c). he is upset with someone else OR d). he is really an egoistic. So retaliating without understanding the situation is not a good option.

“Mature workers are less impulsive, less reactive, more creative and more centered.”Deepak Chopra

So I always tried (I said I tried and I am not trying to tell you I am perfect) to convert my focus to something else or tried to stay with a team of positive people to keep me self motivated. Trust me, based on Harvard Business School (HBS) study, sitting with complementary colleagues together has more positive benefits and my experience too. They also found that toxic workers, who end up being fired for misbehavior, tend to drag down the people around them. Do you agree now why I tried to stay with a team of positive people?.

You may think that I should have fought for my rights and prove to them that I am right. I choose my fights carefully if and only if I believe that I am pushed to a corner and I can’t tolerate the impact. After all every one has a tolerance limit. In these kind of situations, I did not take the hard fight but I made them to realize that they are wrong by doing the right things. Hey I followed 5th tactics now. I see that you are saying I am a liar. Already you guys told me I am liar when I had my fourth strategy; what difference it is going to make if I lie one more? Lie is a lie anyway!. Just kidding..

I have a story, a real story for the 5th strategy. There was a situation my superior ignored my recommendations and advised me of something that is not going to work. When I tried to explain , he told me not to “defend” my ideas but  he has “justified” his/her case! When your boss defends or justify, it is a wisdom but it is not applicable to you!. My boss used the word “defensive” purposefully to  just to keep me shut up and agree to his/her  ideas. I told my boss that the things will happen exactly the way in which I have explained but I also told him/her that I am willing to listen and execute his/her ideas. Guess what?. The execution happened the way in which I have explained to my boss.I made them to realize that they are wrong by doing the right things. There was no appreciation from my boss but a big silence. I prefer this is a better way of teaching a lesson than directly fighting with them since they will get aggravated even more.

I also had 6th one where I gave them back when I have realized their continuous torture tested my patience. When I did that, I knew I was burning the bridges but for me, who I am and the values I believe became  more important than my job. I also knew that my reputation will keep me away from troubles with next level leadership since they knew who I am. Nothing happened to me but the boss had to step back and he did not have a choice. This was my last option.

Enough of my story and it was my turn to ask my friend’s experience and I was curious why did he ask me that question?. I was pretty sure it was not a random question but something was bothering him.  His story was really bad and I was speechless. I could not offer any suggestions.
His boss told him that he is not happy in the way in which my friend communicated certain message to his clients. My friend tried to explain to his boss that he did not open his mouth to say anything to his client. But his boss told him to not to justify his actions but asked him to use the words carefully in the future. My friend asked me, how would someone can carefully choose a word when there was no such words were spoken?. His second question was how come his boss can say he is justifying when he is trying to communicate that he did not even say any of the words his boss mentioned?.

Listening to my friend’s issue, a story came to my mind. One time a minister gave an opinion to his king which was wrong, and which the king didn’t like it all. So he ordered the minister to be thrown to the wild dogs. Minister pleaded the king and said I served you 10 years and you do this?. Please give me 10 days of time before you throw me in with those dogs. During these 10 days the minister fed the dogs every day, gave bath to the dogs, providing all sorts of comforts to them. On the day of execution the minister was put in a cage and the hungry dogs were let out. To the surprise of the king and everyone, the dogs were wiggling the tails and licking the feet of the minster. The king baffled at what he saw. The king  asked the minister but minister said he will give him the answer only if he is pardoned. King pardoned him since he wanted to know the secret behind this miracle. The minister said ” I served the dogs for 10 days and they don’t forget my service… but I served you 10 years and you forgot all at the first mistake…”

The King got angry and ordered minister to be thrown to the crocodiles this time !. Moral of the story. If you have bosses like my friend and they have decided to come after you, you have no choice. (You do have a choice if you have a good reputation and support from next level leadership.. Different story and different topic).

Not all bosses are like that. I am being lucky for the most part.

You are hired to deal with the conflict resolution with your team, your customers and your bosses. Why do you want to use your trump card for such a small issue you can handle by yourself?. But always seek out help if the behavior of your boss continues.Many of you may not agree with my suggestions. You may even offer 6th one to go to your supervisors. It is an option and an excellent option but only a last resort.

Bottom line is that you have to have a great sense of satisfaction and you should feel you are accomplishing something everyday and more importantly you have to enjoy your job. Worst case, your job should not be stressful if not a pleasure.

“Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do.”Steve Jobs

If you have other suggestions, feel free to share.