Got big data ingestion problems? The solution might already be in your network: syslog-ng delivers log data directly to Hadoop, Elasticsearch, MongoDB and Kafka

 

As I wrote in a previous blog, we have been working to make syslog-ng a key component of the data pipeline feeding big data tools like distributed file systems, NoSQL databases, and queuing and messaging systems. While mostly associated with the syslog protocol, syslog-ng has long moved on from system logging to managing application logs, traditional database logs, and any text-based log data. syslog-ng’s latest developments reflect the recent changes in data management.

The massive increase in data generated and the ways in which enterprises manage data, be they social media or energy companies, have spawned a whole new set of tools to manage data. For lack of better term, the moniker of Big Data has been applied to these new tools, which companies use to solve problems and answer questions that were previously too expensive or complicated. Big data is usually characterized by the four Vs of volume, velocity, variety, and veracity. Taken on their own, each of the four Vs don’t pose significant challenges, but combined they can be unworkable with traditional data tools. What are these new data tools?

 

  • Distributed File Systems – When talking about distributed file systems, there’s really one major solution: Hadoop. It provides a cost effective, scalable, redundant solution to storage and processing bottlenecks associated with the traditional Database Server / Storage Array configurations. Hadoop’s popularity has launched a whole ecosystem of tools for specific use cases.
  • NoSQL Databases – While there are many types of NoSQL databases with a variety of data structures, they all share the goal of providing more flexible, horizontally scalable data management. Two of the most popular NoSQL databases are Elasticsearch and MongoDB. Both are document stores, one type of NoSQL database, that offer flexible data models and multi-tenancy.
  • Messaging systems – The first hurdle companies face in deploying a big data solution is delivering the right data at the right time for analysis. This becomes very difficult when the velocity, the speed at which data is generated, is high. Struggling with fast-moving data streams used for real-time analysis by multiple consumers, developers at LinkedIN created Kafka, a message broker for real-time data sets. Like other big data tools, Kafka has a flexible data model and multi-tenancy for scalability.

 

These new tools complement rather than replace existing data management tools, as no single tool can meet the needs of all use cases. For example, the data needs of engineers often differ greatly from business analysts. One of the most interesting characteristics of big data is the reliance on interconnected point solutions that solve narrow use cases but integrate with existing data management and IT systems. This interoperability creates hybrid data management systems to meet the disparate and evolving needs of data consumers.

 

Big Data In(di)gestion

As with any data management deployment, getting data from its source to a consumer in the right format at the right time is more difficult than meets the eye. This is where the fourth V, veracity, comes into play. Incomplete or poor-quality data often causes problems downstream, and can lead to faulty analysis. The sheer amount of data combined with the large number and wide variety of data sources makes big data ingestion complicated.

Schema-less data ingestion, or schema on read, is often promoted as the answer to data ingestion problems. In theory, this provides for ultimate flexibility when managing data. In practice though,  pre-processing data before it is delivered to big data tools can yield big dividends. Simple filtering, classification, parsing, and transformations can reduce the disparity between various data structures and reduce processing times down the line. Even in very flexible data management systems like Hadoop, adding some structure and metadata can make it easier to access and process data later on. syslog-ng’s Pattern Database can classify data in real-time based on message content.

 

Simplify your data pipeline

Much of the data being used in big data deployments is generated by legacy sources like network and security devices. Nearly all of these devices send data over syslog. For decades, the syslog protocol has been the standard way of capturing event data. Due to its flexibility, it is still relevant as hundreds if not thousands of devices and applications can send data using syslog, even web servers like Nginx support syslog.

As the first next-generation syslog daemon, both the Open Source Edition and Premium Edition of syslog-ng naturally handle both syslog types, RFC 3164 and RFC 5424, but both versions also natively support other event types including, Apache Web server logs,   and CSV files. In addition, the Premium Edition natively supports Windows eventlogs, SQL database logs, and SNMP traps. Recently, we added support for Java Script Object Notation (JSON), the primary data structure for application data transfer, to both versions.

In addition to these data types, syslog-ng can parse any text-based file giving it almost unlimited scope for processing events from custom applications, sensor data, or other data streams. With syslog-ng you can collect and pre-process event data from legacy systems as well as custom applications and sensors, and stream it directly to big data tools like Hadoop, Elasticsearch, MongoDB and Kafka and derive much better insights from your data by combining and jointly analyzing and correlating data from different sources and from different parts of your organization and operation. The good news is that syslog-ng is probably installed in your environment as there are more than one millions users of the Open Source Edition. While some of these new destinations require installation of Java Runtime Environment, they only require minor changes to existing configuration files.

 

Flexibly route data

Different data consumers have different goals, even if they use the same data. Security teams have different analytics needs than Operations or DevOps, not to mention Marketing and Business Intelligence. This usually leads to hybrid data management systems consisting of different technologies as no single solution can meet the needs all of the business objectives.

syslog-ng can flexibly route data in real-time to multiple destinations. Given the time-sensitive nature of security data, security teams often store long-term data in Hadoop, but need to process security events in real-time using their SIEM. Moreover, the tools required to analyze security event data require different skills than those required to manage data in big data tools. Many organizations use traditional SQL databases alongside Hadoop for different purposes. In these cases, syslog-ng can deliver data to each destination without needing a separate data ingestion tool.

 

More to come

Watch this space as we are working on new sources, destinations, and parsing capabilities to make syslog-ng a reliable, real-time data pipeline collecting data from more sources and delivering them to even more data management and analytics tools. If you have a question or an interesting use case, let us know on our forum at https://syslog-ng.org/questions/ or on Twitter @sngOSE.