DataCleaner

Software Screenshot:
DataCleaner
Software Details:
Version: 4.0.9 updated
Upload Date: 27 Sep 15
Developer: -
Distribution Type: Freeware
Downloads: 0

Rating: nan/5 (Total Votes: 0)

DataCleaner is an open source and totally free solution for organizations and businesses wishing to increase and measure the quality of their data.

With DataCleaner, users will be able to profile, compare, validate data against business rules, and monitor the progression of these measurements over time.

AMong its features, we can mention data monitoring, data profiling and DQ analysis, data cleansing and enrichment, detect and merge duplicates, customer data quality, as well as super-fast ETLightweight (Extract-Transform-Load).

To learn more about DataCleaner's functions and capabilities, as well as how to work with it, please refer to http://eobjects.dk/docs

What is new in this release:

  • Improvements and new features:
  • We've made it possible to create and drop tables via the desktop UI of DataCleaner. Note that the term "table" here actually covers more than just relational database tables. It also includes Sheets in MS Excel datastores, Collections in MongoDB, Document types in CouchDB and ElasticSearch and so on... Basically all datastore types that support write-operations, except single-table datastores such as CSV datastores, support this functionality! The functionality is exposed via:
  • "Create table" enabled via the right-click menu of schemas in the tree on the left side of the application.
  • "Create table" enabled also via table-selection inputs in components such as Insert into table, Table lookup and Update table.
  • "Drop table" enabled via the right-click menu of tables in the tree on the left side of the application.
  • We've added the (optional) capability of specifying your Salesforce.com web service Endpoint URL. This allows you to use DataCleaner to connect to sandbox environments of Salesforce.com as well to your own custom endpoints.
  • The ElasticSearch support has been improved, allowing custom mappings as well as reusing the ElasticSearch datastore definitions now also for searching and indexing.
  • The sampling of records and selection of potential duplicates in the Duplicate detection function has been improved, leading to faster configuration because the decisions made during the training session are more representative.
  • The Duplicate detection model file format has been updated which has removed the need for a separate 'reference' file in order to save past training decisions. Compatibility with the old format has been retained, but using the new format adds many benefits for the user experience.
  • Bugfixes:
  • A thread starvation issue was fixed in DataCleaner monitor. The impact of this issue was great, but it happened only in rare and very customized cases. If custom listener objects on the DataCleaner monitor would throw an error, it would result in a resource never being freed up and taking up a thread from the Quartz-scheduling pool on the server. If this would happen many times the server could eventually run out of threads in that pool.
  • The vertical menu on the result screen is now doing a proper job of displaying the labels of the components that have results. This makes it easier to recognize which menu item points to what result item.

What is new in version 3.5.7:

  • The 'Synonym lookup' transformation now has a option to look up every token of the input. This is useful if you're doing replacement of synonyms within the values of a long text field.
  • Blocking execution of DataCleaner jobs through the monitor's web service for this could sometimes fail with a bug caused by the blocking thread. This issue has been fixed.
  • An improvement was made in the way jobs and the sequence of components are closed / cleaned up after execution.
  • The JNLP / Java WebStart version of DataCleaner was exposed by a bug in the Java runtime causing certain JAR files not to be recognized by the WebStart launcher, under certain circumstances. This issue has been fixed by making slight modifications to those JAR files.
  • A few dead links in the documentation was fixed.

What is new in version 3.5.4:

  • It is now possible to hide output columns of transformations. Hiding will not affect the processing flow at all, but simply hide them from the user interface, and thus potentially making the experience more clean, when interacting with other components.
  • A new web service has been added to the monitoring web application, which provides a way to poll the status of the execution of a particular job.
  • A bug was fixed, causing the HTML report to fail for certain analysis types when no records had been processed.
  • And 6 other minor bug has been adressed.

What is new in version 3.5.1:

  • Capture changed records:
  • A new filter was added to enable incremental processing of records that have not been processed before, e.g. for profiling or copying only modified records. The new filters's name is Capture changed records, referring to the concept of Change data capture.
  • Queued execution of jobs:
  • The DataCleaner monitor will now queue the execution of the same job, if it is triggered multiple times. This ensures that you don't accidentally run the same job concurrently which may lead to all sorts of issues, depending on what the job does.
  • Minor bugfixes:
  • Several bugfixes was implemented.

What is new in version 3.5:

  • Several wizards are now available for registering datastores; including file-upload to the server for CSV files, database connection entry, guided registration of Salesforce.com credentials and more.
  • The job building wizards have also been extended with several enhanced features; Selection of value distribution and pattern finding fields in the Quick analysis wizard, a completely new wizard for creating EasyDQ based customer cleansing jobs and a new job wizard for firing Pentaho Data Integration jobs (read more below).
  • You can now ad-hoc query any datastore directly in the web user interface. This makes it easy to get quick or sporadic insights into the data without setting up jobs or other managed approaches of processing the data.
  • Once jobs or datastores are created, the user is guided to take action with the newly built object. For instance, you can very quickly run a job right after it's built, or query a datastore after it is registered.
  • Administrators can now directly upload jobs to the repository, which is especially handy if you want to hand-edit the XML content of the job files.
  • A lot of the technical cruft is now hidden away in favor of showing simple dialogs. For instance, when a job is triggered a large loading indicator is shown, and when finished the result will be shown. The advanced logging screen that was previously there can still be displayed upon clicking a link for additional details.

What is new in version 3.1.2:

  • We've added a web service in the monitoring application for getting a (list of) metric values. This makes the monitoring even more usable as a key infrastructure component, as a way to monitor data (quality) and expose the results to third party applications.
  • The 'Table lookup' component has been improved by adding join semantics as a configurable property. Using the join semantics you can tweak if you wish the lookup to work semantically like a LEFT JOIN or an INNER JOIN.
  • The EasyDQ components have been upgraded, adding further configuration options and a richer deduplication result interface.
  • Performance improvements have been a specific focus of this release. Improvements have been made in the engine of DataCleaner to further utilize a streaming processing approach in certain corner cases which was not covered previously.

What is new in version 3.1.1:

  • The date and time related analysis options have been expanded, adding distribution analyzers for week numbers, months and years. All analyzers related to date and time are now grouped within a submenu called "Date and time" under "Analyze".
  • An optional "descriptive statistics" option has been added to the Number analyzer and the Date/time analyzer. This option adds additional metrics to the results of these analyzers, such as Median, Skewness, percentiles and Kurtosis. These metrics are optional since their memory footprint is somewhat larger than the existing metrics.
  • The lines in the timeline charts of the monitoring web application now have small dots in them. This is especially useful for charts with few (or even only one) observations in them - to point out exactly where the observation points are.
  • The query parser when invoking ad-hoc queries have also been substantially improved. Now queries can contain DISTINCT clauses, *-wildcards, subqueries and are fault-tolerant towards text-case issues.
  • Two new transformers have been added for generating UUIDs and for generating timestamps.

What is new in version 3.1:

  • Metric formulas - elaborated Data Quality KPIs:
  • It is now possible to build much more elaborate Data Quality KPIs in DataCleaner's monitoring web application. The user interface allows you to build complex formulas in a spreadsheet-like formula style; using variables collected by DataCleaner jobs.
  • Metric formulas can combine any number of metrics, constants and operations, as long as it can be expressed in a mathematical equation.
  • For instance - measure the rate of duplicate records in percentage of the total record count. Or measure the amount of product codes that conform to a set of multiple string patterns.
  • Ad-hoc querying - of any datastore:
  • With DataCleaner 3.1 you can now perform ad-hoc queries to any datastore! Queries can be expressed in plain SQL and will be applied to databases as well as files, NoSQL databases and more, providing a truly helpful query mechanism to extend into your discovery and data profiling experience.
  • The query option is also available through a web service to monitoring users with the ADMIN role. The query is provided as a HTTP parameter or POST body, and the result is provided as an XHTML table.
  • Value matcher - a new analysis option:
  • Often times you have a firm idea on which values should be allowed and expected for a particular field. In DataCleaner there's always been the Value Distribution analysis option which would help you assert your assumptions. In DataCleaner 3.1 though, you have a more precise offering - the Value matcher. This analysis option allows you to specify a set of expected values and then perform a value distribution like analysis, specifically to validate and identify unexpected values.
  • Copying, deleting and management of jobs:
  • Management of jobs and results in the DataCleaner monitor application has been improved greatly. You can now click a job in the Scheduling page of the monitor, and find management options available for operations such as renaming, copying, deleting and more. Each operation respects the linkages to other artifacts in the monitor, such as analysis results, schedules and more. This means that management of the monitoring repository has become a lot easier and mature.
  • Manage data quality history:
  • Sometimes you're facing situations where you actually want to do monitoring with historic data! It might be that you have historic dumps or backups of databases, which you wish to show and tell the story of. You can now do the analysis of this historic data, upload it to the DataCleaner monitor, and using a new web service, set a historic data of that particular analysis result. This means that your timelines will properly plot the results using their intended date, but with the results that you've collected maybe at a later point in time.
  • Clustered scheduler support (EE only):
  • The scheduler of DataCleaner monitor has been externalized, so that it can be replaced by the means of simple configuration. In the Enterprise Edition (EE) of DataCleaner, we provide a clustered scheduler, providing the ability to load balance and distribute your executions across a cluster of machines.
  • Single-signon (SSO) using CAS (EE only):
  • In the Enterprise Edition (EE) of DataCleaner we now provide a single-signon option for the monitor application. Now DataCleaner can be an integrated part of your IT infrastructure, also security-wise.
  • ... And a lot more:
  • The above is just a summary. More than thirty issues have been resolved in this release. We have solved several requests coming from the forums and community, and we encourage everyone to use this medium as a vehicle for change. We're very happy to make the development of DataCleaner be heavily influenced by the streams in the community.

What is new in version 3.0.3:

  • Adds a service for renaming jobs in the monitoring repository.
  • You can access this as a RESTful Web service or interactively in the UI.
  • A Web service was added for changing the historic date of an analysis result in the monitoring repository.
  • The Web application has been made compatible with legacy JSF containers.
  • Caching of configuration in the Web application was greatly improved, leading to faster page load and job initialization times.

What is new in version 3.0.2:

  • When triggering a job in the monitoring web application, the panel auto-refreshes every second to get the latest state of the execution.
  • File-based datastores (such as CSV or Excel spreadsheets) with absolute paths are now correctly resolved in the monitoring web application.
  • The "Select from key/value map" transformer now supports nested select expressions like "Address.Street" or "orderlines[0].product.name".
  • The table lookup mechanism have been optimized for performance, using prepared statements when running against JDBC databases.
  • Administrators can now download file-based datastores directly from the "Datastores" page.
  • Exception handling in the monitoring web application has been improved a bit, making the error messages more precise and intuitive.

Screenshots

datacleaner-70932_1_70932.png
datacleaner-70932_2_70932.png
datacleaner-70932_3_70932.png

Similar Software

firebirdsql
firebirdsql

20 Feb 15

nous.migration
nous.migration

14 Apr 15

MONyog
MONyog

17 Feb 15

PySQLite
PySQLite

11 May 15

Other Software of Developer -

Pekwm
Pekwm

18 Feb 15

GWhere
GWhere

3 Jun 15

bpmcounter
bpmcounter

3 Jun 15

DailyTasks
DailyTasks

3 Jun 15

Comments to DataCleaner

Comments not found
Add Comment
Turn on images!