One-stop machine learning platform turns health care data into insights

One-stop machine learning platform turns health care data into insights

Spread the love


Over the past few decades, hospitals and other health care centers have put more amounts of time and effort for processing electronic health care data’s, turning hurriedly scribbled doctors’ notes into durable and simple sources of information. But collecting these data in a second was a challenge. Convert these records into actual insights will take more time and effort. Machine learning in healthcare is becoming more widely used and is helping patients and hospitals in many different ways. The most common healthcare use cases for machine learning are automating medical billing, hospital decision making and Medical Diagnosis, etc.

Cardea, is an open-sourced software system built by researchers and software engineers at MIT’s Data to AI Lab (DAI Lab), is built to help to maintain electronic health records (HER). By collecting hospital data through a set of machine learning models, the system could lead hospitals to plan for programs as large as a global epidemic and as small as no-show appointments. Covid-19 situations, challenges encountered by the hospital are Management of suspected and confirmed patients with COVID-19, Managing and taking care of healthcare workers, and anticipating the increase of cases

Automated for the people

Cardea is representing the automated machine learning field, this field is also known as AutoML. In the modern world, Machine learning is essential for every field. It used for everything from vaccine development to image recognition of humans. The goal of AutoML is to build machine learning and deep learning models to reduce or eliminate the need for skilled data scientists and also it helps us to reduce time effort behind. An AutoML system allows us to provide the labeled training data which contain tags or labels or class to the observations as input and receive an optimized model as output

Here we don’t require any data scientists to design and code an entire AutoML model, AutoML system like Cardea platform present one and a catalog contain explanations of how they work and what they do. Users can accomplish their goals by rearranging machine learning modules.

Step by step

machine learning platform turns health care data into insights

  • Automatic data assembler load data from raw format (labeled training data) and convert into an entity set representation.
  • Data labeling is the process used to label the data, giving a time point for each data. the time point that indicates a data of time point (1) used to create features (2) the encoded labels of the prediction task. This is the main phase of cardea.
  • Featurization is the process used to generate a feature matrix Using Featuretools. Feature matrix consists of automatically generate features from our data. 
  • Finally build AutoML, using modeling component Using MLBlocks and MLPrimitives we want to train, and tune our machine learning modeling pipeline.

From above we understand what are the process takes place. Now how its work in real life to turn a bunch of data into useful predictions, Cardea system gets user’s data through a pipeline, with choices and safeguards at every step. These data are first treated by a data assembler, which consumes the information users provide. The developer was built Cardea for work with Fast Healthcare Interoperability Resources (FHIR). FHIR is the current industry standard for electronic health care records.

Each hospital can use FHIR in exactly different ways. So, Cardea has been built to adapt to different situations and different datasets very fast without any mistake. If there are any inconsistencies within the data, Cardea’s data auditor marks them out to fix or dismiss error data.

Cardea is work-friendly with users so they ask the user what they want to do, sometimes users need to find patient details or users want to know how many ICU beds remaining in the hospital or maybe users want to know how long a patient might stay in the hospital. Here there are small questions but in day-to-day hospital management, these questions are important. Especially now, during the Covid-19 situation user can’t able to the hospital individually. Users have more options to choose between different models, and the AutoML system then uses the dataset and models to study patterns from previous patient’s details and to predict what could happen in this case.

Now, the Cardea system is only set up for corporate with four different types of resource-allocation questions. But Cardea interacts with users through the pipeline, so the pipeline contains so many different models, it can be easily adjusted to other scenarios that might arise. As Cardea continues to develop, the goal is for stakeholders to eventually be able to use it to solve many predictions problems within the health care sector.

The researchers need to know the accuracy of AutoML against users from the popular data science field, so they tested and found that it outperformed 90 percent of them. The researchers also tested AutoML efficiency and asked data analysts to use Cardea to make predictions on a demo healthcare dataset and models. Now they concluded that the efficiency of Cardea improving significantly in every test. For example, the data analysts said that feature engineering took only five minutes instead of an average of two hours, so the speed of processing also increasing significantly.

Trust the process

Hospital workers are often tasked job for making high-stakes, critical decisions in a pandemic situation. So, workers need to trust the tools like Cardea, then only the researchers can able to build Cardea system and also improve its working and add new features. Workers need a system that works like give some numbers, press a button and get an answer. It is not possible to get an answer from scratch. So, workers need to know about system models and what happening inside the AutoML system.

Cardea’s next step is a model audit is the process to build a system with even more transparency. Like all predictive equipment, Automatic machine learning models have strengths and weaknesses. From these weaknesses, Cardea allows the user for decides whether to accept this model’s results or to start again with a new one.

2020 was heavily defined by the COVID-19 pandemic, so it’s a bad year for the health sector. But Cardea was released to the public this year, so it’s a good year for the health sector. Cardea is open-source software so users can integrate their own tools and idea in Cardea. The team also took a great effort to guarantee that the software system is not only available for free but also understandable and easy to use. Users can create Cardea software according to their plan, so it will help reproductivity. Cardea system made predictions on models that can be understood and easily checked by others individuals. 

The researcher’s team also plans to make the Cardea software system more accessible to non-experts by giving more data visualizers and detailed explanations. So it provides an, even more, deeper view. The researchers hoping that peoples can adopt and start contributing to it. With the help of the community, we can make it more powerful.

How can use Cardea

Install with pip:-

The easiest way to install Cardea is using pip

pip install cardea

This command will install the latest stable release from PyPi

load the core class

from cardea import Cardea

cardea = Cardea()

Downloading, Unzipping and Loading

for testing Cardea, we want to connect the medical dataset. Here we are using a pre-processed version of the Kaggle dataset: Medical Appointment No Shows. To use this dataset by downloading the data from the link and unzip it into the root directory, or run the command:

curl -O && unzip -d kaggle

To load the data from root directory, pass the data to the loader using the following command:


For verifying the loaded data, we can get the loaded entity set by viewing which should produce the following output:

Entityset: kaggle


    Address [Rows: 81, Columns: 2]

    Appointment_Participant [Rows: 6100, Columns: 2]

    Appointment [Rows: 110527, Columns: 5]

    CodeableConcept [Rows: 4, Columns: 2]

    Coding [Rows: 3, Columns: 2]

    Identifier [Rows: 227151, Columns: 1]

    Observation [Rows: 110527, Columns: 3]

    Patient [Rows: 6100, Columns: 4]

    Reference [Rows: 6100, Columns: 1]

  Relationships: -> Reference.identifier

    Appointment.participant -> Appointment_Participant.object_id

    CodeableConcept.coding -> Coding.object_id

    Observation.code -> CodeableConcept.object_id

    Observation.subject -> Reference.identifier

    Patient.address -> Address.object_id

The output shown by is composed of entities and relationships. So you can solve a problem by specifying the name of the class, and the output of the problem is return as label_times.

label_times = cardea.select_problem(‘MissedAppointment’)

Here label_times means for each instance in the entity dataset 

  1. Finding corresponding label for the instance 
  2. Finding the time index that indicates the timespan allowed for calculating features that apply to each instance in the entity dataset.

label_times shown as below

cutoff_time         instance_id        label

0 2015-11-10 07:13:56       5030230            noshow

1 2015-12-03 08:17:28       5122866            fulfilled

2 2015-12-07 10:40:59       5134197            fulfilled

3 2015-12-07 10:42:42       5134220            noshow

4 2015-12-07 10:43:01       5134223            noshow

Cardea extracts features from label_times using automated feature engineering by applying the label_times to the problem you aim to solve

feature_matrix = cardea.generate_features(label_times[:1000])

From the features, we can now split the data into training and testing models. Featuring the data might take a while because it depends upon the size of the data. So the size of data increases, the time taken for featuring increases drastically.

y = list(feature_matrix.pop(‘label’))

X = feature_matrix.values

X_train, X_test, y_train, y_test = cardea.train_test_split (X, y, test_size=0.2, shuffle=True)

Now we can get a feature matrix from featuring process. Feature matrix contains graphical values corresponding to their labels. Properly divide our feature matrix and train our machine learning Modeling, pipeline, optimizing hyperparameters for finding the most optimal model with accurate.

cardea.select_pipeline(‘Random Forest’), y_train)

y_pred = cardea.predict(X_test)

Finally, we can evaluate the performance of the machine learning model by the following command:

cardea.evaluate(X, y, test_size=0.2, shuffle=True)

The output of the scoring metric depending on the type of problem as shown:


 ‘Accuracy’: 0.75, 

 ‘F1 Macro’: 0.5098039215686274, 

 ‘Precision’: 0.5183001719479243, 

 ‘Recall’: 0.5123528436411872



Cardea is an AutoML system that can help the user by providing many solutions for maintaining the hospital and patient data from electronic records. Users can implement or train the Cardea model for their needs. Users want to trust the Cardea software system and users provide their thinking or thoughts to the Cardea developers. Then Cardea developers can software system to the next level.

Innovature's experienced software engineer with a demonstrated history of working in the Information technology . Skilled in Java, Web design, Spring Boot, Spring Framework

(1 vote. Average 5 of 5)
Leave a reply

Your email address will not be published. Required fields are marked *