Getting started with Clickhouse
Clickhouse is a blazingly fast OLAP system with insanely low latency and high throughput. Let's get started with it in this post.

Agenda
This is a getting started guide for Clickhouse.
Clickhouse is an analytical database rather than a transactional database. We want to keep this post focused on the practical rather than the theoretical aspects, hence we are deliberately skipping getting into a deep comparison between OLAP and OLTP systems.
Like most OLAP databases, Clickhouse too has a columnar architecture.
As this post is geared towards Clickhouse, we will highlight the places where Clickhouse or columnar databases in general shine over traditional row architecture database.
There are places where traditional databases are a necessity and columnar databases would miserably fail.
We will install Clickhouse using Docker. Do not worry if you aren't conversant with Docker advanced concepts, we will keep Docker usage to minimum.
Installing Clickhouse
The easiest way to get a Clickhouse server running is through the Docker image.
Issue the docker run
command.
docker run --name learn-clickhouse yandex/clickhouse-server
We have used Docker image named yandex/clickhouse-server
. We have named our docker container as learn-clickhouse
. This should start a Clickhouse server.
Connecting to Clickhouse server
A program called clickhouse-client
is needed to connect to a running Clickhouse server.
Docker container for Clickhouse also has a clickhouse-client
, hence we will use the same, rather than installing clickhouse-client
on our system.
To use clickhouse-client
from the Docker container, we would need to get shell access to the container. Use the following command to get shell access to the running container.
docker exec -it learn-clickhouse bash
This should give us bash
access to the Docker container. Issue the following command in the Docker container.
/# clickhouse-client
Pay attention that /#
is the terminal prompt. Only type clickhouse-client
on the terminal.
You should get a response similar to the following:
ClickHouse client version 20.6.3.28 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 20.6.3 revision 54436.
This confirms that Clickhouse server is running and we are able to connect to it. We can run a simple query to be certain that things are set up.
select 1;
You should get a response similar to the following:
SELECT 1
┌─1─┐
│ 1 │
└───┘
1 rows in set. Elapsed: 0.008 sec.
Loading test data
We want to load reasonably huge data in the Clickhouse server to showcase its capabilities.
We will use Yandex Metrica's dataset. Yandex.Metrica is a web analytics service and is also the creator of Clickhouse.
This dataset has data of hits and visits across different sites which Yandex.Metrica tracks. This dataset is in tsv
format, i.e tab separated values. Download the following compressed tsv files:
https://datasets.clickhouse.tech/hits/tsv/hits_v1.tsv.xz
https://datasets.clickhouse.tech/visits/tsv/visits_v1.tsv.xz
Uncompress both these files, so that you have hits_v1.tsv
and visits_v1.tsv
in your Downloads
folder.
As we are using Docker for running our Clickhouse server, so even if we load this data into the Clickhouse server, it will be lost when Docker container is destroyed. To avoid that, we will use a Docker volume to persist the loaded data beyond the lifecycle of the Docker container.
Stop and remove the existing container
$ docker stop learn-clickhouse
$ docker rm learn-clickhouse
Create the container again, but this time with a Docker volume.
$ docker run --name learn-clickhouse --volume learn-clickhouse:/var/lib/clickhouse --volume /Users/akshar/Downloads/:/downloads yandex/clickhouse-server
Clickhouse's data directory is /var/lib/clickhouse
and we want to persist the Clickhouse server's data in a Docker volume named learn-clickhouse
. Hence we have used --volume learn-clickhouse:/var/lib/clickhouse
.
Also, we want the downloaded tsv files to be accessible on the container, so we could load them in the clickhouse server. Hence we have used --volume /Users/akshar/Downloads/:/download
. If your downloaded files are in /home/downloads
, then you should use --volume /home/downloads/:/download
.
Let's get bash
access to this container and confirm that we are able to connect to the server.
docker exec -it learn-clickhouse bash
Confirm that a downloads
folder is present and has the tsv files.
/# ls downloads/*.tsv
downloads/hits_v1.tsv downloads/visits_v1.tsv
We need to create the tables for hits
and visits
to load the data. Let's first create the database for these tables.
/# clickhouse-client
Issuse the create database command on clickhouse client.
CREATE DATABASE IF NOT EXISTS tutorial
Confirm that a database named tutorial
has been created by issuing command show databases
.
show databases;
Output should look similar to the following:
SHOW DATABASES
┌─name───────────────────────────┐
│ _temporary_and_external_tables │
│ default │
│ system │
│ tutorial │
└────────────────────────────────┘
4 rows in set. Elapsed: 0.005 sec.
Let's create the table hits_v1
which will store hits data. This create table
command has been copied as it is from Clickhouse tutorial.
CREATE TABLE tutorial.hits_v1
(
`WatchID` UInt64,
`JavaEnable` UInt8,
`Title` String,
`GoodEvent` Int16,
`EventTime` DateTime,
`EventDate` Date,
`CounterID` UInt32,
`ClientIP` UInt32,
`ClientIP6` FixedString(16),
`RegionID` UInt32,
`UserID` UInt64,
`CounterClass` Int8,
`OS` UInt8,
`UserAgent` UInt8,
`URL` String,
`Referer` String,
`URLDomain` String,
`RefererDomain` String,
`Refresh` UInt8,
`IsRobot` UInt8,
`RefererCategories` Array(UInt16),
`URLCategories` Array(UInt16),
`URLRegions` Array(UInt32),
`RefererRegions` Array(UInt32),
`ResolutionWidth` UInt16,
`ResolutionHeight` UInt16,
`ResolutionDepth` UInt8,
`FlashMajor` UInt8,
`FlashMinor` UInt8,
`FlashMinor2` String,
`NetMajor` UInt8,
`NetMinor` UInt8,
`UserAgentMajor` UInt16,
`UserAgentMinor` FixedString(2),
`CookieEnable` UInt8,
`JavascriptEnable` UInt8,
`IsMobile` UInt8,
`MobilePhone` UInt8,
`MobilePhoneModel` String,
`Params` String,
`IPNetworkID` UInt32,
`TraficSourceID` Int8,
`SearchEngineID` UInt16,
`SearchPhrase` String,
`AdvEngineID` UInt8,
`IsArtifical` UInt8,
`WindowClientWidth` UInt16,
`WindowClientHeight` UInt16,
`ClientTimeZone` Int16,
`ClientEventTime` DateTime,
`SilverlightVersion1` UInt8,
`SilverlightVersion2` UInt8,
`SilverlightVersion3` UInt32,
`SilverlightVersion4` UInt16,
`PageCharset` String,
`CodeVersion` UInt32,
`IsLink` UInt8,
`IsDownload` UInt8,
`IsNotBounce` UInt8,
`FUniqID` UInt64,
`HID` UInt32,
`IsOldCounter` UInt8,
`IsEvent` UInt8,
`IsParameter` UInt8,
`DontCountHits` UInt8,
`WithHash` UInt8,
`HitColor` FixedString(1),
`UTCEventTime` DateTime,
`Age` UInt8,
`Sex` UInt8,
`Income` UInt8,
`Interests` UInt16,
`Robotness` UInt8,
`GeneralInterests` Array(UInt16),
`RemoteIP` UInt32,
`RemoteIP6` FixedString(16),
`WindowName` Int32,
`OpenerName` Int32,
`HistoryLength` Int16,
`BrowserLanguage` FixedString(2),
`BrowserCountry` FixedString(2),
`SocialNetwork` String,
`SocialAction` String,
`HTTPError` UInt16,
`SendTiming` Int32,
`DNSTiming` Int32,
`ConnectTiming` Int32,
`ResponseStartTiming` Int32,
`ResponseEndTiming` Int32,
`FetchTiming` Int32,
`RedirectTiming` Int32,
`DOMInteractiveTiming` Int32,
`DOMContentLoadedTiming` Int32,
`DOMCompleteTiming` Int32,
`LoadEventStartTiming` Int32,
`LoadEventEndTiming` Int32,
`NSToDOMContentLoadedTiming` Int32,
`FirstPaintTiming` Int32,
`RedirectCount` Int8,
`SocialSourceNetworkID` UInt8,
`SocialSourcePage` String,
`ParamPrice` Int64,
`ParamOrderID` String,
`ParamCurrency` FixedString(3),
`ParamCurrencyID` UInt16,
`GoalsReached` Array(UInt32),
`OpenstatServiceName` String,
`OpenstatCampaignID` String,
`OpenstatAdID` String,
`OpenstatSourceID` String,
`UTMSource` String,
`UTMMedium` String,
`UTMCampaign` String,
`UTMContent` String,
`UTMTerm` String,
`FromTag` String,
`HasGCLID` UInt8,
`RefererHash` UInt64,
`URLHash` UInt64,
`CLID` UInt32,
`YCLID` UInt64,
`ShareService` String,
`ShareURL` String,
`ShareTitle` String,
`ParsedParams` Nested(
Key1 String,
Key2 String,
Key3 String,
Key4 String,
Key5 String,
ValueDouble Float64),
`IslandID` FixedString(16),
`RequestNum` UInt32,
`RequestTry` UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)
Do not get bothered with the create table command for now. In this post, our purpose is to show how fast selecting data from Clickhouse is. We will dissect the create table
command later.
Let's create the other table for visits and name it visits_v1
.
CREATE TABLE tutorial.visits_v1
(
`CounterID` UInt32,
`StartDate` Date,
`Sign` Int8,
`IsNew` UInt8,
`VisitID` UInt64,
`UserID` UInt64,
`StartTime` DateTime,
`Duration` UInt32,
`UTCStartTime` DateTime,
`PageViews` Int32,
`Hits` Int32,
`IsBounce` UInt8,
`Referer` String,
`StartURL` String,
`RefererDomain` String,
`StartURLDomain` String,
`EndURL` String,
`LinkURL` String,
`IsDownload` UInt8,
`TraficSourceID` Int8,
`SearchEngineID` UInt16,
`SearchPhrase` String,
`AdvEngineID` UInt8,
`PlaceID` Int32,
`RefererCategories` Array(UInt16),
`URLCategories` Array(UInt16),
`URLRegions` Array(UInt32),
`RefererRegions` Array(UInt32),
`IsYandex` UInt8,
`GoalReachesDepth` Int32,
`GoalReachesURL` Int32,
`GoalReachesAny` Int32,
`SocialSourceNetworkID` UInt8,
`SocialSourcePage` String,
`MobilePhoneModel` String,
`ClientEventTime` DateTime,
`RegionID` UInt32,
`ClientIP` UInt32,
`ClientIP6` FixedString(16),
`RemoteIP` UInt32,
`RemoteIP6` FixedString(16),
`IPNetworkID` UInt32,
`SilverlightVersion3` UInt32,
`CodeVersion` UInt32,
`ResolutionWidth` UInt16,
`ResolutionHeight` UInt16,
`UserAgentMajor` UInt16,
`UserAgentMinor` UInt16,
`WindowClientWidth` UInt16,
`WindowClientHeight` UInt16,
`SilverlightVersion2` UInt8,
`SilverlightVersion4` UInt16,
`FlashVersion3` UInt16,
`FlashVersion4` UInt16,
`ClientTimeZone` Int16,
`OS` UInt8,
`UserAgent` UInt8,
`ResolutionDepth` UInt8,
`FlashMajor` UInt8,
`FlashMinor` UInt8,
`NetMajor` UInt8,
`NetMinor` UInt8,
`MobilePhone` UInt8,
`SilverlightVersion1` UInt8,
`Age` UInt8,
`Sex` UInt8,
`Income` UInt8,
`JavaEnable` UInt8,
`CookieEnable` UInt8,
`JavascriptEnable` UInt8,
`IsMobile` UInt8,
`BrowserLanguage` UInt16,
`BrowserCountry` UInt16,
`Interests` UInt16,
`Robotness` UInt8,
`GeneralInterests` Array(UInt16),
`Params` Array(String),
`Goals` Nested(
ID UInt32,
Serial UInt32,
EventTime DateTime,
Price Int64,
OrderID String,
CurrencyID UInt32),
`WatchIDs` Array(UInt64),
`ParamSumPrice` Int64,
`ParamCurrency` FixedString(3),
`ParamCurrencyID` UInt16,
`ClickLogID` UInt64,
`ClickEventID` Int32,
`ClickGoodEvent` Int32,
`ClickEventTime` DateTime,
`ClickPriorityID` Int32,
`ClickPhraseID` Int32,
`ClickPageID` Int32,
`ClickPlaceID` Int32,
`ClickTypeID` Int32,
`ClickResourceID` Int32,
`ClickCost` UInt32,
`ClickClientIP` UInt32,
`ClickDomainID` UInt32,
`ClickURL` String,
`ClickAttempt` UInt8,
`ClickOrderID` UInt32,
`ClickBannerID` UInt32,
`ClickMarketCategoryID` UInt32,
`ClickMarketPP` UInt32,
`ClickMarketCategoryName` String,
`ClickMarketPPName` String,
`ClickAWAPSCampaignName` String,
`ClickPageName` String,
`ClickTargetType` UInt16,
`ClickTargetPhraseID` UInt64,
`ClickContextType` UInt8,
`ClickSelectType` Int8,
`ClickOptions` String,
`ClickGroupBannerID` Int32,
`OpenstatServiceName` String,
`OpenstatCampaignID` String,
`OpenstatAdID` String,
`OpenstatSourceID` String,
`UTMSource` String,
`UTMMedium` String,
`UTMCampaign` String,
`UTMContent` String,
`UTMTerm` String,
`FromTag` String,
`HasGCLID` UInt8,
`FirstVisit` DateTime,
`PredLastVisit` Date,
`LastVisit` Date,
`TotalVisits` UInt32,
`TraficSource` Nested(
ID Int8,
SearchEngineID UInt16,
AdvEngineID UInt8,
PlaceID UInt16,
SocialSourceNetworkID UInt8,
Domain String,
SearchPhrase String,
SocialSourcePage String),
`Attendance` FixedString(16),
`CLID` UInt32,
`YCLID` UInt64,
`NormalizedRefererHash` UInt64,
`SearchPhraseHash` UInt64,
`RefererDomainHash` UInt64,
`NormalizedStartURLHash` UInt64,
`StartURLDomainHash` UInt64,
`NormalizedEndURLHash` UInt64,
`TopLevelDomain` UInt64,
`URLScheme` UInt64,
`OpenstatServiceNameHash` UInt64,
`OpenstatCampaignIDHash` UInt64,
`OpenstatAdIDHash` UInt64,
`OpenstatSourceIDHash` UInt64,
`UTMSourceHash` UInt64,
`UTMMediumHash` UInt64,
`UTMCampaignHash` UInt64,
`UTMContentHash` UInt64,
`UTMTermHash` UInt64,
`FromHash` UInt64,
`WebVisorEnabled` UInt8,
`WebVisorActivity` UInt32,
`ParsedParams` Nested(
Key1 String,
Key2 String,
Key3 String,
Key4 String,
Key5 String,
ValueDouble Float64),
`Market` Nested(
Type UInt8,
GoalID UInt32,
OrderID String,
OrderPrice Int64,
PP UInt32,
DirectPlaceID UInt32,
DirectOrderID UInt32,
DirectBannerID UInt32,
GoodID String,
GoodName String,
GoodQuantity Int32,
GoodPrice Int64),
`IslandID` FixedString(16)
)
ENGINE = CollapsingMergeTree(Sign)
PARTITION BY toYYYYMM(StartDate)
ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID)
SAMPLE BY intHash32(UserID)
Let's load the data in both these tables. Issue the following commands from the Docker container shell.
/# cd downloads/
/downloads# clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv
/downloads# clickhouse-client --query "INSERT INTO tutorial.visits_v1 FORMAT TSV" --max_insert_block_size=100000 < visits_v1.tsv
Let's run clickhouse-client
and confirm that data has been inserted into these tables.
/# clickhouse-client
On clickhouse client terminal, switch to database tutorial
by issuing command use tutorial
.
:) use tutorial;
Notice :)
is the Clickhouse terminal prompt. The command is use tutorial
.
Let's see the number of rows in table visits_v1
.
:) select count(*) from visits_v1;
Output should look similar to:
┌─count()─┐
│ 8873898 │
└─────────┘
This confirms that around 8 million rows have been added to this table.
Running few queries
Let's run a query to select column Hits
from table visits_v1
.
:) select Hits from visits_v1;
Output would look similar to:
1681881 rows in set. Elapsed: 0.052 sec. Processed 1.68 million rows, 6.73 MB (32.08 million rows/s., 128.31 MB/s.)
In this case, Clickhouse processed close to 30 million rows per second.
In a row based database, scanning 10 million records where each record consumed 100 bytes would need data read of 1 GB data. If we get an I/O speed of 200 MB/s, then this would require 5 seconds.
Scanning 30 million rows would need close to 15 seconds in such cases, whereas we were able to get a speed of 32 million rows/second with Clickhouse.
Let's run a query to find the maximum Hits
.
:) select max(Hits) from visits_v1;
Output:
┌─max(Hits)─┐
│ 3296 │
└───────────┘
1 rows in set. Elapsed: 0.021 sec. Processed 1.68 million rows, 6.73 MB (80.07 million rows/s., 320.29 MB/s.)
As you can notice, Clickhouse was able to process close to 1.68 million rows in 0.02 seconds. Getting a speed of close to 80 million rows/second is something highly unlikely in a relational database with row based architecture.
At this point, we should also clarify that Clickhouse or OLAP in general is not a replacement for a relational database like MySQL. In an ecommerce application or application needing high number of transactions per second, a relational database is irreplaceable. OLAP is not a replacement for a transactional database, but something which can beautifully complement a transactional database.
Aggregate functions
Let's find the count of Hits:
:) select Count(1) from hits_v1;
Output:
┌─count(1)─┐
│ 8873898 │
└──────────┘
1 rows in set. Elapsed: 0.018 sec.
It took 18 milliseconds to count around 9 million rows. This speed is highly unlikely in a relational database.
This is possible because of the columnar architecture of Clickhouse. In columnar architecture, much fewer data blocks have to be read and hence much lesser I/O compared to a relational database has to be performed.
Let's find all distinct client ips.
:) select distinct ClientIP from visits_v1;
Output:
197002 rows in set. Elapsed: 0.070 sec. Processed 1.68 million rows, 6.73 MB (24.08 million rows/s., 96.34 MB/s.)
Again, blazingly fast :). This speed could only be expected in a relational database, if there is an index on column ClientIP
.
Let's find the distinct count.
:) select Count(Distinct(ClientIP)) from visits_v1;
Output:
1 rows in set. Elapsed: 0.038 sec. Processed 1.68 million rows, 6.73 MB (43.72 million rows/s., 174.90 MB/s.)
Super fast, again.
Let's find the maximum age.
:) select max(Age) from hits_v1;
Output:
1 rows in set. Elapsed: 0.016 sec. Processed 8.87 million rows, 8.87 MB (539.55 million rows/s., 539.55 MB/s.)
As we noticed, Clickhouse is insanely fast at providing aggregation results.
Let's do an aggregate query and find the sum of hits grouped by client ips.
:) select ClientIP, sum(Hits) from visits_v1 group by ClientIP;
Output:
197002 rows in set. Elapsed: 0.194 sec. Processed 1.68 million rows, 13.46 MB (8.66 million rows/s., 69.29 MB/s.)
Let's order the total hits in the descreasing order
:) select ClientIP, sum(Hits) as TotalHits from visits_v1 group by ClientIP order by TotalHits desc;
Output:
197002 rows in set. Elapsed: 0.108 sec. Processed 1.68 million rows, 13.46 MB (15.57 million rows/s., 124.58 MB/s.)
Selecting different columns
Now that we have established how fast Clickhouse is at processing select and aggregation queries, let's see how Clickhouse's behaviour changes for different data access patterns.
There is a column named Sign
in visits_v1
. It's datatype is Int8
which suggests that every value for Sign
takes 8 bits.
Let's run a query to select Sign
column for all rows.
:) select Sign from visits_v1;
Output:
Showed first 10000.
1681881 rows in set. Elapsed: 0.038 sec. Processed 1.68 million rows, 1.68 MB (44.16 million rows/s., 44.16 MB/s.)
Output suggests that 1.68 million rows were read and it lead to reading of 1.68 MB. Hence 1 Byte for each of the rows.
Let's select column Referer
in visits_v1
.
:) select Referer from visits_v1
Output:
1681881 rows in set. Elapsed: 0.927 sec. Processed 1.68 million rows, 217.75 MB (1.81 million rows/s., 234.81 MB/s.)
You would notice the difference in Elapsed
time. Selecting Referer
required almost 9 times more time than selecting Sign
.
This happened because Referer
column is a String
and is of an arbitrary length, whereas Sign
is of fixed length and that too of 1 byte.
217MB data was read in contrast to 1.68 MB last time, which suggests that this column has much more data and hence cause of data read of fewer rows per second.
In a row based database, selecting any column would take more or less same time since the entire row is read from disk and selective columns can't be read from disk.
Hence, when selecting columns from a columnar database, we need to be mindful of the columns we are selecting. Adding one more column to the select
clause can cause significant increase in the query processing time.
However it still seems pretty fast. It would be hard to get a throughput of around 2 million rows for a varchar column in a relational database.
One of the reasons for this high throughput in columnar database is the effective compression which is possible. In a row based database, compression isn't as effective as columnar database.
Selecting a single column vs multiple columns
Let's select a single field:
:) select Hits from visits_v1;
Output would look like:
1681881 rows in set. Elapsed: 0.102 sec. Processed 1.68 million rows, 6.73 MB (16.54 million rows/s., 66.17 MB/s.)
This suggests that around 16 million rows were processed per second.
Let's select Referer
column in addition to Hits
.
:) select Hits, Referer from visits_v1;
Output would look similar to:
1681881 rows in set. Elapsed: 1.118 sec. Processed 1.68 million rows, 224.48 MB (1.50 million rows/s., 200.69 MB/s.)
This time Clickhouse was able to process only 1.5 million rows per second.
As we add more columns to select
, more data blocks has to be read. This increases the seek time.
Also as more data is being read, i.e for two columns in contrast to one, hence amount of I/O increases too.
The additional data seek time as well as data read time contribute to the degraded performance.
The more columns you need to read, the less benefit you’ll get from using column-oriented storage.
Applying a filter
Let's apply a filter on a column and select only the filtered column:
select ClientIP from hits_v1 where RegionID=57;
Output:
99424 rows in set. Elapsed: 0.174 sec. Processed 8.87 million rows, 68.67 MB (50.89 million rows/s., 393.75 MB/s.)
Clickhouse was able to process close to 50 million rows per second.
Let's query for additional columns too in addition to the one on which we applied filter. We applied filter on RegionID
, but let's also query other columns like WatchID
, JavaEnable
etc.
:) select WatchID, JavaEnable, GoodEvent, CounterID, ClientIP, RegionID, UserID, CounterClass, ResolutionWidth, ResolutionHeight from hits_v1 where RegionID=57;
Output:
99424 rows in set. Elapsed: 0.918 sec. Processed 8.87 million rows, 291.42 MB (9.67 million rows/s., 317.49 MB/s.)
As you would notice, there was almost a 5 fold decrease in the number of rows processed per second.
Let's add few more columns:
:) select WatchID, JavaEnable, GoodEvent, CounterID, ClientIP, RegionID, UserID, CounterClass, ResolutionWidth, ResolutionHeight, UserAgentMajor, UserAgentMinor, IsMobile, SearchEngineID, WindowClientWidth, WindowClientHeight from hits_v1 where RegionID=57;
Output:
99424 rows in set. Elapsed: 1.395 sec. Processed 8.87 million rows, 369.04 MB (6.36 million rows/s., 264.58 MB/s.)
We are noticing that as we query more columns apart from the one we filtered on, the processing speed decreases.
Ordering on a column
Let's write a query to order by column Income
.
:) select Income from hits_v1 order by Income;
Output:
8873898 rows in set. Elapsed: 0.934 sec. Processed 8.87 million rows, 8.87 MB (9.50 million rows/s., 9.50 MB/s.)
Clickhouse was able to process around 10 million rows per second. However, it's highly unlikely that you would need all the rows and will definitely have some limits.
Let's modify the query and fetch only first 100 rows ordered by Income
.
:) select Income from hits_v1 order by Income limit 100;
Output:
100 rows in set. Elapsed: 0.024 sec. Processed 8.87 million rows, 8.87 MB (368.59 million rows/s., 368.59 MB/s.)
Able to process close to 350 million rows per second.
As an exercise, try increasing the limit from 100 to 1000 and then to 5000. You would notice that as the limit is being increased the performance degrades. It is something to be mindful of.
Up until now, we only selected the column on which we applied the order
clause. Let's also select few more columns in addition to the column on which order
is applied.
:) select Age, Sex, Income, ClientIP from hits_v1 order by Income limit 100;
Output:
100 rows in set. Elapsed: 0.113 sec. Processed 8.87 million rows, 62.12 MB (78.29 million rows/s., 548.00 MB/s.)
As more columns apart from the one on which order is applied is added, the throughput decreases.
Resources
The following are some highly informative resources around columnar database:
- https://www.stitchdata.com/columnardatabase/
- https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
- https://news.ycombinator.com/item?id=3524437
We plan to write many more posts around columnar databases and analytics in the coming days. Stay tuned!
The Essential Plug
UrbanPiper is a growing team and we are looking for smart engineers to work with us on problems in the sphere of data, scale and high-throughput integrations.
If you are interested to be a part of our team, feel free to write in to us at [email protected]