Wednesday, March 27, 2013

Improving a PostgreSQL report performance: Part 2 - Temporary Table

The report I was working on still did not live up to expectations. There was something else going on. I had to dig a little deeper.

The report was generated by XTraReports and I have no authority to edit it. The ReportDataSource contains functions for retrieving datasets for main report and subreport.

public class ReportDataSource : BaseDataSource
{
 public DataSet GetPrimaryReportData(DateTime fromDate, DateTime toDate)
 {
  string commandText = "select * from myreportfunction;";
  var reportDataSet = ExecuteQuery("ReportDataSet", commandText, new[] { "MainDataSet" });
  return reportDataSet;
 }

 public DataSet GetSubReportData(DateTime fromDate, DateTime toDate)
 {
  string commandText = String.Format("select * from myreportsubfunction");
  return ExecuteQuery("SubReportDataSet", commandText, new[] { "SubDataSet" });
 }
}

And here's how the PostgreSQL functions looked like.

The myreportsubfunction is the one I worked on already so now it looked like the following

CREATE OR REPLACE FUNCTION myreportsubfunction(IN from timestamp without time zone, IN to timestamp without time zone)
  RETURNS TABLE(Name1 character varying, Name2 character varying) AS
$BODY$
BEGIN
RETURN QUERY EXECUTE
'select Column1 as Name1, Column2 as Name2
from sometable tbl
inner join ...
where ...
and ...
and $1 <= somedate
and $2 >= somedate
group by ...
order by ...;' USING $1, $2;
END
$BODY$
  LANGUAGE plpgsql VOLATILE

And there was the myreportfunction

CREATE FUNCTION myreportfunction ( FROMdate timestamp without time zone, todate timestamp without time zone )
 RETURNS TABLE ( name character varying, somevalue1 integer, somevalue2 real ) AS $body$
 SELECT
    something,
    sum(somevalue1)::int as somevalue1,
    sum(somevalue2)::real as somevalue2
 FROM myreportsubfunction($1, $2)      group by something;
  $body$ LANGUAGE sql;

What's going on here? Well, looks like first the myreportfunction is called, it calls the myreportsubfunction and returns aggregated results. But then the myreportsubfunction is called separately and essentially executes the same huge query again! No wonder the performance is nowhere near acceptable. Anyway, to satisfy the report requirements, I need to have the aggregated data first, which means that I need to save the results of the huge complex query, aggregate the results and return them for the main report, and then return the saved results of the query as subreport, or "detailed" data. My approach is to use the temporary table.

Here is what the functions will do:

myreportfunction

  • if a temptable exists, drop it
  • create a temptable
  • run a complex query and save results in the temptable
  • run a query that returns the aggregated data for the report

myreportsubfunction

  • if the temptable exists, return everything from the table, then drop the table

And the resulting PostgreSQL code

myreportfunction

CREATE OR REPLACE FUNCTION myreportfunction(IN fromdate timestamp without time zone, IN todate timestamp without time zone)
  RETURNS TABLE("name" character varying, somevalue1 character varying, somevalue2 character varying) AS
$BODY$
BEGIN
DROP TABLE IF EXISTS temptable;
CREATE TEMPORARY TABLE temptable("name" character varying, somevalue1 character varying, somevalue2 character varying);
DELETE FROM temptable;

EXECUTE '
insert into temptable(name, somevalue1, somevalue2)  
select Column1 as Name1, Column2 as Name2
from sometable tbl
inner join ...
where ...
and ...
and $1 <= somedate
and $2 >= somedate
group by ...
order by ...;' USING $1, $2;

RETURN QUERY EXECUTE 'SELECT name, somevalue1, somevalue2 FROM temptable group by name;';
END
$BODY$
  LANGUAGE plpgsql VOLATILE

myreportsubfunction

CREATE OR REPLACE FUNCTION myreportsubfunction(IN timestamp without time zone, IN timestamp without time zone)
  RETURNS TABLE(name character varying, somevalue1 integer, somevalue2 real) AS
$BODY$
BEGIN
IF EXISTS (SELECT 1 FROM temptable) THEN 
 RETURN QUERY EXECUTE'select * from temptable';
 DELETE FROM temptable;
 DROP TABLE IF EXISTS temptable;
END IF;
END
$BODY$
  LANGUAGE plpgsql VOLATILE

Now hoping for the performance improvement by at least 50% ...

References

CREATE TABLE
Improving a PostgreSQL report performance: Part 1 - RETURN QUERY EXECUTE
by . Also posted on my website

Sunday, March 24, 2013

Implementing a simple logging engine with MVC 4

I was investigating the simple logging mechanism for the MVC application. First I came up with some requirements for the logging engine:

High-level requirements:

  • Create logging engine that can log exceptions and other messages and store them in the database
  • Display a filterable grid of all messages

Logging engine

  • Should allow logging of exceptions, including handled and unhandled
  • Should allow logging of custom messages

Filterable grid

  • Should allow paged display of all exceptions and log messages in the database
  • Should allow the option to filter messages based on the date logged and severity

I started with these simple classes that will allow handling of messages and exception:

//actual entry, each will correspond to a line in the grid
public class LogEntry
{
 public int Id { get; set; }
 public DateTime TimeStamp { get; set; }
 public string Path { get; set; }
 public string RawUrl { get; set; }
 public string Source { get; set; }
 public string Message { get; set; }
 public string StackTrace { get; set; }
 public string TargetSite { get; set; }

 public int TypeId { get; set; }
 public virtual LogType LogType { get; set; }
}

//a "helper" class for types like Warning, Information etc.
public class LogType
{
 public int LogTypeId { get; set; }
 public string Type { get; set; }
}

//finally, an enum of all supported log message types
public enum LogTypeNames
{
 All = 0,
 Info = 1,
 Warn = 2,
 Debug = 3,
 Error = 4,
 Fatal = 5,
 Exception = 6
}

These will be reflected in two database tables - the main table for saving all log messages, and a helper table to keep names of message severity levels.

public DbSet<LogType> LogTypes { get; set; }
public DbSet<LogEntry> LogEntries { get; set; }

Next, it is time to mention logging of handled and unhandled exceptions, which can be divided into handled and unhandled exceptions.

Develop mechanism for logging exceptions:

1. Log unhandled exceptions

Unhandled exceptions are, well, exceptions that are not handled in the source code. First, the site web.config must be modified to add a line in the section:

Here's how it works: a method is called in Global.asax file:

public static void RegisterGlobalFilters(GlobalFilterCollection filters)
{
    filters.Add(new HandleErrorAttribute());
}

It registers the HandleErrorAttribute as global action filter. The HandleErrorAttribute checks the customErrors mode, and if it is off, shows the "yellow screen of death". If it is on, the Error view is rendered and a Model is passed to it containing exceptions stack trace. Therefore, an Error.cshtml should be added under Views/Shared, and in its simplest form may look as follows

@using Recipes.Logging
@using Recipes.Models
@{
    Layout = null;
    ViewBag.Title = "Error";
    Logger.WriteEntry(Model.Exception);
}

<!DOCTYPE html>
<html>
<head>
    <title>Error</title>
</head>
<body>
    <h2>
        Sorry, an error occurred while processing your request. The details of the error were logged.
    </h2>
</body>
</html>

For simplicity, all log messages - exceptions, handled and unhandled, and all other custom messages - will be saved in a single database table.

2. Log handled exceptions

The handled exceptions are caught by code and handled directly. The following is the Logger class which handles exceptions and custom messages and saves them to the database:

public static class Logger
{
 public static void WriteEntry(Exception ex)
 {
  LogEntry entry = BuildExceptionLogEntry(ex);
  SaveLogEntry(entry);        
 }

 public static void WriteEntry(string mesage, string source, int logType)
 {
  LogEntry entry = BuildLogEntry(mesage, source, logType);
  SaveLogEntry(entry);
 }

 private static void SaveLogEntry(LogEntry entry)
 {
  using (RecipesEntities context = new RecipesEntities())
  {
   context.LogEntries.Add(entry);
   context.SaveChanges();
  }
 }

 private static LogEntry BuildLogEntry(string message, string source, int logType)
 {
  LogEntry entry = BuildLogEntryTemplate();

  entry.Message = message;
  entry.Source = source;
  entry.LogType = GetLogEntryType((LogTypeNames)logType);
  entry.TypeId = logType;

  return entry;
 }

 private static LogEntry BuildExceptionLogEntry(Exception x)
 {
  Exception logException = GetInnerExceptionIfExists(x);
  LogEntry entry = BuildLogEntryTemplate();

  entry.Message = logException.Message;
  entry.Source = logException.Source ?? string.Empty;
  entry.StackTrace = logException.StackTrace ?? string.Empty;
  entry.TargetSite = logException.TargetSite == null ? string.Empty : logException.TargetSite.ToString();
  entry.LogType = GetLogEntryType(LogTypeNames.Exception);
  entry.TypeId = (int) LogTypeNames.Exception;

  return entry;
 }

 private static LogEntry BuildLogEntryTemplate()
 {
  return new LogEntry
       {
        Path = HttpContext.Current.Request.Path,
        RawUrl = HttpContext.Current.Request.RawUrl,
        TimeStamp = DateTime.Now,
       };
 }

 public static string BuildExceptionMessage(Exception x)
 {
  Exception logException = GetInnerExceptionIfExists(x);

  string strErrorMsg = Environment.NewLine + "Error in Path :" + HttpContext.Current.Request.Path;
  // Get the QueryString along with the Virtual Path
  strErrorMsg += Environment.NewLine + "Raw Url :" + HttpContext.Current.Request.RawUrl;
  // Get the error message
  strErrorMsg += Environment.NewLine + "Message :" + logException.Message;
  // Source of the message
  strErrorMsg += Environment.NewLine + "Source :" + logException.Source;
  // Stack Trace of the error
  strErrorMsg += Environment.NewLine + "Stack Trace :" + logException.StackTrace;
  // Method where the error occurred
  strErrorMsg += Environment.NewLine + "TargetSite :" + logException.TargetSite;
  return strErrorMsg;
 }

 private static LogType GetLogEntryType(LogTypeNames name)
 {
  return new LogType{LogTypeId = (int)name, Type = name.ToString()};
 }

 private static Exception GetInnerExceptionIfExists(Exception x)
 {
  if (x.InnerException != null)
   return x.InnerException;
  return x;
 }
}

With this basic structure in place, I can start adding user interface for displaying the log. I decided to only have two views, Index for the main grid which contains all log messages, and a Details for a detailed information about a single message. Details will be linked from the line in a grid that corresponds to a log message.

Index view.

The view will have several main parts, wrapped in a form.

@using (Html.BeginForm("Index", "Logging", new { CurrentPageIndex = 1 }, FormMethod.Get, new { id = "myform" }))
{

}

First is the div that shows the number of records found and gives an option to choose how many records per page will be displayed.

<div class="grid-header">
    <div class="grid-results">
        <div class="inner">
            <span style="float: left">
                @string.Format("{0} records found. Page {1} of {2}", Model.LogEvents.TotalItemCount, Model.LogEvents.PageNumber, Model.LogEvents.PageCount)
            </span>

            <span style="float: right">
                Show @Html.DropDownListFor(model => model.PageSize, new SelectList(FormsHelper.PagingPageSizes, "Value", "Text", Model.PageSize), new { onchange = "document.getElementById('myform').submit()" }) results per page
            </span>
            
            <div style="clear: both"></div>
        </div> <!-- inner -->
    </div>  <!-- grid-results -->
 </div>  <!-- grid-header -->

The second allows to filter records by date logged and severity

 <div class="grid-filter">        
    <div class="inner">
        Level : @Html.DropDownList("LogLevel", new SelectList(FormsHelper.LogLevels, "Value", "Text"))

        For : @Html.DropDownList("Period", new SelectList(FormsHelper.CommonTimePeriods, "Value", "Text"))
        
        <input id="btnGo" name="btnGo" type="submit" value="Apply Filter" />                      
    </div>
 </div>   

Next is the "pager" div, which allows navigation if multiple pages are reqiured

  <div class="paging">
    <div class="pager">
        @Html.Pager(ViewData.Model.LogEvents.PageSize, ViewData.Model.LogEvents.PageNumber, ViewData.Model.LogEvents.TotalItemCount, new { LogType = ViewData["LogType"], Period = ViewData["Period"], PageSize = ViewData["PageSize"] })
    </div>
 </div>

Finally, the main part is the actual grid which displays the messages.

 @if (Model.LogEvents.Count() == 0)
 {
 <p>No results found</p>
 }
 else
 {
 <div class="grid-container">
 <table class="grid">
    <tr>
        <th></th>
        <th>#</th>
        <th>Source</th>
        <th>Date</th>
        <th style='white-space: nowrap;'>Time ago</th>
        <th>Message</th>
        <th>Type</th>
    </tr>

 @{int i = 0;}
     @foreach (var item in Model.LogEvents)
     {
     <tr class="@(i++ % 2 == 1 ? "alt" : "")">
     <td>
        @Html.ActionLink("Details", "Details", new { id = item.Id.ToString(), loggerProviderName = "Go To Details" /*item.LoggerProviderName*/ })
     </td>
     <td>
        @i.ToString()
     </td>
     <td>
        @item.Source
     </td>
     <td style='white-space: nowrap;'>
        @String.Format("{0:g}", item.TimeStamp.ToLocalTime())
     </td>
     <td style='white-space: nowrap;'>
        @item.TimeStamp.ToLocalTime().TimeAgoString()
     </td>
     <td>
        <pre>@item.Message.WordWrap(80)</pre>
     </td>
     <td>
        @item.LogType.Type
     </td>
     </tr>
     }

 </table>
 </div> <!-- grid-container -->
}

A few points of interest:

The Index method in the controller returns a ViewModel. By default, all configurable parameters (page size, time period, page number and log level) are not set, and all log messages are displayed with the default page size of 20 entries. When a value is set in the UI and the form is submitted, a corresponding parameter is passed to the controller.

public ActionResult Index(string Period = null, int? PageSize = null, int? page = null, string LogLevel = null)
{
 string defaultPeriod = Session["Period"] == null ? "All" : Session["Period"].ToString();
 string defaultLogLevel = Session["LogLevel"] == null ? "All" : Session["LogLevel"].ToString();

 LoggingIndexModel model = new LoggingIndexModel();

 model.Period = Period ?? defaultPeriod;
 model.LogLevel = LogLevel ?? defaultLogLevel;
 model.CurrentPageIndex = page.HasValue ? page.Value - 1 : 0;
 model.PageSize = PageSize.HasValue ? PageSize.Value : 20;

 TimePeriod timePeriod = TimePeriodHelper.GetUtcTimePeriod(model.Period);

 model.LogEvents = repository.GetByDateRangeAndType(model.CurrentPageIndex, model.PageSize, timePeriod.Start, timePeriod.End, model.LogLevel);

 ViewData["Period"] = model.Period;
 ViewData["LogLevel"] = model.LogLevel;
 ViewData["PageSize"] = model.PageSize;

 Session["Period"] = model.Period;
 Session["LogLevel"] = model.LogLevel;

 return View(model);
}

GetByDateRangeAndType does the work for selecting appropriate set of log messages from the database.

public IPagedList<LogEntry> GetByDateRangeAndType(int pageIndex, int pageSize, DateTime start, DateTime end, string logLevel)
{
 IQueryable<LogEntry> list;
 IPagedList<LogEntry> pagedList;

 list = db.LogEntries.Where(e =>
   (e.TimeStamp >= start && e.TimeStamp <= end));

 if (logLevel != LogTypeNames.All.ToString())
 {
  list = list.Where(e => e.LogType.Type.ToLower() == logLevel.ToLower());
 }

 list = list.OrderByDescending(e => e.TimeStamp);
 pagedList = new PagedList<LogEntry>(list, pageIndex, pageSize);
 return pagedList;
}

The data is returned in the form of a PagedList which is implemented as follows:

public interface IPagedList<T> : IList<T>
{
 int PageCount { get; }
 int TotalItemCount { get; }
 int PageIndex { get; }
 int PageNumber { get; }
 int PageSize { get; }

 bool HasPreviousPage { get; }
 bool HasNextPage { get; }
 bool IsFirstPage { get; }
 bool IsLastPage { get; }
}

Main part of the PagedList class:

public class PagedList<T> : List<T>, IPagedList<T>
{
 public PagedList(IEnumerable<T> source, int index, int pageSize)
  : this(source, index, pageSize, null)
 {
 }

 #region IPagedList Members

 public int PageCount { get; private set; }
 public int TotalItemCount { get; private set; }
 public int PageIndex { get; private set; }
 public int PageNumber { get { return PageIndex + 1; } }
 public int PageSize { get; private set; }
 public bool HasPreviousPage { get; private set; }
 public bool HasNextPage { get; private set; }
 public bool IsFirstPage { get; private set; }
 public bool IsLastPage { get; private set; }

 #endregion

 protected void Initialize(IQueryable<T> source, int index, int pageSize, int? totalCount)
 {
  //### argument checking
  if (index < 0)
  {
   throw new ArgumentOutOfRangeException("PageIndex cannot be below 0.");
  }
  if (pageSize < 1)
  {
   throw new ArgumentOutOfRangeException("PageSize cannot be less than 1.");
  }

  //### set source to blank list if source is null to prevent exceptions
  if (source == null)
  {
   source = new List<T>().AsQueryable();
  }

  //### set properties
  if (!totalCount.HasValue)
  {
   TotalItemCount = source.Count();
  }
  PageSize = pageSize;
  PageIndex = index;
  if (TotalItemCount > 0)
  {
   PageCount = (int)Math.Ceiling(TotalItemCount / (double)PageSize);
  }
  else
  {
   PageCount = 0;
  }
  HasPreviousPage = (PageIndex > 0);
  HasNextPage = (PageIndex < (PageCount - 1));
  IsFirstPage = (PageIndex <= 0);
  IsLastPage = (PageIndex >= (PageCount - 1));

  //### add items to internal list
  if (TotalItemCount > 0)
  {
   AddRange(source.Skip((index) * pageSize).Take(pageSize).ToList());
  }
 }
}

PagedList uses some helper methods from the Pager class to render HTML and generate links to other pages of the log.

public class Pager
{
 .....

 /// <summary>
 /// Rendes HTML to display a "pager" control (used at the top and bottom of the list of logged messages)
 /// </summary>
 /// <returns>String of HTML</returns>
 public string RenderHtml()
 {
  int pageCount = (int)Math.Ceiling(totalItemCount / (double)pageSize);
  const int nrOfPagesToDisplay = 10;

  var sb = new StringBuilder();

  // Previous
  if (currentPage > 1)
  {
   sb.Append(GeneratePageLink("<", this.currentPage - 1));
  }
  else
  {
   sb.Append("<span class=\"disabled\"><</span>");
  }

  int start = 1;
  int end = pageCount;

  if (pageCount > nrOfPagesToDisplay)
  {
   int middle = (int)Math.Ceiling(nrOfPagesToDisplay / 2d) - 1;
   int below = (currentPage - middle);
   int above = (currentPage + middle);

   if (below < 4)
   {
    above = nrOfPagesToDisplay;
    below = 1;
   }
   else if (above > (pageCount - 4))
   {
    above = pageCount;
    below = (pageCount - nrOfPagesToDisplay);
   }

   start = below;
   end = above;
  }

  if (start > 3)
  {
   sb.Append(GeneratePageLink("1", 1));
   sb.Append(GeneratePageLink("2", 2));
   sb.Append("...");
  }
  for (int i = start; i <= end; i++)
  {
   if (i == currentPage)
   {
    sb.AppendFormat("<span class=\"current\">{0}</span>", i);
   }
   else
   {
    sb.Append(GeneratePageLink(i.ToString(), i));
   }
  }
  if (end < (pageCount - 3))
  {
   sb.Append("...");
   sb.Append(GeneratePageLink((pageCount - 1).ToString(), pageCount - 1));
   sb.Append(GeneratePageLink(pageCount.ToString(), pageCount));
  }

  // Next
  if (currentPage < pageCount)
  {
   sb.Append(GeneratePageLink(">", (currentPage + 1)));
  }
  else
  {
   sb.Append("<span class=\"disabled\">></span>");
  }
  return sb.ToString();
 }

 /// <summary>
 /// Generates a link to a page
 /// </summary>
 /// <param name="linkText">Text displayed on the page</param>
 /// <param name="pageNumber">Number of the page the link leads to</param>
 /// <returns></returns>
 private string GeneratePageLink(string linkText, int pageNumber)
 {
  var pageLinkValueDictionary = new RouteValueDictionary(linkWithoutPageValuesDictionary) {{"page", pageNumber}};
  var virtualPathData = RouteTable.Routes.GetVirtualPath(this.viewContext.RequestContext, pageLinkValueDictionary);

  if (virtualPathData != null)
  {
   const string linkFormat = "<a href=\"{0}\">{1}</a>";
   return String.Format(linkFormat, virtualPathData.VirtualPath, linkText);
  }
  return null;
 }
}

Details view.

There is nothing special about the details view - it's a usual MVC view that displays entity data.

@model Recipes.Models.LogEntry
@{
    ViewBag.Title = "Details";
}

<link href="@Url.Content("~/Content/logging.css")" rel="stylesheet" type="text/css" />

<h2>Details</h2>

<p>        
    @Html.ActionLink("Back to List", "Index")
</p>

<fieldset>
    <legend>Fields</legend>
        
    <div class="display-label">Id</div>
    <div class="display-field">@Model.Id</div>
        
    <div class="display-label">LogDate</div>
    <div class="display-field">@String.Format("{0:g}", Model.TimeStamp)</div>
        
    <div class="display-label">Source</div>
    <div class="display-field">@Model.Source</div>
        
    <div class="display-label">Type</div>
    <div class="display-field">@Model.LogType.Type</div>
        
    <div class="display-label">Message</div>
    <div class="display-field">
        <pre>@Model.Message.WordWrap(80)</pre>
    </div>
        
    <div class="display-label">StackTrace</div>
    <div class="display-field">@Model.StackTrace</div>                      
        
</fieldset>

<p>        
    @Html.ActionLink("Back to List", "Index")
</p>

Details Controller

public ActionResult Details(string loggerProviderName, string id)
{
 LogEntry logEvent = repository.GetById(id);

 return View(logEvent);
}

The final result is represented in the image below:

MVC logging engine

References

Processing Unhandled Exceptions (C#)
Logging in MVC Part 1- Elmah (and other posts of the series)
MVC Basic Site: Step 2 - Exceptions Management
How is Error.cshtml called in ASP.NET MVC?
by . Also posted on my website

Wednesday, March 20, 2013

Improving a PostgreSQL report performance: Part 1 - RETURN QUERY EXECUTE

I was working on optimising a report which had a very poor performance. The "heart" of the report was a fairly complex query which I will briefly refer to as follows

select Column1 as Name1, Column2 as Name2
from sometable tbl
inner join ...
where ...
and ...
and $1 <= somedate
and $2 >= somedate
group by ...
order by ...;

In fact, the query joined seven tables and several WHERE conditions, grouping on four fields and finally sorting the results. I went through the usual stuff with analyzing the query plan, verifying that all required indexes were in place (a couple of joins on particularly large tables, unfortunately, were on the 'varchar' fields, but modifying the database at this stage is out of the question so I had to do what I could). Eventually what limited amount of tricks I had at my disposal was depleted, and the performance only slightly improved. However, when I measured the performance of the report when called from the application and compared it to running the query directly against the database, there was a significant overhead in case of the report. When the report was ran from code, it sent the following query to the database:

select * from myreportsubfunction ('2013-03-13', '2013-03-20');

And the myreportsubfunction was declared similar to the following:

CREATE OR REPLACE FUNCTION myreportsubfunction(IN from timestamp without time zone, IN to timestamp without time zone)
  RETURNS TABLE(Name1 character varying, Name2 character varying) AS
$BODY$
select Column1 as Name1, Column2 as Name2
from sometable tbl
inner join ...
where ...
and ...
and $1 <= somedate
and $2 >= somedate
group by ...
order by ...;
$BODY$
  LANGUAGE sql VOLATILE

So - what's the trick here? The function seems to return the result of the query, but takes way much longer to execute compared to the raw query. And here is the reason: when the database prepares a query plan for the function, it does not know anything about the parameters. The result is likely to be a bad query plan, especially if the query is complex. The solution is to change sql language to plpgsql and make use of the RETURN QUERY EXECUTE command. Now the myreportsubfunction looks like the following:

CREATE OR REPLACE FUNCTION myreportsubfunction(IN from timestamp without time zone, IN to timestamp without time zone)
  RETURNS TABLE(Name1 character varying, Name2 character varying) AS
$BODY$
BEGIN
RETURN QUERY EXECUTE
'select Column1 as Name1, Column2 as Name2
from sometable tbl
inner join ...
where ...
and ...
and $1 <= somedate
and $2 >= somedate
group by ...
order by ...;' USING $1, $2;
END
$BODY$
  LANGUAGE plpgsql VOLATILE

The function now takes as much time to run as the "raw" query, significantly improving the performance.

References

39.5. Basic Statements
Postgresql Slow on custom function, php but fast if directly input on psql using text search with gin index
by . Also posted on my website

Sunday, March 10, 2013

Fixing PostgreSQL index bloating with scheduled REINDEX via pgAgent

There is a problem generally called 'index bloating' that occurs in PostgreSQL under certain circumstances when there are continuous inserts and deletes happening in a table. It is described as follows "B-tree index pages that have become completely empty are reclaimed for re-use. However, there is still a possibility of inefficient use of space: if all but a few index keys on a page have been deleted, the page remains allocated. Therefore, a usage pattern in which most, but not all, keys in each range are eventually deleted will see poor use of space. For such usage patterns, periodic reindexing is recommended".

This looks exactly like the problem I came across, when a table with ~2K rows had an index of over 120MB and another table with ~80K rows had 4 indexes on it, with total size approaching 3GB. The AUTOVACUUM was running as configured by default but apparently not enough to prevent index bloating.

Eventually, I decided configuring a REINDEX to run monthly on said tables. pgAgent is the job scheduler to use with PostgreSQL, but with PostgreSQL 9.1 I could install it following the documentation - the tables were created and the service was running, but I could not find any UI for it. So here's an example script that I used to create a scheduled job.

SET search_path = pgagent;

INSERT INTO pga_jobclass VALUES (6, 'Scheduled Tasks');

INSERT INTO pga_job VALUES (5, 6, 'TableReindex', 'Reindex tables', '', true, 
 '2013-03-27 10:00:00.000+11', --date created
 '2013-03-07 10:00:00.000+11', --date changed
 NULL, NULL, NULL);

INSERT INTO pga_schedule VALUES (3, 5, 'TableReindexSchedule', 'Reindex tables', 
 true, --enabled
 '2013-03-27 10:00:00.000+11', --start date
 NULL, --end (never)
 '{t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f}', --minutes: 't' for run on the first minute of an hour
 '{t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f}', --hours: 't' to run at 3 AM
 '{t,t,t,t,t,t,t}', -- weekdays: don't care, all false
 '{t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f}', -- monthdays: 't' to run on the first day
 '{t,t,t,t,t,t,t,t,t,t,t,t}'); -- months: all true to run on the first day on each month

INSERT INTO pga_jobstep VALUES (5, 5, 'TableReindexInfo', '', true, 's', 'REINDEX TABLE mytable1;REINDEX TABLE mytable2;', '', '@@DATABASE_NAME@@', 'f', NULL);

To verify, I checked the pga_job table and found '2013-04-01 03:00:00+11' in jobnextrun column - that's when I want to run it, at 3 AM on the first of next month.

I still have a question though - I tried using the

'{f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t}', -- monthdays: 't' 

to run on the first day pattern, because the last value is supposedly used for setting the "last day" - run on the last day of the month.

This, however, returns me quite a bunch of errors which suggest that PostgreSQL has some troubles calculating the next time the job will run.

ERROR:  value "32768" is out of range for type smallint
CONTEXT:  PL/pgSQL function "pga_next_schedule" line 1254 at assignment
SQL statement "SELECT                                                    MIN(pgagent.pga_next_schedule(jscid, jscstart, jscend, jscminutes, jschours, jscweekdays, jscmonthdays, jscmonths))
               FROM pgagent.pga_schedule
              WHERE jscenabled AND jscjobid=OLD.jobid"
PL/pgSQL function "pga_job_trigger" line 24 at SQL statement
SQL statement "UPDATE pgagent.pga_job
           SET jobnextrun = NULL
         WHERE jobenabled AND jobid=NEW.jscjobid"
PL/pgSQL function "pga_schedule_trigger" line 60 at SQL statement

********** Error **********

ERROR: value "32768" is out of range for type smallint
SQL state: 22003
Context: PL/pgSQL function "pga_next_schedule" line 1254 at assignment
SQL statement "SELECT                                                    MIN(pgagent.pga_next_schedule(jscid, jscstart, jscend, jscminutes, jschours, jscweekdays, jscmonthdays, jscmonths))
               FROM pgagent.pga_schedule
              WHERE jscenabled AND jscjobid=OLD.jobid"
PL/pgSQL function "pga_job_trigger" line 24 at SQL statement
SQL statement "UPDATE pgagent.pga_job
           SET jobnextrun = NULL
         WHERE jobenabled AND jobid=NEW.jscjobid"
PL/pgSQL function "pga_schedule_trigger" line 60 at SQL statement

The simplest way to reproduce it is to run an UPDATE similar to the following

UPDATE pgagent.pga_schedule
SET
jscmonthdays = '{f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t}'
WHERE jscid = 3

Since I don't really care about running it on a first or last day, I won't dig deep into it. Could be a bug in PostgreSQL for all I know.

References:

23.2. Routine Reindexing
How To Install PgAgent On Windows (PostgreSQL Job Scheduler)
Setting up PgAgent and Doing Scheduled Backups
Automated Backup on Windows
by . Also posted on my website